Baselining PassGAN: Adventures in the Rhubarb

Coalfire Cybersecurity Team

June 26, 2020
Blog Images 2022 06 26 Tile

Cracking is a complex topic full of misunderstandings, confusing terminology and weird people. This blog post is front-loaded with some terminology, some explanations, and maybe some apologies. Hopefully, this will help calm your mind as we barrel into the weeds.

Password cracking: This is fundamentally one thing: guessing. We’re not reversing, or talking to spirits or anything—we are picking a password candidate, running it through a hash algorithm and comparing the output to a target hash. In other words, math.

Hashcat: An extremely fast password cracker. It supports hundreds of hashing algorithms and has many built-in functions for natively generating a candidate (guess) or modifying a candidate from a source. One of these options is rulesets.

Ruleset (hashcat): A series of candidate-altering commands. This easily allows a single candidate (pancakes) to explode into thousands, or millions, of unique guesses. (Pancakes, PANCAKES1!, P4nC4k3s2020!)

Source: In this post, source encompasses wordlists and tool output; a source of password candidates.

Wordlist/dictionary: A text file of password candidates (sometimes the literal dictionary). Used interchangeably and just what it sounds like.

Tool: For this discussion, tool refers to a program that takes a dictionary and creates a set of tool-specific rules or code from that dictionary. The tool then uses that trained set to generate password guesses.

Trained set(or set): The output produced by the analysis/training side of a tool. This data is then used by the generation side of a tool to produce password guesses. The methodology used and exact output can vary drastically between tools, but the sets serve the same purpose.

GAN (Generative Adversarial Network): A machine learning technique where two neural networks work against each other to make each other better at the set task, the result being one of the nets is a good generator of the training material. Like faces.

PCFG (Probabilistic Context-free Grammar): This has to do with statistical modeling of grammars and language. I swear to you, reader, I cannot explain this.

Markov Chain: A series of states where the next in the series has a certain probability depending on the current state. (Eg, if the current letter is a, there’s a 20% chance the next will be s)

Aaron Jones: Coalfire Pen Tester and author of this paper; cracking enthusiast and definitely trying to get your password. STR: 8 DEX: 9 CON: 11 INT: 12 WIS: 2 CHA: 8; Weaknesses: Bad ideas, spelling; Strenks: Unknown; Alignment: Chaotic Good; Other attributes: Full of malarkey

Here we go
What I’ve done here, and I really doubt this hasn’t been done before, is to take a bunch of dictionaries, a handful of password generation tools, crack a bunch of hashes, and make a spreadsheet. It took weeks. Months? It took an unknown amount of late nights and so, so many kilowatts.

This is not what I meant to do.

Looking back though, it all seems inevitable.

What I originally meant to do was experiment with GANs for password cracking—which I eventually did—but not until after coming up with a way to compare it to other things. Then figure out what those things would be. Then get the data for all those other things.

Then make a spreadsheet.

The baseline.

This is not what I envisioned doing, but the seed had planted, the roots had grown, the rock had cracked, and the mountain was moving.

The Baseline

Overall baseline process

After a lot of handwringing, I decided on a straightforward test that reflected what I do; I’d crack some hashes. It seemed that evaluating how well a GAN could crack hashes from the wild would be worth trying.

I would get a big pile of hashes sourced from real users, and attack those. That way I wouldn’t have to worry about any bias I might have introduced if I created the list myself or tried to use a model of what passwords supposedly looked like. The hashes came from sources with a wide variety of password policies, so I believe it represents a good sampling of what other pen testers might run into and reflects an overall effectiveness. Uncracked hashes were left in.

In my experience, every wordlist or tool by itself typically has mediocre to terrible results. But when you add a ruleset to a hashcat, things get interesting. Using a ruleset is part of my standard workflow, so I wanted the baseline to reflect a typical use case. However, since the idea of using GANs to generate passwords is to get away from the need for rulesets (which of course are themselves the result of attempts at extracting emergent patterns), using a ruleset on the output of the GAN seemed to defeat the entire point.

To address this, I added a passthrough rule to the ruleset and debug log hashcats cracks. On a successful crack, the candidate and active rule would be logged, any direct hits from the source would be on a passthrough.

  1. Hashcat would be using NotSoSecure’s OneRuleToRuleThemAll ruleset, saving each attacks results to its own potfile and rule log for later analysis.
  2. Attack with each dictionary (6 dictionaries)
  3. Use each tool to generate trained sets from each dictionary (4 tools, 6 sets per tool)
  4. Attack with each tool and trained set combination (24 tools attacks)

If successful, this would provide comparable metrics on 30 different attack scenarios (60 if I’m milking it: 30 unique runs, debug analysis would show direct hits vs onerule increases).

The theory was; once a process was established, questions on any other source could be compared objectively, apples to apples.

Operative word: theory.


These dictionaries were picked based on a general sense of their infamy, size variance and effectiveness; the big idea was to take a selection of lists that anyone could find and attempt to put them through the same paces.

134Mb, 14344392 lines

The legendary classic, sourced from Kali.

13Gb, 1133849621 lines (1.1 billion)

A fantastic wordlist, and until these trials my go-to top performer. This wordlist along with onerule will crush a domain if their policy isn't up to snuff.

Linkedin Leak
619Mb, 60635806 lines

Gathered from in January 2020. Hashlists are a living project there, so there have been updates since I grabbed it.

Adult Friend Finder
230M, 36885516 lines

Gathered from in January 2020. Hashlists are a living project there, so there have been updates since I grabbed it.

15G, 1212356398 lines (1.2 billion)

Legion. Legend. Highly effective despite some of its baffling content.
Because of its baffling content?

HIBP: Have I been P0wned (v1-v5)
6.1G, 552843281 lines

Gathered from in January 2020. Hashlists are a living project there, so there have been updates since I grabbed it.

I collated and deduped all five releases and called the result ‘you have been p0wned’, aka yhbp. 

Might be artMight be art

The tools

Delving into the differences and theory of each tool is beyond the scope of this writeup. Below you’ll find a brief description (mostly lifted from GitHub repositories) and links the tools. You can find much more thorough descriptions of the theory and implementations at their respective repos.

PassGAN: A Deep Learning Approach for Password Guessing
With the use of GANs, a neural network should theoretically be able to intuit the properties that fingerprinting, raking, and analysis are attempting to capture and directly generate appropriate candidates, depreciating the need for said wordlists and rulesets.

NOTE: There wasn’t any code published with the paper, so I had to add my own blood sweat and tears to the fine work of others, likely corrupting anything good and decent about it. HERE and HERE

OMEN: Ordered Markov ENumerator
“OMEN” is a Markov model-based password guesser. It generates password candidates according to their occurrence probabilities (i.e., it outputs most likely passwords first.)

NOTE: Fancier Markov chains, using varying size word chunks instead of a single character. Several tool options are available but were not explored. Worth a look!

Pcfg_cracker: Probabilistic Context Free Grammar
This project uses machine learning to identify password creation habits of users. A PCFG model is generated by training on a list of disclosed plaintext/cracked passwords ... this project also includes a PCFG guess generator that makes use of this ruleset to generate password guesses in probability order.

NOTE: PCFG; Grammar theory to model symbol strings originated from work in computational linguistics aiming to understand the structure of natural languages. (Wikipedia) Buckle up, this is a deep read.

Semantic Password Guesser
Tools for training probabilistic context-free grammars on password lists. The models encode syntactic and semantic linguistic patterns and can be used to generate guesses.

NOTE: Uses PCFG techniques, but also uses generalized rules from natural language studies in the Natural Language Tool Kit (NTLK). This paper is a trip, awesome.

Hi, Alice, where’s the rabbit?


The training of some of these tools took a very long time, days per wordlist in some cases. It wasn’t uncommon for a training session to run for days, then crash—leaving me with nothing but broken dreams. The simple fact is, many of the tools I was trying to use weren’t made with billion-line wordlists in mind, that and some of the wordlists had strange things in them, characters that were most definitely not ascii.

I’m lucky enough to have a server with an inordinate amount of RAM installed so I could bruteforce my way through when a training phase needed to soak up 100G of ram with jillions of arrays. I mention this in case you go looking to reproduce my experience.

Some tools were more sensitive than others and would break during training. Sometimes the training would complete, but then break in generation. Not all training and cracking combinations were successful.

The whole process took many, many weeks spread across several months. Many non-contiguous fortnights? That’s the one.

All of the tools have a selection of tunable options I didn’t have time to dig into, so I opted to use the defaults, trusting the author would know where to set the dials for a n00b.

I would highly recommend anyone curious about password cracking to take a look at them, I certainly plan to revisit them all.

The Gear

I used my own gear for this, hand-cobbled from recycling bins, Craigslist and pretentious names. If one were to use something more professional, or a pittance of Amazon’s unfathomable resources, who knows what one might be able to uncover? On the flip side of that sentiment, you don’t need ridiculous resources to crack effectively, experiment, or learn. (doesn’t hurt, but not required)


Semi-retired Ethereum rig. Cracking and GAN training was done here.


i3-7350K 4.2GHz 4 core, 4.2 GHz


32 Gb


6x Gforce GTX1070


Ubuntu 16


Docker version 19.03.8, build afacb8b7f0


TF docker: tensorflow/tensorflow:1.13.1-gpu-py3


NVIDIA-SMI 440.59 Driver Version: 440.59 CUDA Version: 10.2


hashcat v5.1.0


Proxmox based VM host.


Proliant DL380 D7


2x Intel(R) Xeon(R) CPU X5650 @ 2.67GHz (6cores); 24 core multithreaded






proxmox 6.1-3

Finish the frakking story, man—what about the GANs?

Overall, the performance of the PassGAN was a bit of a letdown, although I think what I saw mostly just highlighted that I have a lot to learn when it comes to machine learning.

Not today, Jones
Not today, Jones

Training session would take 10-12 hours, and for cracking I would set the generator to run for sometimes just as long. The output from the generation was jittery and didn’t keep hashcat very busy, even running on a single card.

I think the PassGAN might actually be a lot more effective as a secondary attack against an organization once you hit the fingerprinting stage, otherwise it’s just over-generalized and super slow to boot, although as far as speed goes I’m sure a more capable coder could drastically speed things up.

I started seeing some better results after I took rocktastic12a and stripped out non-ascii, words more than 27 characters, and anything with more than four consecutive or repeating numbers.

Yeah, well, you know, that’s just, like, your opinion, man. — The Dude

I wasn’t very rigorous keeping track of runtimes, but honestly the differences in runtimes were so huge that practical usage precluded using many of these tools. After all, why would I spend days training and hours running something when I could run a list+rule attack in 20 minutes that yields far more passwords?

But still, now I know because I tried.

The (successfully) trained sets are available for download, to save anyone else the headache of trying to do it themselves. Some general take-aways:

  1. PCFG guesser basically ruled the tools; great scores and very resilient to the ridiculous things I was doing to it. Go get it, post-haste.
  2. The larger the list you try to train with, the worse the results tend to be. If you’re using a tool that is intended to discover patterns, and you give it a pile of stuff riddled with Japanese, Greek, huge repeating sets of numbers… whatever it ‘learns’ will reflect that. Now, I suspect if the huge input is actually appropriate for your target (i.e.: trained on the first 70% of cracked passwords) it might learn something useful. Prime example: the Pcfg results from the linked in set outperformed nearly everything.
  3. Bigger isn’t always better. Crackstation and rocktastic are good lists, fantastic lists, but they both have a fair amount of ‘garbage’ (obvious hashes). Especially Crackstation— there is literally MOTD splash text and forum posts in there—it’s wild. The HIBP rollup far surpassed my expectations by beating Rocktastic AND Crackstation while being half the size.
  4. A lot of password analyzers don’t deal well with huge wordlists, some handle the situation better than others, but I have the feeling that might be a result of a use case oversight than anything inherent with the particular theory. What madman is going to shove a billion lines through a password analysis tool? Seriously, it’s kind of nuts.
  5. Machine learning is hard.
  6. Do not write these tools off. I still intend to use them, just not as a first attack based on huge over-generalized dictionaries. I highly suspect their usage and performance will shine in smaller, more focused settings.

I would like to acknowledge and emphasize that my results, in the end, are still very specific to my use case. If one were to be cracking hashes from a non-English speaking source, the results would look far different. These numbers might not hold up for a company with a mature security stance. Cracking is where psychology meets statistics, and I have no doubt a skilled cracker with a solid understanding of their target can accomplish more with a laptop than I could with all of AWS.

That said, as specific as this may be, I think it provides some useful information.

Result numbers

Results are from a single, cold start attack. Hashcat was used with the Oneruletorulethemall ruleset in all cases.

  • Black: wordlist
  • Blue: pcfg
  • Green: semantic word guesser
  • Orange: OMEN
  • Red: GAN

Inception (see bottom of table) was feeding all harvested cracks from the hashset back into the GAN process to see what a cribbed network would do. I am not sure why out of all trials this ended up on the bottom, I was honestly expecting it to beat everything else.  This is why we test.


Crack count

% of Hashes (207683)

Direct hits






























































































This isn’t what I meant to do, but I’m glad I did it.

It started out as a simple, albeit vague, mission to investigate using GANs to crack passwords. The trip I ended up on was a much longer and painful journey than I expected, but in the end was far more interesting and valuable than I could have hoped for.

With any luck you’ve found this somewhat useful, either directly or as a cautionary tale.

See you in the weeds.