gwordlist

| ***** |NOTICE| ***** | |-|:----:|-| | |This repository serves large files using GitHub's LFS which now charges for bandwidth. If you receive a quota error, download the tiny 1gramsbyfreq.sh shell script. Running that on your own machine will download Google's entire corpus (over 15 GB) and then, after much processing, prune it down to 0.25 GB.| | | ***** | | ***** |

This project includes wordlists derived from Google's ngram corpora plus the programs used to automatically download and derive the lists, should you so wish.

The most import files:

frequency-all.txt.gz 266 MB. Compressed list of all 29 billion words in the corpus, sorted by frequency. Decompresses to over 2 GB. Includes words with weird symbols, numbers, misspellings, OCR errors, and foreign languages.
frequency-alpha-alldicts.txt 18 MB. List of the 246,591 alphabetical words which were able to be verified by various dictionaries. (GCIDE/Websters1913, WordNet, and OEDv2). Sorted by frequency.
1gramsbyfreq.sh: The main shell script which downloads the data from Google and extracts frequency information.

What does the data look like?

Here's a sample of one of the files:

#RANKING   WORD                             COUNT      PERCENT   CUMULATIVE
1          ,                      115,513,165,249    5.799422%    5.799422%
2          the                    109,892,823,605    5.517249%   11.316671%
3          .                       86,243,850,165    4.329935%   15.646607%
4          of                      66,814,250,204    3.354458%   19.001065%
5          and                     47,936,995,099    2.406712%   21.407776%

Interestingly, if this data is right, only five words make up 20% of all the words in books from 1880 to 2020. And two of those "words" are punctuation marks!! (Don't believe comma is a word? I've also created wordlists that exclude punctuation. See the files named "alpha").

Why does this exist?

I needed my XKCD 936 compliant password generator to have a good list of words in order to make memorable passphrases. Most lists I've seen are not terribly good for my purposes as the words are often from extremely narrow domains. The best I found was SCOWL, but I didn't like that the words weren't sorted by frequency so I couldn't easily take a slice of, say, the top 4096 most frequent words.

The obvious solution was to use Google's ngram corpus which claims to have a trillion different words pruned from all the books they've scanned for books.google.com (about 4% of all books ever published, they say). Unfortunately, while some people had posted small lists, nobody had the entire list of every word sorted by frequency. So, I made this and here it is.

What can this data be used for?

Anything you want. While my programs are licensed under the GNU GPL ≥3, I'm explicitly releasing the data produced under the same license as Google granted me: Creative Commons Attribution 3.0.

How many words does it really have?

There are 37,235,985 entries in the V3 (20200217) corpus, but it's a mistake to think there are 37 million different, useful words. For example, 6% of the words found are a single comma. Google used completely automated OCR techniques to find the words and it made a lot of mistakes. Moreover, their definition of a “word” includes things like s, A4oscow, IIIIIIIIIIIIIIIIIIIIIIIIIIIII, cuando, لاامش, ihm,SpecialMarkets@ThomasNelson, buisness[sic], and ,.

To compensate, they only included words in the corpus that appeared at least 40 times, but even so there's so much dreck at the bottom of the list that it's really not worth bothering. Personally, I found that words that appeared over 100,000 times tended to be worthwhile. In addition, I was getting so many obvious OCR errors that I decided to also create some cleaner lists by using dict to check every word against a dictionary. (IMPORTANT NOTE! If you run these scripts, be sure to setup your own dictd so you're not pounding the internet servers for a bazillion lookups.)

After pruning with dictionaries, I found 65536 words seemed like a more reasonable number to cutoff. However, the script currently does not limit the number of words. Because this part has not been optimized yet, it can take a very long time. For faster runs, set maxcount=65536.

How big are the files?

If you run my scripts (which are tiny) they will download about 14 GiB of data from Google. However, if you simply want the final list, it uncompresses to over 350 MB. Alternately, if you don't need so many words, consider downloading one of the smaller files I created that have been cleaned up and limited to only the top words verified in dictionaries, such as frequency-alpha-alldicts.txt.

What got thrown away in these subcopora?

As you can guess, since the file size went down by 90%, I tossed a lot of info. The biggest changes were from losing the separate counts for each year, ignoring the tags for part of speech (e.g., I used only the count for "watch", which includes the counts for watch_VERB with watch_NOUN), and from combining different capitalization into a single term. (Each word is listed under its most frequent capitalization: for example, "London", instead of "london".) If you need that data, it's not hard to modify the scripts. Let me know if you have trouble.

What got added?

I counted up the total number of words in all the books so I could get a rough percentage of how often each word was being used in English. I also include a running total of the percentage so you can truncate the file wherever you want. (E.g., to get a list of 95% of all words used in English).

Part of Speech tags

The corpus includes words suffixed with an underscore and then a tag marking what part of speech the word appears to have been used. For example:

#5101      watch               	    76,770,311	    0.001284%	   85.124506%
#8225      watch_VERB          	    44,060,908	    0.000737%	   88.174382%
#10464     watch_NOUN          	    32,697,074	    0.000547%	   89.601624%

Words tagged with part of speech appear to be simply duplicate counts of the root word. In the example of watch above, note that 76,770,311 ≈ 44,060,908 + 32,697,074.
List of Part of Speech tags (from books.google.com/ngrams/info) NOUN noun (Examples: time_NOUN, State_Noun, Mr._Noun) VERB verb (Examples: is_VERB, be_VERB, have_VERB) ADJ adjective (Examples: other_ADJ, such_ADJ, same_ADJ) ADV adverb (Examples: not_ADV, when_ADV, so_ADV) PRON pronoun (Examples: it_PRON, I_PRON, he_PRON) DET determiner or article (Examples: the_DET, a_DET, this_DET) ADP an adposition: either a preposition or a postposition (Examples: of_ADP, in_ADP, for_ADP) NUM numeral (Examples: one_NUM, 1_NUM, 2001_NUM) CONJ conjunction (Examples: and_CONJ, or_CONJ, but_CONJ) PRT particle (Examples: to_PRT, 's_PRT, '_PRT out_PRT)
Part of speech tags, undocumented by Google: . punctuation (Example: ,_.) X ??? (Example: [_X, *_X, =_X, etc._x, de_X, No_X)
Google uses these tags for searching, but they don't appear (at least in 1-grams): ROOT root of the parse tree These tags must stand alone (e.g., START) START start of a sentence END end of a sentence

Bugs

Script cannot run on a 32-bit machine as it briefly requires more than 4 GiB of RAM as it makes a hashtable of every word.

To Do

Use Makefile for dependencies so that multiprocessing is built in (using make -j), instead of having to append & to commands.
Use comm instead of dict to check wordlists against dictionaries.
The number of books a word occurs in should help determine popularity. Perhaps popularity = occurences × books ?

LFS

Github does not allow files larger than 100MB. The file frequency-all.txt.gz is 266 MB, so it has been placed on git-lfs.

Misc Notes

Hyphenated words do not appear in the 1-gram list. Why not? Perhaps they are considered 2-grams?
I may need a manually created "stopword" list due to all the obviously non-English words appearing in the list.
Some of the 1-grams I'm turning up as quite popular should actually be 2-grams: e.g. York -> New York. Maybe I should add in 2-grams to the list, since some of them will clearly be in the list of most common "words".
Some words should be capitalized, such as "I" and "London". But it makes sense to accumulate "the" and "The", since otherwise both will be listed as one of the most common words.

Solution: Accumulate twice. First time case-sensitive. Sort by frequency. Then, second time, case-insensitive, outputting the first variation found.
I'm currently getting some very strange results, or at least unexpected, results. While the 100 words seem reasonably common, there are some strangely highly ranked words:

124 s 147 p 151 J 165 de 202 M 209 general 214 B 225 S 226 Mr 228 York 238 D 241 government 254 R 272 et 282 E 291 John 292 University 294 U 309 H 325 P 328 pp 359 English 365 L 371 v 373 London 390 W 391 Fig 399 e 405 F 422 Figure 426 G 444 British 445 T 446 c 455 N 466 II 472 b 478 French 479 England 508 St 509 General

Compare that with common words that are found much less frequently:

2124 eat
4004 TV
6040 ate
6041 bedroom
6138 fool
10007 foul
10012 swim
10017 sore
15013 lone
15020 doom

Certain dom

Gwordlist

Install / Use

README