Tidy
Combine and clean word lists
Install / Use
/learn @sts10/TidyREADME
Tidy
A command-line tool for combining and cleaning large word list files.
A throw of the dice will never abolish chance. — Stéphane Mallarmé
What this tool aims to help users do
Tidy aims to help users create "better" word lists -- generally word lists that will be used to create passphrases like "block insoluble cardinal discounts".
Tidy performs basic list-cleaning operations like removing duplicate words and blank lines by default. It additionally provides various optional standardizations and filters, like lowercasing all words (-l), or removing words with integers in them (-I), as well as protections against rare-but-possible passphrase pitfalls, such as prefix codes (-P) and low minimum word lengths (see below for explanations).
Tidy can also make word lists more "typo-resistant" by enforcing a minimum edit distance (-d), removing homophones and/or enforcing a unique prefix length (-x), which can allow users to auto-complete words after a specified number of characters.
Tidy can be used to create new word lists (for example, if given more than one list, Tidy will combine and de-duplicate them) with desirable qualities. You can obviously edit existing word lists.
Other resources
- If you want to audit an existing word list without editing it, Tidy can do that, but I'd suggest using my related Word List Auditor.
- If you just want some word lists, you can check out my Orchard Street Wordlists.
Tidy's features
Given a text file with one word per line, this tool will create a new word list in which...
- duplicate lines (words) are removed
- empty lines have been removed
- whitespace from beginning and end of words is deleted
- words are sorted alphabetically (though this can be optionally prevented -- see below)
and print that new word list to the terminal or to a text file.
Optionally, the tool can...
- combine two or more inputted word lists
- make all characters lowercase (
-l) - set a minimum and maximum for word lengths
- handle words with integers and non-alphanumeric characters
- delete all characters before or after a delimiter (
-d/-D) - take lists of words to reject or allow
- remove homophones from a provided list of comma-separated pairs of homophones
- enforce a minimum edit distance between words
- remove prefix words (see below) (
-P) - remove suffix words (
-S) - remove all words with non-alphabetic characters from new list
- straighten curly/smart quotes, i.e. replacing them with their "straight" equivalents (
-q) - guarantee a maximum shared prefix length (see below) (
-x) - normalize Unicode of all characters of all words on list to a specified normalization form (NFC, NFKD, etc.) (
-z) - print corresponding dice rolls before words, separated by a tab. Dice can have 2 to 36 sides. (
--dice) - print information about the new list, such as entropy per word, to the terminal (
-A,-AA,-AAA, or-AAAAdepending on how much information you want to printed)
and more!
NOTE: If you do NOT want Tidy to sort list alphabetically, you can use the --no-sort option.
Usage
Usage: tidy [OPTIONS] <Inputted Word Lists>...
Arguments:
<Inputted Word Lists>...
Word list input files. Can be more than one, in which case they'll be
combined and de-duplicated. Requires at least one file
Options:
-a, --approve <APPROVED_LIST>
Path(s) for optional list of approved words. Can accept multiple files
-A, --attributes...
Print attributes about new list to terminal. Can be used more than once to
print more attributes. Some attributes may take a nontrivial amount of time
to calculate
-j, --json
Print attributes and word samples in JSON format
--cards
Print playing card abbreviation next to each word. Strongly recommend only
using on lists with lengths that are powers of 26 (26^1, 26^2, 26^3, etc.)
--debug
Debug mode
-d, --delete-after <DELETE_AFTER_DELIMITER>
Delete all characters after the first instance of the specified delimiter
until the end of line (including the delimiter). Delimiter must be a single
character (e.g., ','). Use 't' for tab and 's' for space. May not be used
together with -g or -G options
-D, --delete-before <DELETE_BEFORE_DELIMITER>
Delete all characters before and including the first instance of the specified
delimiter. Delimiter must be a single character (e.g., ','). Use 't' for tab
and 's' for space. May not be used together with -g or -G options
-i, --delete-integers
Delete all integers from all words on new list
-n, --delete-nonalphanumeric
Delete all non-alphanumeric characters from all words on new list. Characters
with diacritics will remain
--dice <DICE_SIDES>
Print dice roll before word in output. Set number of sides of dice. Must be
between 2 and 36. Use 6 for normal dice
--dry-run
Dry run. Don't write new list to file or terminal
-f, --force
Force overwrite of output file if it exists
--homophones <HOMOPHONES_LIST>
Path(s) to file(s) containing homophone pairs. There must be one pair of
homophones per line, separated by a comma (sun,son). If BOTH words are found
on a list, the SECOND word is removed. File(s) can be a CSV (with no column
headers) or TXT file(s)
-g, --ignore-after <IGNORE_AFTER_DELIMITER>
Ignore characters after the first instance of the specified delimiter until the
end of line, treating anything before the delimiter as a word. Delimiter must be
a single character (e.g., ','). Use 't' for tab and 's' for space. Helpful for
ignoring metadata like word frequencies. Works with attribute analysis and most
word removal options, but not with word modifications (like to lowercase).
May not be used together with -d, -D or -G options
-G, --ignore-before <IGNORE_BEFORE_DELIMITER>
Ignore characters before and including the first instance of the specified
delimiter, treating anything after the delimiter as a word. Delimiter must
be a single character (e.g., ','). Use 't' for tab and 's' for space. Helpful
for ignoring metadata like word frequencies. Works with attribute analysis
and most word removal options, but not with word modifications (like to lowercase).
May not be used together with -d, -D or -g options
--locale <LOCALE>
Specify a locale for words on the list. Aids with sorting. Examples: en-US,
es-ES. Defaults to system LANG. If LANG environmental variable is not set,
uses en-US
-l, --lowercase
Lowercase all words on new list
-M, --maximum-word-length <MAXIMUM_LENGTH>
Set maximum word length
-x, --shared-prefix-length <MAXIMUM_SHARED_PREFIX_LENGTH>
Set number of leading characters to get to a unique prefix, which can aid
auto-complete functionality. Setting this value to say, 4, means that knowing
the first 4 characters of any word on the generated list is enough to know
which word it is
-e, --minimum-edit-distance <MINIMUM_EDIT_DISTANCE>
Set minimum edit distance between words, which can reduce the cost of typos
when entering words
-m, --minimum-word-length <MINIMUM_LENGTH>
Set minimum word length
--sort-by-length
Sort by word length, with longest words first. First sorts words
alphabetically, respecting inputted locale
--concat
If multiple word list files give, concatenate word lists in order given.
Default behavior is to "blend" them, like dealing playing cards in reverse
-O, --no-sort
Do NOT sort outputted list alphabetically. Preserves original list order. Note
that duplicate lines and blank lines will still be removed
-z, --normalization-form <NORMALIZATION_FORM>
Normalize Unicode of all characters of all words. Accepts nfc, nfd, nfkc,
or nfkd (case insensitive)
-o, --output <OUTPUT>
Path for outputted list file. If none given, generated word list will be printed
to terminal
--sides-as-base
When printing dice roll before word in output, print dice values according to
the base selected through --dice option. Effectively this means that letters will
be used to represent numbers higher than 9. Note that this option also 0-indexes
the dice values. This setting defaults to `false`, which will 1-indexed
dice values, and use double-digit numbers when necessary (e.g. 18-03-08)
--print-first <PRINT_FIRST>
Just before printing generated list, cut list down to a set number of
words. Can accept expressions in the form of base**exponent (helpful
for generating diceware lists). Words are selected from the beginning
of processed list, and before it is sorted alphabetically
--print-rand <PRINT_RAND>
Just before printing generated list, cut list down to a set number of words.
Can accept expressions in the form of base**exponent (helpful for generating
diceware lists). Cuts are done randomly
--quiet
Do not print any extra information
-I, --remove-integers
Remove all words with integers in them from list
-N, --remove-nonalphanumeric
Remove all words with non-alphanumeric characters from new list. Wor
