Tidy

A command-line tool for combining and cleaning large word list files.

A throw of the dice will never abolish chance. — Stéphane Mallarmé

What this tool aims to help users do

Tidy aims to help users create "better" word lists -- generally word lists that will be used to create passphrases like "block insoluble cardinal discounts".

Tidy performs basic list-cleaning operations like removing duplicate words and blank lines by default. It additionally provides various optional standardizations and filters, like lowercasing all words (-l), or removing words with integers in them (-I), as well as protections against rare-but-possible passphrase pitfalls, such as prefix codes (-P) and low minimum word lengths (see below for explanations).

Tidy can also make word lists more "typo-resistant" by enforcing a minimum edit distance (-d), removing homophones and/or enforcing a unique prefix length (-x), which can allow users to auto-complete words after a specified number of characters.

Tidy can be used to create new word lists (for example, if given more than one list, Tidy will combine and de-duplicate them) with desirable qualities. You can obviously edit existing word lists.

Other resources

If you want to audit an existing word list without editing it, Tidy can do that, but I'd suggest using my related Word List Auditor.
If you just want some word lists, you can check out my Orchard Street Wordlists.

Tidy's features

Given a text file with one word per line, this tool will create a new word list in which...

duplicate lines (words) are removed
empty lines have been removed
whitespace from beginning and end of words is deleted
words are sorted alphabetically (though this can be optionally prevented -- see below)

and print that new word list to the terminal or to a text file.

Optionally, the tool can...

combine two or more inputted word lists
make all characters lowercase (-l)
set a minimum and maximum for word lengths
handle words with integers and non-alphanumeric characters
delete all characters before or after a delimiter (-d/-D)
take lists of words to reject or allow
remove homophones from a provided list of comma-separated pairs of homophones
enforce a minimum edit distance between words
remove prefix words (see below) (-P)
remove suffix words (-S)
remove all words with non-alphabetic characters from new list
straighten curly/smart quotes, i.e. replacing them with their "straight" equivalents (-q)
guarantee a maximum shared prefix length (see below) (-x)
normalize Unicode of all characters of all words on list to a specified normalization form (NFC, NFKD, etc.) (-z)
print corresponding dice rolls before words, separated by a tab. Dice can have 2 to 36 sides. (--dice)
print information about the new list, such as entropy per word, to the terminal (-A, -AA, -AAA, or -AAAA depending on how much information you want to printed)

and more!

NOTE: If you do NOT want Tidy to sort list alphabetically, you can use the --no-sort option.

Usage

Usage: tidy [OPTIONS] <Inputted Word Lists>...

Arguments:
  <Inputted Word Lists>...
          Word list input files. Can be more than one, in which case they'll be
          combined and de-duplicated. Requires at least one file

Options:
  -a, --approve <APPROVED_LIST>
          Path(s) for optional list of approved words. Can accept multiple files

  -A, --attributes...
          Print attributes about new list to terminal. Can be used more than once to
          print more attributes. Some attributes may take a nontrivial amount of time
          to calculate

  -j, --json
          Print attributes and word samples in JSON format

      --cards
          Print playing card abbreviation next to each word. Strongly recommend only
          using on lists with lengths that are powers of 26 (26^1, 26^2, 26^3, etc.)

      --debug
          Debug mode

  -d, --delete-after <DELETE_AFTER_DELIMITER>
          Delete all characters after the first instance of the specified delimiter
          until the end of line (including the delimiter). Delimiter must be a single
          character (e.g., ','). Use 't' for tab and 's' for space. May not be used
          together with -g or -G options

  -D, --delete-before <DELETE_BEFORE_DELIMITER>
          Delete all characters before and including the first instance of the specified
          delimiter. Delimiter must be a single character (e.g., ','). Use 't' for tab
          and 's' for space. May not be used together with -g or -G options

  -i, --delete-integers
          Delete all integers from all words on new list

  -n, --delete-nonalphanumeric
          Delete all non-alphanumeric characters from all words on new list. Characters
          with diacritics will remain

      --dice <DICE_SIDES>
          Print dice roll before word in output. Set number of sides of dice. Must be
          between 2 and 36. Use 6 for normal dice

      --dry-run
          Dry run. Don't write new list to file or terminal

  -f, --force
          Force overwrite of output file if it exists

      --homophones <HOMOPHONES_LIST>
          Path(s) to file(s) containing homophone pairs. There must be one pair of
          homophones per line, separated by a comma (sun,son). If BOTH words are found
          on a list, the SECOND word is removed. File(s) can be a CSV (with no column
          headers) or TXT file(s)

  -g, --ignore-after <IGNORE_AFTER_DELIMITER>
          Ignore characters after the first instance of the specified delimiter until the
          end of line, treating anything before the delimiter as a word. Delimiter must be
          a single character (e.g., ','). Use 't' for tab and 's' for space. Helpful for
          ignoring metadata like word frequencies. Works with attribute analysis and most
          word removal options, but not with word modifications (like to lowercase).
          May not be used together with -d, -D or -G options

  -G, --ignore-before <IGNORE_BEFORE_DELIMITER>
          Ignore characters before and including the first instance of the specified
          delimiter, treating anything after the delimiter as a word. Delimiter must
          be a single character (e.g., ','). Use 't' for tab and 's' for space. Helpful
          for ignoring metadata like word frequencies. Works with attribute analysis
          and most word removal options, but not with word modifications (like to lowercase).
          May not be used together with -d, -D or -g options

      --locale <LOCALE>
          Specify a locale for words on the list. Aids with sorting. Examples: en-US,
          es-ES. Defaults to system LANG. If LANG environmental variable is not set,
          uses en-US

  -l, --lowercase
          Lowercase all words on new list

  -M, --maximum-word-length <MAXIMUM_LENGTH>
          Set maximum word length

  -x, --shared-prefix-length <MAXIMUM_SHARED_PREFIX_LENGTH>
          Set number of leading characters to get to a unique prefix, which can aid
          auto-complete functionality. Setting this value to say, 4, means that knowing
          the first 4 characters of any word on the generated list is enough to know
          which word it is

  -e, --minimum-edit-distance <MINIMUM_EDIT_DISTANCE>
          Set minimum edit distance between words, which can reduce the cost of typos
          when entering words

  -m, --minimum-word-length <MINIMUM_LENGTH>
          Set minimum word length

      --sort-by-length
          Sort by word length, with longest words first. First sorts words 
          alphabetically, respecting inputted locale

  --concat
        If multiple word list files give, concatenate word lists in order given. 
        Default behavior is to "blend" them, like dealing playing cards in reverse

  -O, --no-sort
          Do NOT sort outputted list alphabetically. Preserves original list order. Note
          that duplicate lines and blank lines will still be removed

  -z, --normalization-form <NORMALIZATION_FORM>
          Normalize Unicode of all characters of all words. Accepts nfc, nfd, nfkc,
          or nfkd (case insensitive)

  -o, --output <OUTPUT>
          Path for outputted list file. If none given, generated word list will be printed
          to terminal

      --sides-as-base
          When printing dice roll before word in output, print dice values according to
          the base selected through --dice option. Effectively this means that letters will
          be used to represent numbers higher than 9. Note that this option also 0-indexes
          the dice values. This setting defaults to `false`, which will 1-indexed
          dice values, and use double-digit numbers when necessary (e.g. 18-03-08)

      --print-first <PRINT_FIRST>
          Just before printing generated list, cut list down to a set number of
          words. Can accept expressions in the form of base**exponent (helpful
          for generating diceware lists). Words are selected from the beginning
          of processed list, and before it is sorted alphabetically

      --print-rand <PRINT_RAND>
          Just before printing generated list, cut list down to a set number of words.
          Can accept expressions in the form of base**exponent (helpful for generating
          diceware lists). Cuts are done randomly

      --quiet
          Do not print any extra information

  -I, --remove-integers
          Remove all words with integers in them from list

  -N, --remove-nonalphanumeric
          Remove all words with non-alphanumeric characters from new list. Wor

Tidy

Install / Use

README

Tidy

What this tool aims to help users do

Other resources

Tidy's features

Usage