Spellkit
Fast, safe typo correction for Ruby. SymSpell-based spell checker with Rust performance, term protection via regex patterns, and hot-reloadable dictionaries. Sub-millisecond latency, zero dependencies.
Install / Use
/learn @scientist-labs/SpellkitREADME
Fast, safe typo correction for search-term extraction. A Ruby gem with a native Rust implementation of the SymSpell algorithm.
SpellKit provides:
- Fast correction using SymSpell with configurable edit distance (1 or 2)
- Term protection - never alter protected terms using exact matches or regex patterns
- Hot reload - update dictionaries without restarting your application
- Sub-millisecond latency - p95 < 2µs on small dictionaries
- Thread-safe - built with Rust's Arc<RwLock> for safe concurrent access
Why a custom implementation? Existing Rust SymSpell crates require lowercase dictionary entries, but SpellKit preserves canonical forms (NASA stays NASA, iPhone stays iPhone). We also needed domain-specific guards, hot-reload, and Aspell-style skip patterns - features not available in existing implementations.
Why SpellKit?
No Runtime Dependencies
SpellKit is a pure Ruby gem with a Rust extension. Just gem install spellkit and you're done. No need to install Aspell, Hunspell, or other system packages. This makes deployment simpler and more reliable across different environments.
Fast Performance
Built on the SymSpell algorithm with Rust, SpellKit delivers:
- 350,000+ operations/second for spell checking
- 3.7x faster than Aspell for correctness checks
- 40x faster than Aspell for generating suggestions
- p99 latency < 25µs even under load
See the Benchmarks section for detailed comparisons.
Production Ready
- Thread-safe concurrent access
- Hot reload dictionaries without restarts
- Instance-based API for multi-domain support
- Comprehensive error handling
Installation
Add to your Gemfile:
gem "spellkit"
Or install directly:
gem install spellkit
Quick Start
SpellKit works with dictionaries from URLs or local files. Try it immediately:
require "spellkit"
# Load from URL (downloads and caches automatically)
SpellKit.load!(dictionary: SpellKit::DEFAULT_DICTIONARY_URL)
# Or use a configure block (recommended for Rails)
SpellKit.configure do |config|
config.dictionary = SpellKit::DEFAULT_DICTIONARY_URL
config.edit_distance = 1
end
# Or load from local file
# SpellKit.load!(dictionary: "path/to/dictionary.tsv")
# Check if a word is spelled correctly
puts SpellKit.correct?("hello")
# => true
# Get suggestions for a misspelled word
suggestions = SpellKit.suggestions("helllo", 5)
puts suggestions.inspect
# => [{"term"=>"hello", "distance"=>1, "freq"=>...}]
# Correct a typo
corrected = SpellKit.correct("helllo")
puts corrected
# => "hello"
# Batch correction
tokens = %w[helllo wrld ruby teset]
corrected_tokens = SpellKit.correct_tokens(tokens)
puts corrected_tokens.inspect
# => ["hello", "world", "ruby", "test"]
# Check stats
puts SpellKit.stats.inspect
# => {"loaded"=>true, "dictionary_size"=>..., "edit_distance"=>1, "loaded_at"=>...}
Usage
Basic Correction
require "spellkit"
# Load from URL (auto-downloads and caches)
SpellKit.load!(dictionary: "https://example.com/dict.tsv")
# Or from local file
SpellKit.load!(dictionary: "models/dictionary.tsv", edit_distance: 1)
# Check if a word is correct
SpellKit.correct?("hello")
# => true
# Get suggestions
SpellKit.suggestions("lyssis", 5)
# => [{"term"=>"lysis", "distance"=>1, "freq"=>2000}, ...]
# Correct a typo
SpellKit.correct("helllo")
# => "hello"
# Batch correction
tokens = %w[helllo wrld ruby]
SpellKit.correct_tokens(tokens)
# => ["hello", "world", "ruby"]
Term Protection
Protect specific terms from correction using exact matches or regex patterns:
# Load with exact-match protected terms
SpellKit.load!(
dictionary: "models/dictionary.tsv",
protected_path: "models/protected.txt" # file with terms to protect
)
# Protect terms matching regex patterns
SpellKit.load!(
dictionary: "models/dictionary.tsv",
protected_patterns: [
/^[A-Z]{3,4}\d+$/, # gene symbols like CDK10, BRCA1
/^\d{2,7}-\d{2}-\d$/, # CAS numbers like 7732-18-5
/^[A-Z]{2,3}-\d+$/ # SKU patterns like ABC-123
]
)
# Or combine both
SpellKit.load!(
dictionary: "models/dictionary.tsv",
protected_path: "models/protected.txt",
protected_patterns: [/^[A-Z]{3,4}\d+$/]
)
# Protected terms are automatically respected
SpellKit.correct("CDK10")
# => "CDK10" # protected, never changed
# Batch correction with protection
tokens = %w[helllo wrld ABC-123 for CDK10]
SpellKit.correct_tokens(tokens)
# => ["hello", "world", "ABC-123", "for", "CDK10"]
Multiple Instances
SpellKit supports multiple independent checker instances, useful for different domains or languages:
# Create separate instances for different domains
medical_checker = SpellKit::Checker.new
medical_checker.load!(
dictionary: "models/medical_dictionary.tsv",
protected_path: "models/medical_terms.txt"
)
legal_checker = SpellKit::Checker.new
legal_checker.load!(
dictionary: "models/legal_dictionary.tsv",
protected_path: "models/legal_terms.txt"
)
# Use them independently
medical_checker.suggestions("lyssis", 5)
legal_checker.suggestions("contractt", 5)
# Each maintains its own state
medical_checker.stats # Shows medical dictionary stats
legal_checker.stats # Shows legal dictionary stats
Configuration Block
Use the configure block pattern for Rails initializers:
SpellKit.configure do |config|
config.dictionary = "models/dictionary.tsv"
config.protected_path = "models/protected.txt"
config.protected_patterns = [/^[A-Z]{3,4}\d+$/]
config.edit_distance = 1
config.frequency_threshold = 10.0
end
# This becomes the default instance
SpellKit.suggestions("word", 5) # Uses configured dictionary
Dictionary Format
Dictionary (required)
Whitespace-separated file with term and frequency (supports both space and tab delimiters):
hello 10000
world 8000
lysis 2000
Or space-separated:
hello 10000
world 8000
lysis 2000
Protected Terms (optional)
One term per line. Terms are matched case-insensitively:
protected.txt
# Product codes
ABC-123
XYZ-999
# Technical terms
CDK10
BRCA1
# Brand names
MyBrand
SpecialTerm
Dictionary Sources
SpellKit doesn't bundle dictionaries, but works with several sources:
Use the Default Dictionary (Recommended)
# English 80k word dictionary from SymSpell
SpellKit.load!(dictionary: SpellKit::DEFAULT_DICTIONARY_URL)
Public Dictionary URLs
- SymSpell English 80k:
https://raw.githubusercontent.com/wolfgarbe/SymSpell/master/SymSpell.FrequencyDictionary/en-80k.txt - SymSpell English 500k:
https://raw.githubusercontent.com/wolfgarbe/SymSpell/master/SymSpell.FrequencyDictionary/en-500k.txt
Build Your Own
See "Building Dictionaries" section below for creating domain-specific dictionaries.
Caching
Dictionaries downloaded from URLs are cached in ~/.cache/spellkit/ for faster subsequent loads.
Configuration
SpellKit.load!(
dictionary: "models/dictionary.tsv", # required: path or URL
protected_path: "models/protected.txt", # optional
protected_patterns: [/^[A-Z]{3,4}\d+$/], # optional
edit_distance: 1, # 1 (default) or 2
frequency_threshold: 10.0, # default: 10.0 (minimum frequency for corrections)
# Skip pattern filters (all default to false)
skip_urls: true, # Skip URLs (http://, https://, www.)
skip_emails: true, # Skip email addresses
skip_hostnames: true, # Skip hostnames (example.com)
skip_code_patterns: true, # Skip code identifiers (camelCase, snake_case, etc.)
skip_numbers: true # Skip numeric patterns (versions, IDs, measurements)
)
Frequency Threshold
The frequency_threshold parameter controls which corrections are accepted by correct and correct_tokens:
- For misspelled words (not in dictionary): Only suggest corrections with frequency ≥
frequency_threshold - For dictionary words: Only suggest alternatives with frequency ≥
frequency_threshold × original_frequency
This prevents suggesting rare words as corrections for common typos.
Example:
# With default threshold (10.0), suggest any correction with freq ≥ 10
SpellKit.load!(dictionary: "dict.tsv")
SpellKit.correct("helllo") # => "hello" (if freq ≥ 10)
# With high threshold (1000.0), only suggest common corrections
SpellKit.load!(dictionary: "dict.tsv", frequency_threshold: 1000.0)
SpellKit.correct("helllo") # => "hello" (if freq ≥ 1000)
SpellKit.correct("rarword") # => "rarword" (no correction if freq < 1000)
Skip Patterns
SpellKit can automatically skip certain patterns to avoid "correcting" technical terms, URLs, and other special content. Inspired by Aspell's filter modes, these patterns are automatically applied when configured.
Available skip patterns:
SpellKit.load!(
dictionary: "dict.tsv",
skip_urls: true, # Skip URLs: https://example.com, www.example.com
skip_emails: true, # Skip emails: user@domain.com, admin+tag@example.com
skip_hostnames: true, # Skip hostnames: example.com, api.example.com
skip_code_patterns: true, # Skip code: camelCase, snake_case, PascalCase, dotted.paths
skip_numbers: true # Skip numbers: 1.2.3, #123, 5kg, 100mb
)
# With skip patterns enabled, technical content is preserved
SpellKit.correct("https://example.com") # => "https://example.com"
SpellKit.correct("user@test.com") # => "user@test.com"
SpellKit.correct("getElementById") # => "getElementById"
SpellKit.correct("version-1.2.3") # => "version-1.2.3"
# Regular typos are still corrected
SpellKit.correct("helllo") # => "hello"
