TextAssociations.jl
No description available
Install / Use
/learn @atantos/TextAssociations.jlREADME
🎯 Introduction
TextAssociations.jl is a Julia package for word association analysis and corpus-based research in linguistics, social sciences and the digital humanities. It provides a unified framework to quantify lexical relationships within texts and corpora using 51 association measures—spanning statistical, information-theoretic, epidemiological and lexical-gravity approaches—for transparent, data-driven analysis of how words co-occur and connect across discourse.
⚠️ Early Release Notice
This is an early, pre-registration release of TextAssociations.jl.
The package is fully functional but still evolving — documentation, tutorials, and examples are actively being expanded.
Even at this stage, it already offers functionality comparable to established corpus analysis tools:
- AntConc (but more programmable)
- SketchEngine (but open source)
- WordSmith Tools (but with more metrics)
With added advantages of:
- Being fully programmable and extensible
- Integration with
Julia's ecosystem - Support for custom metrics
- Ability to process streaming data
- Parallel computing capabilities
This makes TextAssociations.jl a powerful tool for computational linguistics, digital humanities and any field requiring sophisticated text analysis!
Check out our documentation for a detailed overview of all available features and functionalities.
Why Word Association Metrics Still Matter
Even in the era of transformer models and word embeddings, association metrics remain valuable because they:
- 📊 Are interpretable: Provide transparent, statistical insights into word relationships
- 🔄 Complement neural models: Can be used alongside embeddings to enhance performance and also enhance RAG pipelines.
- 📏 Serve as benchmarks: Provide baselines for evaluating complex models
- 💾 Work with limited data: Perform well even with small corpora
✨ Core Features
📈 51 Association Metrics
Comprehensive suite including PMI, Log-likelihood, Dice, Jaccard, Lexical Gravity and many more specialized measures from corpus linguistics, information theory and even some association metrics inspired from epidemiology.
📚 Corpus-Level Analysis
Process entire document collections with built-in support for:
- Large-scale corpus processing
- Temporal analysis (track changes over time)
- Subcorpus comparison with statistical tests
- Keyword extraction (TF-IDF and other methods soon to come)
🚀 Performance Optimized
- Lazy evaluation for memory efficiency
- Parallel processing support
- Streaming for massive corpora
- Caching system for repeated analyses
🔧 Flexible and Extensible
- Multiple input formats (text files, CSV, JSON, DataFrames)
- Easy to add custom metrics
- Comprehensive API for programmatic access
📦 Installation
You can install TextAssociations.jl directly from its GitHub repository using Julia’s package manager. In the Julia REPL, press ] to enter Pkg mode and run:
using Pkg
Pkg.add(url="https://github.com/atantos/TextAssociations.jl")
🚀 Quick Start
Basic Usage
using TextAssociations
# Simple analysis with a single text
text = "The cat sat on the mat. The cat played with the ball."
ct = ContingencyTable(text, "cat", windowsize=3, minfreq=1)
# Calculate PMI scores
pmi_scores = assoc_score(PMI, ct)
# Multiple metrics at once
results = assoc_score([PMI, LogDice, LLR], ct)
Corpus Analysis
# Load a corpus from a directory
corpus = read_corpus("path/to/texts/", preprocess=true)
# Analyze word associations across the entire corpus
results = analyze_node(corpus, "innovation", PMI, windowsize=5, minfreq=10)
# Analyze multiple words with multiple metrics
nodes = ["technology", "innovation", "research"]
metrics = [PMI, LogDice, LLR, ChiSquare]
analysis = analyze_nodes(corpus, nodes, metrics, top_n=100)
📊 Supported Metrics
TextAssociations.jl supports 51 metrics organized by category:
Information-Theoretic Metrics
- PMI (Pointwise Mutual Information): $\log \frac{P(x,y)}{P(x)P(y)}$
- PMI², PMI³: Squared and cubed variants
- PPMI: Positive PMI (negative values set to 0)
- LLR: Log-likelihood ratio
- LexicalGravity: Asymmetric association measure
Statistical Metrics
- ChiSquare: Pearson's χ² test
- Tscore, Zscore: Statistical significance tests
- PhiCoef: Phi coefficient (φ)
- CramersV: Cramér's V
- YuleQ, YuleOmega: Yule's measures
Similarity Coefficients
- Dice: $\frac{2a}{2a + b + c}$
- LogDice: Logarithmic Dice (more stable)
- JaccardIdx: Jaccard similarity
- CosineSim: Cosine similarity
- OverlapCoef: Overlap coefficient
Epidemiological Metrics
- RelRisk, LogRelRisk: Relative risk measures
- OddsRatio, LogOddsRatio: Odds ratios
- RiskDiff: Risk difference
- AttrRisk: Attributable risk
Complete Metric List
<details> <summary>Click to see all 51 metrics with formulas</summary>| Metric | Type | Formula |
| ----------------- | ------------- | ---------------------------------------------------- |
| PMI | PMI | $\log \frac{P(x,y)}{P(x)P(y)}$ |
| PMI² | PMI² | $(\log \frac{P(x,y)}{P(x)P(y)})^2$ |
| PMI³ | PMI³ | $(\log \frac{P(x,y)}{P(x)P(y)})^3$ |
| PPMI | PPMI | $\max(0, \log \frac{P(x,y)}{P(x)P(y)})$ |
| LLR | LLR | $2 \sum_{i,j} O_{ij} \ln \frac{O_{ij}}{E_{ij}}$ |
| LLR² | LLR² | $\sum_{i,j} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}$ |
| Dice | Dice | $\frac{2a}{2a + b + c}$ |
| LogDice | LogDice | $14 + \log_2(\frac{2a}{2a + b + c})$ |
| Jaccard | JaccardIdx | $\frac{a}{a + b + c}$ |
| Cosine | CosineSim | $\frac{a}{\sqrt{(a + b)(a + c)}}$ |
| Overlap | OverlapCoef | $\frac{a}{\min(a + b, a + c)}$ |
| Relative Risk | RelRisk | $\frac{a/(a+b)}{c/(c+d)}$ |
| Odds Ratio | OddsRatio | $\frac{ad}{bc}$ |
| Chi-square | ChiSquare | $\sum_{i,j}\frac{(f_{ij}-\hat{f_ij})^2}{\hat{f_ij}}$ |
| Phi | PhiCoef | $\frac{ad - bc}{\sqrt{(a+b)(c+d)(a+c)(b+d)}}$ |
| Cramér's V | CramersV | $\sqrt{\frac{\chi^2}{n \cdot \min(r-1, c-1)}}$ |
| ...and 35+ more | | |
🎯 Advanced Features
Temporal Analysis
Track how word associations change over time:
temporal_analysis = analyze_temporal(
corpus, ["pandemic", "vaccine"], :year, PMI, time_bins=5
)
Subcorpus Comparison
Compare associations across document groups with statistical tests:
comparison = compare_subcorpora(
corpus, :category, "innovation", PMI
)
# Access statistical tests and effect sizes
tests = comparison.statistical_tests
Collocation Networks
Build and export word association networks with richer metadata:
network = colloc_graph(
corpus, ["climate", "change"]; # seed terms
metric=PMI,
depth=2,
min_score=2.5,
direction=:undirected,
include_frequency=true,
weight_normalization=:minmax,
compute_centrality=true,
centrality_metrics=[:pagerank, :betweenness]
)
first(network.edges, 5) # includes Frequency / DocFrequency / NormalizedWeight
first(network.node_metrics, 5) # includes degrees, strengths & centrality scores
gephi_graph(network, "nodes.csv", "edges.csv")
Keyword Extraction
keywords = keyterms(corpus, method=:tfidf, num_keywords=50)
Concordance (KWIC)
concordance = kwic(corpus, "innovation", context_size=50)
for line in concordance.lines
println("...$(line.LeftContext) [$(line.Node)] $(line.RightContext)...")
end
⚡ Performance Features
Parallel Processing
# Use multiple cores
using Distributed
addprocs(4)
analysis = analyze_nodes(
corpus, nodes, metrics, parallel=true
)
Streaming for Large Corpora
# Process files without loading everything into memory
results = stream_corpus_analysis(
"texts/*.txt", "word", PMI, chunk_size=1000
)
Batch Processing
# Process hundreds of node words efficiently
batch_process_corpus(
corpus, "nodelist.txt", "output/",
batch_size=100
)
🔬 Use Cases
TextAssociations.jl is ideal for:
- Corpus Linguistics: Collocation analysis, lexical patterns, semantic prosody
- Digital Humanities: Literary analysis, historical text mining, stylometry
- NLP Research: Feature extraction, baseline models, evaluation metrics
- Social Media Analysis: Trend detection, sentiment associations, hashtag networks
- Information Retrieval: Query expansion, document similarity, term weighting
📖 Documentation
- [Getting Started Guide](https://atantos.git
