🎯 Introduction

TextAssociations.jl is a Julia package for word association analysis and corpus-based research in linguistics, social sciences and the digital humanities. It provides a unified framework to quantify lexical relationships within texts and corpora using 51 association measures—spanning statistical, information-theoretic, epidemiological and lexical-gravity approaches—for transparent, data-driven analysis of how words co-occur and connect across discourse.

⚠️ Early Release Notice This is an early, pre-registration release of TextAssociations.jl. The package is fully functional but still evolving — documentation, tutorials, and examples are actively being expanded.

Even at this stage, it already offers functionality comparable to established corpus analysis tools:

AntConc (but more programmable)
SketchEngine (but open source)
WordSmith Tools (but with more metrics)

With added advantages of:

Being fully programmable and extensible
Integration with Julia's ecosystem
Support for custom metrics
Ability to process streaming data
Parallel computing capabilities

This makes TextAssociations.jl a powerful tool for computational linguistics, digital humanities and any field requiring sophisticated text analysis!

Check out our documentation for a detailed overview of all available features and functionalities.

Why Word Association Metrics Still Matter

Even in the era of transformer models and word embeddings, association metrics remain valuable because they:

📊 Are interpretable: Provide transparent, statistical insights into word relationships
🔄 Complement neural models: Can be used alongside embeddings to enhance performance and also enhance RAG pipelines.
📏 Serve as benchmarks: Provide baselines for evaluating complex models
💾 Work with limited data: Perform well even with small corpora

✨ Core Features

📈 51 Association Metrics

Comprehensive suite including PMI, Log-likelihood, Dice, Jaccard, Lexical Gravity and many more specialized measures from corpus linguistics, information theory and even some association metrics inspired from epidemiology.

📚 Corpus-Level Analysis

Process entire document collections with built-in support for:

Large-scale corpus processing
Temporal analysis (track changes over time)
Subcorpus comparison with statistical tests
Keyword extraction (TF-IDF and other methods soon to come)

🚀 Performance Optimized

Lazy evaluation for memory efficiency
Parallel processing support
Streaming for massive corpora
Caching system for repeated analyses

🔧 Flexible and Extensible

Multiple input formats (text files, CSV, JSON, DataFrames)
Easy to add custom metrics
Comprehensive API for programmatic access

📦 Installation

You can install TextAssociations.jl directly from its GitHub repository using Julia’s package manager. In the Julia REPL, press ] to enter Pkg mode and run:

using Pkg
Pkg.add(url="https://github.com/atantos/TextAssociations.jl")

🚀 Quick Start

Basic Usage

using TextAssociations

# Simple analysis with a single text
text = "The cat sat on the mat. The cat played with the ball."
ct = ContingencyTable(text, "cat", windowsize=3, minfreq=1)

# Calculate PMI scores
pmi_scores = assoc_score(PMI, ct)

# Multiple metrics at once
results = assoc_score([PMI, LogDice, LLR], ct)

Corpus Analysis

# Load a corpus from a directory
corpus = read_corpus("path/to/texts/", preprocess=true)

# Analyze word associations across the entire corpus
results = analyze_node(corpus, "innovation", PMI, windowsize=5, minfreq=10)

# Analyze multiple words with multiple metrics
nodes = ["technology", "innovation", "research"]
metrics = [PMI, LogDice, LLR, ChiSquare]
analysis = analyze_nodes(corpus, nodes, metrics, top_n=100)

📊 Supported Metrics

TextAssociations.jl supports 51 metrics organized by category:

Information-Theoretic Metrics

PMI (Pointwise Mutual Information): $\log \frac{P(x,y)}{P(x)P(y)}$
PMI², PMI³: Squared and cubed variants
PPMI: Positive PMI (negative values set to 0)
LLR: Log-likelihood ratio
LexicalGravity: Asymmetric association measure

Statistical Metrics

ChiSquare: Pearson's χ² test
Tscore, Zscore: Statistical significance tests
PhiCoef: Phi coefficient (φ)
CramersV: Cramér's V
YuleQ, YuleOmega: Yule's measures

Similarity Coefficients

Dice: $\frac{2a}{2a + b + c}$
LogDice: Logarithmic Dice (more stable)
JaccardIdx: Jaccard similarity
CosineSim: Cosine similarity
OverlapCoef: Overlap coefficient

Epidemiological Metrics

RelRisk, LogRelRisk: Relative risk measures
OddsRatio, LogOddsRatio: Odds ratios
RiskDiff: Risk difference
AttrRisk: Attributable risk

Complete Metric List

<details> <summary>Click to see all 51 metrics with formulas</summary>

| Metric | Type | Formula | | ----------------- | ------------- | ---------------------------------------------------- | | PMI | PMI | $\log \frac{P(x,y)}{P(x)P(y)}$ | | PMI² | PMI² | $(\log \frac{P(x,y)}{P(x)P(y)})^2$ | | PMI³ | PMI³ | $(\log \frac{P(x,y)}{P(x)P(y)})^3$ | | PPMI | PPMI | $\max(0, \log \frac{P(x,y)}{P(x)P(y)})$ | | LLR | LLR | $2 \sum_{i,j} O_{ij} \ln \frac{O_{ij}}{E_{ij}}$ | | LLR² | LLR² | $\sum_{i,j} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}$ | | Dice | Dice | $\frac{2a}{2a + b + c}$ | | LogDice | LogDice | $14 + \log_2(\frac{2a}{2a + b + c})$ | | Jaccard | JaccardIdx | $\frac{a}{a + b + c}$ | | Cosine | CosineSim | $\frac{a}{\sqrt{(a + b)(a + c)}}$ | | Overlap | OverlapCoef | $\frac{a}{\min(a + b, a + c)}$ | | Relative Risk | RelRisk | $\frac{a/(a+b)}{c/(c+d)}$ | | Odds Ratio | OddsRatio | $\frac{ad}{bc}$ | | Chi-square | ChiSquare | $\sum_{i,j}\frac{(f_{ij}-\hat{f_ij})^2}{\hat{f_ij}}$ | | Phi | PhiCoef | $\frac{ad - bc}{\sqrt{(a+b)(c+d)(a+c)(b+d)}}$ | | Cramér's V | CramersV | $\sqrt{\frac{\chi^2}{n \cdot \min(r-1, c-1)}}$ | | ...and 35+ more | | |

</details>

🎯 Advanced Features

Temporal Analysis

Track how word associations change over time:

temporal_analysis = analyze_temporal(
    corpus, ["pandemic", "vaccine"], :year, PMI, time_bins=5
)

Subcorpus Comparison

Compare associations across document groups with statistical tests:

comparison = compare_subcorpora(
    corpus, :category, "innovation", PMI
)
# Access statistical tests and effect sizes
tests = comparison.statistical_tests

Collocation Networks

Build and export word association networks with richer metadata:

network = colloc_graph(
    corpus, ["climate", "change"];  # seed terms
    metric=PMI,
    depth=2,
    min_score=2.5,
    direction=:undirected,
    include_frequency=true,
    weight_normalization=:minmax,
    compute_centrality=true,
    centrality_metrics=[:pagerank, :betweenness]
)

first(network.edges, 5)          # includes Frequency / DocFrequency / NormalizedWeight
first(network.node_metrics, 5)    # includes degrees, strengths & centrality scores

gephi_graph(network, "nodes.csv", "edges.csv")

Keyword Extraction

keywords = keyterms(corpus, method=:tfidf, num_keywords=50)

Concordance (KWIC)

concordance = kwic(corpus, "innovation", context_size=50)
for line in concordance.lines
    println("...$(line.LeftContext) [$(line.Node)] $(line.RightContext)...")
end

⚡ Performance Features

Parallel Processing

# Use multiple cores
using Distributed
addprocs(4)

analysis = analyze_nodes(
    corpus, nodes, metrics, parallel=true
)

Streaming for Large Corpora

# Process files without loading everything into memory
results = stream_corpus_analysis(
    "texts/*.txt", "word", PMI, chunk_size=1000
)

Batch Processing

# Process hundreds of node words efficiently
batch_process_corpus(
    corpus, "nodelist.txt", "output/",
    batch_size=100
)

🔬 Use Cases

TextAssociations.jl is ideal for:

Corpus Linguistics: Collocation analysis, lexical patterns, semantic prosody
Digital Humanities: Literary analysis, historical text mining, stylometry
NLP Research: Feature extraction, baseline models, evaluation metrics
Social Media Analysis: Trend detection, sentiment associations, hashtag networks
Information Retrieval: Query expansion, document similarity, term weighting

📖 Documentation

[Getting Started Guide](https://atantos.git

TextAssociations.jl

Install / Use

README