Stickler

A library for evaluating structured data and AI outputs with weighted field comparison and custom comparators

Generate Convert Improve

Install / Use

/learn @awslabs/Stickler

About this skill

Quality Score

0/100

README

Stickler: Structured Object Evaluation for GenAI

When in the course of human events, it becomes necessary to evaluate structured outputs from generative AI systems, we must acknowledge that traditional evaluation treats all fields equally. But not all fields are created equal.

Stickler is a Python library that enables complex structured JSON comparison and evaluation that lets you focus on the fields your customer actually cares about, to answer the question: "Is it doing a good job?"

Stickler uses specialized comparators for different data types: exact matching for critical identifiers, numeric tolerance for currency amounts, semantic similarity for text fields, and fuzzy matching for names and addresses. You can build custom comparators for domain-specific logic. The Hungarian algorithm ensures optimal list matching regardless of order, while the recursive evaluation engine handles unlimited nesting depth. Business-weighted scoring reflects actual operational impact, not just technical accuracy.

Consider an invoice extraction agent that perfectly captures shipment numbers—which must be exact or packages get routed to the wrong warehouse—but sometimes garbles driver notes like "delivered to front door" vs "left at entrance." Those note variations don't affect logistics operations at all. Traditional evaluation treats both error types identically and reports your agent as "95% accurate" without telling you if that 5% error rate matters. Stickler tells you exactly where the errors are and whether they're actually problems.

Whether you're extracting data from documents, performing ETL transformations, evaluating ML model outputs, or simply trying to diff complex JSON structures, Stickler transforms evaluation from a technical afterthought into a business-aligned decision tool.

Installation

pip install stickler-eval

Get Started in 30 Seconds

# pip install stickler-eval
from typing import List
from stickler import StructuredModel, ComparableField
from stickler.comparators import ExactComparator, NumericComparator, LevenshteinComparator

# Define your models
class LineItem(StructuredModel):
    product: str = ComparableField(comparator=LevenshteinComparator(), weight=1.0)
    quantity: int = ComparableField(weight=0.8)
    price: float = ComparableField(comparator=NumericComparator(tolerance=0.01), weight=1.2)

class Invoice(StructuredModel):
    shipment_id: str = ComparableField(comparator=ExactComparator(), weight=3.0)  # Critical
    amount: float = ComparableField(comparator=NumericComparator(tolerance=0.01), weight=2.0)
    line_items: List[LineItem] = ComparableField(weight=2.0)  # Hungarian matching!

# JSON from your systems (agent output, ground truth, etc.)
ground_truth_json = {
    "shipment_id": "SHP-2024-001",
    "amount": 1247.50,
    "line_items": [
        {"product": "Wireless Mouse", "quantity": 2, "price": 29.99},
        {"product": "USB Cable", "quantity": 5, "price": 12.99}
    ]
}

prediction_json = {
    "shipment_id": "SHP-2024-001",  # Perfect match
    "amount": 1247.48,  # Within tolerance
    "line_items": [
        {"product": "USB Cord", "quantity": 5, "price": 12.99},  # Name variation
        {"product": "Wireless Mouse", "quantity": 2, "price": 29.99}  # Reordered
    ]
}

# Construct from JSON and compare
ground_truth = Invoice(**ground_truth_json)
prediction = Invoice(**prediction_json)
result = ground_truth.compare_with(prediction)

print(f"Overall Score: {result['overall_score']:.3f}")  # 0.693
print(f"Shipment ID: {result['field_scores']['shipment_id']:.3f}")  # 1.000 - exact match
print(f"Line Items: {result['field_scores']['line_items']:.3f}")  # 0.926 - Hungarian optimal matching

Requirements

Python 3.12+
conda (recommended)

Quick Install

# Create conda environment
conda create -n stickler python=3.12 -y
conda activate stickler

# Install the library
pip install -e .

Development Install

# Install with testing dependencies
pip install -e ".[dev]"

Quick Test

Run the example to verify installation:

python examples/scripts/quick_start.py

Run tests:

pytest tests/

Basic Usage

Static Model Definition

from stickler import StructuredModel, ComparableField
from stickler.comparators.levenshtein import LevenshteinComparator

# Define your data structure
class Invoice(StructuredModel):
    invoice_number: str = ComparableField(
        comparator=LevenshteinComparator(),
        threshold=0.9
    )
    total: float = ComparableField(threshold=0.95)

# Compare objects
result = ground_truth.compare_with(prediction, evaluator_format=True)

print(f"Overall Score: {result['overall']['anls_score']:.3f}")

Confidence Evaluation

Evaluate prediction confidence calibration with AUROC metrics:

# Prediction with confidence scores
prediction = Invoice.from_json({
    "invoice_number": {"value": "INV-2024-001", "confidence": 0.95},
    "total": {"value": 1247.50, "confidence": 0.8}
})

# Enable confidence metrics
result = ground_truth.compare_with(
    prediction,
    add_confidence_metrics=True,
    document_field_comparisons=True
)

print(f"Overall Score: {result['overall_score']:.3f}")
print(f"Confidence AUROC: {result['auroc_confidence_metric']:.3f}")

Dynamic Model Creation (New!)

Create models from JSON configuration for maximum flexibility:

from stickler.structured_object_evaluator.models.structured_model import StructuredModel

# Define model configuration
config = {
    "model_name": "Product",
    "match_threshold": 0.8,
    "fields": {
        "name": {
            "type": "str",
            "comparator": "LevenshteinComparator",
            "threshold": 0.8,
            "weight": 2.0
        },
        "price": {
            "type": "float",
            "comparator": "NumericComparator",
            "default": 0.0
        }
    }
}

# Create dynamic model class
Product = StructuredModel.model_from_json(config)

# Use like any Pydantic model
product1 = Product(name="Widget", price=29.99)
product2 = Product(name="Gadget", price=29.99)

# Full comparison capabilities
result = product1.compare_with(product2)
print(f"Similarity: {result['overall_score']:.2f}")

Complete JSON-to-Evaluation Workflow (New!)

For maximum flexibility, load both configuration AND data from JSON:

# Load model config from JSON
with open('model_config.json') as f:
    config = json.load(f)

# Load test data from JSON  
with open('test_data.json') as f:
    data = json.load(f)

# Create model and instances from JSON
Model = StructuredModel.model_from_json(config)
ground_truth = Model(**data['ground_truth'])
prediction = Model(**data['prediction'])

# Evaluate - no Python object construction needed!
result = ground_truth.compare_with(prediction)

Benefits of JSON-Driven Approach:

Zero Python object construction required
Configuration-driven model creation
A/B testing different field configurations
Runtime model generation from external schemas
Production-ready JSON-based evaluation pipeline
Full Pydantic compatibility with comparison capabilities

See examples/scripts/json_to_evaluation_demo.py for a complete working example and docs/StructuredModel_Dynamic_Creation.md for comprehensive documentation.

JSON Schema Extensions: `x-aws-stickler-*` Complete Reference

Stickler supports standard JSON Schema (Draft 7+) with custom x-aws-stickler-* extensions for controlling comparison behavior. These extensions let you configure exactly how each field is evaluated without writing Python code.

Why Use JSON Schema Extensions?

Configuration-driven evaluation: Define models and comparison logic in JSON
No Python code required: Perfect for non-Python systems or runtime configuration
Version control friendly: Track evaluation logic changes alongside your schemas
A/B testing: Easily test different comparison strategies
Integration ready: Works with existing JSON Schema tooling and validators

Field-Level Extensions

Add these to any property in your JSON Schema to control comparison behavior:

`x-aws-stickler-comparator`

Type: string
Required: No
Default: Type-dependent (see table below)

Specifies the comparison algorithm for this field.

Available Comparators:

| Comparator Name | Best For | How It Works | |-----------------|----------|--------------| | "LevenshteinComparator" | Names, addresses, text with typos | Calculates edit distance between strings. Score = 1 - (edits / max_length) | | "ExactComparator" | IDs, codes, booleans, exact matches | Returns 1.0 for exact match, 0.0 otherwise | | "NumericComparator" | Prices, quantities, measurements | Compares numbers with configurable tolerance | | "FuzzyComparator" | Flexible text, descriptions | Token-based fuzzy matching (order-independent) | | "SemanticComparator" | Semantic similarity | Embedding-based comparison for meaning | | "BertComparator" | Deep semantic understanding | BERT model for contextual similarity | | "LLMComparator" | Complex semantic evaluation | LLM-powered comparison with reasoning |

Default Comparators by JSON Schema Type:

| JSON Schema Type | Default Comparator | Default Threshold | Rationale | |------------------|-------------------|-------------------|-----------| | "string" | LevenshteinComparator | 0.5 | Handles typos and minor variations | | "number" | NumericComparator | 0.5 | Tolerates small numeric differences | | "integer" | NumericComparator | 0.5 | Tolerates small numeric differences | | "boolean" | ExactComparator | 1.0 | Must be exactly true or false | | "array" (primitives) | Based on item type | Based on item type | Inhe

Related Skills

node-connect

351.8k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

110.9k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

351.8k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

351.8k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。

awslabs

View profile

View on GitHub

GitHub Stars30

CategoryDevelopment

Updated4d ago

Forks10

awslabs/stickler

Languages

Python

Security Score

90/100

Audited on Apr 3, 2026

No findings

Stickler

Install / Use

README

Stickler: Structured Object Evaluation for GenAI

Installation

Get Started in 30 Seconds

Requirements

Quick Install

Development Install

Quick Test

Basic Usage

Static Model Definition

Confidence Evaluation

Dynamic Model Creation (New!)

Complete JSON-to-Evaluation Workflow (New!)

JSON Schema Extensions: x-aws-stickler-* Complete Reference

Why Use JSON Schema Extensions?

Field-Level Extensions

x-aws-stickler-comparator

Related Skills

JSON Schema Extensions: `x-aws-stickler-*` Complete Reference

`x-aws-stickler-comparator`