SkillAgentSearch skills...

SSEM

SSEM is a semantic similarity-based evaluation library for natural language processing (NLP) text generation tasks. It supports various similarity metrics and evaluation levels, and is compatible with any Hugging Face pre-trained transformer model.

Install / Use

/learn @TechyNilesh/SSEM
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

SSEM — Standardized Scoring and Evaluation Metrics

<p align="center"> <img src="https://raw.githubusercontent.com/TechyNilesh/SSEM/main/assets/ssem-logo.png" alt="SSEM Logo" width="300"> </p> <p align="center"> <img src="https://img.shields.io/badge/Python-3.10%2B-blue.svg" alt="Python"> <img src="https://img.shields.io/pypi/v/ssem.svg" alt="PyPI"> <a href="https://pepy.tech/project/ssem"><img src="https://static.pepy.tech/personalized-badge/ssem?period=total&units=none&left_color=grey&right_color=green&left_text=Downloads" alt="Downloads"></a> <img src="https://img.shields.io/badge/License-MIT-green.svg" alt="License"> </p> <p align="center"><b>Lightweight evaluation metrics for LLMs and AI agents. No platform. No API keys. Just scores.</b></p>

SSEM provides 12 evaluation metrics covering text generation quality, factual consistency, hallucination detection, agentic AI evaluation, and safety — all with full scoring transparency and research citations.

Installation

# PyPI
pip install ssem

# uv
uv pip install ssem

# Latest from GitHub
pip install git+https://github.com/TechyNilesh/SSEM.git

Quick Start

from SSEM import SSEM

evaluator = SSEM()

# BERTScore — one line
result = evaluator.bertscore(
    ["The cat sat on the mat."],
    ["A cat was sitting on a mat."]
)
print(result.score)      # 0.87
print(result.explain())  # Full transparency report

Why SSEM?

| SSEM | DeepEval / Ragas | |------|-----------------| | 12 metrics in one lightweight package | Bundled with platforms, tracing, dashboards | | No LLM-as-judge required — embedding + NLI based | Often requires GPT-4 API calls ($$$) | | Agentic metrics built-in — tool accuracy, reasoning chains | Focused on RAG, agents are afterthought | | Every score is transparent — method, model, citations | Black-box scores | | Runs offline on CPU — no API keys needed | Many require cloud API keys |

Available Metrics

Text Generation Quality

| Metric | Method | Score Range | Citation | |--------|--------|-------------|----------| | semantic_similarity | Sentence embedding cosine/euclidean/pearson similarity | [-1, 1] or [0, 1] | Vadapalli et al. (2021) | | bertscore | Token-level precision, recall, F1 via contextual embeddings | [0, 1] | Zhang et al. (2020) |

Factual Consistency

| Metric | Method | Score Range | Citation | |--------|--------|-------------|----------| | faithfulness | Claim extraction + NLI/embedding entailment checking | [0, 1] | Kryscinski et al. (2020) | | hallucination | Fraction of output claims NOT grounded in source | [0, 1] | Kryscinski et al. (2020); Manakul et al. (2023) | | answer_relevancy | Question-answer embedding similarity | [0, 1] | Es et al. (2024) |

Agentic AI Evaluation

| Metric | Method | Score Range | Citation | |--------|--------|-------------|----------| | reasoning_coherence | Sequential + goal-aligned step similarity, contradiction detection | [0, 1] | Xia et al. (2024) | | tool_accuracy | Tool selection + parameter + ordering accuracy (LCS) | [0, 1] | Liu et al. (2023) | | task_completion | Checklist or reference-based graded completion | [0, 1] | Liu et al. (2023) |

Consistency & Safety

| Metric | Method | Score Range | Citation | |--------|--------|-------------|----------| | multi_turn_consistency | Cross-turn semantic consistency + contradiction detection | [0, 1] | Zheng et al. (2023) | | selfcheck | Sampling consistency for hallucination detection | [0, 1] | Manakul et al. (2023) | | toxicity | Classifier-based toxicity scoring | [0, 1] | Gehman et al. (2020) |

Code Evaluation

| Metric | Method | Score Range | Citation | |--------|--------|-------------|----------| | code_correctness | Execution-based Pass@k with unbiased estimator | [0, 1] | Chen et al. (2021) |

Scoring Transparency

Every SSEM metric returns a MetricResult — never a bare number. Each result includes:

result = evaluator.bertscore(outputs, references)

result.score           # 0.87 — the primary score
result.score_range     # (0.0, 1.0) — possible range
result.interpretation  # "Strong token-level overlap..."
result.method          # Step-by-step computation description
result.model_used      # "bert-base-multilingual-cased"
result.citations       # List of Citation objects
result.details         # Per-sample scores, intermediates
result.elapsed_sec     # Wall-clock time

# Full human-readable transparency report
print(result.explain())

Example output of result.explain():

Metric        : BERTScore
Score         : 0.8734
Score Range   : [0.0, 1.0]
Interpretation: Strong token-level overlap — output captures most reference content.
Model Used    : bert-base-multilingual-cased

How This Score Was Computed:
1. Encoded 1 sentence pairs into per-token contextual embeddings using 'bert-base-multilingual-cased'.
2. For each pair, built a cosine similarity matrix between output tokens and reference tokens.
3. Precision = mean of row-wise max similarities (each output token's best reference match).
4. Recall = mean of column-wise max similarities (each reference token's best output match).
5. F1 = harmonic mean of precision and recall.
6. Averaged across 1 pairs: P=0.8912, R=0.8561, F1=0.8734.

Research Citations:
  [1] Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2020). BERTScore: Evaluating Text Generation with BERT. ICLR 2020. https://arxiv.org/abs/1904.09675

Usage Examples

Text Generation Evaluation

from SSEM import SSEM

evaluator = SSEM()

outputs = ["The cat sat on the mat.", "It was a sunny day."]
references = ["A cat was sitting on a mat.", "The weather was sunny."]

# Semantic similarity
result = evaluator.semantic_similarity(outputs, references)

# BERTScore (P/R/F1)
result = evaluator.bertscore(outputs, references)
print(result.details["precision"])  # 0.89
print(result.details["recall"])     # 0.85
print(result.details["f1"])         # 0.87

Faithfulness & Hallucination

output = "Paris is the capital of France. The Eiffel Tower is in London."
source = "Paris is the capital of France. The Eiffel Tower is located in Paris."

# Faithfulness — are claims grounded?
result = evaluator.faithfulness(output, source)
print(result.score)    # 0.5 — one of two claims is unfaithful
print(result.details)  # Per-claim breakdown with individual scores

# Hallucination — what fraction is fabricated?
result = evaluator.hallucination(output, source)
print(result.score)    # 0.5 — half the claims are hallucinated

Agentic AI Evaluation

# Reasoning chain coherence
result = evaluator.reasoning_coherence(
    reasoning_steps=[
        "First, I need to find the user's order history.",
        "Next, I'll filter orders from the last 30 days.",
        "Then, I'll calculate the total spending.",
        "Finally, I'll generate a summary report.",
    ],
    goal="Generate a spending report for the last month."
)
print(result.score)                       # 0.82
print(result.details["contradictions"])   # [] — no contradictions

# Tool call accuracy
result = evaluator.tool_accuracy(
    predicted_calls=[
        {"tool": "database_query", "params": {"table": "orders", "days": 30}},
        {"tool": "calculate_sum", "params": {"column": "amount"}},
    ],
    expected_calls=[
        {"tool": "database_query", "params": {"table": "orders", "days": 30}},
        {"tool": "calculate_sum", "params": {"column": "amount"}},
    ],
)
print(result.score)  # 1.0 — perfect tool usage

# Task completion (checklist mode)
result = evaluator.task_completion(
    agent_output="I queried the database and found 15 orders totaling $1,234.",
    expected_criteria=[
        "Query the order database",
        "Calculate total spending",
        "Report the number of orders",
    ],
)
print(result.score)  # 0.67 — 2 of 3 criteria met

Multi-Turn Consistency

result = evaluator.multi_turn_consistency(
    responses=[
        "I recommend Python for this project.",
        "Python has great ML libraries like scikit-learn.",
        "Actually, you should use Java instead.",  # contradiction!
    ]
)
print(result.score)                     # 0.61
print(result.details["contradictions"]) # Flags the Java contradiction

Code Correctness

result = evaluator.code_correctness(
    code_samples=[
        "def factorial(n):\n    if n <= 1: return 1\n    return n * factorial(n-1)",
        "def factorial(n):\n    return n * n",  # wrong
    ],
    test_code="assert factorial(5) == 120\nassert factorial(0) == 1",
    k_values=[1, 2],
)
print(result.details["pass_at_k"])  # {"pass@1": 0.5, "pass@2": 1.0}

Full Evaluation Report

report = evaluator.evaluate_all(
    output_sentences=["The cat sat on the mat."],
    reference_sentences=["A cat was sitting on a mat."],
    source_context="A cat was observed sitting on a mat in the room.",
    reasoning_steps=["Find the cat.", "Describe its position."],
)

print(report.summary())   # One-line-per-metric table
print(report.explain())   # Full transparency + bibliography
print(report.to_json())   # JSON export for pipelines

Research Citations

SSEM is grounded in peer-reviewed research. Every metric cites its origin:

| Metric | Paper | Venue | |--------|-------|-------| | BERTScore | Zhang et al. "BERTScore: Evaluating Text Generation with BERT" | ICLR 2020 | | Semantic Similarity | Beken Fikri et al. "Semantic Similarity Based Evaluation for Abstractive News Summarization" | GEM @ ACL 2021 | | Faithfulness | Kryscinski et al. "Evaluating the Factual Consistency of Abstractive Text Summarization" | EMNLP 2020 | | SelfCheck | Manakul et al. "SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection" | EMNLP 2023 | | Answer Relevancy | Es et al. "RAGAS: Automated Evaluation of Retrieval Augmented Generation" | EACL 2024 | | Reasoning Coherence | Xia, Li, Liu, Wu &

View on GitHub
GitHub Stars5
CategoryCustomer
Updated9d ago
Forks0

Languages

Python

Security Score

85/100

Audited on Mar 29, 2026

No findings