SkillAgentSearch skills...

TextMachina

A modular and extensible Python framework, designed to aid in the creation of high-quality, unbiased datasets to build robust models for MGT-related tasks such as detection, attribution, and boundary detection.

Install / Use

/learn @Genaios/TextMachina
About this skill

Quality Score

0/100

Category

Design

Supported Platforms

Universal

README

<!--- Copyright 2023 Genaios Licensed under the CC BY-NC-ND 4.0 License You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. You may not use the material for commercial purposes. If you remix, transform, or build upon the material, you may not distribute the modified material. You are free to copy and redistribute this material as it is in any medium or format You may obtain a copy of the License at https://creativecommons.org/licenses/by-nc-nd/4.0/ --> <p align="center"> <picture> <img alt="TextMachina" src="https://github.com/Genaios/TextMachina/blob/main/assets/title.png?raw=true" width="352" height="59" style="max-width: 100%;"> </picture> <br/> <br/> </p> <p align="center"> <a href="LICENSE"> <img alt="license" src="https://img.shields.io/badge/license-CC_BY_NC_ND_4.0-green"> </a> <a href="https://textmachina.readthedocs.io/en/latest/"> <img alt="Documentation" src="https://img.shields.io/badge/Documentation-Readthedocs-green"> </a> <a href="CODE_OF_CONDUCT.md"> <img alt="Contributor Covenant" src="https://img.shields.io/badge/Contributor%20Covenant-v2.0-green"> </a> <a href="https://pypi.org/project/text-machina/"> <img alt="Pypi version" src="https://img.shields.io/pypi/v/text-machina"> </a> <a href="https://pypi.org/project/text-machina/"> <img alt="Downloads" src="https://img.shields.io/pypi/dm/text-machina"> </a> </p> <h3 align="center"> <p><b>Unifying strategies to build MGT datasets in a single framework</b></p> </h3>

icon TextMachina is a modular and extensible Python framework, designed to aid in the creation of high-quality, unbiased datasets to build robust models for MGT-related tasks such as:

  • 🔎 Detection: detect whether a text has been generated by an LLM.
  • 🕵️‍♂️ Attribution: identify what LLM has generated a text.
  • 🚧 Boundary detection: find the boundary between human and generated text.
  • 🎨 Mixcase: ascertain whether specific text spans are human-written or generated by LLMs.

icon TextMachina provides a user-friendly pipeline that abstracts away the inherent intricacies of building MGT datasets:

  • 🦜 LLM integrations: easily integrates any LLM provider. Currently, icon supports LLMs from Anthropic, Cohere, OpenAI, Google Vertex AI, Amazon Bedrock, AI21, Azure OpenAI, models deployed on VLLM and TRT inference servers, and any model from HuggingFace deployed either locally or remotely through Inference API or Inference Endpoints. See models to implement your own LLM provider.

  • ✍️ Prompt templating: just write your prompt template with placeholders and let icon extractors to fill the template and prepare a prompt for an LLM. See extractors to implement your own extractors and learn more about the placeholders for each extractor.

  • 🔒 Constrained decoding: automatically infer LLM decoding hyper-parameters from the human texts to improve the quality and reduce the biases of your MGT datasets. See constrainers to implement your own constrainers.

  • 🛠️ Post-processing: post-process functions aimed to improve the quality of any MGT dataset and prevent common biases and artifacts. See postprocessing to add new postprocess functions.

  • 🌈 Bias mitigation: icon is built with bias prevention in mind and helps you across all the pipeline to prevent introducing spurious correlations in your datasets.

  • 📊 Dataset exploration: explore the generated datasets and quantify its quality with a set of metrics. See metrics and interactive to implement your own metrics and visualizations.

The following diagram depicts the icon's pipeline.

<p align="center"> <picture> <img alt="TextMachina Pipeline" src="https://github.com/Genaios/TextMachina/blob/main/assets/diagram.png?raw=true"> </picture> <br/> <br/> </p>

🔧 Installation


You can install all the dependencies with pip:

pip install text-machina[all]

or just with specific dependencies for an specific LLM provider or development dependencies (see setup.py):

pip install text-machina[anthropic,dev]

You can also install directly from source:

pip install .[all]

If you're planning to modify the code for specific use cases, you can install icon in development mode:

pip install -e .[dev]

👀 Quick Tour


Once installed, you are ready to use icon for building MGT datasets either using the CLI or programmatically.

📟 Using the CLI

The first step is to define a YAML configuration file or a directory tree containing YAML files. Read the examples/learning files to learn how to define configuration using different providers and extractors for different tasks. Take a look to examples/use_cases to see configurations for specific use cases.

Then, we can call the explore and generate endpoints of icon's CLI. The explore endpoint allows to inspect a small generated dataset using an specific configuration through an interactive interface. For instance, let's suppose we want to check how an MGT detection dataset generated using XSum news articles and gpt-3.5-turbo-instruct looks like, and compute some metrics:

text-machina explore --config-path etc/examples/xsum_gpt-3-5-turbo-instruct_openai.yaml \
--task-type detection \
--metrics-path etc/metrics.yaml \
--max-generations 10
<p align="center"> <picture> <img alt="CLI interface showing generated and human text for detection" src="https://github.com/Genaios/TextMachina/blob/main/assets/explore.png?raw=true"> </picture> <br/> <br/> </p>

Great! Our dataset seems to look great, no artifacts, no biases, and high-quality text using this configuration. Let's now generate a whole dataset for MGT detection using that config file. The generate endpoint allows you to do that:

text-machina generate --config-path etc/examples/xsum_gpt-3-5-turbo-instruct_openai.yaml \
--task-type detection

A run name will be assigned to your execution and icon will cache results behind the scenes. If your run is interrupted at any point, you can use --run-name <run-name> to recover the progress and continue generating your dataset.

👩‍💻 Programmatically

You can also use icon programmatically. To do that, instantiate a dataset generator by calling get_generator with a Config object, and run its generate method. The Config object must contain the input, model, and generation configs, together with the task type for which the MGT dataset will be generated. Let's replicate the previous experiment programmatically:

from text_machina import get_generator
from text_machina import Config, InputConfig, ModelConfig

input_config = InputConfig(
    domain="news",
    language="en",
    quantity=10,
    random_sample_human=True,
    dataset="xsum",
    dataset_text_column="document",
    dataset_params={"split": "test"},
    template=(
        "Write a news article whose summary is '{summary}'"
        "using the entities: {entities}\n\nArticle:"
    ),
    extractor="combined",
    extractors_list=["auxiliary.Auxiliary", "entity_list.EntityList"],
    max_input_tokens=256,
)

model_config = ModelConfig(
    provider="openai",
    model_name="gpt-3.5-turbo-instruct",
    api_type="COMPLETION",
    threads=8,
    max_retries=5,
    timeout=20,
)

generation_config = {"temperature": 0.7, "presence_penalty": 1.0}

config = Config(
    input=input_config,
    model=model_config,
    generation=generation_config,
    task_type="detection",
)
generator = get_generator(config)
dataset = generator.generate()

🛠️ Supported tasks


icon can generate datasets for MGT detection, attribution, boundary detection, and mixcase detection:

<p align="center"> <picture> <img alt="CLI interface showing generated and human text for detection" src="https://github.com/Genaios/TextMachina/blob/main/assets/tasks/detection.png?raw=true"> <figcaption>Example from a detection task.</figcaption> </picture> <br/> <br/> </p> <p align="center"> <picture> <img alt="CLI interface showing generated and human text for attribution" src="https://github.com/Genaios/TextMachina/blob/main/assets/tasks/attribution.png?raw=true"> <figcaption>Example from an attribution task.</figcaption> </picture> <br/> <br/> </p> <p align="center"> <picture> <img alt="CLI interface showing generated and human text for boundary" src="https://github.com/Genaios/TextMachina/blob/main/assets/tasks/bounda

Related Skills

View on GitHub
GitHub Stars21
CategoryDesign
Updated3d ago
Forks0

Languages

Python

Security Score

75/100

Audited on Apr 8, 2026

No findings