transCSSR: Causal State Splitting Reconstruction for epsilon-transducers

A Python implementation of the Causal State Splitting Reconstruction (CSSR) algorithm for inferring epsilon-transducers from data generated by discrete-valued, discrete-time input/output systems. This library allows you to model complex systems by learning their underlying causal structure from observed sequences.

This implementation has been tested on macOS 15.5, but should work on any Unix-based OS.

Key Features

Causal State Inference: Learns the minimal set of causal states that explain the observed input-output sequences, including support for memoryful transducers where state depends on past history.
Predictive Modeling: Uses the inferred causal states to predict future outputs given past observations.
Statistical Hypothesis Testing: Implements chi-squared and G-tests to statistically compare predictive distributions and determine state equivalence.
Visualization: Generates .dot files representing the inferred epsilon-transducers, which can be visualized using Graphviz.
Data Processing & Evaluation: Includes utilities for data filtering, metric computation (accuracy, precision, recall, TV distance), and model evaluation.

Installation

To install the library, use pip:

pip install git+https://github.com/ddarmon/transCSSR

Dependencies: The library requires the following packages: numpy, scipy, pandas, python-igraph, matplotlib, joblib, tqdm. These will be installed automatically via pip.

Alternative Installation with Conda: For easier dependency management, especially with compiled packages:

conda install numpy scipy pandas python-igraph matplotlib joblib tqdm
pip install git+https://github.com/ddarmon/transCSSR

Graphviz

Visualization of the inferred epsilon-machines requires Graphviz or similar software for reading dot files.

To use Graphviz within a Jupyter notebook (e.g., with Anaconda), install the python-graphviz package:

pip install graphviz

Usage

Programmatic Usage

You can use the transCSSR library directly in your Python scripts to infer causal states and make predictions.

# Example usage in a Python script:

from transCSSR import estimate_predictive_distributions, run_transCSSR, filter_and_predict
import numpy as np
import itertools

# --- 1. Prepare Data ---
# NOTE: transCSSR expects string inputs, not lists of strings.
# For demonstration, let's use a small sample similar to what might be in demo scripts.
# In a real scenario, you would load your data from files or other sources.

# Example data (replace with your actual data)
stringX = 'abacabacabac'  # Input sequence as string
stringY = '010001000100'  # Output sequence as string

# Alternative: Load from files (as used in demo scripts)
# stringY = open('data/even.dat').readline().strip()
# stringX = '0' * len(stringY)  # For output-only processes

# Define alphabet symbols
axs = ['a', 'b', 'c'] # Input alphabet
ays = ['0', '1']      # Output alphabet

# Define maximum history length (how far back to look for patterns)
L_max = 3

# Create emission symbols (required for run_transCSSR)
# These represent all possible (input, output) pairs
e_symbols = list(itertools.product(axs, ays))

# Define names for the input/output processes (required for run_transCSSR)
Xt_name = 'input_process'
Yt_name = 'output_process'

# --- 2. Infer Predictive Distributions ---
# This step estimates the probability of future outputs given past inputs and outputs.
word_lookup_marg, word_lookup_fut = estimate_predictive_distributions(
    stringX, stringY, L_max, axs=axs, ays=ays
)

# --- 3. Run CSSR Algorithm ---
# This infers the causal states and transitions (epsilon-transducer).
epsilon, invepsilon, morph_by_state = run_transCSSR(
    word_lookup_marg, word_lookup_fut, L_max, axs, ays,
    e_symbols, Xt_name, Yt_name
)

print(f"Inferred {len(invepsilon)} causal states.")

# Display information about the inferred states
for state_id, histories in invepsilon.items():
    print(f"  State {state_id}: {len(histories)} histories")

# --- 4. Predict Future Outputs ---
# For detailed prediction and evaluation, refer to the demo scripts.
# The filter_and_predict function can be used for prediction:
filtered_states, filtered_probs, stringY_pred = filter_and_predict(
    stringX, stringY, epsilon, invepsilon, morph_by_state,
    axs, ays, e_symbols, L_max
)

# Results explanation:
# - epsilon: mapping from histories to states
# - invepsilon: mapping from states to histories
# - morph_by_state: predictive distributions for each state
# - filtered_states: inferred state sequence
# - stringY_pred: predicted output sequence

print("Inference complete. For detailed prediction and visualization, please refer to the demo scripts.")

Demo Scripts

The repository includes several demo scripts to illustrate the functionality:

demo_transCSSR.py: Demonstrates the end-to-end process of inferring an epsilon-transducer and using it for prediction, including plotting results.
demo_computational_mechanics_bootstrap.py: Focuses on using bootstrapping to assess the statistical reliability of the inferred models.
demo_CSSR.py: Provides another example of CSSR inference and prediction, potentially with different datasets or parameters.

Project Structure

transCSSR.py: Contains the core implementation of the Causal State Splitting Reconstruction algorithm, including state inference, statistical tests, and visualization utilities.
filter_data_methods.py: Provides helper functions for data preprocessing, filtering, and computing performance metrics (accuracy, precision, recall, etc.).
demo_*.py files: Example scripts demonstrating how to use the library for various tasks, from basic inference to advanced analysis like bootstrapping and prediction.
simulation-codes/: Contains scripts for generating synthetic data, useful for testing and benchmarking the algorithm.
data/: Directory for storing datasets used in demonstrations or analyses.
README.md: This file, providing an overview, installation, and usage instructions.
setup.py: Standard Python setup script for packaging and distributing the library.

Data Format

The transCSSR library accepts input data in two formats:

Single Sequence Format

For single time series:

stringX: A single string where each character represents an input symbol at a given time step.
stringY: A single string where each character represents the corresponding output symbol at that time step.

Ensemble Format (Multiple Sequences)

For multiple independent realizations or ensemble data, set is_multiline=True:

stringX: A list of strings, where each string is an independent input sequence.
stringY: A list of strings, where each string is the corresponding output sequence.

# Example ensemble data
stringX_list = ['abacabac', 'babacaba', 'cabacabc']
stringY_list = ['01000100', '10100101', '00100010']

# Use with is_multiline=True
word_lookup_marg, word_lookup_fut = estimate_predictive_distributions(
    stringX_list, stringY_list, L_max,
    axs=axs, ays=ays, is_multiline=True
)

Requirements:

All sequences in an ensemble must use the same alphabets
Corresponding input and output sequences must have the same length
The alphabet for inputs (axs) and outputs (ays) should be defined as lists of strings containing all possible symbols

Contributing

Contributions, bug reports, and feature requests are welcome! Please feel free to open an issue or submit a pull request on the GitHub repository.

License

This project is licensed under the GNU Public License.

Legacy Python 2 Version of `transCSSR`

This version of transCSSR is for use with Python 3.7+. A legacy version for use with Python 2.7 is hosted here.

TransCSSR

Install / Use

README

transCSSR: Causal State Splitting Reconstruction for epsilon-transducers

Key Features

Installation

Graphviz

Usage

Programmatic Usage

Demo Scripts

Project Structure

Data Format

Single Sequence Format

Ensemble Format (Multiple Sequences)

Contributing

License

Legacy Python 2 Version of `transCSSR`

TransCSSR

Install / Use

README

transCSSR: Causal State Splitting Reconstruction for epsilon-transducers

Key Features

Installation

Graphviz

Usage

Programmatic Usage

Demo Scripts

Project Structure

Data Format

Single Sequence Format

Ensemble Format (Multiple Sequences)

Contributing

License

Legacy Python 2 Version of transCSSR

Legacy Python 2 Version of `transCSSR`