GraphFLA
A graph-based python framework for fitness landscape analysis
Install / Use
/learn @COLA-Laboratory/GraphFLAREADME
GraphFLA

GraphFLA (Graph-based Fitness Landscape Analysis) is a Python framework for constructing, analyzing, manipulating and visualizing fitness landscapes as graphs. It provides a broad collection of features rooted in evolutoinary biology to decipher the topography of complex fitness landscapes of diverse modalities.
This is also the official code & data repository for the NeurIPS 2025 (Spotlight) paper "Augmenting Biological Fitness Prediction Benchmarks with Landscapes Features from GraphFLA".
Feel free to explore examples in Google Colab!
Key Features
- Versatility: applicable to arbitrary discrete, combinatorial sequence-fitness data, ranging from biomolecules like DNA, RNA, and protein, to functional units like genes, to complex ecological communities.
- Comprehensiveness: offers a holistic collection of 20+ features for characterizing 4 fundamental topographical aspects of fitness landscape, including ruggedness, navigability, epistassi and neutrality.
- Interoperability: works with the same data format (i.e.,
Xandf) as in training machine learning (ML) models, thus being interoperable with established ML ecosystems in different disciplines. - Scalability: heavily optimized to be capable of handling landscapes with even millions of variants.
- Extensibility: new landscape features can be easily added via an unified API.
Quick Start
Our documentation website is currently under development, but GraphFLA is quite easy to get started with!
1. Installation
Official installation (pip)
pip install graphfla
2. Prepare your data
GraphFLA is designed to interoperate with established ML frameworks and benchmarks by using the same data format as in ML model training: an X and an f.
Specifically, X can either be a list of sequences of strings representing genotypes, or a pd.DataFrame or an numpy.ndarray, wherein each column represents a loci; f can either be a list, pd.Series or numpy.ndarray.
To make landscape construction faster, we recommended removing redundant loci in X (i.e., those that are never mutated across the whole library) .
import pandas as pd
# Load data:
data = pd.DataFrame({
"sequences": ["AAA", "AAG", "AGA", "AGG", "GAA", "GAG", "GGA", "GGG"],
"fitness": [0.10, 0.25, 0.25, 0.40, 0.25, 0.40, 0.40, 0.91]
})
# 3 positions (A/G), 8 variants; all connected via single mutations; unimodal (GGG optimum)
X = data["sequences"]
f = data["fitness"]
3. Create the landscape object
Creating a landscape object in GraphFLA is much like training an ML model: we first initialize a Landscape class, and then build it with our data.
Here, assume we are working with DNA sequences. GraphFLA provides registered methods for performance optimization for this type, which can be triggered by specifying type="dna". Alternatively, you can directly use the DNALandscape class to get the same effect, which is natively built for DNA data.
The maximize parameter specifies the direction of optimization, i.e., whether f is to be optimized or minimized.
from graphfla.landscape import DNALandscape
# initialize the landscape
landscape = DNALandscape(maximize=True)
# build the landscape with our data
landscape.build_from_data(X, f, verbose=True)
4. Landscape analysis
Once the landscape is constructed, we can then analyze its features using the available functions (see later).
from graphfla.analysis import (
lo_ratio,
classify_epistasis,
r_s_ratio,
neutrality,
global_optima_accessibility,
)
local_optima_ratio = lo_ratio(landscape)
epistasis = classify_epistasis(landscape)
r_s_score = r_s_ratio(landscape)
neutrality_index = neutrality(landscape)
go_access = global_optima_accessibility(landscape)
5. Playing with arbitrary combinatorial data
The type parameter of the Landscape class currently supports "dna", rna, "protein", and "boolean". However, this does not mean that GraphFLA can only work with these types of data; instead, these registered values are only for convenience and performance optimization purpose.
In fact, GraphFLA can handle arbitrary combinatorial search space as long as the values of each variable is discrete. To work with such data, we can initialize a general landscape, and then pass in a dictionary to specify the data type of each variable (options: {"ordinal", "cateogrical", "boolean"}).
import pandas as pd
from graphfla.landscape import Landscape
complex_data = pd.read_csv("path_to_complex_data.csv")
f = complex_data["fitness"]
# data serving as "X"
complex_search_space = complex_data.drop(columns=["fitness"])
# initialize a general fitness landscape without specifying `type`
landscape = Landscape(maximize=True)
# create a data type dictionary
data_types = {
"x1": "ordinal",
"x2": "categorical",
"x3": "boolean",
"x4": "categorical"
}
# build the landscape with our data and specified data types
landscape.build_from_data(X, f, data_types=data_types, verbose=True)
Landscape Analysis Features
GraphFLA currently supports the following features for landscape analysis. Only landscape-level analysis tools are listed; mutation-specific (e.g., distribution_fit_effects, idiosyncratic_index) and position-specific (e.g., single_mutation_effects) tools are excluded.
| Class | Function | Feature | Range | Higher value indicates |
|--------------------------|----------------------------------|----------------------------------------|---------------|----------------------------------------|
| Ruggedness | lo_ratio | Fraction of local optima | [0, 1] | ↑ more peaks |
| | r_s_ratio | Roughness-slope ratio | [0, ∞) | ↑ ruggedness |
| | autocorrelation | Autocorrelation | [-1, 1] | ↓ ruggedness |
| | gradient_intensity | Gradient intensity | [0, ∞) | ↑ average fitness change per edge |
| | neighbor_fit_corr | Neighbor-fitness correlation | [-1, 1] | ↓ ruggedness |
| Epistasis | classify_epistasis | Magnitude epistasis | [0, 1) | ↓ evolutionary constraints |
| | classify_epistasis | Sign epistasis | [0, 1] | ↑ evolutionary constraints |
| | classify_epistasis | Reciprocal sign epistasis | [0, 1] | ↑ evolutionary constraints |
| | classify_epistasis | Positive epistasis | [0, 1] | ↑ synergistic effects |
| | classify_epistasis | Negative epistasis | [0, 1] | ↑ antagonistic effects |
| | global_idiosyncratic_index | Global idiosyncratic index | [0, 1] | ↑ specific interactions |
| | diminishing_returns_index | Diminishing return epistasis | [-1, 1] | ↓ flat peaks (higher = less diminishing returns) |
| | increasing_costs_index | Increasing cost epistasis | [-1, 1] | ↑ steep descents |
| | higher_order_epistasis | Higher-order epistasis (R²) | [0, 1] | ↑ higher-order interactions |
| | gamma_statistic | Gamma statistic | [-1, 1] | ↑ epistasis (magnitude) |
| | gamma_star | Gamma star statistic | [-1, 1] | ↑ sign epistasis consistency |
| | walsh_hadamard_coefficient | Pairwise and higher-order epistasis | - | - |
| | extradimensional_bypass_analysis| Extradimensional bypass proportion | [0, 1] | ↑ navigability |
| Navigability | fitness_distance_corr | Fitness-distance correlation | [-1, 1] | ↑ navigation |
| | fitness_flattening_index | Fitness flattening index | [-1, 1] | ↑ flatter around global optimum |
|
