Pyfastg

A minimal Python library for parsing FASTG files

Generate Convert Improve

Install / Use

/learn @fedarko/Pyfastg

About this skill

Quality Score

0/100

README

pyfastg: a minimal Python library for parsing FASTG files

<div align="center"> <a href="https://github.com/fedarko/pyfastg/actions/workflows/main.yml"><img src="https://github.com/fedarko/pyfastg/actions/workflows/main.yml/badge.svg" alt="pyfastg CI" /></a> <a href="https://codecov.io/gh/fedarko/pyfastg"><img src="https://codecov.io/gh/fedarko/pyfastg/branch/master/graph/badge.svg" alt="Code Coverage" /></a> <a href="https://pypi.org/project/pyfastg"><img src="https://img.shields.io/pypi/v/pyfastg?color=0073b7&labelColor=003d63" alt="PyPI" /></a> <a href="https://anaconda.org/bioconda/pyfastg"><img src="https://img.shields.io/conda/vn/bioconda/pyfastg.svg?color=3eb049&labelColor=005500" alt="bioconda" /></a> </div>

The FASTG file format

FASTG is a format for describing sequencing assembly graphs. It attempts to accurately represent the ambiguity resulting from sequencing limitations, ploidy, or other factors that complicate representation of a seqence as a simple string.

The latest specification for the FASTG format is version 1.00, as of writing; the original FASTG website is down, but an archived version of the v1.00 specification is accessible here. Whenever the rest of this documentation mentions "the FASTG spec," this is in reference to this version of the specification.

pyfastg is a Python library designed to parse graphs that follow a subset of the FASTG spec. In particular, pyfastg is designed to work with files output by the SPAdes family of assemblers. It also now supports files output by MEGAHIT!

The pyfastg library

The pyfastg library contains parse_fastg(), a function that takes as input a path to a FASTG file. parse_fastg() reads this FASTG file and returns a NetworkX DiGraph object representing the structure of the assembly graph.

pyfastg is useful as a starting point for other applications. Using this NetworkX DiGraph object, we can do whatever we want with the assembly graph: analyze it, convert it to other formats, visualize it, etc.

Note about the graph topology

The FASTG spec contains the following sentence (in section 6, page 7):

Note also that strictly speaking, [the structure described in a FASTG file] is not a graph at all, as we have not specified a notion of vertex. However in many cases one can without ambiguity define vertices and thereby associate a bona fide digraph, and we do so frequently in this document to illustrate concepts.

We use the following approach to get around this problem: "edges" in the FASTG file will be represented as nodes in the NetworkX graph produced by pyfastg, and "adjacencies" between edges in the FASTG file will be represented as edges in the NetworkX graph produced by pyfastg.

As far as we're aware, this "conversion" from edges to nodes matches how FASTG files have often been visualized in the past.

Installation

pyfastg can be installed using pip or conda:

Installation using pip

pip install pyfastg

Installation using conda

conda install -c bioconda pyfastg

Dependencies

As of writing, pyfastg's only direct dependency (which should be installed automatically when running either of the above installation commands) is NetworkX. pyfastg requires a minimum NetworkX version of 2.

As of writing, pyfastg supports Python 3.6 and up. pyfastg might be able to work with earlier versions of Python, but we do not explicitly test against these.

Quick example: using pyfastg to load and analyze an assembly graph

The second line (which points to one of pyfastg's test assembly graphs) assumes that you're located in the root directory of the pyfastg repo.

>>> import pyfastg
>>> g = pyfastg.parse_fastg("pyfastg/tests/input/assembly_graph.fastg")
>>> # g is now a NetworkX DiGraph! We can do whatever we want with this object.
>>>
>>> # Example: List the sequences in this graph (these are "edges" in the FASTG
>>> # file, but are represented as nodes in g)
>>> g.nodes()
NodeView(('1+', '29-', '1-', '6-', '2+', '26+', '27+', '2-', '3+', '4+', '6+', '7+', '3-', '33-', '9-', '4-', '5+', '5-', '28+', '7-', '8+', '28-', '9+', '8-', '12-', '10+', '12+', '10-', '24-', '32-', '11+', '30-', '11-', '27-', '19-', '13+', '25+', '31-', '13-', '14+', '14-', '26-', '15+', '15-', '23-', '16+', '16-', '17+', '17-', '19+', '18+', '33+', '18-', '20+', '20-', '22+', '21+', '21-', '22-', '23+', '24+', '25-', '29+', '30+', '31+', '32+'))
>>>
>>> # Example: Get details for a single sequence (length, coverage, GC-content)
>>> g.nodes["15+"]
{'length': 193, 'cov': 6.93966, 'gc': 0.5492227979274611}
>>>
>>> # Example: Get information about the graph's connectivity
>>> import networkx as nx
>>> components = list(nx.weakly_connected_components(g))
>>> for c in components:
...     print(len(c), "nodes")
...     print(c)
...
33 nodes
{'8-', '17-', '15+', '30+', '16+', '26-', '25+', '19+', '7+', '23+', '14-', '18-', '10-', '29-', '20-', '27-', '11-', '5-', '3+', '2-', '12-', '13+', '31-', '6+', '1+', '21-', '24-', '32-', '22+', '28+', '4+', '33-', '9-'}
33 nodes
{'26+', '29+', '18+', '3-', '2+', '8+', '15-', '24+', '9+', '17+', '27+', '28-', '11+', '6-', '20+', '14+', '19-', '13-', '4-', '21+', '5+', '31+', '22-', '12+', '25-', '30-', '10+', '1-', '7-', '32+', '23-', '33+', '16-'}

Details about the required input file format (tl;dr: SPAdes-dialect FASTG files only)

Currently, pyfastg is only designed to parse FASTG files created by the SPAdes or MEGAHIT assemblers. Other valid FASTG files that don't follow the formats used by SPAdes or MEGAHIT are not supported. (If you would like us to add support for a new assembler's output, please open an issue!)

Edge names

Each sequence in the file should have a name formatted like:

<table align="center"> <thead> <tr> <th>SPAdes</td> <th>MEGAHIT</td> </tr> </thead> <tbody> <tr> <td><code>EDGE_1_length_9909_cov_6.94721</code></td> <td><code>NODE_1_length_9909_cov_6.9472_ID_1</code></td> </tr> </tbody> </table>

In MEGAHIT FASTG files, these sequences are referred to as NODEs instead of EDGEs. We will keep saying "edge" throughout this documentation for the sake of simplicity.

The edge ID (here, 1) can contain the characters a-z, A-Z, and 0-9. In MEGAHIT files, there are two IDs included in each name -- one after the NODE_ and one at the very end of the name (ID_). We will only use the first one.

The edge length (here, 9909) can contain the characters 0-9.

The edge coverage (here, 6.94721) can contain the characters 0-9 and ..

An edge name can optionally end with a ' character, indicating that this edge is a reverse complement. We will refer to whether or not an edge name ends with ' as its orientation: an edge that does not end with a ' has a + orientation, and an edge name that ends with a ' has a - orientation.

All edge names in a FASTG file should be consistent with respect to a given ID and orientation. If, in a single FASTG file, pyfastg sees a reference to an edge named EDGE_1_length_9909_cov_6.94721 and also a reference to an edge named EDGE_1_length_9908_cov_6.95 (with the same ID [1] and orientation [+], but a different length and/or coverage) then it will throw an error.

Edge declaration lines

Here, we refer to each line starting with > as an edge declaration. An edge's sequence is described in the line(s) following its edge declaration (until the next edge declaration); additionally, the outgoing adjacencies from this edge to other edges may be described on this line, if present. For example, the line

>EDGE_1_length_5_cov_10:EDGE_2_length_3_cov_1,EDGE_3_length_6_cov_2.5',EDGE_4_length_8_cov_5.1;

indicates that the edge EDGE_1_length_5_cov_10 has three outgoing adjacencies: to the edges EDGE_2_length_3_cov_1, EDGE_3_length_6_cov_2.5', and EDGE_4_length_8_cov_5.1. This line would thus result in three "edges" being created in the NetworkX graph produced by pyfastg: (1+ → 2+), (1+ → 3-), and (1+ → 4+).

Each edge declaration must end with a ; character (after removing trailing whitespace). Section 15 of the FASTG spec mentions that having a newline after the semicolon isn't required, but we require it here for the sake of simplicity.

Edge sequences

We assume that each sequence (the line(s) between edge declarations) consists only of the characters A, C, G, T, or U. So, more complex types of strings (e.g. the "stuffed gaps" described in the FASTG spec) are not allowed in an edge's sequence.

Additionally, lowercase characters or degenerate nucleotides are not allowed; this matches section 15 of the FASTG spec. The FASTG spec doesn't explicitly allow for uracil (U), but we allow it anyway in order to support RNA sequences. (U and T are allowed to be contained in the same sequence, in the unlikely case that this is needed.)

Leading and trailing whitespace in sequence lines will be ignored, as will blank lines within a sequence. So, something like

>EDGE_1_length_4_cov_100;
    ATC

 G

is technically valid: this sequence is read as ATCG. However, the following example:

>EDGE_1_length_4_cov_100;
ATC G

is not valid and will cause pyfastg to throw an error. This is because the inner space between the C and the G would be read as part of the sequence.

Details about the output NetworkX graph

Node names and attributes

Nodes in the returned DiGraph (corresponding to edges in the FASTG file) will contain three attribute fields:

length: the length of the

Related Skills

node-connect

342.5k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

85.3k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

342.5k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

342.5k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。