PyProCT
pyProCT is an open source cluster analysis software especially adapted for jobs related with structural proteomics. Its approach allows users to define a clustering goal (clustering hypothesis) based on their domain knowledge. This hypothesis will guide the software in order to find the best algorithm and parameters (including the number of clusters) to obtain the result that fulfills their expectatives. In this way users do not need to use cluster analysis algorithms as a black box, improving this way the results. pyProCT not only generates a resulting clustering, it also implements some use cases like the extraction of representatives or trajectory redundance elimination.
Install / Use
/learn @victor-gil-sepulveda/PyProCTREADME
<img src="img/cite.png"></img> If you plan to use pyProCT or any of its parts, including its documentation, to write a scientific article,
please consider to add the following cite:
J. Chem. Theory Comput., 2014, 10 (8), pp 3236–3243
<img src="img/warning.png"></img> pyProCT README and docs are currently a bit outdated (some new functionalities and changes are missing) if you find something is not working as expected, just send me a mail to victor.gil.sepulveda@gmail.com and I will try to answer (and update the part you need) the faster I can.
pyProCT
pyProCT is an open source cluster analysis software especially adapted for jobs related with structural proteomics. Its approach allows users to define a clustering goal (clustering hypothesis) based on their domain knowledge. This hypothesis will guide the software in order to find the best algorithm and parameters (including the number of clusters) to obtain the result that better fulfills their expectatives. In this way users do not need to use cluster analysis algorithms as a black box, which will (hopefully) improve their results. pyProCT not only generates a resulting clustering, it also implements some use cases like the extraction of representatives or trajectory redundance elimination.
- pyProCT
- Documentation
- TODO
Installation
pyProCT is quite easy to install using pip. Just write:
> sudo pip install pyProCT
And pip will take care of all the dependencies (shown below).
<img src="img/dependencies.png"> </img>
<img src="img/warning.png"></img> It is recommended to install Numpy and Scipy before starting the installation using your OS software manager. You can try to download and install them manually if you dare.
<img src="img/warning.png"></img> mpi4py is pyProCT's last dependency. It can give problems when installing it in OS such as SUSE. If the installation of this last package is not succesful, pyProCT can still work in Serial and Parallel (using multiprocessing) modes.
Using pyProCT as a standalone program
The preferred way to use pyProCT is through a JSON "script" that describes the clustering task. It can be executed using the following line in your shell:
> python -m pyproct.main script.json
The JSON script has 4 main parts, each one dealing with a different aspect of the clustering pipeline. This sections are:
- "global": Handles workspace and scheduler parameterization.
- "data": Handles distance matrix parameterization.
- "clustering": Handles algorithms and evaluation parameterization.
- "preprocessing": Handles what to do with the clustering we have calculated.
{
"global":{},
"data":{},
"clustering":{},
"postprocessing":{}
}
Global
{
"control": {
"scheduler_type": "Process/Parallel",
"number_of_processes": 4
},
"workspace": {
"tmp": "tmp",
"matrix": "matrix",
"clusterings": "clusterings",
"results": "results",
"base": "/home/john/ClusteringProject"
}
}
This is an example of "global" section. It describes the work environment (workspace) and the type of scheduler that will be built. Defining the subfolders of the wokspace is not mandatory, however it may be convenient in some scenarios (for instance, in serial multiple clustering projects, sharing the tmp folder would lower the disk usage as at each step it will be overwritten).
This is a valid global section using a serial scheduler and default names for workspace inner folders:
{
"control": {
"scheduler_type": "Serial"
},
"workspace": {
"base": "/home/john/ClusteringProject"
}
}
pyProCT allows the use of 3 different schedulers that help to improve the overall performance of the software by parallelizing some parts of the code. The available schedulers are "Serial", "Process/Parallel" (uses Python's multiprocessing) and "MPI/Parallel" (uses MPI through the module mpi4py).
####Workspace parameters
The workspace structure accepts two parameters that modify the way the workspace is created (and cleared). "overwrite" : The contents existing folders will be removed before executing. "clear_after_exec": An array containing the folders that must be removed after execution.
Example:
"workspace": {
"base": "/home/john/ClusteringProject",
"parameters":{
"overwrite": true,
"clear_after_exec":["tmp","clusterings"]
}
}
Data
The "data" section defines how pyProCT must build the distance matrix that will be used by the compression algorithms. Currently pyProCT offers up to three options to build that matrix: "load", "rmsd" and "distance"
- "rmsd": Calculates a all vs all rmsd matrix using any of the pyRMSD calculators available. It can calculate the RMSD of the fitted region (defined by Prody compatible selection string in fit_selection) or one can use one selection to superimpose and another to calculate the rmsd (calc_selection).
There are two extra parameters that must be considered when building an RMSD matrix.
- "type": This property can have two values: "COORDINATES" or "DIHEDRALS". If DIHEDRALS is chosen, each element (i,j) of the distance matrix will be the RMSD of the arrays containing the phi-psi dihedral angle series of conformation i and j.
- "chain_map": If set to true pyProCT will try to reorder the chains of the biomolecule in order to minimize the global RMSD value. This means that it will correctly calculate the RMSD even if chain coordinates were permuted in some way. The price to pay is an increase of the calculation time (directly proportional to the number of chains or the number of chains having the same length).
- "distance": After superimposing the selected region it calculates the all vs all distances of the geometrical center of the region of interest (body_selection).
- "load": Loads a precalculated matrix.
JSON chunk needed to generate an RMSD matrix from two trajectories:
{
"type": "pdb_ensemble",
"files": [
"A.pdb",
"B.pdb"
],
"matrix": {
"method": "rmsd",
"parameters": {
"calculator_type": "QCP_OMP_CALCULATOR",
"fit_selection": "backbone"
},
"image": {
"filename": "matrix_plot"
},
"filename":"matrix"
}
}
JSON chunk to generate a dihedral angles RMSD matrix from one trajectories:
{
"type": "pdb_ensemble",
"files": [
"A.pdb"
],
"matrix": {
"method": "rmsd",
"parameters": {
"type":"DIHEDRAL"
},
"image": {
"filename": "matrix_plot"
},
"filename":"matrix"
}
}
The matrix can be stored if the filename property is defined. The matrix can also be stored as an image if the image property is defined.
pyProCT can currently load pdb and dcd files. The details to load the files must be written into the array under the "files" keyword. There are many ways of telling pyProCT the files that have to be load and can be combined in any way you like:
1 - Using a list of file paths. If the file extension is ".txt" or ".list" it will be treated as a pdb list file. Each line of such files will be a pdb path or a pdb path and a selection string, separated by comma.
A.pdb, name CA
B.pdb
C.pdb, name CA
...
2 - Using a list of file objects:
{
"file": ... ,
"base_selection": ...
}
Where base_selection is a Prody compatible selection string. Loading files this way can help in cases where not all files have structure with the same number of atoms: base_selection should define the common region between them (if a 1 to 1 map does not exist, the RMSD calculation will be wrong).
3 - Only for dcd files:
{
"file": ...,
"atoms_file": ...,
"base_selection": ...
}
Where atoms_file is a pdb file with at least one frame that holds the atomic information needed by the dcd file.
Note: data.type is currently unsused
Clustering
The clustering section specifies how the clustering exploration will be done. It is divided in 3 other subsections:
{
"generation": {
"method": "generate"
},
"algorithms": {
...
},
"evaluation": {
...
}
}
Generation
Defines how to generate the clustering ("load" or "generate"). if "load" is chosen, this section will also contain the clustering that may be used in the "clusters" property. Ex.:
{
"clustering": {
"generation": {
"method" : "load",
"clusters": [
{
"prototype " : 16,
"id": "cluster_00",
"elements" : "9, 14:20"
},
{
"prototype": 7,
"id": "cluster_01",
"elements": "0:8, 10:14, 21"
}
]
}
}
Algorithms
If clustering.generation.method equals "generate", this section
Related Skills
next
A beautifully designed, floating Pomodoro timer that respects your workspace.
product-manager-skills
50PM skill for Claude Code, Codex, Cursor, and Windsurf: diagnose SaaS metrics, critique PRDs, plan roadmaps, run discovery, and coach PM career transitions.
pm
PM Agent Rule This rule is triggered when the user types `@pm` and activates the Product Manager agent persona.
devplan-mcp-server
3MCP server for generating development plans, project roadmaps, and task breakdowns for Claude Code. Turn project ideas into paint-by-numbers implementation plans.
