SkillAgentSearch skills...

Anadama2

AnADAMA2 is the next generation of AnADAMA (Another Automated Data Analysis Management Application). AnADAMA is a tool to capture your workflow and execute it efficiently on your local machine or in a grid compute environment (ie. sun grid engine or slurm).

Install / Use

/learn @biobakery/Anadama2
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

AnADAMA2 User Manual

AnADAMA2 is the next generation of AnADAMA (Another Automated Data Analysis Management Application). AnADAMA is a tool to create reproducible workflows and execute them efficiently. AnADAMA operates in a make-like manner using targets and dependencies of each task to allow for parallelization. In cases where a workflow is modified or input files change, only those tasks impacted by the changes will be rerun.

Tasks can be run locally or in a grid computing environment to increase efficiency. AnADAMA2 includes meta-schedulers for SLURM and SGE grids. Tasks that are specified to run on the grid will be submitted and monitored. If a task exceeds its time or memory allotment it will be resubmitted with double the time or memory (based on which resource needs to be increased) at most three times. Benchmarking information of time, memory, and cores will be recorded for each task run on the grid.

Essential information from all tasks is recorded, using the default logger and command line reporters, to ensure reproducibility. The information logged includes the command line options provided to the workflow, the function or command executed for each task, versions of tracked executables, and any output (stdout or stderr) from a task. Information reported on the command line includes status for each task (ie which task is ready, started, or running) along with an overall status of percent and total tasks complete for the workflow. A auto-doc feature allows for workflows to generate documentation automatically to further ensure reproducibility by capturing the latest essential workflow information.

AnADAMA2 was architected to be modular allowing users to customize the application by subclassing the base grid meta-schedulers, reporters, and tracked objects (ie files, executables, etc).


.. contents:: Table of Contents


Features ............

  • Captures your workflow steps along with the specific inputs, outputs, and environment used for each of your workflow runs
  • Parallel workflow execution on your local machine or in a grid compute environment
  • Ability to rerun a workflow, executing only sub-steps, based on changes in dependencies

Installation ................

AnADAMA2 is easy to install.

Requirements

Python 2.7+ is required. All other basic dependencies will be installed when installing AnADAMA2.

The workflow documentation feature uses Pweave <http://mpastell.com/pweave>_ which will automatically be installed for you when installing AnADAMA2 and Pandoc <http://pandoc.org/installing.html>. For workflows that use the documentation feature, matplotlib <http://matplotlib.org/users/installing.html> (version2+ required), Pandoc <http://pandoc.org/installing.html>_ (<version2 required), and LaTeX <https://www.latex-project.org/get/>_ will need to be installed manually. If your document includes hclust2 heatmaps, hclust2 <https://bitbucket.org/nsegata/hclust2/overview>_ will also need to be installed.

Install

Run the following command to install AnADAMA2 and dependencies: ::

$ pip install anadama2

Add the option --user to the install command if you do not have root permissions.

Test

Once you have AnADAMA2 installed, you can optionally run the unit tests. To run the unit tests, change directories into the AnADAMA2 install folder and run the following command:

::

$ python setup.py test

Basic Usage ...............

AnADAMA2 does not install an executable that you would run from the command line. Instead, you define your workflows as a Python script; anadama2 is a module you import and use to describe your workflow.

Definitions

Before we get started with a basic workflow, there are a couple important definitions to review.

  • Workflow

    • A collection of tasks.
  • Task

    • A unit of work in the workflow.
    • A task has at least one action, zero or more targets, and zero or more dependencies.
  • Target

    • An item that is created or modified by the task (ie like writing to a file).
    • All targets must exist after a task is run (they might not exist before the task is run).
  • Dependency

    • An item that is required to run the task (ie input file or variable string).
    • All dependencies of a task must exist before the task can be run.

Targets and dependencies can be of different formats. See the section on "Types of Tracked Items" for all of the different types.

Tasks are run by executing all of its actions after all of its dependencies exist. After a task is run, it's marked as successful if no Python exceptions were raised, no shell commands had a non-zero exit status, and all of the targets were created.

Run a Basic Workflow

A basic workflow script can be found in the examples folder in the source repository named exe_check.py. The script exe_check.py gets a list of the global executables and also the local executables (those for the user running the script). It then checks to see if there are any global executables that are also installed locally. This script shows how to specify dependencies and targets in the commands directly. Lines 4-6 of the example script show targets with the format [t:file] and dependencies with the format [d:file].

To run this example simply execute the script directly:

::

$ python exe_check.py

The contents of this script are as follows (line numbers are shown for clarity):

::

1 from anadama2 import Workflow
2
3 workflow = Workflow(remove_options=["input","output"])
4 workflow.do("ls /usr/bin/ | sort > [t:global_exe.txt]")
5 workflow.do("ls $HOME/.local/bin/ | sort > [t:local_exe.txt]")
6 workflow.do("join [d:global_exe.txt] [d:local_exe.txt] > [t:match_exe.txt]")
7 workflow.go()

The first line imports AnADAMA2 and the third line creates an instance of the Workflow class removing the command line options input and output as they are not used for this workflow. These two lines are required for every AnADAMA2 workflow. Lines 4-6 add tasks to the workflow and line 7 tells AnADAMA2 to execute the tasks.

Command Line Interface

All AnADAMA2 workflows have a command line interface that includes a few default arguments. The default arguments include an input folder, an output folder, and the number of tasks to run in parallel. See the section "Run an Intermediate Workflow" for information on how to add custom options.

For a full list of options, run your workflow script with the "--help" option.

::

$ python exe_check.py --help
usage: exe_check.py [options]

AnADAMA2 Workflow
Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  -j JOBS, --local-jobs=JOBS
                        The number of tasks to execute in parallel locally.
  -t TARGET, --target=TARGET
                        Only execute tasks that make these targets. Use this
                        flag multiple times to build many targets. If the
                        provided value includes ? or * or [, treat it as a
                        pattern and build all targets that match.
  -d, --dry-run         Print tasks to be run but don't execute their actions.
  -l, --deploy          Create directories used by other options
  -T EXCLUDE_TARGET, --exclude-target=EXCLUDE_TARGET
                        Don't execute tasks that make these targets. Use this
                        flag multiple times to exclude many targets. If the
                        provided value includes ? or * or [, treat it as a
                        pattern and exclude all targets that match.
  -u UNTIL_TASK, --until-task=UNTIL_TASK
                        Stop after running the named task. Can refer to the
                        end task by task number or task name.
  -e, --quit-early      If any tasks fail, stop all execution immediately. If
                        not set, children of failed tasks are not executed but
                        children of successful or skipped tasks are executed.
                        The default is to keep running until all tasks that
                        are available to execute have completed or failed.
  -g GRID, --grid=GRID  Run gridable tasks on this grid type.
  -U EXCLUDE_TASK, --exclude-task=EXCLUDE_TASK
                        Don't execute these tasks. Use this flag multiple
                        times to not execute multiple tasks.
  -i INPUT, --input=INPUT
                        Collect inputs from this directory.
  -o OUTPUT, --output=OUTPUT
                        Write output to this directory. By default the
                        dependency database and log are written to this
                        directory
  -n, --skip-nothing    Skip no tasks, even if you could; run it all.
  -J GRID_JOBS, --grid-jobs=GRID_JOBS
                        The number of tasks to submit to the grid in parallel.
                        The default setting is zero jobs will be run on the
                        grid. By default, all jobs, including gridable jobs,
                        will run locally.
  --grid-tasks GRID_TASKS
                        Settings for specific tasks on the grid (task name, time, mem, cores, partition, docker_image)
  -p GRID_PARTITION, --grid-partition=GRID_PARTITION
                        Run gridable tasks on this partition.
  --config              Find workflow configuration in this folder
                        [default: only use command line options] 

Options can be provided on the command line or included in a config file with the option "--config=FILE

Related Skills

View on GitHub
GitHub Stars12
CategoryData
Updated6d ago
Forks3

Languages

Python

Security Score

80/100

Audited on Mar 25, 2026

No findings