Noworkflow
Supporting infrastructure to run scientific experiments without a scientific workflow management system.
Install / Use
/learn @gems-uff/NoworkflowREADME
noWorkflow
Copyright (c) 2016 Universidade Federal Fluminense (UFF). Copyright (c) 2016 Polytechnic Institute of New York University. All rights reserved.
noWorkflow is a tool designed to automatically trace the provenance of a Python script without requiring changes to the original code, thereby providing users with the creation and analysis of a detailed history of how data was produced and transformed. This history ensures transparency and reliability in scientific experiments and data processes. Developed in Python, noWorkflow can capture the provenance of scripts using software engineering techniques such as abstract syntax tree (AST) analysis, reflection, and profiling to collect provenance without necessitating a version control system or any other external environment.
Installing and using noWorkflow is simple and easy. Please check our installation and basic usage guidelines below, and the tutorial videos at our Wiki page.
Team
The main noWorkflow team is composed by researchers from Universidade Federal Fluminense (UFF) in Brazil and New York University (NYU), in the USA.
- João Felipe Pimentel (UFF) (main developer)
- Juliana Freire (NYU)
- Leonardo Murta (UFF)
- Vanessa Braganholo (UFF)
- Arthur Paiva (UFF)
Collaborators
- David Koop (University of Massachusetts Dartmouth)
- Fernando Chirigati (NYU)
- Paolo Missier (Newcastle University)
- Vynicius Pontes (UFF)
- Henrique Linhares (UFF)
- Eduardo Jandre (UFF)
- Jessé Lima (Summer of Reproducibility)
- Joshua Daniel Talahatu (Google Summer of Code)
History
The project started in 2013, when Leonardo Murta and Vanessa Braganholo were visiting professors at New York University (NYU) with Juliana Freire. At that moment, David Koop and Fernando Chirigati also joined the project. They published the initial paper about noWorkflow in IPAW 2014. After going back to their home university, Universidade Federal Fluminense (UFF), Leonardo and Vanessa invited João Felipe Pimentel to join the project in 2014 for his PhD. João, Juliana, Leonardo and Vanessa integrated noWorkflow and IPython and published a paper about it in TaPP 2015. They also worked on provenance versioning and fine-grained provenance collection and published papers in IPAW 2016. During the same time, David, João, Leonardo and Vanessa worked with the YesWorkflow team on an integration between noWorkflow & YesWorkflow and published a demo in IPAW 2016. The research and development on noWorkflow continues and is currently under the responsibility of João Felipe, in the context of his PhD thesis.
Publications
- MURTA, L. G. P.; BRAGANHOLO, V.; CHIRIGATI, F. S.; KOOP, D.; FREIRE, J.; [noWorkflow: Capturing and Analyzing Provenance of Scripts.] (https://github.com/gems-uff/noworkflow/raw/master/docs/ipaw2014.pdf) In: International Provenance and Annotation Workshop (IPAW), 2014, Cologne, Germany.
- PIMENTEL, J. F. N.; FREIRE, J.; MURTA, L. G. P.; BRAGANHOLO, V.; Collecting and Analyzing Provenance on Interactive Notebooks: when IPython meets noWorkflow. In: Theory and Practice of Provenance (TaPP), 2015, Edinburgh, Scotland.
- PIMENTEL, J. F.; FREIRE, J.; BRAGANHOLO, V.; MURTA, L. G. P.; Tracking and Analyzing the Evolution of Provenance from Scripts. In: International Provenance and Annotation Workshop (IPAW), 2016, McLean, Virginia.
- PIMENTEL, J. F.; FREIRE, J.; MURTA, L. G. P.; BRAGANHOLO, V.; Fine-grained Provenance Collection over Scripts Through Program Slicing. In: International Provenance and Annotation Workshop (IPAW), 2016, McLean, Virginia.
- PIMENTEL, J. F.; DEY, S.; MCPHILLIPS, T.; BELHAJJAME, K.; KOOP, D.; MURTA, L. G. P.; BRAGANHOLO, V.; LUDÄSCHER B.; Yin & Yang: Demonstrating Complementary Provenance from noWorkflow & YesWorkflow. In: International Provenance and Annotation Workshop (IPAW), 2016, McLean, Virginia.
- PIMENTEL, J. F.; MURTA, L. G. P.; BRAGANHOLO, V.; FREIRE, J.; noWorkflow: a Tool for Collecting, Analyzing, and Managing Provenance from Python Scripts. In: International Conference on Very Large Data Bases (VLDB), 2017, Munich, Germany.
- PONTES, V.: Reducing the Storage Overhead of the noWorkflow Content Database by using Git. Final undergraduate Project, Sistemas de Informação, Universidade Federal Fluminense, 2018.
- OLIVEIRA, E.; Enabling Collaboration in Scientific Experiments. Masters Dissertation, Universidade Federal Fluminense, 2022.
Quick Installation
To install noWorkflow, you should follow these basic instructions. Note that these steps may install an older version of noWorkflow. To make sure you are using the newest stable version, please follow the "Alternative" installation procedure mentioned below.
Using Python 3.8, use pip to install noWorkflow:
$ pip install noworkflow[all]
This installs noWorkflow, PyPosAST, SQLAlchemy, python-future, flask, IPython, Jupyter and PySWIP. The only requirements for running noWorkflow are PyPosAST, SQLAlchemy and python-future. The other libraries are only used for provenance analysis.
If you only want to install noWorkflow, PyPosAST, SQLAlchemy and python-future please do:
$ pip install noworkflow
Alternative: install the most up-to-date version of noWorkflow
If you wish to install the most up-to-date stable version of noWorkflow, you can clone our repository using Git.
$ git clone git@github.com:gems-uff/noworkflow.git
If you don't have git, just download the ZIP source code from our repository and decompress the zip file into a folder.
Then, use Python to install it (the most up-to-date version of noWorkflow works with newer versions of Python. The current version was tested with Python 3.12.4.
Go to the folder where you decompressed the files (or where you cloned the project) and then execute the following:
$ cd noworkflow-master
$ python setup.py install
$ pip install -e ".[all]"
This installs noWorkflow and its dependencies on your system.
Upgrade
To upgrade the version of a previously installed noWorkflow using pip, you should run the following command:
$ pip install --upgrade noworkflow[all]
Basic Usage
noWorkflow is transparent in the sense that it requires neither changes to the script, nor any laborious configuration. Run
$ now --help
to learn the usage options.
noWorkflow comes with a demonstration project. Follow the Wiki page to see how extract it.
To run noWorkflow you should run:
$ now run script.py
The -v option turns the verbose mode on, so that noWorkflow gives you feedback on the steps taken by the tool. The output, in this case, is similar to what follows.
$ now run -v script.py
[now] removing noWorkflow boilerplate
[now] setting up local provenance store
[now] using content engine noworkflow.now.persistence.content.plain_engine.PlainEngine
[now] collecting deployment provenance
[now] registering environment attributes
[now] collection definition and execution provenance
[now] executing the script
[now] the execution of trial 91f4fdc7-6c36-4c9d-a43a-341eaee9b7fb finished successfully
Each new run produces a different trial that will be stored with a universally unique identifier in the relational database.
Verifying the module dependencies is a time consuming step, and scientists can bypass this step by using the -b flag if they know that no library or source code has changed. The current trial then inherits the module dependencies of the previous one. To see more usage options, run "now run -h".
To restore files, run:
$ now restore [trial]
By default, the restore command will restore the trial script, imported local modules and the first access to files. Use the option -s to leave out the script; the option -l to leave out modules; and the option -a to leave out file accesses. The restore command track the evolution history. By default, subsequent trials are based on the previous Trial. When you restore a Trial, the next Trial will be based on the restored Trial.
The restore command also provides a -f path option. This option can be used to restore a single file. With this command there are extra options: -t path2 specifies the target of restored file; -i id identifies the file. There are 3 possibilities to identify files: by access time, by code hash, or by number of access. The option -f does not affect evolution history. To see more usage options, run "now restore -h".
To execute the git garbage collection in the content database, run:
$ now gc
Analysis
To list all trials, just run:
$ now list
Assuming we run the experiment again and then run now list, the output would be as follows.
$ now list
[now] trials available in the provenance store:
[f]Trial 7fb4ca3d-8046-46cf-9c54-54923d2076ba: run -v .\simulation.py .\data1.dat .\data2.dat
with code hash 6a28e58e34bbff0facaf55f80313ab2fd2505a58
ran from 2023-04-12 19:38:50.234485 to 2023-04-12 19:38:51.672300
duration: 0:00:01.437815
[*]Trial 01482b72-2005-4319-bd57-773291f9f7b1: run -v .\simulation.py .\data1.dat .\data2.dat

