Refine.bio

Refine.bio harmonizes petabytes of publicly available biological data into ready-to-use datasets for cancer researchers and AI/ML scientists.

This README file is about building and running the refine.bio project source code.

If you're interested in simply using the service, you should go to the website or read the documentation.

Refine.bio currently has four sub-projects contained within this repo:

common Contains code needed by both foreman and workers.
foreman Discovers data to download/process and manages jobs.
workers Runs Downloader and Processor jobs.
infrasctructure Manages infrastructure for Refine.bio.

Refine.bio

Development

Git Workflow

refinebio uses a feature branch based workflow. New features should be developed on new feature branches, and pull requests should be sent to the dev branch for code review. Merges into master happen at the end of sprints, and tags in master correspond to production releases.

Installation

To run Refine.bio locally, you will need to have the prerequisites installed onto your local machine. This will vary depending on whether you are developing on a Mac or a Linux machine. Linux instructions have been tested on Ubuntu 16.04 or later, but other Linux distributions should be able to run the necessary services. Microsoft Windows is currently unsupported by this project.

Note: The install_all.sh script will configure a git pre-commit hook to auto-format your python code. This will format your code in the same way as the rest of the project, allowing it to pass our linting check.

Automatic

The easiest way to run Refine.bio locally is to run ./scripts/install_all.sh to install all of the necessary dependencies. As long as you are using a recent version of Ubuntu or macOS it should work. If you are using another version of Linux it should still install most of the dependencies as long as you give the appropriate INSTALL_CMD environment variable, but some dependencies may be named differently in your package manager than in Ubuntu's.

Linux (Manual)

The following services will need to be installed:

Python3 and Pip: sudo apt-get -y install python3-pip
Docker: Be sure to follow the post installation steps so Docker does not need sudo permissions.
Terraform
pip3 can be installed on Linux clients with sudo apt-get install python3-pip
black can be installed on Linux clients with pip3 install black
jq
iproute2
shellcheck

Instructions for installing Docker and Terraform can be found by following the link for each service. jq and iproute2 can be installed via sudo apt-get install jq iproute2 shellcheck.

Mac (Manual)

The following services will need to be installed:

Instructions for installing Docker and Homebrew can be found by on their respective homepages.

Once Homebrew is installed, the other required applications can be installed by running: brew install iproute2mac terraform jq black shellcheck.

Many of the computational processes running are very memory intensive. You will need to raise the amount of virtual memory available to Docker from the default of 2GB to 12GB or 24GB, if possible.

Virtual Environment

Run ./scripts/create_virtualenv.sh to set up the virtualenv. It will activate the dr_env for you the first time. This virtualenv is valid for the entire refinebio repo. Sub-projects each have their own environments managed by their containers. When returning to this project you should run source dr_env/bin/activate to reactivate the virtualenv.

Services

refinebio also depends on Postgres. Postgres can be run in a local Docker container

Postgres

To start a local Postgres server in a Docker container, use:

./scripts/run_postgres.sh

Then, to initialize the database, run:

./scripts/install_db_docker.sh

If you need to access a psql shell for inspecting the database, you can use:

./scripts/run_psql_shell.sh

or if you have psql installed this command will give you a better shell experience:

source scripts/common.sh && PGPASSWORD=mysecretpassword psql -h $(get_docker_db_ip_address) -U postgres -d data_refinery

Common Dependecies

The common sub-project contains common code which is depended upon by the other sub-projects. So before anything else you should prepare the distribution directory common/dist with this script:

./scripts/update_models.sh

(Note: This step requires the postgres container to be running and initialized.)

Note: there is a small chance this might fail with a can't stat, error. If this happens, you have to manually change permissions on the volumes directory with sudo chmod -R 740 volumes_postgres then re-run the migrations.

ElasticSearch

One of the API endpoints is powered by ElasticSearch. ElasticSearch must be running for this functionality to work. A local ElasticSearch instance in a Docker container can be executed with:

./scripts/run_es.sh

And then the ES Indexes (akin to Postgres 'databases') can be created with:

./scripts/rebuild_es_index.sh

Testing

To run the entire test suite:

./scripts/run_all_tests.sh

(Note: Running all the tests can take some time, especially the first time because it downloads a lot of files.)

For more granular testing, you can just run the tests for specific parts of the system.

API

To just run the API tests:

./api/run_tests.sh

Common

To just run the common tests:

./common/run_tests.sh

Foreman

To just run the foreman tests:

./foreman/run_tests.sh

Workers

To just run the workers tests:

./workers/run_tests.sh

If you only want to run tests with a specific tag, you can do that too. For example, to run just the salmon tests:

./workers/run_tests.sh -t salmon

All of our worker tests are tagged, generally based on the Docker image required to run them. Possible values for worker test tags are:

affymetrix
agilent
downloaders
illumina
no_op
qn (short for quantile normalization)
salmon
smasher
transcriptome

Style

R files in this repo follow Google's R Style Guide. Python Files in this repo follow PEP 8. All files (including Python and R) have a line length limit of 100 characters.

In addition to following pep8, python files must also conform to the formatting style enforced by black. black is a highly opinionated auto-formatter. (black's highly opinionated style is a strict su

Refinebio

Install / Use

README