Refinebio
Refine.bio harmonizes petabytes of publicly available biological data into ready-to-use datasets for cancer researchers and AI/ML scientists.
Install / Use
/learn @AlexsLemonade/RefinebioREADME
Refine.bio

<!-- This section needs to be drastically improved -->
Refine.bio harmonizes petabytes of publicly available biological data into ready-to-use datasets for cancer researchers and AI/ML scientists.
This README file is about building and running the refine.bio project source code.
If you're interested in simply using the service, you should go to the website or read the documentation.
Refine.bio currently has four sub-projects contained within this repo:
- common Contains code needed by both
foremanandworkers. - foreman Discovers data to download/process and manages jobs.
- workers Runs Downloader and Processor jobs.
- infrasctructure Manages infrastructure for Refine.bio.
Table of Contents
- Refine.bio
Development
Git Workflow
refinebio uses a
feature branch
based workflow. New features should be developed on new feature branches, and
pull requests should be sent to the dev branch for code review. Merges into
master happen at the end of sprints, and tags in master correspond to
production releases.
Installation
To run Refine.bio locally, you will need to have the prerequisites installed onto your local machine. This will vary depending on whether you are developing on a Mac or a Linux machine. Linux instructions have been tested on Ubuntu 16.04 or later, but other Linux distributions should be able to run the necessary services. Microsoft Windows is currently unsupported by this project.
Note: The install_all.sh script will configure a git pre-commit hook to auto-format your python code. This will format your code in the same way as the rest of the project, allowing it to pass our linting check.
Automatic
The easiest way to run Refine.bio locally is to run ./scripts/install_all.sh
to install all of the necessary dependencies. As long as you are using a recent
version of Ubuntu or macOS it should work. If you are using another version of
Linux it should still install most of the dependencies as long as you give the
appropriate INSTALL_CMD environment variable, but some dependencies may be
named differently in your package manager than in Ubuntu's.
Linux (Manual)
The following services will need to be installed:
- Python3 and Pip:
sudo apt-get -y install python3-pip - Docker: Be sure to follow the post installation steps so Docker does not need sudo permissions.
- Terraform
- pip3 can be installed on Linux clients with
sudo apt-get install python3-pip - black can be installed on Linux clients with
pip3 install black - jq
- iproute2
- shellcheck
Instructions for installing Docker and Terraform can be found by
following the link for each service. jq and iproute2 can be installed via
sudo apt-get install jq iproute2 shellcheck.
Mac (Manual)
The following services will need to be installed:
Instructions for installing Docker and Homebrew can be found by on their respective homepages.
Once Homebrew is installed, the other required applications can be installed by running: brew install iproute2mac terraform jq black shellcheck.
Many of the computational processes running are very memory intensive. You will need to raise the amount of virtual memory available to Docker from the default of 2GB to 12GB or 24GB, if possible.
Virtual Environment
Run ./scripts/create_virtualenv.sh to set up the virtualenv. It will activate the dr_env
for you the first time. This virtualenv is valid for the entire refinebio
repo. Sub-projects each have their own environments managed by their
containers. When returning to this project you should run
source dr_env/bin/activate to reactivate the virtualenv.
Services
refinebio also depends on Postgres. Postgres can be
run in a local Docker container
Postgres
To start a local Postgres server in a Docker container, use:
./scripts/run_postgres.sh
Then, to initialize the database, run:
./scripts/install_db_docker.sh
If you need to access a psql shell for inspecting the database, you can use:
./scripts/run_psql_shell.sh
or if you have psql installed this command will give you a better shell experience:
source scripts/common.sh && PGPASSWORD=mysecretpassword psql -h $(get_docker_db_ip_address) -U postgres -d data_refinery
Common Dependecies
The common sub-project contains common code which is
depended upon by the other sub-projects. So before anything else you
should prepare the distribution directory common/dist with this
script:
./scripts/update_models.sh
(Note: This step requires the postgres container to be running and initialized.)
Note: there is a small chance this might fail with a can't stat, error. If this happens, you have
to manually change permissions on the volumes directory with sudo chmod -R 740 volumes_postgres
then re-run the migrations.
ElasticSearch
One of the API endpoints is powered by ElasticSearch. ElasticSearch must be running for this functionality to work. A local ElasticSearch instance in a Docker container can be executed with:
./scripts/run_es.sh
And then the ES Indexes (akin to Postgres 'databases') can be created with:
./scripts/rebuild_es_index.sh
Testing
To run the entire test suite:
./scripts/run_all_tests.sh
(Note: Running all the tests can take some time, especially the first time because it downloads a lot of files.)
For more granular testing, you can just run the tests for specific parts of the system.
API
To just run the API tests:
./api/run_tests.sh
Common
To just run the common tests:
./common/run_tests.sh
Foreman
To just run the foreman tests:
./foreman/run_tests.sh
Workers
To just run the workers tests:
./workers/run_tests.sh
If you only want to run tests with a specific tag, you can do that too. For example, to run just the salmon tests:
./workers/run_tests.sh -t salmon
All of our worker tests are tagged, generally based on the Docker image required to run them. Possible values for worker test tags are:
- affymetrix
- agilent
- downloaders
- illumina
- no_op
- qn (short for quantile normalization)
- salmon
- smasher
- transcriptome
Style
R files in this repo follow Google's R Style Guide. Python Files in this repo follow PEP 8. All files (including Python and R) have a line length limit of 100 characters.
In addition to following pep8, python files must also conform to the formatting style enforced by black.
black is a highly opinionated auto-formatter.
(black's highly opinionated style is a strict su
