Finder
A fully automated gene annotator from RNA-Seq expression data
Install / Use
/learn @sagnikbanerjee15/FinderREADME
Welcome to finder2
MORE DETAILS COMING
finder is a gene annotator pipeline which automates the process of downloading short reads, aligning them and using the assembled transcripts to generate gene annotations. Additionally it uses protein sequences and reports gene predictions by BRAKER2. It is a fast, scalable, platform independent software that generates gene annotations in GTF format. finder accepts inputs through the command line interface. It finds several novel genes/transcripts and also reports the tissue/conditions they were found to be in. finder is released as a docker image. Users need to have python3 installed in their system to be able to run finder. The header script will create either a docker container or a singularity container depending on what is installed on the system with preference given to docker.
If you use finder for your research please cite
Sagnik Banerjee, Priyanka Bhandary, Margaret Woodhouse, Taner Z Sen,Roger P Wise, and Carson M Andorf. FINDER: an automated software package to annotate eukaryotic genes from RNA-Seq data and associated protein sequences BMC Bioinformatics
Installation
finder requires a number of softwares which needs to be installed. This might cause version conflicts with softwares that are already installed in your system. Hence, the developers have decided to enforce the use of finder within a conda environment.
Installing finder from GitHub
git pull https://github.com/sagnikbanerjee15/Finder.git
Downloading finder from release (Latest stable version)
wget https://github.com/sagnikbanerjee15/Finder/archive/refs/tags/finder_v1.1.0.tar.gz
tar -xvzf finder_v1.1.0.tar.gz
cd finder_v1.1.0
echo "export PATH=\$PATH:$(pwd)" >> ~/.bashrc
source ~/.bashrc
You can choose to run finder using the command outlined in [this section](#Running Finder). When run_finder command is executed, it will pull the latest docker image from docker hub. Depending on what is installed, the program will create either a docker or a singularity container and execute the main program inside it. If you wish to create the docker image locally execute the following command:
docker build -t sagnikbanerjee15/finder:1.1.0 .
Please remember to add proxies if you are on a VPN.
finder runs BRAKER2 which depends on GeneMark-ET. GeneMark-ET is hosted at the University of Georgia website. The license prohibits the redistribution of their software, which is why it could not be included in this package. Hence, users have to manually download the software and provide the path as input to the software. Please follow the instructions below to download the softwares and the key:
- Open a browser of your choice
- Go to this website
- Select the option GeneMark-ES/ET/EP ver 4.62_lic (2<sup>nd</sup> from top) and LINUX 64
- Enter your name, institution, country and email-id and click on the button that says I agree to the terms of this license agreement
- Right click on the link that says Please download program here and select Copy Link Address
- Then type in
wgetand paste the path you just copied - This command will download the file gmes_linux_64.tar.gz in the current directory
- Now, right click on the link that says 64_bit and select Copy Link Address
- Then type in
wgetand paste the path you just copied - This command will download the file gm_key_64.tar.gz in the current directory. Please note that this key will expire after one year from the date of download.
- Execute the following commands:
tar -xvzf gm_key_64.tar.gz
tar -xvzf gmes_linux_64.tar.gz
Executing FINDER with Sample data
Please follow the following the instructions to generate gene annotations using Arabidopsis thaliana. A csv file template has been provided with the release in example/Arabidopsis_thaliana_metadata.csv. Keep all the headers intact and replace the data with your samples of choice. Also note, that FINDER can work with both data downloaded from NCBI and also with data on local directories. Below is a detailed description of the each column of the metadata file. All the fields must be present in the metadata file. Mandatory fields must have some valid data whereas other fields like Description, Date and Read Length can be left vacant.
| Column Name | Column Description | Mandatory |
| :--------------- | :----------------------------------------------------------- | :-------- |
| BioProject | Name of the bioproject that the data belongs to. If you are using locally saved data then please enter a dummy project name. Please note that FINDER will NOT be able to process empty fields of Bioproject. | YES |
| SRA Accession | Enter the SRA Accession number of the sample that you expect finder to use for generating the gene annotations. Note that FINDER will use this ID to download the read samples from NCBI-SRA. In case you wish to use data which is not currently uploaded to NCBI, then you should enter the name of the local file. Do not enter any file extension in this field. For example, if your filename is sample1.fastq, please enter sample1 in this field. finder assumes all files have the extension fastq. If there are files in your system that end with f.q please rename those to *.fastq. For paired-ended samples do not include the pair information in this field. For example, if you have 2 files sample2_1.fastq and sample2_2.fastq please enter sample2 in this field. | YES |
| Tissues | Mention the tissue type or condition from which the sample has been collected. finder will report the tissues that are associated with a particular transcript. This can be used to find gene models that are expressed in a specific tissue and/or condition | YES |
| Description | A brief description of the data. This field is not mandatory and is not used by finder. It is upto the user to enter whatever metadata is deemed important. | NO |
| Date | Enter the date of producing the RNA-Seq sample. This field is not mandatory and is not used by finder. | NO |
| Read Length (bp) | Enter the length of the reads. This field is not mandatory and is not used by finder. | NO |
| Ended | Enter either PE or SE for Paired ended reads or single neded reads. No other value should be entered. | YES |
| RNA-Seq | Enter 1 for all the rows. This field is included for future extensions. | YES |
| process | Enter 1 if you wish to process the sample. If a value of 0 is present, then finder will ignore the sample | YES |
| Location | Enter the location of the directory. For samples to be downloaded from NCBI, this field should be left empty. If the location of a directory is provided here then finder will assume that the sample is present in it. finder will generate an error if the sample is not found in this directory. It is not necessary to have all the samples in the same directory. | YES |
To optimize disk space usage finder will process read samples from each bioproject at a time. Once the data is downloaded and reads are mapped, FINDER will remove all those data (if -no-cleanup is not specificied) to save disk space. But samples that were locally present will not be removed.
Running FINDER
Help menu for FINDER can be launched by the following command:
run_finder -h
usage: run_finder [-h] [--version] --metadatafile METADATAFILE --output_directory OUTPUT_DIRECTORY --genome GENOME --organism_model {VERT,INV,PLANTS,FUNGI} --genemark_path GENEMARK_PATH --genemark_license GENEMARK_LICENSE [--cpu CPU] [--genome_dir_star GENOME_DIR_STAR]
[--genome_dir_olego GENOME_DIR_OLEGO] [--verbose VERBOSE] [--protein PROTEIN] [--no_cleanup] [--preserve_raw_input_data] [--checkpoint CHECKPOINT] [--perform_post_completion_data_cleanup] [--run_tests] [--addUTR] [--skip_cpd] [--exonerate_gff3 EXONERATE_GFF3]
[--star_shared_mem] [--framework {docker,singularity}]
Generates gene annotation from RNA-Seq data
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
Required arguments:
--metadatafile METADATAFILE, -mf METADATAFILE
Please enter the name of the metadata file. Enter 0 in the last column of those samples which you wish to skip processing. The columns should represent the following in order --> BioProject, SRA Accession, Tissues, Description, Date, Read Length, Ended (PE or SE), RNA-Seq, process, Location. If the sample is skipped it will not be downloaded. Leave the directory path blank if you are downloading the samples. In the end of the run the program will output a csv file with the directory path filled out. Please check the provided csv file for more information on how to configure the metadata file.
--output_directory OUTPUT_DIRECTORY, -out_dir OUTPUT_DIRECTORY
Enter the name of the directory where all other operations will be performed
--genome GENOME, -g GENOME
Enter the SOFT-MASKED genome file of the organism
--organism_model {VERT,INV,PLANTS,FUNGI}, -om {VERT,INV,PLANTS,FUNGI}
Enter the type of organism
--genemark_path GENEMARK_PATH, -gm GENEMARK_PATH
Enter the path to genemark
--genemark_license GENEMARK_LICENSE, -gml GENEMARK_LICENSE
Enter the licence file. Please make sure your license file is less than 365 days old
Optional arguments:
--cpu CPU, -n CPU Enter the number of CPUs to be used.
--genome_dir_star GENOME_DIR_STAR, -g
Related Skills
node-connect
352.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
111.1kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
352.2kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
352.2kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
