Oneliners
Useful bash one-liners for bioinformatics.
Install / Use
/learn @stephenturner/OnelinersREADME
Bioinformatics one-liners
Useful bash one-liners useful for bioinformatics (and some, more generally useful).
Contents
- Sources
- Basic awk & sed
- awk & sed for bioinformatics
- sort, uniq, cut, etc.
- find, xargs, and GNU parallel
- seqtk
- GFF3 Annotations
- Other generally useful aliases for your .bashrc
- Etc.
Sources
- http://gettinggeneticsdone.blogspot.com/2013/10/useful-linux-oneliners-for-bioinformatics.html#comments
- http://sed.sourceforge.net/sed1line.txt
- https://github.com/lh3/seqtk
- http://lh3lh3.users.sourceforge.net/biounix.shtml
- http://genomespot.blogspot.com/2013/08/a-selection-of-useful-bash-one-liners.html
- http://biowize.wordpress.com/2012/06/15/command-line-magic-for-your-gene-annotations/
- http://genomics-array.blogspot.com/2010/11/some-unixperl-oneliners-for.html
- http://bioexpressblog.wordpress.com/2013/04/05/split-multi-fasta-sequence-file/
- http://www.commandlinefu.com/
Basic awk & sed
Extract fields 2, 4, and 5 from file.txt:
awk '{print $2,$4,$5}' input.txt
Print each line where the 5th field is equal to ‘abc123’:
awk '$5 == "abc123"' file.txt
Print each line where the 5th field is not equal to ‘abc123’:
awk '$5 != "abc123"' file.txt
Print each line whose 7th field matches the regular expression:
awk '$7 ~ /^[a-f]/' file.txt
Print each line whose 7th field does not match the regular expression:
awk '$7 !~ /^[a-f]/' file.txt
Get unique entries in file.txt based on column 2 (takes only the first instance):
awk '!arr[$2]++' file.txt
Print rows where column 3 is larger than column 5 in file.txt:
awk '$3>$5' file.txt
Sum column 1 of file.txt:
awk '{sum+=$1} END {print sum}' file.txt
Compute the mean of column 2:
awk '{x+=$2}END{print x/NR}' file.txt
Replace all occurances of foo with bar in file.txt:
sed 's/foo/bar/g' file.txt
Trim leading whitespaces and tabulations in file.txt:
sed 's/^[ \t]*//' file.txt
Trim trailing whitespaces and tabulations in file.txt:
sed 's/[ \t]*$//' file.txt
Trim leading and trailing whitespaces and tabulations in file.txt:
sed 's/^[ \t]*//;s/[ \t]*$//' file.txt
Delete blank lines in file.txt:
sed '/^$/d' file.txt
Delete everything after and including a line containing EndOfUsefulData:
sed -n '/EndOfUsefulData/,$!p' file.txt
Remove duplicates while preserving order
awk '!visited[$0]++' file.txt
awk & sed for bioinformatics
Returns all lines on Chr 1 between 1MB and 2MB in file.txt. (assumes) chromosome in column 1 and position in column 3 (this same concept can be used to return only variants that above specific allele frequencies):
cat file.txt | awk '$1=="1"' | awk '$3>=1000000' | awk '$3<=2000000'
Basic sequence statistics. Print total number of reads, total number unique reads, percentage of unique reads, most abundant sequence, its frequency, and percentage of total in file.fq:
cat myfile.fq | awk '((NR-2)%4==0){read=$1;total++;count[read]++}END{for(read in count){if(!max||count[read]>max) {max=count[read];maxRead=read};if(count[read]==1){unique++}};print total,unique,unique*100/total,maxRead,count[maxRead],count[maxRead]*100/total}'
Convert .bam back to .fastq:
samtools view file.bam | awk 'BEGIN {FS="\t"} {print "@" $1 "\n" $10 "\n+\n" $11}' > file.fq
Keep only top bit scores in blast hits (best bit score only):
awk '{ if(!x[$1]++) {print $0; bitscore=($14-1)} else { if($14>bitscore) print $0} }' blastout.txt
Keep only top bit scores in blast hits (5 less than the top):
awk '{ if(!x[$1]++) {print $0; bitscore=($14-6)} else { if($14>bitscore) print $0} }' blastout.txt
Split a multi-FASTA file into individual FASTA files:
awk '/^>/{s=++d".fa"} {print > s}' multi.fa
Output sequence name and its length for every sequence within a fasta file:
cat file.fa | awk '$0 ~ ">" {print c; c=0;printf substr($0,2,100) "\t"; } $0 !~ ">" {c+=length($0);} END { print c; }'
Convert a FASTQ file to FASTA:
sed -n '1~4s/^@/>/p;2~4p' file.fq > file.fa
Extract every 4th line starting at the second line (extract the sequence from FASTQ file):
sed -n '2~4p' file.fq
Print everything except the first line
awk 'NR>1' input.txt
Print rows 20-80:
awk 'NR>=20&&NR<=80' input.txt
Calculate the sum of column 2 and 3 and put it at the end of a row:
awk '{print $0,$2+$3}' input.txt
Calculate the mean length of reads in a fastq file:
awk 'NR%4==2{sum+=length($0)}END{print sum/(NR/4)}' input.fastq
Convert a VCF file to a BED file
sed -e 's/chr//' file.vcf | awk '{OFS="\t"; if (!/^#/){print $1,$2-1,$2,$4"/"$5,"+"}}'
Create a tab-delimited transcript-to-gene mapping table from a GENCODE GFF. The substr(x,s,n) returns n characters from string x starting from position s. This gets rid of the quotes and semicolon.
bioawk -c gff '$feature=="transcript" {print $group}' gencode.v28.annotation.gtf | awk -F ' ' '{print substr($4,2,length($4)-3) "\t" substr($2,2,length($2)-3)}' > txp2gene.tsv
extract specific reads from fastq file according to reads name :
zcat a.fastq.gz | awk 'BEGIN{RS="@";FS="\n"}; $1~/readsName/{print $2; exit}'
count missing sample in vcf file per line:
bcftools query -f '[%GT\t]\n' a.bcf | awk '{miss=0};{for (x=1; x<=NF; x++) if ($x=="./.") {miss+=1}};{print miss}' > nmiss.count
Interleave paired-end fastq files:
paste <(paste - - - - < reads-1.fastq) \
<(paste - - - - < reads-2.fastq) \
| tr '\t' '\n' \
> reads-int.fastq
Decouple an interleaved fastq file:
paste - - - - - - - - < reads-int.fastq \
| tee >(cut -f 1-4 | tr '\t' '\n' > reads-1.fastq) \
| cut -f 5-8 | tr '\t' '\n' > reads-2.fastq
sort, uniq, cut, etc.
Number each line in file.txt:
cat -n file.txt
Count the number of unique lines in file.txt
cat file.txt | sort -u | wc -l
Find lines shared by 2 files (assumes lines within file1 and file2 are unique; pipe to wd -l to count the number of lines shared):
sort file1 file2 | uniq -d
# Safer
sort -u file1 > a
sort -u file2 > b
sort a b | uniq -d
# Use comm
comm -12 file1 file2
Sort numerically (with logs) (g) by column (k) 9:
sort -gk9 file.txt
Find the most common strings in column 2:
cut -f2 file.txt | sort | uniq -c | sort -k1nr | head
Pick 10 random lines from a file:
shuf file.txt | head -n 10
Print all possible 3mer DNA sequence combinations:
echo {A,C,T,G}{A,C,T,G}{A,C,T,G}
Untangle an interleaved paired-end FASTQ file. If a FASTQ file has paired-end reads intermingled, and you want to separate them into separate /1 and /2 files, and assuming the /1 reads precede the /2 reads:
cat interleaved.fq |paste - - - - - - - - | tee >(cut -f 1-4 | tr "\t" "\n" > deinterleaved_1.fq) | cut -f 5-8 | tr "\t" "\n" > deinterleaved_2.fq
Take a fasta file with a bunch of short scaffolds, e.g., labeled >Scaffold12345, remove them, and write a new fasta without them:
samtools faidx genome.fa && grep -v Scaffold genome.fa.fai | cut -f1 | xargs -n1 samtools faidx genome.fa > genome.noscaffolds.fa
Display hidden control characters:
python -c "f = open('file.txt', 'r'); f.seek(0); file = f.readlines(); print file"
find, xargs, and GNU parallel
Download GNU parallel at https://www.gnu.org/software/parallel/.
Search for .bam files anywhere in the current directory recursively:
find . -name "*.bam"
Delete all .bam files (Irreversible: use with caution! Confirm list BEFORE deleting):
find . -name "*.bam" | xargs rm
Find the largest file in a directory:
find . -type f -printf '%s %p\n' | sort -nr | head -1 | cut -d' ' -f2-
Count the number of lines in all files with a specific extension in a directory
find . -type f -name '*.txt' -exec wc -l {} \; | awk '{total += $1} END {print total}'
Find all directories in the current directory that contain files with the extension ".log" and compress them into a tar archive:
find . -type f -name "*.log" -printf "%h\0" | sort -uz | xargs -0 tar -czvf logs.tar.gz
Rename all .txt files to .bak (backup *.txt before doing something else to them, for example):
find . -name "*.txt" | sed "s/\.txt$//" | xargs -i echo mv {}.txt {}.bak | sh
Chastity filter raw Illumina data (grep reads containing :N:, append (-A) the three lines after the match containing the sequence and quality info, and write a new filtered fastq file):
find *fq | parallel "cat {} | grep -A 3 '^@.*[^:]*:N:[^:]*:' | grep -v '^\-\-$' > {}.filt.fq"
Run FASTQC in parallel 12 jobs at a time:
find *.fq | parallel -j 12 "fastqc {} --outdir ."
Index your bam files in parallel, but only echo the commands (--dry-run) rather than actually running them:
find *.bam | parallel --dry-run 'samtools index {}'
seqtk
Download seqtk at https://github.com/lh3/seqtk. Seqtk is a fast and lightweight tool for processing sequences in the FASTA or FASTQ format. It seamlessly parses both FASTA and FASTQ files which can also be optionally compressed by gzip.
Convert FASTQ to FASTA:
seqtk seq -a in.fq.gz > out.fa
Convert ILLUMINA 1.3+ FASTQ to FASTA and mask bases with quality lower than 20 to lowercases (the 1st command line) or to N (the 2nd):
seqtk seq -aQ64 -q20 in.fq > out.fa
seqtk seq -aQ64 -q20 -n N in.fq > out.fa
Fold long FASTA/Q lines and remove FASTA/Q comments:
seqtk seq -
Related Skills
node-connect
349.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
109.5kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
349.2kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
349.2kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
Security Score
Audited on Apr 5, 2026
