EarlGrey
Earl Grey: A fully automated TE curation and annotation pipeline
Install / Use
/learn @TobyBaril/EarlGreyREADME

Earl Grey
Earl Grey is a full-automated transposable element (TE) annotation pipeline, leveraging the most widely-used tools and combining these with a consensus elongation process to better define de novo consensus sequences when annotating new genome assemblies.
Contents
References and Acknowledgements
<!-- toc -->Important Considerations
Earl Grey version 6 uses Dfam 3.9. After installation, you MUST configure Dfam partitions as needed. Earl Grey will generate the script to do this and provide guidance when you run it for the first time. You need to specify which partitions of Dfam and/or RepBase to configure Earl Grey with. Choose partitions carefully as the combination will highly influence your results, especially if you want to pre-mask your input genome. Please make use of issues and discussions tabs if you have questions about this, we are always happy to help!
Notes / Updates
We often get questions related to runtime. TE curation and annotation remains resource and time intensive. Fast is not necessarily better, and runtime is highly dependent on genome size, complexity, and repeat content. Runs will likely take longer than you might expect, and be very RAM-hungry. As some generic benchmarks, a 40Mb genome can take anywhere from a few hours to a day, 400Mb up to around 4-5 days, a 3Gb genome ~a week, and a 25Gb genome several weeks! Things will be running even if it doesn't look like they are. Each step checkpoints, so if you have server limits, you can resubmit the same script with the same parameters, and Earl Grey will skip completed steps. TEstrainer and the final divergence calculator use a lot of memory. Check carefully for OOM errors in the logs! As a rule of thumb, you need at least 3GB of RAM per thread, with more being better. Therefore, 16 threads requires at least 48GB of RAM depending on repeat complexity of the input genome.
We have been made aware of some instability in repeat annotation percentages when high numbers of CPUs are employed in certain server environments. Please be sure to check logs carefully for instances of interruption. Known cases so far will show the following message:
OpenBLAS blas_thread_init: pthread_create failed for thread X of X: Resource temporarily unavailable
OpenBLAS blas_thread_init: ensure that your address space and process count limits are big enough (ulimit -a)
OpenBLAS blas_thread_init: or set a smaller OPENBLAS_NUM_THREADS to fit into what you have available
If you see this message, re-run analysis with less threads. Alternatively, you can modify your instance of the TEstrainer script initial_mafft_setup.py to add the following after import os:
os.environ['OPENBLAS_NUM_THREADS'] = '1'
Changes in Latest Release
Earl Grey v7.2.1 patches a bug where the curated library directory was not created when an existing library was supplied via -l (without -r). In this case, earlGreyAnnotationOnly could fail when attempting to change into the directory during the final masking step. The directory is now created with mkdir -p before use, matching the fix already applied to the main earlGrey script.
Previous Changes
Earl Grey v7.2.0 significantly reduces peak RAM usage in two RAM-intensive components: TEstrainer and divergence_calc.py. These changes prevent OOM kills when running with large thread counts on memory-constrained compute nodes, with no change to output.
TEstrainer / TEstrainer_for_earlGrey.sh:
- The default
MEM_FREEthreshold raised from200Mto1G. The previous value was lower than the startup cost of a single Python interpreter with heavy scientific libraries, making the guard ineffective. - All GNU
parallelcalls in the BEAT curation loop (trf, initial_mafft_setup, mafft, TEtrim) now carry--memfree ${MEM_FREE}, throttling job dispatch when free RAM drops below the threshold. - A RAM-cap guard is applied at startup: the requested thread count is capped based on available RAM (
free -m) at an estimate of 800 MB per concurrent job, with a warning printed if a cap is applied.
divergence_calc.py:
- Switched from the default
forkmultiprocessing start method toforkserver. On Linux,forkduplicates the full parent address space (including the GFF DataFrame) into every worker;forkserverworkers start clean and receive only a file path, eliminating N-fold GFF copies in RAM. - GFF chunks are now serialised to temp TSV files on disk before the pool is created. The parent DataFrame is freed before any workers are launched, reducing parent RSS during the pool run.
pool.imap_unorderedreplacespool.map, allowing workers to be retired as they finish rather than all buffering results simultaneously.maxtasksperchild=1forces worker process restart after each chunk, releasing accumulated pybedtools handles and BioPython caches between chunks.- Periodic
pybedtools.cleanup()every 500 rows prevents temp-file accumulation in long-running workers. - A pre-existing bug was fixed:
os.remove(query_path)was called unconditionally in both cleanup branches even when the file was never created (e.g. whenpybedtools.sequence()raised a samtools exception). Both removal calls are now guarded withif exists(query_path).
Benchmarked results confirm peak RSS for divergence_calc.py is flat at ~76 MB across 1–16 threads (previously scaling linearly), and TEstrainer peak RSS is ~877 MB at both 4 and 32 threads (previously unbounded with thread count).
Earl Grey v7.1.1 patches a small bug in TEstrainer where consensus sequences comprised of tandem repeats were triggering a warning due to the change to pandas >2.0. The output results were not affected. The codebase has now been updated to handle these cases without throwing warnings, with the same expected behaviour and handling of tandem repeats as before.
Earl Grey v7.1.0 removes the dependency on Python 3.9, which is no longer supported. Earl Grey can now be built and run with Python 3.10 and above. This change was made to ensure that Earl Grey remains compatible with newer versions of Python and to remove the reliance on an unsupported version.
Earl Grey v7.0.3 fixes an issue with final calculation tables which did not count other repeats towards total repeat content.
Earl Grey v7.0.2 adds RepeatLandscapes for Penelope-like elements and SINEs. Importantly, the -norna option in RepeatMasker is no longer invoked as default behaviour, which will improver the detection and masking of small tRNA-derived SINEs.
Earl Grey v7.0.1 patches the summary table generation, where LINEs and Penelopes were being counted in both categories for nested repeats only.
☕ Earl Grey v7.0.0 is here!
Some long-awaited changes in this release—thank you for your patience while I found the time to properly test and implement them.
🐞 RepeatCraft fixes First, I’ve fixed an issue in RepeatCraft where distal TEs could be grouped erroneously. This occurred in rare edge cases where the internal counter failed to iterate when new distal copies of an existing TE were detected on the same contig. This should now be fully resolved.
🧬 Fully nested TE handling (major update) I’ve also completely revamped the Earl Grey post-processing and summary steps to properly deal with Fully Nested TEs. Partially overlapping TEs are handled as before, following the approach described in the original Earl Grey paper. Fully nested TEs are now identified using a new iterative process: The GFF is scanned for nested TEs These are labelled and stored in a separate file The nested TE is removed and the GFF is re-scanned to detect deeper (multi-level) nesting This continues until no new nested TEs are found
📊 Changes to coverage and summaries (important)
Nested TEs are no longer included in the TE coverage calculation used to generate the pie chart in the summaryFiles directory.
Instead: Summary tables now include new categories showing how many base pairs are comprised of nested TEs These base pairs are not counted toward Total Interspersed Repeats, as doing so would result in double-counting genomic space.
⚠️ This represents a substantial change from previous versions, so please be aware of this difference when upgrading to v7.
The output table summary now has the following format:
|TE Classification | Coverage (bp)|Copy Number | % Genome Coverage| Genome Size| TE Family Count|
|:-------------------------------------------------|-------------:|:-------------|-----------------:|-----------:|---------------:|
|DNA | 80886|326 | 0.7607795| 10631990| 326|
|DNA-nested | 1449|29 | 0.0136287| 10631990| 29|
|Rolling Circle | 8022|31 | 0.0754515| 10631990| 3
