Optimize resource usage #36

samuell · 2024-12-12T13:06:19Z

The EMU pipeline requires substantial computational resources, coming primarily from the initial mapping step based on minimap2.

Based on some toyish trying on 10 cores on my laptop (12th Gen Intel(R) Core(TM) i7-1255U), I've got:

Ca 375 CPU core ms per read (median of ca 1400 bp read length)

Ca 2.5 CPU core minutes per "chunked" fastq file med 4000 reads from the instrument

A typical sample with hundreds of such files seem to easily take 10-20 CPU core hours (note that spreading on multiple cores then will cut this down significantly).

Creating this issue to summarize ideas and things we've been trying to optimize resource usage of the pipeline.

Ideas

Run only on the forward strand in the database
- Using this flag to emu abundance:
  
  --mm2-forward-only force minimap2 to consider the forward transcript strand only
- Thanks to @jodjo86 for sharing the tip here!
- @samuell has done naive tests that indicates running times can be cut with nearly 50%.
Filter down the database to top hits from running kraken2 on the reads.
- @samuell has done naive tests on a tiny dataset indicating that keeping the top 12 species cut the running time with 50% (more info below).
Use an optimized minimap2 version, such as:
- mm2plus
- os-minimap2
- ....
More ideas?

@ryanjameskennedy and @LordRust feel free to fill in here, as I understand you've been looking at this too!

The text was updated successfully, but these errors were encountered:

samuell · 2024-12-12T13:16:46Z

My small test of filtering down the EMU database using Kraken2 was done as following:

I used the first 1000 sequences from the assets/test_assets/Mock_dil_1_2_BC1.fastq.gz file in the repo.
I kept the 12 most abundant species according to Kraken2 (v 2.1.2) with the ~8GB database.

The results are as follows:

The top ~4 hits keep the same ranking, and the top ~8 ones are roughly the same.
See image below, where species on the same rank are noted with green lines, and those where the rank was moved around are noted with orange lines:

samuell added the enhancement New feature or request label Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize resource usage #36

Optimize resource usage #36

samuell commented Dec 12, 2024 •

edited

Loading

samuell commented Dec 12, 2024

Optimize resource usage #36

Optimize resource usage #36

Comments

samuell commented Dec 12, 2024 • edited Loading

Ideas

samuell commented Dec 12, 2024

samuell commented Dec 12, 2024 •

edited

Loading