Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize resource usage #36

Open
samuell opened this issue Dec 12, 2024 · 1 comment
Open

Optimize resource usage #36

samuell opened this issue Dec 12, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@samuell
Copy link
Contributor

samuell commented Dec 12, 2024

The EMU pipeline requires substantial computational resources, coming primarily from the initial mapping step based on minimap2.

Based on some toyish trying on 10 cores on my laptop (12th Gen Intel(R) Core(TM) i7-1255U), I've got:

  • Ca 375 CPU core ms per read (median of ca 1400 bp read length)
  • Ca 2.5 CPU core minutes per "chunked" fastq file med 4000 reads from the instrument
  • A typical sample with hundreds of such files seem to easily take 10-20 CPU core hours (note that spreading on multiple cores then will cut this down significantly).

Creating this issue to summarize ideas and things we've been trying to optimize resource usage of the pipeline.

Ideas

  1. Run only on the forward strand in the database
    • Using this flag to emu abundance:

      --mm2-forward-only force minimap2 to consider the forward transcript strand only

    • Thanks to @jodjo86 for sharing the tip here!
    • @samuell has done naive tests that indicates running times can be cut with nearly 50%.
  2. Filter down the database to top hits from running kraken2 on the reads.
    • @samuell has done naive tests on a tiny dataset indicating that keeping the top 12 species cut the running time with 50% (more info below).
  3. Use an optimized minimap2 version, such as:
  4. More ideas?

@ryanjameskennedy and @LordRust feel free to fill in here, as I understand you've been looking at this too!

@samuell samuell added the enhancement New feature or request label Dec 12, 2024
@samuell
Copy link
Contributor Author

samuell commented Dec 12, 2024

My small test of filtering down the EMU database using Kraken2 was done as following:

  • I used the first 1000 sequences from the assets/test_assets/Mock_dil_1_2_BC1.fastq.gz file in the repo.
  • I kept the 12 most abundant species according to Kraken2 (v 2.1.2) with the ~8GB database.

The results are as follows:

  • The top ~4 hits keep the same ranking, and the top ~8 ones are roughly the same.
  • See image below, where species on the same rank are noted with green lines, and those where the rank was moved around are noted with orange lines:

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant