Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different ValidPairs rate between chromap and bowtie2 in HiC data #147

Open
WangZhSi opened this issue Jan 9, 2024 · 9 comments
Open

Different ValidPairs rate between chromap and bowtie2 in HiC data #147

WangZhSi opened this issue Jan 9, 2024 · 9 comments

Comments

@WangZhSi
Copy link

WangZhSi commented Jan 9, 2024

Problem

Different ValidPairs rate between chromap and bowtie2 in HiC data using HiC-Pro pipeline, which convert bam into *.VailidPairs format and stat.
When I use chromap, result is: Dumped_pairs_rate%: 85.6677191749
When I use bowtie2, result is: Dumped_pairs_rate%: 0.269802518711(same data)

CMD

chromap(0.2.5-r473)
chromap --preset hic -x genome.index -r genome.fa -1 R1 -2 R2 --SAM -o out.sam --remove-pcr-duplicates -t 8 --summary out.summary
samtools view -bh out.sam | samtools sort -@ 8 > out.bam

bowtie2(2.2.3)
bowtie2 --very-sensitive -L 30 --score-min L,-0.6,-0.2 --end-to-end --reorder --rg-id --phred33-quals -p 5 --un out.unmap.fq --rg -x genome -U R1 | samtools view -F 4 -bS - > R1.bam
same cmd for R2
then merge bam and sort using samtools to get out.bam

map2frag(HiC-Pro tools)
python mapped_2hic_fragments.py -v -a -s 0 -l 700 -f DpnII_resfrag_genome.bed -r out.bam -o outdir
Here, I get different ValidPairs rate.

Data Info

HiC Data: PE 150, 200X depth clean

I've used chromap in many genome, it is the first time this has happened. Chromap is very helpful.

Thanks!

@mourisl
Copy link
Collaborator

mourisl commented Jan 14, 2024

Is this human genome? Do you see very few alignment in the out.sam file as well?

@WangZhSi
Copy link
Author

WangZhSi commented Jan 15, 2024

It's a plant genome. Here is the flagstat result of bam:

chromap.out.bam:
197585161 + 0 in total (QC-passed reads + QC-failed reads)
197585161 + 0 primary
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
197585161 + 0 mapped (100.00% : N/A)
197585161 + 0 primary mapped (100.00% : N/A)
197585161 + 0 paired in sequencing
98802418 + 0 read1
98782743 + 0 read2
197585161 + 0 properly paired (100.00% : N/A)
197585161 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
67103149 + 0 with mate mapped to a different chr
62048492 + 0 with mate mapped to a different chr (mapQ>=5)

bowtie2.out.bam(after merge, rmdup)
225507040 + 0 in total (QC-passed reads + QC-failed reads)
225507040 + 0 primary
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
225507040 + 0 mapped (100.00% : N/A)
225507040 + 0 primary mapped (100.00% : N/A)
225507040 + 0 paired in sequencing
112753520 + 0 read1
112753520 + 0 read2
225507040 + 0 properly paired (100.00% : N/A)
225507040 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
76941620 + 0 with mate mapped to a different chr
74674808 + 0 with mate mapped to a different chr (mapQ>=5)

There seems OK with alignment, I'm not terribly bothered by the difference in the number of reads mapping.

@mourisl
Copy link
Collaborator

mourisl commented Jan 15, 2024

Then the discrepancy between Chromap and Bowtie2 probably comes from the map2frag stage. If this tool use TLEN field in the BAM file to determine the insert size, there could be an issue. We recently fixed a bug for the hi-c data: #139. If the TLEN is an issue in your data, could you please pull the version on github and give it a try? If this fixes your bug, we will release a version.

@WangZhSi
Copy link
Author

I pull li_dev5 branch and got this error:

In file included from src/chromap_driver.cc:10:0: src/chromap.h: In instantiation of ‘void chromap::Chromap::MapSingleEndReads() [with MappingRecord = chromap::PAFMapping]’: src/chromap_driver.cc:669:70: required from here src/chromap.h:302:9: error: ‘chromap::Chromap::num_corrected_barcode_’ is not a variable in clause ‘reduction’ #pragma omp parallel shared(num_reads_, mm_history, reference, index, read_batch, barcode_batch, read_batch_for_loading, barcode_batch_for_loading, std::cerr, num_loaded_reads_for_loading, num_loaded_reads, num_reference_sequences, mappings_on_diff_ref_seqs_for_diff_threads, mappings_on_diff_ref_seqs_for_diff_threads_for_saving, mappings_on_diff_ref_seqs, temp_mapping_file_handles, mm_to_candidates_cache, mapping_writer, minimizer_generator, candidate_processor, mapping_processor, draft_mapping_generator, mapping_generator, num_mappings_in_mem, max_num_mappings_in_mem) num_threads(mapping_parameters_.num_threads) reduction(+:num_candidates_, num_mappings_, num_mapped_reads_, num_uniquely_mapped_reads_, num_barcode_in_whitelist_, num_corrected_barcode_) src/chromap.h:302:9: error: ‘chromap::Chromap::num_barcode_in_whitelist_’ is not a variable in clause ‘reduction’ src/chromap.h:302:9: error: ‘chromap::Chromap::num_uniquely_mapped_reads_’ is not a variable in clause ‘reduction’ src/chromap.h:302:9: error: ‘chromap::Chromap::num_mapped_reads_’ is not a variable in clause ‘reduction’ src/chromap.h:302:9: error: ‘chromap::Chromap::num_mappings_’ is not a variable in clause ‘reduction’ src/chromap.h:302:9: error: ‘chromap::Chromap::num_candidates_’ is not a variable in clause ‘reduction’ src/chromap.h:302:9: error: ‘chromap::Chromap::num_reads_’ is not a variable in clause ‘shared’ src/chromap.h: In instantiation of ‘void chromap::Chromap::MapSingleEndReads() [with MappingRecord = chromap::SAMMapping]’: src/chromap_driver.cc:673:70: required from here

my cmd:
cd chromap-li_dev5 && make

@mourisl
Copy link
Collaborator

mourisl commented Jan 15, 2024

You can use the main branch. It should contain the fix for the TLEN. I will check the issue you found for the li_dev5.

@WangZhSi
Copy link
Author

I pull main branch and re-run my data(mapping using new chromap, sort, Map2frag), the Dumped_pairs_rate has NOTING change.

But I check TLEN in bam(or sam) from chromap, seems like that's where the error is coming from.

image

BTW, the error I got when I install li_dev5 branch comes from my OS version error, you can forget it.

@mourisl
Copy link
Collaborator

mourisl commented Jan 16, 2024

You can see that the alignments are indeed very far from each other, so the absolute value of TLEN is large. The sign is based on the forward or reverse strand of the mate pairs. I think in HiC the fragment size can be large, so the command "python mapped_2hic_fragments.py -v -a -s 0 -l 700 " that restricts the insert size to be between 0 and 700 might not be desirable?

@WangZhSi
Copy link
Author

Thank you for your advice! I changed my command as python mapped_2hic_fragments.py -v -a -s 0 -l 100000000 which restricts the insert size to be between 0 and 100Mb, and the Dumped_pairs_rate get a little better to 84.56%(from 85.66%). Maybe the error comes from bam to vaildPairs with HiC-pro pipeline, but I don't know why, hope you can solve this problem. If you need any file or mid-result, please let me know.

Anyway, I'll try another pipeline form chromap to .hic file. I think there's no substitute for chromap's speed advantage!

@mourisl
Copy link
Collaborator

mourisl commented Jan 16, 2024

Thank you! I think for the hic data, there is no need to restrain the insert size, because many alignments are expected to be far from each other and on different chromosomes. I'm curious how Bowtie2 set TLEN values in your data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants