Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GtfToBed errors out with because "gene" is null #9103

Open
mmahmoudian opened this issue Feb 27, 2025 · 2 comments
Open

GtfToBed errors out with because "gene" is null #9103

mmahmoudian opened this issue Feb 27, 2025 · 2 comments

Comments

@mmahmoudian
Copy link

mmahmoudian commented Feb 27, 2025

This bug report is regarding a new tool, GtfToBed, which was introduced in #8942 PR. The following code creates a reproducible example of the error:

Get the necessary files

Reference genome

if [ ! -f 'hg38.fa.gz' ]; then
    echo 'Downloading the reference genome'
    wget https://hgdownload.soe.ucsc.edu/goldenpath/hg38/bigZips/latest/hg38.fa.gz
fi

sha256sum 'hg38.fa.gz'
c1dd87068c254eb53d944f71e51d1311964fce8de24d6fc0effc9c61c01527d4  hg38.fa.gz

GTF file

if [ ! -f 'hg38.ncbiRefSeq.gtf.gz' ]; then
    echo 'Downloading the reference genome'
    wget https://hgdownload.soe.ucsc.edu/goldenpath/hg38/bigZips/genes/hg38.ncbiRefSeq.gtf.gz
fi

sha256sum 'hg38.ncbiRefSeq.gtf.gz'
856919cfc5854079e70dd016048045092fd79b782aa8da9dbbd1c51a9046d8a4  hg38.ncbiRefSeq.gtf.gz

Prepare files

Unpack the compressed files

gunzip --keep 'hg38.ncbiRefSeq.gtf.gz' 'hg38.fa.gz'

Create the dict file

./gatk-4.6.1.0/gatk CreateSequenceDictionary \
                    --REFERENCE 'hg38.fa' \
                    --VERBOSITY WARNING
[Thu Feb 27 12:20:49 EET 2025] CreateSequenceDictionary --VERBOSITY WARNING --REFERENCE hg38.fa --TRUNCATE_NAMES_AT_WHITESPACE true --NUM_SEQUENCES 2147483647 --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Thu Feb 27 12:20:49 EET 2025] Executing as mehrad@pamp-precision-tower on Linux 6.12.16-1-lts amd64; OpenJDK 64-Bit Server VM 23.0.2; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:4.6.1.0
[Thu Feb 27 12:21:00 EET 2025] picard.sam.CreateSequenceDictionary done. Elapsed time: 0.18 minutes.
Runtime.totalMemory()=3816816640

Convert GTF to BED

./gatk-4.6.1.0/gatk GtfToBed \
                    --gtf-path 'hg38.ncbiRefSeq.gtf' \
                    --sequence-dictionary 'hg38.dict' \
                    --output 'blah.bed' \
                    --verbosity WARNING
Using GATK jar /home/mehrad/tmp/gatk-4.6.1.0/gatk-package-4.6.1.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/mehrad/tmp/gatk-4.6.1.0/gatk-package-4.6.1.0-local.jar GtfToBed --gtf-path hg38.ncbiRefSeq.gtf --sequence-dictionary hg38.dict --output blah.bed --verbosity WARNING
SLF4J(W): Class path contains multiple SLF4J providers.
SLF4J(W): Found provider [org.apache.logging.slf4j.SLF4JServiceProvider@4ee8051c]
SLF4J(W): Found provider [ch.qos.logback.classic.spi.LogbackServiceProvider@53125718]
SLF4J(W): See https://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J(I): Actual provider is of type [org.apache.logging.slf4j.SLF4JServiceProvider@4ee8051c]
[February 27, 2025, 12:26:04 PM EET] org.broadinstitute.hellbender.tools.walkers.conversion.GtfToBed done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=134217728
java.lang.NullPointerException: Cannot invoke "org.broadinstitute.hellbender.utils.codecs.gtf.GencodeGtfGeneFeature.addTranscript(org.broadinstitute.hellbender.utils.codecs.gtf.GencodeGtfTranscriptFeature)" because "gene" is null
	at org.broadinstitute.hellbender.utils.codecs.gtf.AbstractGtfCodec.aggregateRecordsIntoGeneFeature(AbstractGtfCodec.java:339)
	at org.broadinstitute.hellbender.utils.codecs.gtf.AbstractGtfCodec.decode(AbstractGtfCodec.java:170)
	at org.broadinstitute.hellbender.utils.codecs.gtf.AbstractGtfCodec.decode(AbstractGtfCodec.java:23)
	at htsjdk.tribble.TribbleIndexedFeatureReader$WFIterator.readNextRecord(TribbleIndexedFeatureReader.java:377)
	at htsjdk.tribble.TribbleIndexedFeatureReader$WFIterator.<init>(TribbleIndexedFeatureReader.java:344)
	at htsjdk.tribble.TribbleIndexedFeatureReader.iterator(TribbleIndexedFeatureReader.java:311)
	at org.broadinstitute.hellbender.engine.FeatureDataSource.iterator(FeatureDataSource.java:531)
	at java.base/java.lang.Iterable.spliterator(Unknown Source)
	at org.broadinstitute.hellbender.utils.Utils.stream(Utils.java:1182)
	at org.broadinstitute.hellbender.engine.FeatureWalker.traverse(FeatureWalker.java:97)
	at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1119)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:150)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:203)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:222)
	at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:166)
	at org.broadinstitute.hellbender.Main.mainEntry(Main.java:209)
	at org.broadinstitute.hellbender.Main.main(Main.java:306)
@gokalpcelik
Copy link
Contributor

Hi @mmahmoudian
This tool is written to comply with Gencode style GTF files. UCSC GTF file that you provided lacks the proper gene level entries to build the map to perform other functions to sort and prioritize based on tags provided in the GTF. Ignoring lack of gene level entries to just create bed file based on gtf coordinates is not the way this tool is implemented so you may need to dig your way through using python or any other scripting language to convert that gtf to bed or you may use options provided by UCSC table browser to extract bed format from refseq table.

I hope this helps.

Regards

@mmahmoudian
Copy link
Author

@gokalpcelik thanks for the explanation. Considering that this information was not mentioned in the documentation (at least me and my colleague missed it if it is there), and considering that it is generally not a good practice to throw ambiguous errors to user, may I suggest:

  1. Update the documentation (website and --help) to clarify which GTF file is suitable for this tool
  2. Add a part in the GtfToBed function to first check and validate the input, and produce clear and user-friendly error in case something is not up to the standard/expectations of the tool.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants