Skip to content
Mike Weigand edited this page Feb 28, 2024 · 27 revisions

Overview

T O S T A D A S
Toolkit for Open Sequence Triage, Annotation, and DAtabase Submission

A portable, open-source pipeline designed to streamline submission of pathogen genomic data to public repositories. Reducing barriers to timely data submission increases the value of public repositories for both public health decision making and scientific research. TOSTADAS facilitates routine sequence submission by standardizing:

[1] Metadata Validation
[2] Genome Annotation
[3] File submission

The current distribution has been tested with sequence data from Poxviruses and select bacteria but TOSTADAS is designed to be flexible, modular, and pathogen agnostic, allowing users to customize their submission of raw read data, assembled genomes, or both.

Pipeline Summary

[1] Metadata Validation

Verifies that user provided metadata conforms to NCBI standards and match the sequence data file(s), all of which are organized in an easy to edit Excel spreadsheet. By default, TOSTADAS uses a set of metadata fields appropriate for most pathogen genomic data submissions, but can be configured to accommodate custom metadata fields specific to any use case. A full guide to using custom metadata fields can be found here: Custom Metadata Guide

[2] Gene Annotation

Optional gene calling and feature annotation of assembled genomes (FASTA) using one of the following:

  1. RepeatMasker and Liftoff (viral)

    • Optimized for variola genomes, this workflow combines RepeatMasker for annotating repeat motifs and Liftoff to annotate functional regions. Execution requires a reference genome (FASTA) and feature list (GFF3) definition. Modifications likely necessary for use with other pathogens.
  2. VADR (viral)

  3. Bakta (bacterial)

    • Annotates bacterial genomes and plasmids using Bakta. Execution requires a reference database, such as (found here).

All annotation options produce a general feature format file (GFF) and NCBI feature table (TBL) compatible with downstream NCBI submission requirements.

[3] Submission

Prepare necessary submission files for BioSample, SRA, and/or GenBank depending on the provided inputs and perform optional upload to NCBI via ftp. This workflow was adapted from the SeqSender public database submission pipeline.