HighSPA (High-Performance Selection Pressure Analysis) is a scalable framework built on Parsl, a parallel scripting library for orchestrating workflows in heterogeneous HPC environments. HighSPA enables the parallel execution of selection pressure analyses across multiple datasets in a single run, efficiently distributing tasks over multiple computing nodes
The HighSPA framework orchestrates a comprehensive sequence of bioinformatics tasks, automating data processing steps such as sequence alignment, format conversion, phylogenetic inference, and evolutionary model testing. The framework contains two workflows, one using PAML's CodeML and the other with HYPHY, called ParslCodeML and ParslHyPhy respectively.
Both workflows integrate multiple bioinformatics tools, efficiently orchestrated using the Parsl parallel scripting framework. This modular architecture provides portability, ease of use, and ease of maintenance, enabling component replacement and extensions to the framework. In the current implementation, all tasks, including those capable of multithreading, are executed in a single-threaded manner. Parsl’s task-based parallelism is utilized to ensure scalable and efficient execution across multiple computational nodes.
The framework takes as input a folder containing a set of files, each comprising multiple genetic sequences in multi-FASTA format, along with the specified workflow configuration. It is then executed for each file individually, generating the corresponding outputs while preserving the original folder hierarchy.
The workflow execution starts receiving a multi-fasta file. The file is aligned using MAFFT, which outputs a multiple sequence alignment. The alignment is then used as input in two activities in parallel. It's processed by RAxML to infer a maximum likelihood phylogenetic tree and its replicates for branch supporting calculation. It's also used in the activity \textit{format phylip alignment} to adapt it to the CodeML input format. After that, both the phylogenetic tree and the formated alignment are used as input in six different CodeML process, each one applying a distinct codon substitution model a different model (M0, M1, M2, M3, M7 and M8). Outputs are organized into model-specific directories for systematic analysis.
The execution of the ParslHyPhy workflow is similar to its counterpart; however, in this workflow, the output generated by MAFFT does not require reformatting before being used as input for HyPhy. The first activity, MAFFT, aligns the input multi-FASTA file. Subsequently, RAxML is executed to infer the phylogenetic tree. The outputs from both MAFFT and RAxML are then used as inputs for six distinct HyPhy analyses, each employing a different codon-based model designed to detect specific signatures of natural selection: SLAC, FEL, MEME, NY, FUBAR, and aBSREL.
- Python 3.8 or later
- Parsl
- MAFFT
- PAML (codeml)
- RAxML
- HyPHY
- Additional Python dependencies (see requirements.txt)
- Clone this repository:
git clone https://github.com/karyocana/HighSPA.git
cd HighSPA
- Create and activate a virtual environment:
python3 -m venv parsl_env
source parsl_env/bin/activate
- Install python dependencies:
pip install -r requirements.txt
- Configure external tools:
- Ensure MAFFT, RAxML, codeml and HyPhy are correctly installed.
- Update the
executables.json
script file according to the path and name of the software binaries. If the binaries are in the system's PATH, the variable can be omitted, as shown in the following example:{ "mafft": { "path": "", "executable": "mafft" }, "raxml": { "path": "", "executable": "raxmlHPC" }, "codeml": { "path": "", "executable": "codeml" }, "hyphy": { "path":"", "executable": "hyphy" } }
The framework is currently configured to execute in local machines or in clusters that use the SLURM Resource and Job Manager.
To run the framework in a local machine, first activate the conda environment (if used) and then execute the HighSPA.py
script with the following arguments:
-t/--threads
: the maximum number of threads used by the framework;-i/--input
: folder containing the fasta files used by the framework. The framework will scan the folder recursively looking for the fasta files;-o/--output
: The folder used to store the outputs.
Example:
python HighSPA.py -t 12 -i input_folder -o output_folder
The first step to run the framework in a SLURM cluster is to prepare the SBATCH script, as presented below.
#!/bin/bash
#SBATCH --nodes 1
#SBATCH --ntasks-per-node=48
#SBATCH -p desired_partition
#SBATCH --exclusive
#SBATCH --j HighSPA
#SBATCH --time=00:20:00
#SBATCH -e slurm-%j.err
#SBATCH -o slurm-%j.out
module load anaconda3/2024.02_sequana
eval "$(conda shell.bash hook)"
CONDA_ENV="/path/to/conda/env"
conda activate ${CONDA_ENV}
CDIR="/path/to/HighSPA"
INPUT_FOLDER="${CDIR}/examples/inputs"
OUTPUT_FOLDER="${CDIR}/output"
EXECUTABLES="${CDIR}/executables.json"
ENV_FILE="${CDIR}/path/to/env_file"
mkdir -p $OUTPUT_FOLDER
export CONDA_ENV
python HighSPA.py -i ${INPUT_FOLDER} -o ${OUTPUT_FOLDER} -e ${EXECUTABLES} -env ${ENV_FILE} --onslurm
Notice that some extra arguments are necessary. They are:
--onslurm
: flag to inform parsl to execute using the HighThroughput executor in one or more nodes. Important: the number of nodes and threads will be obtained from the SLURM environment variables;--env
: plain text file containing the environment variables and everything else that should be loaded in the worker node. This file needs to contain the conda activation steps (if used) and all the necessary modules that need to be load. For example:
module load mafft
module load raxml
module load anaconda3/2024.02_sequana
eval $(conda shell.bash hook)
conda activate $CONDA_ENV
export PYTHONPATH=$PYTHONPATH:$PWD
The default workflow to be executed is ParslCodeML. However, the use can use the argument --hyphy
to execute the HighPSA using HyPhy.
The usage of Parsl's monitoring module can be activated using the -m/--monitoring
argument.
- Ensure all external tools are installed and accessible;
- Verify file permissions for input and output directories;
- Check error logs in the stderr directory under the output path for details on failed tasks;
- If you find a bug in the framework, please open an issue here on this repository.
This project is licensed under the MIT License.
- Developed using Parsl for parallel task execution.
- Utilizes tools from the PAML suite, HYPHY suite, MAFFT, and RAxML for bioinformatics analyses.