HighSPA - High Performance Selection Pressure Analysis Framework

HighSPA (High-Performance Selection Pressure Analysis) is a scalable framework built on Parsl, a parallel scripting library for orchestrating workflows in heterogeneous HPC environments. HighSPA enables the parallel execution of selection pressure analyses across multiple datasets in a single run, efficiently distributing tasks over multiple computing nodes

The HighSPA framework orchestrates a comprehensive sequence of bioinformatics tasks, automating data processing steps such as sequence alignment, format conversion, phylogenetic inference, and evolutionary model testing. The framework contains two workflows, one using PAML's CodeML and the other with HYPHY, called ParslCodeML and ParslHyPhy respectively.

Both workflows integrate multiple bioinformatics tools, efficiently orchestrated using the Parsl parallel scripting framework. This modular architecture provides portability, ease of use, and ease of maintenance, enabling component replacement and extensions to the framework. In the current implementation, all tasks, including those capable of multithreading, are executed in a single-threaded manner. Parsl’s task-based parallelism is utilized to ensure scalable and efficient execution across multiple computational nodes.

The framework takes as input a folder containing a set of files, each comprising multiple genetic sequences in multi-FASTA format, along with the specified workflow configuration. It is then executed for each file individually, generating the corresponding outputs while preserving the original folder hierarchy.

HighSPA-CodeML Workflow

The workflow execution starts receiving a multi-fasta file. The file is aligned using MAFFT, which outputs a multiple sequence alignment. The alignment is then used as input in two activities in parallel. It's processed by RAxML to infer a maximum likelihood phylogenetic tree and its replicates for branch supporting calculation. It's also used in the activity \textit{format phylip alignment} to adapt it to the CodeML input format. After that, both the phylogenetic tree and the formated alignment are used as input in six different CodeML process, each one applying a distinct codon substitution model a different model (M0, M1, M2, M3, M7 and M8). Outputs are organized into model-specific directories for systematic analysis.

HighSPA-Hyphy Workflow

The execution of the ParslHyPhy workflow is similar to its counterpart; however, in this workflow, the output generated by MAFFT does not require reformatting before being used as input for HyPhy. The first activity, MAFFT, aligns the input multi-FASTA file. Subsequently, RAxML is executed to infer the phylogenetic tree. The outputs from both MAFFT and RAxML are then used as inputs for six distinct HyPhy analyses, each employing a different codon-based model designed to detect specific signatures of natural selection: SLAC, FEL, MEME, NY, FUBAR, and aBSREL.

Installation

Requirements

Python 3.8 or later
Parsl
MAFFT
PAML (codeml)
RAxML
HyPHY
Additional Python dependencies (see requirements.txt)

Setup

Clone this repository:

git clone https://github.com/karyocana/HighSPA.git
cd HighSPA

Create and activate a virtual environment:

python3 -m venv parsl_env
source parsl_env/bin/activate

Install python dependencies:

pip install -r requirements.txt

Configure external tools:

Ensure MAFFT, RAxML, codeml and HyPhy are correctly installed.

Update the executables.json script file according to the path and name of the software binaries. If the binaries are in the system's PATH, the variable can be omitted, as shown in the following example:

  {
      "mafft": {
          "path": "",
          "executable": "mafft"
      },
      "raxml": {
          "path": "",
          "executable": "raxmlHPC"
      },
      "codeml": {
          "path": "",
          "executable": "codeml"
      },
      "hyphy": {
          "path":"",
          "executable": "hyphy"
      }
  }

Usage

The framework is currently configured to execute in local machines or in clusters that use the SLURM Resource and Job Manager.

Local machine

To run the framework in a local machine, first activate the conda environment (if used) and then execute the HighSPA.py script with the following arguments:

-t/--threads: the maximum number of threads used by the framework;
-i/--input: folder containing the fasta files used by the framework. The framework will scan the folder recursively looking for the fasta files;
-o/--output: The folder used to store the outputs.

Example:

python HighSPA.py -t 12 -i input_folder -o output_folder

SLURM Cluster

The first step to run the framework in a SLURM cluster is to prepare the SBATCH script, as presented below.

#!/bin/bash
#SBATCH --nodes 1
#SBATCH --ntasks-per-node=48
#SBATCH -p desired_partition
#SBATCH --exclusive
#SBATCH --j HighSPA
#SBATCH --time=00:20:00
#SBATCH -e slurm-%j.err
#SBATCH -o slurm-%j.out

module load anaconda3/2024.02_sequana
eval "$(conda shell.bash hook)"
CONDA_ENV="/path/to/conda/env"
conda activate ${CONDA_ENV}
CDIR="/path/to/HighSPA"
INPUT_FOLDER="${CDIR}/examples/inputs"
OUTPUT_FOLDER="${CDIR}/output"
EXECUTABLES="${CDIR}/executables.json"
ENV_FILE="${CDIR}/path/to/env_file"
mkdir -p $OUTPUT_FOLDER

export CONDA_ENV
python HighSPA.py -i ${INPUT_FOLDER} -o ${OUTPUT_FOLDER} -e ${EXECUTABLES} -env ${ENV_FILE} --onslurm

Notice that some extra arguments are necessary. They are:

--onslurm: flag to inform parsl to execute using the HighThroughput executor in one or more nodes. Important: the number of nodes and threads will be obtained from the SLURM environment variables;
--env: plain text file containing the environment variables and everything else that should be loaded in the worker node. This file needs to contain the conda activation steps (if used) and all the necessary modules that need to be load. For example:

module load mafft
module load raxml
module load anaconda3/2024.02_sequana
eval $(conda shell.bash hook)
conda activate $CONDA_ENV
export PYTHONPATH=$PYTHONPATH:$PWD

How to choose from CodeML and HyPhy

The default workflow to be executed is ParslCodeML. However, the use can use the argument --hyphy to execute the HighPSA using HyPhy.

Monitoring

The usage of Parsl's monitoring module can be activated using the -m/--monitoring argument.

Troubleshooting

Ensure all external tools are installed and accessible;
Verify file permissions for input and output directories;
Check error logs in the stderr directory under the output path for details on failed tasks;
If you find a bug in the framework, please open an issue here on this repository.

License

This project is licensed under the MIT License.

Acknowledgments

Developed using Parsl for parallel task execution.
Utilizes tools from the PAML suite, HYPHY suite, MAFFT, and RAxML for bioinformatics analyses.

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
examples		examples
experiments		experiments
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HighSPA - High Performance Selection Pressure Analysis Framework

HighSPA-CodeML Workflow

HighSPA-Hyphy Workflow

Installation

Requirements

Setup

Usage

Local machine

SLURM Cluster

How to choose from CodeML and HyPhy

Monitoring

Troubleshooting

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

karyocana/HighSPA

Folders and files

Latest commit

History

Repository files navigation

HighSPA - High Performance Selection Pressure Analysis Framework

HighSPA-CodeML Workflow

HighSPA-Hyphy Workflow

Installation

Requirements

Setup

Usage

Local machine

SLURM Cluster

How to choose from CodeML and HyPhy

Monitoring

Troubleshooting

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages