Yersinia pestis dashboard

Authors

Benjamin TOURNAN

Juliette GUTIERREZ

Nicolas REQUIN

Published

November 30, 2023

Check out the final project

To access the dashboard, simply click on this link.

Detailed descriptions of the bioinformatic analyses can be found in the following tutorial.

Bioinformatics tutorial

This tutorial outlines the steps for SRA data retrieval, FastQC analysis, and PALEOMIX analysis, including reference genome preparation and execution scripts. Adjust the paths and parameters according to your specific setup.

SRA Data Retrieval

In this section, we’ll walk through the steps to retrieve sequence data from the Sequence Read Archive (SRA) using a Bash script (scr_SRA.sh). The script uses the SRA Toolkit to download and convert SRA files to FASTQ format.

Navigate to your project directory and create or open the script in the Nano text editor.

nano scr_SRA.sh

In this script, change the job name, the email address and the two variables : Ref and Num_Accession.

#!/bin/bash
#SBATCH -J Spyrou_19_SRA
#SBATCH --mem=99G
#SBATCH --cpus-per-task=32
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=benjamin.tournan@univ-tlse3.fr

Ref="Spyrou_19"
Num_Accession="PRJEB29990"

# Load modules
module load bioinfo/SRA-Toolkit/3.0.2

# Create a new directory that will contain the sequences
mkdir -p $Ref
cd $Ref || exit # Checks if the directory already exists

# Download SRA data
prefetch $Num_Accession || { echo "prefetch failed"; exit 1; }

# Bring .sra out of the intermediate files
find . -mindepth 2 -type f -exec mv -t . {} +

# Convert SRA source files to fastQ sequence files
fasterq-dump *.sra || { echo "fasterq-dump failed"; exit 1; }

# Remove .sra files and empty sub-directories
rm *.sra
find . -mindepth 1 -maxdepth 1 -type d -exec rm -r {} \;

Once you’ve made the script executable using the chmod +x command, run the script.

chmod +x scr_SRA.sh
sbatch scr_SRA.sh

You need to separate single-end and paired-end files in different sub-directories. Use the following commands to do so.

cd Spyrou_19
mkdir ../Spyrou_19_paired
mv *_1.fastq ../Spyrou_19_paired
mv *_2.fastq ../Spyrou_19_paired

# Rename the current directory Spyrou_19_single.
cd ..
mv Spyrou_19 Spyrou_19_single

FastQC Analysis

Navigate to your project directory and create or open the script in the Nano text editor.

nano scr_fastqc.sh

In this script, change the job name, the email address and the directory.

#!/bin/bash
#SBATCH -J Spyrou_19_fastqc
#SBATCH --mem=64G
#SBATCH --cpus-per-task=80
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=benjamin.tournan@univ-tlse3.fr

# Load module
module load bioinfo/FastQC/0.12.1
module load bioinfo/MultiQC/1.14

# Change directory
cd "$(pwd)/Spyrou_19_paired"

# Launch analysis
for fichier in *.fastq
do
    fastqc "$fichier"
done

multiqc . -o ..

rm *fastqc.zip
rm *fastqc.html

Once you’ve made the script executable using the chmod +x command, run the script.

chmod +x scr_fastqc.sh
sbatch scr_fastqc.sh

Open a new terminal window and execute the following command to retrieve the report from Genotoul.

scp -r -p geranium@genobioinfo.toulouse.inrae.fr:/home/geranium/work/PJ_M2_ASO/Spyrou_19_paired/multiqc_report.html .

Seqtk treatment

To effectively handle paired-end sequences fused into a singular sequence, it is imperative to employ the Seqtk to create new paired-end reads.

Navigate to your project directory and create or open the script in the Nano text editor.

nano scr_seqtk.sh

In this script, change the job name, the email address and the two variables : Ref and sequence_length, you can find the sequence length in the MultiQC report.

#!/bin/bash
#SBATCH -J Bos_11_seqtk
#SBATCH --mem=64G
#SBATCH --cpus-per-task=80
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=benjamin.tournan@univ-tlse3.fr

# Change this
Ref="Bos_11"
sequence_length="150"

# Move to the sequence directory
cd "$Ref"

# Trim length
trimlength=$((sequence_length / 2))

# Create the folder for the output files
output_folder="${Ref}_seqtk"
mkdir -p "$output_folder"

# Load the module
module load bioinfo/Seqtk/1.3

# Iterate through all FASTQ files in the directory
for file in *.fastq; do
    # Extract the file name without the directory path
    file_name=$(basename "$file")

    # Remove the file name extension
    base_name=$(echo "$file_name" | sed 's/\.[^.]*$//')

    # Apply seqtk and sed commands to each file
    seqtk trimfq -b "$trimlength" "$file" | sed "s/length=$sequence_length/length=$trimlength/g" > "${output_folder}/${base_name}_trimmed_1.fastq"
    seqtk trimfq -e "$trimlength" "$file" | sed "s/length=$sequence_length/length=$trimlength/g" > "${output_folder}/${base_name}_trimmed_2.fastq"
done

Once you’ve made the script executable using the chmod +x command, run the script.

chmod +x scr_seqtk.sh
sbatch scr_seqtk.sh

PALEOMIX Analysis

Reference Genome Preparation

This step is only necessary the first time you set up your analysis pipeline. It involves preparing the reference genome for subsequent analysis.

Navigate to your project directory and execute the following commands.

# Create a directory for Yersinia pestis reference files
mkdir -p Ref_Ypestis
cd Ref_Ypestis

# Download the genomic sequence file from NCBI
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/009/065/GCF_000009065.1_ASM906v1/GCF_000009065.1_ASM906v1_genomic.fna.gz

# Unzip the downloaded file
gunzip GCF_000009065.1_ASM906v1_genomic.fna.gz

# Split the genomic sequence file into individual FASTA files
csplit -z -f GCF_000009065.1_ -b %02d.fna GCF_000009065.1_ASM906v1_genomic.fna "/>/" '{*}'

# Rename the split files for better identification
mv GCF_000009065.1_00.fna GCF_000009065.1_chr.fasta
mv GCF_000009065.1_01.fna GCF_000009065.1_pCD1.fasta
mv GCF_000009065.1_02.fna GCF_000009065.1_pMT1.fasta
mv GCF_000009065.1_03.fna GCF_000009065.1_pPCP1.fasta

# Load required modules
module load devel/Miniconda/Miniconda3
module load bioinfo/PALEOMIX/1.3.8

# Navigate to the home directory
cd ~

# Create a directory for JAR files and create a symbolic link to Picard JAR file
mkdir -p ~/install/jar_root
ln -s /usr/local/bioinfo/src/Miniconda/Miniconda3/envs/paleomix-v1.3.7_env/share/picard-2.27.4-0/picard.jar ~/install/jar_root/

PALEOMIX Analysis Script

Before generating the makefile for the analysis pipeline, it’s crucial to create a list of libraries. This step is essential for providing the necessary information about the libraries that will be processed in the subsequent analysis.

Please navigate to the directory where the sequences are located and execute the corresponding command.

For paired-end data, each library consists of two files: one for the forward reads (usually denoted as *_1.fastq) and another for the reverse reads (denoted as *_2.fastq). To facilitate the makefile creation, generate a list of libraries by specifying the corresponding file paths for each library.

for file in *_1.fastq; do
    base=$(basename "$file" "_1.fastq")
    echo "    $base:"
    echo "      Lane_$base: $(readlink -f ${base}_{Pair}.fastq)"
done > libraries.txt

For single-end data, each library consists of one file. To facilitate the makefile creation, generate a list of libraries by specifying the corresponding file paths for each library.

for file in *.fastq; do
    base=$(basename "$file" ".fastq")
    echo "    $base:"
    echo "      Lane_$base: $(readlink -f ${base}.fastq)"
done > libraries.txt

Copy the content of libraries.txt onto your computer and save it for the next step, ensuring to maintain proper indentation.

Next, we generate the makefile, a configuration file for the PALEOMIX pipeline. The makefile contains instructions on how to process the data, including details about the reference genome, trimming parameters, and aligner settings. Generate the makefile within the directory where your earlier scripts are located.

# Load modules
module load devel/Miniconda/Miniconda3
module load bioinfo/PALEOMIX/1.3.8

# Retrieve the makefile template and modify it
paleomix bam_pipeline makefile > makefile.yaml
nano makefile.yaml

Once you’ve generated the makefile.yaml, utilize the Nano text editor to access it for required modifications. Adjust the mapping quality (Phred) to 25 within the BWA options. Subsequently, insert the specified Prefixes (ensure to update the path accordingly) and the contents of libraries.txt in the designated section.

# Settings for mappings performed using BWA
    BWA:
      # One of "backtrack", "bwasw", or "mem"; see the BWA documentation
      # for a description of each algorithm (defaults to 'backtrack')
      Algorithm: backtrack
      # Filter aligned reads with a mapping quality (Phred) below this value
      MinQuality: 25
# Map of prefixes by name, each having a Path, which specifies the location of the
# BWA/Bowtie2 index, and optional regions for which additional statistics are produced.
Prefixes:
  CO92_chromosome:
    Path: /home/geranium/work/PJ_M2_ASO/Ref_Ypestis/GCF_000009065.1_chr.fasta
  pCD1:
    Path: /home/geranium/work/PJ_M2_ASO/Ref_Ypestis/GCF_000009065.1_pCD1.fasta
  pMT1:
    Path: /home/geranium/work/PJ_M2_ASO/Ref_Ypestis/GCF_000009065.1_pMT1.fasta
  pPCP1:
    Path: /home/geranium/work/PJ_M2_ASO/Ref_Ypestis/GCF_000009065.1_pPCP1.fasta
# Mapping targets are specified using the following structure. Replace 'NAME_OF_TARGET'
# with the desired prefix for filenames.
Spyrou19_single:  # NAME_OF_TARGET
  Spyrou19_single:  # NAME_OF_SAMPLE (same as NAME_OF_TARGET)
  
# Paste the content of libraries.txt here
    ERR3457581:
      Lane_ERR3457581: /work/user/geranium/PJ_M2_ASO/Spyrou_19_single/ERR3457581.fastq
    ERR3457582:
      Lane_ERR3457582: /work/user/geranium/PJ_M2_ASO/Spyrou_19_single/ERR3457582.fastq
# ...

Prior to proceeding to the subsequent stage, conduct a dry run with the following Bash command.

paleomix bam_pipeline dryrun makefile.yaml

PALEOMIX Execution Script

In this final step, we set up and execute the paleogenomic analysis script to process the prepared data using the configured analysis pipeline. The script, named scr_paleomix.sh, is responsible for defining job parameters and initiating the analysis on a high-performance computing (HPC) environment.

Navigate to your project directory and create or open the script in the Nano text editor.

nano scr_paleomix.sh

In this script, change the job name, the email address, the output and the jar-root directories. Make sure that this script is in the same location as the makefile.yaml.

#!/bin/bash
#SBATCH -J Spyrou_19_paleomix
#SBATCH --mem=99G
#SBATCH --cpus-per-task=99
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=benjamin.tournan@univ-tlse3.fr

paleomix bam_pipeline run --destination=Spyrou_19_paired_output --max-threads 99 --jre-option="-Xmx4g" --jar-root="/home/geranium/install/jar_root/" makefile.yaml
chmod +x scr_paleomix.sh
sbatch scr_paleomix.sh

As the final step, it is essential to review the summary file generated during the analysis. The summary file provides a comprehensive overview of key metrics, including read mapping statistics, coverage information, and potential post-mortem DNA damage assessments.

Reuse

CC BY-NC-ND