nano scr_SRA.sh
Yersinia pestis dashboard
Check out the final project
To access the dashboard, simply click on this link.
Detailed descriptions of the bioinformatic analyses can be found in the following tutorial.
Bioinformatics tutorial
This tutorial outlines the steps for SRA data retrieval, FastQC analysis, and PALEOMIX analysis, including reference genome preparation and execution scripts. Adjust the paths and parameters according to your specific setup.
SRA Data Retrieval
In this section, we’ll walk through the steps to retrieve sequence data from the Sequence Read Archive (SRA) using a Bash script (scr_SRA.sh
). The script uses the SRA Toolkit to download and convert SRA files to FASTQ format.
Navigate to your project directory and create or open the script in the Nano text editor.
In this script, change the job name, the email address and the two variables : Ref
and Num_Accession
.
#!/bin/bash
#SBATCH -J Spyrou_19_SRA
#SBATCH --mem=99G
#SBATCH --cpus-per-task=32
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=benjamin.tournan@univ-tlse3.fr
Ref="Spyrou_19"
Num_Accession="PRJEB29990"
# Load modules
module load bioinfo/SRA-Toolkit/3.0.2
# Create a new directory that will contain the sequences
mkdir -p $Ref
cd $Ref || exit # Checks if the directory already exists
# Download SRA data
prefetch $Num_Accession || { echo "prefetch failed"; exit 1; }
# Bring .sra out of the intermediate files
find . -mindepth 2 -type f -exec mv -t . {} +
# Convert SRA source files to fastQ sequence files
fasterq-dump *.sra || { echo "fasterq-dump failed"; exit 1; }
# Remove .sra files and empty sub-directories
rm *.sra
find . -mindepth 1 -maxdepth 1 -type d -exec rm -r {} \;
Once you’ve made the script executable using the chmod +x
command, run the script.
chmod +x scr_SRA.sh
sbatch scr_SRA.sh
You need to separate single-end and paired-end files in different sub-directories. Use the following commands to do so.
cd Spyrou_19
mkdir ../Spyrou_19_paired
mv *_1.fastq ../Spyrou_19_paired
mv *_2.fastq ../Spyrou_19_paired
# Rename the current directory Spyrou_19_single.
cd ..
mv Spyrou_19 Spyrou_19_single
FastQC Analysis
Navigate to your project directory and create or open the script in the Nano text editor.
nano scr_fastqc.sh
In this script, change the job name, the email address and the directory.
#!/bin/bash
#SBATCH -J Spyrou_19_fastqc
#SBATCH --mem=64G
#SBATCH --cpus-per-task=80
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=benjamin.tournan@univ-tlse3.fr
# Load module
module load bioinfo/FastQC/0.12.1
module load bioinfo/MultiQC/1.14
# Change directory
cd "$(pwd)/Spyrou_19_paired"
# Launch analysis
for fichier in *.fastq
do
fastqc "$fichier"
done
multiqc . -o ..
rm *fastqc.zip
rm *fastqc.html
Once you’ve made the script executable using the chmod +x
command, run the script.
chmod +x scr_fastqc.sh
sbatch scr_fastqc.sh
Open a new terminal window and execute the following command to retrieve the report from Genotoul.
scp -r -p geranium@genobioinfo.toulouse.inrae.fr:/home/geranium/work/PJ_M2_ASO/Spyrou_19_paired/multiqc_report.html .
Seqtk treatment
To effectively handle paired-end sequences fused into a singular sequence, it is imperative to employ the Seqtk
to create new paired-end reads.
Navigate to your project directory and create or open the script in the Nano text editor.
nano scr_seqtk.sh
In this script, change the job name, the email address and the two variables : Ref
and sequence_length
, you can find the sequence length in the MultiQC report.
#!/bin/bash
#SBATCH -J Bos_11_seqtk
#SBATCH --mem=64G
#SBATCH --cpus-per-task=80
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=benjamin.tournan@univ-tlse3.fr
# Change this
Ref="Bos_11"
sequence_length="150"
# Move to the sequence directory
cd "$Ref"
# Trim length
trimlength=$((sequence_length / 2))
# Create the folder for the output files
output_folder="${Ref}_seqtk"
mkdir -p "$output_folder"
# Load the module
module load bioinfo/Seqtk/1.3
# Iterate through all FASTQ files in the directory
for file in *.fastq; do
# Extract the file name without the directory path
file_name=$(basename "$file")
# Remove the file name extension
base_name=$(echo "$file_name" | sed 's/\.[^.]*$//')
# Apply seqtk and sed commands to each file
seqtk trimfq -b "$trimlength" "$file" | sed "s/length=$sequence_length/length=$trimlength/g" > "${output_folder}/${base_name}_trimmed_1.fastq"
seqtk trimfq -e "$trimlength" "$file" | sed "s/length=$sequence_length/length=$trimlength/g" > "${output_folder}/${base_name}_trimmed_2.fastq"
done
Once you’ve made the script executable using the chmod +x
command, run the script.
chmod +x scr_seqtk.sh
sbatch scr_seqtk.sh
PALEOMIX Analysis
Reference Genome Preparation
This step is only necessary the first time you set up your analysis pipeline. It involves preparing the reference genome for subsequent analysis.
Navigate to your project directory and execute the following commands.
# Create a directory for Yersinia pestis reference files
mkdir -p Ref_Ypestis
cd Ref_Ypestis
# Download the genomic sequence file from NCBI
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/009/065/GCF_000009065.1_ASM906v1/GCF_000009065.1_ASM906v1_genomic.fna.gz
# Unzip the downloaded file
gunzip GCF_000009065.1_ASM906v1_genomic.fna.gz
# Split the genomic sequence file into individual FASTA files
csplit -z -f GCF_000009065.1_ -b %02d.fna GCF_000009065.1_ASM906v1_genomic.fna "/>/" '{*}'
# Rename the split files for better identification
mv GCF_000009065.1_00.fna GCF_000009065.1_chr.fasta
mv GCF_000009065.1_01.fna GCF_000009065.1_pCD1.fasta
mv GCF_000009065.1_02.fna GCF_000009065.1_pMT1.fasta
mv GCF_000009065.1_03.fna GCF_000009065.1_pPCP1.fasta
# Load required modules
module load devel/Miniconda/Miniconda3
module load bioinfo/PALEOMIX/1.3.8
# Navigate to the home directory
cd ~
# Create a directory for JAR files and create a symbolic link to Picard JAR file
mkdir -p ~/install/jar_root
ln -s /usr/local/bioinfo/src/Miniconda/Miniconda3/envs/paleomix-v1.3.7_env/share/picard-2.27.4-0/picard.jar ~/install/jar_root/
PALEOMIX Analysis Script
Before generating the makefile for the analysis pipeline, it’s crucial to create a list of libraries. This step is essential for providing the necessary information about the libraries that will be processed in the subsequent analysis.
Please navigate to the directory where the sequences are located and execute the corresponding command.
For paired-end data, each library consists of two files: one for the forward reads (usually denoted as *_1.fastq
) and another for the reverse reads (denoted as *_2.fastq
). To facilitate the makefile creation, generate a list of libraries by specifying the corresponding file paths for each library.
for file in *_1.fastq; do
base=$(basename "$file" "_1.fastq")
echo " $base:"
echo " Lane_$base: $(readlink -f ${base}_{Pair}.fastq)"
done > libraries.txt
For single-end data, each library consists of one file. To facilitate the makefile creation, generate a list of libraries by specifying the corresponding file paths for each library.
for file in *.fastq; do
base=$(basename "$file" ".fastq")
echo " $base:"
echo " Lane_$base: $(readlink -f ${base}.fastq)"
done > libraries.txt
Copy the content of libraries.txt
onto your computer and save it for the next step, ensuring to maintain proper indentation.
Next, we generate the makefile, a configuration file for the PALEOMIX pipeline. The makefile contains instructions on how to process the data, including details about the reference genome, trimming parameters, and aligner settings. Generate the makefile within the directory where your earlier scripts are located.
# Load modules
module load devel/Miniconda/Miniconda3
module load bioinfo/PALEOMIX/1.3.8
# Retrieve the makefile template and modify it
paleomix bam_pipeline makefile > makefile.yaml
nano makefile.yaml
Once you’ve generated the makefile.yaml
, utilize the Nano text editor to access it for required modifications. Adjust the mapping quality (Phred) to 25 within the BWA options. Subsequently, insert the specified Prefixes
(ensure to update the path accordingly) and the contents of libraries.txt
in the designated section.
# Settings for mappings performed using BWA
BWA:
# One of "backtrack", "bwasw", or "mem"; see the BWA documentation
# for a description of each algorithm (defaults to 'backtrack')
Algorithm: backtrack
# Filter aligned reads with a mapping quality (Phred) below this value
MinQuality: 25
# Map of prefixes by name, each having a Path, which specifies the location of the
# BWA/Bowtie2 index, and optional regions for which additional statistics are produced.
Prefixes:
CO92_chromosome:
Path: /home/geranium/work/PJ_M2_ASO/Ref_Ypestis/GCF_000009065.1_chr.fasta
pCD1:
Path: /home/geranium/work/PJ_M2_ASO/Ref_Ypestis/GCF_000009065.1_pCD1.fasta
pMT1:
Path: /home/geranium/work/PJ_M2_ASO/Ref_Ypestis/GCF_000009065.1_pMT1.fasta
pPCP1:
Path: /home/geranium/work/PJ_M2_ASO/Ref_Ypestis/GCF_000009065.1_pPCP1.fasta
# Mapping targets are specified using the following structure. Replace 'NAME_OF_TARGET'
# with the desired prefix for filenames.
Spyrou19_single: # NAME_OF_TARGET
Spyrou19_single: # NAME_OF_SAMPLE (same as NAME_OF_TARGET)
# Paste the content of libraries.txt here
ERR3457581:
Lane_ERR3457581: /work/user/geranium/PJ_M2_ASO/Spyrou_19_single/ERR3457581.fastq
ERR3457582:
Lane_ERR3457582: /work/user/geranium/PJ_M2_ASO/Spyrou_19_single/ERR3457582.fastq
# ...
Prior to proceeding to the subsequent stage, conduct a dry run with the following Bash command.
paleomix bam_pipeline dryrun makefile.yaml
PALEOMIX Execution Script
In this final step, we set up and execute the paleogenomic analysis script to process the prepared data using the configured analysis pipeline. The script, named scr_paleomix.sh
, is responsible for defining job parameters and initiating the analysis on a high-performance computing (HPC) environment.
Navigate to your project directory and create or open the script in the Nano text editor.
nano scr_paleomix.sh
In this script, change the job name, the email address, the output and the jar-root
directories. Make sure that this script is in the same location as the makefile.yaml
.
#!/bin/bash
#SBATCH -J Spyrou_19_paleomix
#SBATCH --mem=99G
#SBATCH --cpus-per-task=99
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=benjamin.tournan@univ-tlse3.fr
paleomix bam_pipeline run --destination=Spyrou_19_paired_output --max-threads 99 --jre-option="-Xmx4g" --jar-root="/home/geranium/install/jar_root/" makefile.yaml
chmod +x scr_paleomix.sh
sbatch scr_paleomix.sh
As the final step, it is essential to review the summary file generated during the analysis. The summary file provides a comprehensive overview of key metrics, including read mapping statistics, coverage information, and potential post-mortem DNA damage assessments.