Introduction to Bioinformatics on Tufts HPC#
Author: Shirley Li, xue.li37@tufts.edu
Date: 2025-02
Prerequisites#
Basic understanding of biology and bioinformatics
Access to an HPC cluster (e.g., login credentials, necessary software installations)
Bioinformatics modules#
On the cluster#
Use module avail
to check the full list of tools available on the cluster. Below are some commonly used tools:
abcreg/0.1.0 kallisto/0.48.0 (D) orthofinder/2.5.5
abyss/2.3.7 kneaddata/0.12.0 pandaseq/2.11
alphafold/2.3.0 kraken2/2.1.3 parabricks/4.0.0-1
alphafold/2.3.1 krakentools/1.2 parabricks/4.2.1-1
alphafold/2.3.2 macs2/2.2.7.1
amplify/2.0.0 macs3/3.0.0a6 pepper_deepvariant/r0.8
angsd/0.939 masurca/4.0.9 petitefinder/cpu
angsd/0.940 (D) masurca/4.1.0 (D) picard/2.25.1
bakta/1.9.3 medaka/1.11.1 picard/2.26.10
bbmap/38.93 megahit/1.2.9 plink/1.90b6.21
bbmap/38.96 (D) meme/5.5.5 plink2/2.00a2.3
bbtools/39.00 metaphlan/4.0.2 polypolish/0.5.0
bcftools/1.13 metaphlan/4.0.6 (D) preseq/3.2.0
bcftools/1.14 miniasm/0.3_r179 prokka/1.14.6
bcftools/1.17 minimap2/2.26 (D) qiime2/2023.2
bcftools/1.20 (D) minipolish/0.1.3 qiime2/2023.5
beast2/2.6.3 mirdeep2/2.0.1.3 qiime2/2023.7
beast2/2.6.4 mirge3/0.1.4 qiime2/2023.9
beast2/2.6.6 (D) mothur/1.46.0 qiime2/2024.2
... ...
A few tips#
Before installing your own tools, check if they are already available on the cluster using the
module avail
command.Always be aware of the software versions, especially when using scripts from colleagues.
For less common tools, consider installing them yourself to ensure you have full control over the version and availability.
If you need to install a less commonly used tool, it’s best to handle the installation yourself to ensure proper maintenance. Follow this tutorial to install your own tool
Using the Open OnDemand App#
You can access Open OnDemand through this link
Bioinformatics Apps#
We offer a wide range of bioinformatics tools as apps, including AlphaFold
and CellProfiler
. Additionally, 31 nf-core pipelines are available as apps for ease of use, with the most popular being nf-core/rnaseq, which we will demonstrate in our final workshop.
RStudio and Apps#
RStudio Pax, use R/4.4.2 which has the most comprehensive packages installed (1300+).
How to initiate an R job
Log in to Open Ondemand.
Navigate to
interactive apps
and selectRStudio Pax
Specify the required resources:
Number of hours
Number of CPU cores
Amount of memory
CPU partition (set to
batch
)R version (latest available: 4.4.2)
Click
Launch
to submit your job to the queue.Wait a few minutes until your job starts running.
Click
Connect to Rstudio server
In RStudio, go to the
Packages
tab on the right to check the installed packages.
Installing R Packages
Refer to our previous workshop materials for detailed instructions on installing R packages.
Other Apps#
We also provide other applications like Jupyter Bioinfo
, JupyterLab
, Jupyter Notebook
, IGV
, and Galaxy
to support your daily research activities.
nf-core pipelines#
On cluster, use
module avail nf-core
to get the list of nf-core pipelines deployed on cluster.On Open OnDemand, you can go to
bioinformatics apps
to find out what has been installed.
Writing Bioinformatics job script#
Let’s explore how to write a SLURM script, using STAR alignment as an example.
1. Prepare the SLURM Script#
Create a file named run_star.sh
and add the following content:
#!/bin/bash
#SBATCH -J STAR_JOB # Job name
#SBATCH --time=12:00:00 # Maximum runtime (D-HH:MM:SS format)
#SBATCH -p batch # Partition (queue) to submit the job to
#SBATCH -n 1 # Number of tasks (1 task in this case)
#SBATCH --mem=32g # Memory allocation (32 GB)
#SBATCH --cpus-per-task=8 # Number of CPU cores allocated for the task
#SBATCH --output=STAR.%j.out # Standard output file (%j = Job ID)
#SBATCH --error=STAR.%j.err # Standard error file (%j = Job ID)
#SBATCH --mail-type=ALL # Notifications for job status (start, end, fail)
#SBATCH --mail-user=utln@tufts.edu # Your email address for notifications
# Load necessary module
module load star/2.7.11b
# Create output directory
mkdir -p star_output
# Run STAR alignment for single-end reads
STAR --genomeDir ./reference_data/reference_index/ \
--readFilesIn ./raw_fastq/Irrel_kd_1.subset.fq \
--outFileNamePrefix ./star_output/ \
--runThreadN 8
2. Submit the Job Script#
If you’re running the script directly in the terminal, you need to make it executable first:
chmod +x run_star.sh
However, SLURM does not require execution permissions, so you can submit the job as-is using:
sbatch run_star.sh
3. Monitor Job Status#
Use the following command to check the job status:
squeue -u yourusername
STAR Workflow Details#
Loading Modules#
Before using STAR, load the appropriate module:
module load star/2.7.11b
Generating the Genome Index#
Before aligning reads, generate the genome index using a reference genome (.fa
) and an annotation file (.gtf
):
STAR --runMode genomeGenerate \
--genomeDir ./reference_data/ \
--genomeFastaFiles ./reference_data/chr1.fa \
--sjdbGTFfile ./reference_data/chr1-hg19_genes.gtf \
--runThreadN 8
Aligning Reads#
For Single-End Reads:#
STAR --genomeDir ./reference_data/ \
--readFilesIn ./raw_fastq/Irrel_kd_1.subset.fq \
--outFileNamePrefix ./star_output/ \
--runThreadN 8
For Paired-End Reads:#
STAR --genomeDir ./reference_data/ \
--readFilesIn /path/to/read1.fastq /path/to/read2.fastq \
--outFileNamePrefix ./star_output/ \
--runThreadN 8
Output Files#
After running STAR, the output directory will contain:
Aligned.out.sam Log.final.out Log.out Log.progress.out SJ.out.tab
Aligned.out.sam
: Contains alignment data.Log.final.out
: Summarizes alignment metrics.
Additional Tips#
Always test commands interactively before incorporating them into job scripts.
Use the SLURM
--time
,--mem
, and--cpus-per-task
options to optimize resource allocation.Check the SLURM output and error files for troubleshooting.
Run job with GPU node#
Interactive session#
srun -p preempt -n 1 --time=04:00:00 --mem=20G --gres=gpu:1 --pty /bin/bash
You can also specify which gpu node you would like to run jobs on
srun -p preempt -n 1 --time=04:00:00 --mem=20G --gres=gpu:a100:1 --pty /bin/bash
Submit jobs to queue#
Example script: align.sh
using parabricks to do the alignment.
#!/bin/bash
#SBATCH -J fq2bam_alignment # Job name
#SBATCH -p preempt # Submit to the 'preempt' partition (modify based on your cluster setup)
#SBATCH --gres=gpu:1 # Request 1 GPU for accelerated processing
#SBATCH -n 2 # Number of tasks (2 in this case)
#SBATCH --mem=60g # Memory allocation (60GB)
#SBATCH --time=02:00:00 # Maximum job run time (2 hours)
#SBATCH --cpus-per-task=20 # Number of CPU cores allocated per task
#SBATCH --output=alignment.%j.out # Standard output file (with job ID %j)
#SBATCH --error=alignment.%j.err # Standard error file (with job ID %j)
#SBATCH --mail-type=ALL # Email notifications for all job states (begin, end, fail)
#SBATCH --mail-user=utln@tufts.edu # Email address for notifications
# Load necessary modules
nvidia-smi # Show GPU information (optional for logging)
module load parabricks/4.0.0-1 # Load Parabricks module for GPU-accelerated alignment
# Define variables
genome_reference="/path/to/reference_genome" # Path to the reference genome (.fasta)
input_fastq1="/path/to/input_read1.fastq" # Path to the first paired-end FASTQ file
input_fastq2="/path/to/input_read2.fastq" # Path to the second paired-end FASTQ file
sample_name="sample_identifier" # Sample identifier
known_sites_vcf="/path/to/known_sites.vcf" # Known sites VCF file for BQSR (optional, if available)
output_directory="/path/to/output_directory" # Directory for the output BAM file and reports
output_bam="${output_directory}/${sample_name}.bam" # Output BAM file path
output_bqsr_report="${output_directory}/${sample_name}.BQSR-report.txt" # Output BQSR report path
# Run the Parabricks fq2bam alignment pipeline
pbrun fq2bam \
--ref ${genome_reference} \ # Reference genome (.fasta)
--in-fq ${input_fastq1} ${input_fastq2} \ # Input paired-end FASTQ files
--read-group-sm ${sample_name} \ # Sample name for read group
--knownSites ${known_sites_vcf} \ # Known sites for BQSR
--out-bam ${output_bam} \ # Output BAM file
--out-recal-file ${output_bqsr_report} # Output Base Quality Score Recalibration (BQSR) report
Here is the command to submit job
chmod +x align.sh # Makes the script executable
sbatch align.sh # Submits the script to the SLURM queue
Use squeue -u yourusername
to check job status.
Additional Resources#
Datasets#
In bioinformatics, it’s common to download databases or reference genomes from public websites. For example, performing sequence alignment requires downloading the appropriate reference genome. To simplify this process, we have pre-downloaded and managed several databases/datasets for users. These include:
New user guide#
In early 2025, we launched a new RT Guides website, offering comprehensive resources on a wide range of topics, including but not limited to HPC, data science, and and, most importantly, bioinformatics. We keep up with the latest trends and regularly update our materials to reflect new developments. We highly recommend bookmarking the website and referring to it whenever you encounter challenges. Your feedback is invaluable—let us know if you spot any errors or have suggestions.
For updates on bioinformatics education, software, and tools, consider subscribing to our e-list: best@elist.tufts.edu.