Introduction to Bioinformatics on Tufts HPC#

Date: 2025-02

Prerequisites#

Basic understanding of biology and bioinformatics
- Familiarity with common bioinformatics data formats, including FASTQ, FASTA, SAM, BAM, and GTF files.
Familiarity with the command line
Access to an HPC cluster (e.g., login credentials, necessary software installations)

Bioinformatics modules#

On the cluster#

Use module avail to check the full list of tools available on the cluster. Below are some commonly used tools:

   abcreg/0.1.0                         kallisto/0.48.0                     (D)    orthofinder/2.5.5
   abyss/2.3.7                          kneaddata/0.12.0                           pandaseq/2.11
   alphafold/2.3.0                      kraken2/2.1.3                              parabricks/4.0.0-1
   alphafold/2.3.1                      krakentools/1.2                            parabricks/4.2.1-1
   alphafold/2.3.2                      macs2/2.2.7.1
   amplify/2.0.0                        macs3/3.0.0a6                              pepper_deepvariant/r0.8
   angsd/0.939                          masurca/4.0.9                              petitefinder/cpu
   angsd/0.940                   (D)    masurca/4.1.0                       (D)    picard/2.25.1
   bakta/1.9.3                          medaka/1.11.1                              picard/2.26.10
   bbmap/38.93                          megahit/1.2.9                              plink/1.90b6.21
   bbmap/38.96                   (D)    meme/5.5.5                                 plink2/2.00a2.3
   bbtools/39.00                        metaphlan/4.0.2                            polypolish/0.5.0
   bcftools/1.13                        metaphlan/4.0.6                     (D)    preseq/3.2.0
   bcftools/1.14                        miniasm/0.3_r179                           prokka/1.14.6
   bcftools/1.17                        minimap2/2.26                       (D)    qiime2/2023.2
   bcftools/1.20                 (D)    minipolish/0.1.3                           qiime2/2023.5
   beast2/2.6.3                         mirdeep2/2.0.1.3                           qiime2/2023.7
   beast2/2.6.4                         mirge3/0.1.4                               qiime2/2023.9
   beast2/2.6.6                  (D)    mothur/1.46.0                              qiime2/2024.2
   ... ...

A few tips#

Before installing your own tools, check if they are already available on the cluster using the module avail command.
Always be aware of the software versions, especially when using scripts from colleagues.
For less common tools, consider installing them yourself to ensure you have full control over the version and availability.

If you need to install a less commonly used tool, it’s best to handle the installation yourself to ensure proper maintenance. Follow this tutorial to install your own tool

Using the Open OnDemand App#

You can access Open OnDemand through this link

Bioinformatics Apps#

We offer a wide range of bioinformatics tools as apps, including AlphaFold and CellProfiler. Additionally, 31 nf-core pipelines are available as apps for ease of use, with the most popular being nf-core/rnaseq, which we will demonstrate in our final workshop.

RStudio and Apps#

RStudio Pax, use R/4.4.2 which has the most comprehensive packages installed (1300+).

How to initiate an R job

Log in to Open Ondemand.
Navigate to interactive apps and select RStudio Pax
Specify the required resources:
- Number of hours
- Number of CPU cores
- Amount of memory
- CPU partition (set to batch)
- R version (latest available: 4.4.2)
Click Launch to submit your job to the queue.
Wait a few minutes until your job starts running.
Click Connect to Rstudio server
In RStudio, go to the Packages tab on the right to check the installed packages.

Installing R Packages

Refer to our previous workshop materials for detailed instructions on installing R packages.

Other Apps#

We also provide other applications like Jupyter Bioinfo, JupyterLab, Jupyter Notebook, IGV, and Galaxy to support your daily research activities.

nf-core pipelines #

On cluster, use module avail nf-core to get the list of nf-core pipelines deployed on cluster.
On Open OnDemand, you can go to bioinformatics apps to find out what has been installed.
Quick Start Guide to Using the nf-core Pipeline

Writing Bioinformatics job script#

Let’s explore how to write a SLURM script, using STAR alignment as an example.

1. Prepare the SLURM Script#

Create a file named run_star.sh and add the following content:

#!/bin/bash
#SBATCH -J STAR_JOB             # Job name
#SBATCH --time=12:00:00         # Maximum runtime (D-HH:MM:SS format)
#SBATCH -p batch                # Partition (queue) to submit the job to
#SBATCH -n 1                    # Number of tasks (1 task in this case)
#SBATCH --mem=32g               # Memory allocation (32 GB)
#SBATCH --cpus-per-task=8       # Number of CPU cores allocated for the task
#SBATCH --output=STAR.%j.out    # Standard output file (%j = Job ID)
#SBATCH --error=STAR.%j.err     # Standard error file (%j = Job ID)
#SBATCH --mail-type=ALL         # Notifications for job status (start, end, fail)
#SBATCH --mail-user=utln@tufts.edu  # Your email address for notifications

# Load necessary module
module load star/2.7.11b

# Create output directory
mkdir -p star_output

# Run STAR alignment for single-end reads
STAR --genomeDir ./reference_data/reference_index/ \
     --readFilesIn ./raw_fastq/Irrel_kd_1.subset.fq \
     --outFileNamePrefix ./star_output/ \
     --runThreadN 8

2. Submit the Job Script#

If you’re running the script directly in the terminal, you need to make it executable first:

chmod +x run_star.sh

However, SLURM does not require execution permissions, so you can submit the job as-is using:

sbatch run_star.sh

3. Monitor Job Status#

Use the following command to check the job status:

squeue -u yourusername

STAR Workflow Details#

Loading Modules#

Before using STAR, load the appropriate module:

module load star/2.7.11b

Generating the Genome Index#

Before aligning reads, generate the genome index using a reference genome (.fa) and an annotation file (.gtf):

STAR --runMode genomeGenerate \
     --genomeDir ./reference_data/ \
     --genomeFastaFiles ./reference_data/chr1.fa \
     --sjdbGTFfile ./reference_data/chr1-hg19_genes.gtf \
     --runThreadN 8

Aligning Reads#

For Single-End Reads:#

STAR --genomeDir ./reference_data/ \
     --readFilesIn ./raw_fastq/Irrel_kd_1.subset.fq \
     --outFileNamePrefix ./star_output/ \
     --runThreadN 8

For Paired-End Reads:#

STAR --genomeDir ./reference_data/ \
     --readFilesIn /path/to/read1.fastq /path/to/read2.fastq \
     --outFileNamePrefix ./star_output/ \
     --runThreadN 8

Output Files#

After running STAR, the output directory will contain:

Aligned.out.sam  Log.final.out  Log.out  Log.progress.out  SJ.out.tab

Aligned.out.sam: Contains alignment data.
Log.final.out: Summarizes alignment metrics.

Additional Tips#

Always test commands interactively before incorporating them into job scripts.
Use the SLURM --time, --mem, and --cpus-per-task options to optimize resource allocation.
Check the SLURM output and error files for troubleshooting.

Run job with GPU node#

Interactive session#

srun -p preempt -n 1 --time=04:00:00 --mem=20G --gres=gpu:1 --pty /bin/bash

You can also specify which gpu node you would like to run jobs on

srun -p preempt -n 1 --time=04:00:00 --mem=20G --gres=gpu:a100:1 --pty /bin/bash

Submit jobs to queue#

Example script: align.sh using parabricks to do the alignment.

#!/bin/bash
#SBATCH -J fq2bam_alignment          # Job name
#SBATCH -p preempt                   # Submit to the 'preempt' partition (modify based on your cluster setup)
#SBATCH --gres=gpu:1                 # Request 1 GPU for accelerated processing
#SBATCH -n 2                         # Number of tasks (2 in this case)
#SBATCH --mem=60g                    # Memory allocation (60GB)
#SBATCH --time=02:00:00              # Maximum job run time (2 hours)
#SBATCH --cpus-per-task=20           # Number of CPU cores allocated per task
#SBATCH --output=alignment.%j.out    # Standard output file (with job ID %j)
#SBATCH --error=alignment.%j.err     # Standard error file (with job ID %j)
#SBATCH --mail-type=ALL              # Email notifications for all job states (begin, end, fail)
#SBATCH --mail-user=utln@tufts.edu   # Email address for notifications

# Load necessary modules
nvidia-smi                              # Show GPU information (optional for logging)
module load parabricks/4.0.0-1          # Load Parabricks module for GPU-accelerated alignment

# Define variables
genome_reference="/path/to/reference_genome"      # Path to the reference genome (.fasta)
input_fastq1="/path/to/input_read1.fastq"         # Path to the first paired-end FASTQ file
input_fastq2="/path/to/input_read2.fastq"         # Path to the second paired-end FASTQ file
sample_name="sample_identifier"                  # Sample identifier
known_sites_vcf="/path/to/known_sites.vcf"        # Known sites VCF file for BQSR (optional, if available)
output_directory="/path/to/output_directory"      # Directory for the output BAM file and reports
output_bam="${output_directory}/${sample_name}.bam"            # Output BAM file path
output_bqsr_report="${output_directory}/${sample_name}.BQSR-report.txt"  # Output BQSR report path

# Run the Parabricks fq2bam alignment pipeline
pbrun fq2bam \
    --ref ${genome_reference} \                # Reference genome (.fasta)
    --in-fq ${input_fastq1} ${input_fastq2} \  # Input paired-end FASTQ files
    --read-group-sm ${sample_name} \           # Sample name for read group
    --knownSites ${known_sites_vcf} \          # Known sites for BQSR
    --out-bam ${output_bam} \                  # Output BAM file
    --out-recal-file ${output_bqsr_report}     # Output Base Quality Score Recalibration (BQSR) report

Here is the command to submit job

chmod +x align.sh    # Makes the script executable
sbatch align.sh      # Submits the script to the SLURM queue

Use squeue -u yourusername to check job status.

Additional Resources#

Datasets#

In bioinformatics, it’s common to download databases or reference genomes from public websites. For example, performing sequence alignment requires downloading the appropriate reference genome. To simplify this process, we have pre-downloaded and managed several databases/datasets for users. These include:

New user guide #

In early 2025, we launched a new RT Guides website, offering comprehensive resources on a wide range of topics, including but not limited to HPC, data science, and and, most importantly, bioinformatics. We keep up with the latest trends and regularly update our materials to reflect new developments. We highly recommend bookmarking the website and referring to it whenever you encounter challenges. Your feedback is invaluable—let us know if you spot any errors or have suggestions.

For updates on bioinformatics education, software, and tools, consider subscribing to our e-list: best@elist.tufts.edu.