Running Bioinformatics Jobs on Tufts HPC#
2026-02-24
Shirley Li: xue.li37@tufts.edu
Workshop Overview#
This hands-on workshop introduces how to run bioinformatics analyses on the Tufts HPC using SLURM. Using an RNA-seq example, participants will learn how to structure projects, submit jobs, and scale workflows efficiently.
You will learn how to:
Organize a reproducible HPC project directory
Write and submit SLURM batch scripts
Allocate CPU, memory, and runtime appropriately
Monitor logs and troubleshoot jobs
Use job dependencies and SLURM job arrays for scalable analysis
1. Set Up Your Working Directory#
All work for this project will be done under:
/cluster/tufts/workshop/utln/
If you have a dedicated lab storage space, you may use that instead.
Create a directory for your project
mkdir /cluster/tufts/workshop/utln/myproject/
# creating a folder to hold all files related to this analysis
Always create a dedicated project directory. Never mix analyses together.
2. Create a Basic Project Structure#
Navigate into your project directory and create the following folders:
cd /cluster/tufts/workshop/utln/myproject/
# You are now inside your project folder
mkdir raw_data results scripts
touch README.md
# Document what this project does
mkdir scripts/logs
# Create logs folder under scripts
# SLURM will generate: .out files (standard output) and .err files (error messages)
# Keeping logs in one place makes debugging much easier.
Your project structure should look like this:
myproject/
├── raw_data/ # input files (never modify these)
├── results/ # output files from analysis
├── scripts/ # your SLURM scripts
│ └── logs/
└── README.md
This structure keeps:
Raw input data separate
Results organized
Scripts and logs clearly managed
3. Prepare Input Fastq Files#
The example FASTQ files are located at:
/cluster/tufts/workshop/public/2026spring/nfcore/fastq/*
Instead of copying large files, create symbolic links in your raw_data directory:
cd /cluster/tufts/workshop/utln/myproject/raw_data/
# move into raw_data folder
ln -s /cluster/tufts/workshop/public/2026spring/nfcore/fastq/* ./
# Create symbolic links
# This creates shortcuts in your raw_data/ folder that point to the original files.
Why We Don’t Copy the Files
FASTQ files are large. Copying them would:
Waste storage
Slow down the system
Create unnecessary duplicates
Instead, we use symbolic links.
This allows you to:
Access the files locally in your project
Avoid duplicating large datasets
Keep the original data unchanged
4. Running FastQC#
We will now write our first batch job to run FastQC.
fastqc.sh
#!/bin/bash
#SBATCH --job-name=fastqc_rnaseq
#SBATCH --partition=preempt
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=8G
#SBATCH --time=01:00:00
#SBATCH --output=logs/fastqc_%j.out
#SBATCH --error=logs/fastqc_%j.err
echo "Job started on $(date)"
echo "Running on node: $(hostname)"
# Load FastQC module (adjust if needed)
module load fastqc/0.11.9
DIR=/cluster/tufts/workshop/utln/myproject/
# change this line
mkdir -p $DIR/results/fastqc
# Run FastQC on all FASTQ files
fastqc $DIR/raw_data/*.fastq.gz \
--outdir $DIR/results/fastqc \
--threads 8
echo "Job finished on $(date)"
What This Job Does
Reads FASTQ files from
raw_data/Writes reports to
results/fastqc/Saves logs in
scripts/logs/Uses 8 CPU threads
Runs all samples in one submission
Submit:
sbatch fastqc.sh
5. Alignment with STAR#
star.sh
#!/bin/bash
#SBATCH --job-name=star_align
#SBATCH --cpus-per-task=8
#SBATCH --partition=preempt
#SBATCH --mem=32G
#SBATCH --time=02:00:00
#SBATCH --output=logs/star_%j.out
#SBATCH --error=logs/star_%j.err
module load star/2.7.11b
STARINDEX=/cluster/tufts/workshop/public/2026spring/star_index/
DIR=/cluster/tufts/workshop/utln/myproject/
mkdir -p $DIR/results/star/
for fq in $DIR/raw_data/*_1_*.fastq.gz
do
sample=$(basename $fq _1_sub.fastq.gz)
STAR \
--runThreadN 8 \
--genomeDir $STARINDEX \
--readFilesIn $fq \
--readFilesCommand zcat \
--outFileNamePrefix $DIR/results/star/${sample}_
echo $fq DONE
done
What This Job Does
Aligns reads to the reference genome
Processes all samples in a loop
Writes BAM output to
results/star/Uses 8 CPU threads
Submit:
sbatch star.sh
6. Post-Processing the BAM File (Sorting)#
Many downstream tools require a sorted BAM file. We will sort the BAM file using samtools.
sort.sh
#!/bin/bash
#SBATCH --job-name=sort_bam
#SBATCH --cpus-per-task=4
#SBATCH --partition=preempt
#SBATCH --mem=32G
#SBATCH --time=02:00:00
#SBATCH --output=logs/sort_%j.out
#SBATCH --error=logs/sort_%j.err
module load samtools/1.21
mkdir -p $DIR/results/sorted_bam/
for bam in $DIR/results/star/*_Aligned.out.bam
do
sample=$(basename $bam _Aligned.out.bam)
samtools sort \
-@ 4 \
-o $DIR/results/sorted_bam/${sample}_sorted.bam \
$bam
done
What This Job Does
Takes STAR output BAM files
Sorts each BAM file
Writes sorted files to
results/sorted_bam/Uses 4 CPU threads
Submit:
sbatch sort.sh
7. Wrapper Script to Chain Jobs#
This script chains STAR and sorting using SLURM dependency. FastQC runs independently.
run_pipeline.sh
#!/bin/bash
# Submit FASTQC independently
sbatch fastqc.sh
# Submit STAR alignment
jid1=$(sbatch star.sh | awk '{print $4}')
echo "STAR job submitted with Job ID: $jid1"
# Submit sorting job after STAR completes successfully
jid2=$(sbatch --dependency=afterok:$jid1 sort.sh | awk '{print $4}')
echo "Sorting job submitted with Job ID: $jid2"
Make Script Executable and run:
chmod +x run_pipeline.sh
./run_pipeline.sh
What This Does
FASTQC runs independently
STAR runs
Sorting runs only if STAR finishes successfully
If STAR fails, sorting will not start
Advanced usage: SLURM Job Array#
In the previous star.sh script, we used a for loop to process all samples inside a single SLURM job.
That approach works, but it runs samples sequentially — one after another — within the same job allocation.
If you have multiple independent samples, a more scalable approach is to use a SLURM job array.
Why Use a Job Array?#
A job array allows SLURM to:
Run multiple jobs of the same script
Process different input files independently
Execute jobs in parallel across different compute nodes
Instead of one job looping through 6 samples, SLURM launches 6 separate jobs automatically.
Each job:
Processes one sample
Has its own log file
Uses its own allocated resources
This improves efficiency and scalability.
star_array.sh
#!/bin/bash
#SBATCH --job-name=star_align
#SBATCH --partition=preempt
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G
#SBATCH --time=04:00:00
#SBATCH --array=1-6
#SBATCH --output=logs/star_%A_%a.out
#SBATCH --error=logs/star_%A_%a.err
module load star/2.7.11b
STARINDEX=/cluster/tufts/workshop/public/2026spring/star_index/
DIR=/cluster/tufts/workshop/utln/myproject/
mkdir -p $DIR/results/star_arrayjob/
#============================
# --array=1-6: where 6 = number of samples.
# If you don’t know how many samples there are, you can count them first:
# ls $DIR/raw_data/*_1_*.fastq.gz | wc -l
#============================
files=($DIR/raw_data/*_1_*.fastq.gz)
# This collects all matching FASTQ files into a bash array.
fq=${files[$SLURM_ARRAY_TASK_ID-1]}
# SLURM_ARRAY_TASK_ID starts at 1.
# Bash arrays start at 0.
sample=$(basename $fq _1_sub.fastq.gz)
STAR \
--runThreadN 8 \
--genomeDir $STARINDEX \
--readFilesIn $fq \
--readFilesCommand zcat \
--outFileNamePrefix $DIR/results/star/${sample}_