# Running Bioinformatics Jobs on Tufts HPC

2026-02-24

Shirley Li: [xue.li37@tufts.edu ](mailto:xue.li37@tufts.edu)

## Workshop Overview

This hands-on workshop introduces how to run bioinformatics analyses on the Tufts HPC using SLURM. Using an RNA-seq example, participants will learn how to structure projects, submit jobs, and scale workflows efficiently.

**You will learn how to:**

- Organize a reproducible HPC project directory

- Write and submit SLURM batch scripts

- Allocate CPU, memory, and runtime appropriately

- Monitor logs and troubleshoot jobs

- Use job dependencies and SLURM job arrays for scalable analysis

## 1. Set Up Your Working Directory

All work for this project will be done under:

```
/cluster/tufts/workshop/utln/
```

If you have a dedicated lab storage space, you may use that instead.

Create a directory for your project

```
mkdir /cluster/tufts/workshop/utln/myproject/
# creating a folder to hold all files related to this analysis
```

**Always create a dedicated project directory. Never mix analyses together.**

## 2. Create a Basic Project Structure

Navigate into your project directory and create the following folders:

```
cd /cluster/tufts/workshop/utln/myproject/
# You are now inside your project folder

mkdir raw_data results scripts
touch README.md
# Document what this project does

mkdir scripts/logs
# Create logs folder under scripts
# SLURM will generate: .out files (standard output) and .err files (error messages)
# Keeping logs in one place makes debugging much easier.
```

Your project structure should look like this:

```
myproject/
├── raw_data/        # input files (never modify these)
├── results/         # output files from analysis
├── scripts/         # your SLURM scripts
│   └── logs/
└── README.md
```

This structure keeps:

- Raw input data separate
- Results organized
- Scripts and logs clearly managed

## 3. Prepare Input Fastq Files

The example FASTQ files are located at:

```
/cluster/tufts/workshop/public/2026spring/nfcore/fastq/*
```

Instead of copying large files, create symbolic links in your `raw_data` directory:

```
cd /cluster/tufts/workshop/utln/myproject/raw_data/
# move into raw_data folder

ln -s /cluster/tufts/workshop/public/2026spring/nfcore/fastq/* ./
# Create symbolic links
# This creates shortcuts in your raw_data/ folder that point to the original files.
```

**Why We Don’t Copy the Files**

FASTQ files are large. Copying them would:

- Waste storage
- Slow down the system
- Create unnecessary duplicates

Instead, we use symbolic links.

This allows you to:

- Access the files locally in your project

- Avoid duplicating large datasets

- Keep the original data unchanged

## 4. Running FastQC

We will now write our first batch job to run FastQC.

`fastqc.sh`

```
#!/bin/bash
#SBATCH --job-name=fastqc_rnaseq
#SBATCH --partition=preempt
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=8G
#SBATCH --time=01:00:00
#SBATCH --output=logs/fastqc_%j.out
#SBATCH --error=logs/fastqc_%j.err

echo "Job started on $(date)"
echo "Running on node: $(hostname)"

# Load FastQC module (adjust if needed)
module load fastqc/0.11.9


DIR=/cluster/tufts/workshop/utln/myproject/
# change this line

mkdir -p $DIR/results/fastqc

# Run FastQC on all FASTQ files
fastqc $DIR/raw_data/*.fastq.gz \
       --outdir $DIR/results/fastqc \
       --threads 8

echo "Job finished on $(date)"
```

**What This Job Does**

- Reads FASTQ files from `raw_data/`
- Writes reports to `results/fastqc/`
- Saves logs in `scripts/logs/`
- Uses 8 CPU threads
- Runs all samples in one submission

Submit:

```
sbatch fastqc.sh
```

## 5. Alignment with STAR

`star.sh`

```
#!/bin/bash
#SBATCH --job-name=star_align
#SBATCH --cpus-per-task=8
#SBATCH --partition=preempt
#SBATCH --mem=32G
#SBATCH --time=02:00:00
#SBATCH --output=logs/star_%j.out
#SBATCH --error=logs/star_%j.err

module load star/2.7.11b

STARINDEX=/cluster/tufts/workshop/public/2026spring/star_index/
DIR=/cluster/tufts/workshop/utln/myproject/

mkdir -p $DIR/results/star/

for fq in $DIR/raw_data/*_1_*.fastq.gz
do
    sample=$(basename $fq _1_sub.fastq.gz)

    STAR \
      --runThreadN 8 \
      --genomeDir $STARINDEX \
      --readFilesIn $fq \
      --readFilesCommand zcat \
      --outFileNamePrefix $DIR/results/star/${sample}_

    echo $fq DONE
done
```

**What This Job Does**

- Aligns reads to the reference genome
- Processes all samples in a loop
- Writes BAM output to `results/star/`
- Uses 8 CPU threads

Submit:

```
sbatch star.sh
```

## 6. Post-Processing the BAM File (Sorting)

Many downstream tools require a **sorted BAM file**. We will sort the BAM file using `samtools`.

`sort.sh`

```
#!/bin/bash
#SBATCH --job-name=sort_bam
#SBATCH --cpus-per-task=4
#SBATCH --partition=preempt
#SBATCH --mem=32G
#SBATCH --time=02:00:00
#SBATCH --output=logs/sort_%j.out
#SBATCH --error=logs/sort_%j.err

module load samtools/1.21

mkdir -p $DIR/results/sorted_bam/

for bam in $DIR/results/star/*_Aligned.out.bam
do
    sample=$(basename $bam _Aligned.out.bam)

    samtools sort \
        -@ 4 \
        -o $DIR/results/sorted_bam/${sample}_sorted.bam \
        $bam
done

```

**What This Job Does**

- Takes STAR output BAM files
- Sorts each BAM file
- Writes sorted files to `results/sorted_bam/`
- Uses 4 CPU threads

Submit:

```
sbatch sort.sh
```

## 7. Wrapper Script to Chain Jobs

This script chains STAR and sorting using SLURM dependency. FastQC runs independently.

`run_pipeline.sh`

```
#!/bin/bash

# Submit FASTQC independently
sbatch fastqc.sh

# Submit STAR alignment
jid1=$(sbatch star.sh | awk '{print $4}')
echo "STAR job submitted with Job ID: $jid1"

# Submit sorting job after STAR completes successfully
jid2=$(sbatch --dependency=afterok:$jid1 sort.sh | awk '{print $4}')
echo "Sorting job submitted with Job ID: $jid2"

```

Make Script Executable and run:

```
chmod +x run_pipeline.sh
./run_pipeline.sh
```

**What This Does**

- FASTQC runs independently
- STAR runs
- Sorting runs only if STAR finishes successfully
- If STAR fails, sorting will not start

---

## Advanced usage: SLURM Job Array

In the previous `star.sh` script, we used a `for` loop to process all samples inside a single SLURM job.

That approach works, but it runs samples **sequentially** — one after another — within the same job allocation.

If you have multiple independent samples, a more scalable approach is to use a **SLURM job array**.

### Why Use a Job Array?

A job array allows SLURM to:

- Run multiple jobs of the same script
- Process different input files independently
- Execute jobs in parallel across different compute nodes

Instead of one job looping through 6 samples, SLURM launches 6 separate jobs automatically.

Each job:

- Processes one sample
- Has its own log file
- Uses its own allocated resources

This improves efficiency and scalability.

`star_array.sh`

```
#!/bin/bash
#SBATCH --job-name=star_align
#SBATCH --partition=preempt
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G
#SBATCH --time=04:00:00
#SBATCH --array=1-6
#SBATCH --output=logs/star_%A_%a.out
#SBATCH --error=logs/star_%A_%a.err

module load star/2.7.11b

STARINDEX=/cluster/tufts/workshop/public/2026spring/star_index/
DIR=/cluster/tufts/workshop/utln/myproject/

mkdir -p $DIR/results/star_arrayjob/

#============================
# --array=1-6: where 6 = number of samples.
# If you don’t know how many samples there are, you can count them first:
# ls $DIR/raw_data/*_1_*.fastq.gz | wc -l
#============================

files=($DIR/raw_data/*_1_*.fastq.gz)
# This collects all matching FASTQ files into a bash array.

fq=${files[$SLURM_ARRAY_TASK_ID-1]}
# SLURM_ARRAY_TASK_ID starts at 1.
# Bash arrays start at 0.

sample=$(basename $fq _1_sub.fastq.gz)

STAR \
  --runThreadN 8 \
  --genomeDir $STARINDEX \
  --readFilesIn $fq \
  --readFilesCommand zcat \
  --outFileNamePrefix $DIR/results/star/${sample}_
```