How to Organize a Bioinformatics Project on the Tufts HPC#

A Practical Guide for Reproducible and Scalable Research

Author: Shirley Li, xue.li37@tufts.edu
Date: 2026-02-26

If you are running RNA-seq, single-cell, spatial transcriptomics, or other omics analyses on the Tufts HPC cluster, organizing your project correctly from the beginning will save time, reduce errors, and make your work reproducible.

This guide explains:

  • Where to work on the Tufts HPC

  • A recommended project folder structure

  • What belongs in each directory

  • How to use Git/GitHub correctly

  • Best practices for long-term project management

  • This structure works for SLURM-based workflows, R/Python pipelines, and multi-analysis projects.

1. Where to Work on Tufts HPC#

Each user works under:

/cluster/tufts/<labname>/<utln>/

Your lab should maintain shared resources under:

/cluster/tufts/<labname>/shared/

Use the shared folder for reference genomes, indexes, annotation files, or datasets reused across the lab.

3. What Goes in Each Folder#

data_raw/#

Untouched input files (FASTQ, BAM, MERFISH CSV, Visium outputs). Should never be modified.

data_processed/#

Filtered matrices, Seurat objects (.rds), processed tables. Time-consuming results that you want to reuse.

scripts_global/#

Shared utilities, helper functions, custom plotting, reusable QC code.

analysis*/scripts/#

Analysis-specific R/Python scripts, e.g.:

01_qc.R
02_clustering.R
03_spatial.R

analysis*/notebooks/#

Exploratory notebooks (Rmd/Jupyter). Not part of the formal pipeline.

analysis*/slurm/#

Job submission files (.sbatch). Match names to scripts, e.g.:

scripts/01_qc.R
slurm/01_qc.sbatch

analysis*/results/#

Figures, tables, intermediate outputs.

analysis*/logs/#

SLURM .out/.err logs and analysis logs.

envs/#

Conda environments, renv folder, Apptainer definitions.

docs/#

Notes, meeting summaries, methods drafts, descriptions of workflows.

4. Version Control: Use Git + GitHub#

Track only code and documents, not large data.

Put these under Git:

scripts_global/
analysis*/scripts/
analysis*/slurm/
analysis*/notebooks/
docs/
README.md

Add to .gitignore:

data_raw/
data_processed/
results/
*.rds
*.h5
*.fastq*
*.bam

Create a GitHub repo for tracking scripts + documentation.

5. Essential Best Practices for HPC Projects#

5.1. Separate Raw and Processed Data#

Raw data:

  • Never modified

  • Archived after publication

  • Moved to cold storage if appropriate Public raw datasets:

  • Document download commands

  • Delete after processing (can be re-downloaded)

5.2. Document Software Versions#

Record:

  • Tool versions

  • Parameters

  • Package versions Write analysis notes in a methods-style format while working.

5.3. Keep Folder Sizes Reasonable#

A typical manuscript project (excluding raw/intermediate data) should only be a few GB. Regular cleanup is much easier than emergency cleanup.

5.4. Use Consistent Naming Conventions#

Match: 01_qc.R    01_qc.sbatch

Use logical numbering. Avoid ambiguous names like: final_new_v2_fixed.R

5.5. Save Intermediate Objects#

For large omics analyses:

  • Save Seurat .rds

  • Save processed matrices Preprocessing can take hours. Avoid repeating it.

5.6. Track Everything with Git#

Track:

  • Code

  • SLURM scripts

  • Notebooks

  • Documentation

  • Exclude all data.

  • Commit frequently.

Frequently Asked Questions#

Q: Should each paper have its own project folder?#

Yes. Each manuscript should have its own top-level project directory. This prevents cross-contamination of scripts, results, and documentation between studies.

Q: Should I organize by data type or biological question?#

Organize by biological question or manuscript goal.

For example:

analysis1/  → differential expression
analysis2/  → spatial analysis
analysis3/  → integration

Avoid mixing unrelated analyses in the same folder.

Q: Where should I run interactive work on Tufts HPC?#

Use Open OnDemand for:

  • RStudio

  • Jupyter

  • Small-scale exploratory work

Use SLURM batch jobs for:

  • Large RNA-seq runs

  • STAR alignment

  • nf-core workflows

  • High-memory jobs

Q: Should I run everything through SLURM?#

If a job:

  • Runs > 5–10 minutes

  • Uses multiple cores

  • Requires significant memory It should be submitted via SLURM.

Interactive sessions are for exploration only.

Q: Where should I store large reference genomes?#

Store them in: /cluster/tufts/<labname>/shared/

Never duplicate reference genomes across personal project folders.

Q: How should I name my project folders?#

Use descriptive and date-aware names:

2026_skin_merfish_project/
2026_rnaseq_inflammation/

Avoid vague names like:

new_project/
final_analysis/
test2/

Q: Should notebooks replace scripts?#

No. Notebooks are for:

  • Exploration

  • Visualization

  • Demonstration

Scripts are for:

  • Reproducible pipelines

  • SLURM jobs

  • Formal analysis

A clean workflow uses both.

Q: When should I clean up intermediate files?#

After:

  • You save key processed objects

  • The pipeline has completed successfully

  • The manuscript is accepted

  • Large intermediate files (temporary BAMs, intermediate matrices) should not live forever.

Q: How large should a project folder be?#

As a rule of thumb: Code + figures + docs → usually only a few GB

Large raw/intermediate data should not dominate your active project space

If a folder grows unexpectedly large, check: du -sh *

Q: Should I store environments inside the project?#

Yes. Keep environment definitions in: envs/

This ensures:

  • Reproducibility

  • Easier collaboration

  • Future reruns of analysis

Q: How do I make my project reproducible for future lab members?#

Keep clear folder structure

  • Maintain README.md

  • Track code with Git

  • Record software versions

  • Avoid undocumented manual steps

A future lab member should be able to understand your project in one hour.

Q: What is the most common mistake in HPC project organization?#

Mixing everything in one folder. Typical bad pattern:

scripts/
data/
results/
more_scripts/
new_results/
final/

Without structure, scaling becomes impossible.

Q: Should I compress FASTQ or BAM files?#

FASTQ should remain .fastq.gz. Do not unzip them.

BAM files can be archived if no longer needed, but do not delete primary alignment results before publication unless archived properly.