Set Up Conda Environment and Create Jupyter Kernel for scRNA-seq Analysis#

Shirley Li, Bioinformatician, TTS Research Technology xue.li37@tufts.edu

Date: 2024-11-01

Overview#

In this tutorial, you will learn how to:

Create a Conda environment for single-cell RNA-seq analysis using Python-only packages.
Install popular Python packages for scRNA-seq, such as Scanpy, and Scrublet.
Set up a Jupyter kernel that uses the Conda environment for easy access to the tools in a notebook interface.

Create a Conda Environment for scRNA-seq#

Load miniforge and conda-env-mod module

module load miniforge/24.7.1-py312
module load conda-env-mod/default

Configure your conda

Note (steps in this session only needs to be executed ONCE)

Since your home directory has limited storage, it’s recommended to install conda packages in your group research storage space. Follow these steps:

Create two directories in your group research storage space (one for storing the envs, one for storing the pkgs, for example: condaenv, condapkg)
```
mkdir /cluster/tufts/XXXXlab/$USER/condaenv/
mkdir /cluster/tufts/XXXXlab/$USER/condapkg/
```
If you haven’t used conda before on the cluster, create a file named “.condarc” in your home directory.

Now add the following 4 lines to the .condarc file in your home directory (modify according to your real path to the directories):
```
envs_dirs:
  - /cluster/tufts/XXXXlab/$USER/condaenv/
pkgs_dirs:
  - /cluster/tufts/XXXXlab/$USER/condapkg/
```
After this, your .condarc file should look like this:
```
envs_dirs:
  - /cluster/tufts/XXXXlab/$USER/condaenv/
pkgs_dirs:
  - /cluster/tufts/XXXXlab/$USER/condapkg/
channels:
  - bioconda
  - conda-forge
  - defaults
```
Create your conda environment with conda-env-mod

cd /cluster/tufts/XXXXlab/$USER/condaenv/
conda-env-mod create -p scrna_seq_py_env python=3.8  --jupyter

You will see something like this, and enter y to continue

  ...

The following NEW packages will be INSTALLED:

  _libgcc_mutex      conda-forge/linux-64::_libgcc_mutex-0.1-conda_forge
  _openmp_mutex      conda-forge/linux-64::_openmp_mutex-4.5-2_gnu
  asttokens          conda-forge/noarch::asttokens-2.4.1-pyhd8ed1ab_0
  bzip2              conda-forge/linux-64::bzip2-1.0.8-hd590300_5
  ca-certificates    conda-forge/linux-64::ca-certificates-2024.7.4-hbcca054_0
  ...

Proceed ([y]/n)? y

When it’s complete, you will see something like this:

...
Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... done
+---------------------------------------------------------------+
| To use this environment, load the following modules:          |
|     module load use.own                                       |
|     module load conda-env/scrna_seq_py_env-py3.12.5           |
| (then standard 'conda install' / 'pip install' / run scripts) |
+---------------------------------------------------------------+

Install Selected Python Packages#

Activate conda environment and install new packages

module load use.own
module load conda-env/scrna_seq_py_env-py3.12.5

conda list # check packages installed in this environment

pip install jupyter
pip install numpy
pip install pandas
pip install anndata
conda install -c conda-forge scanpy
conda install -c bioconda scrublet
pip install harmony-pytorch
pip install gseapy
pip install scanorama
pip install pyscenic
pip install scvi-tools
pip install -i https://test.pypi.org/simple/ memento
pip install pooch
conda install -c conda-forge python-igraph

conda list # check again

Create a jupyter kernel

conda-env-mod kernel -n scrna_seq_py_env

You will see something like this:

requested kernel with arguments:  -n 'scrna_seq_py_env' --

Jupyter kernel created: "Python (My scrna_seq_py_env Kernel)"
+---------------------------------------------------------------+
| We recommend installing packages into your kernel environment |
| via the command line (with 'conda install' or 'pip install'). |
+---------------------------------------------------------------+

Using Open OnDemand Jupyter Lab#

Natigate to Open Ondemand

In Open Ondemand dashboard, let’s go to Interactive APPs => Jupyter Lab and select the number of hours, number of cores, and Amount of memory that you would like to request and Launch this job.

Under Notebook, select the kernel you just created. Ex: scrna_seq_py_env

Start your python code from there.

Example code to check the installation:

# Import installed packages
import os
import seaborn as sns
import scanpy as sc
import scrublet as scr
import anndata
import harmony
import memento
import numpy as np
import pandas as pd
import scvi
import matplotlib.pyplot as plt

Single-Cell RNA-seq Analysis Packages#

Scanpy#

Summary: Scanpy is a widely used Python package for analyzing large-scale single-cell RNA-seq datasets. It is optimized for scalability and supports workflows for preprocessing, clustering, dimensionality reduction, differential expression, and visualization of single-cell data.
Paper: Wolf, F. A., Angerer, P., & Theis, F. J. (2018). “Scanpy: large-scale single-cell gene expression data analysis.” Genome Biology, 19(1), 15. https://doi.org/10.1186/s13059-017-1382-0 % codespell:ignore theis
Website: https://scanpy.readthedocs.io

Scrublet#

Summary: Scrublet is a Python tool designed to detect doublets in single-cell RNA-seq data. Doublets are instances where two cells are captured in a single droplet, which can distort downstream analysis. Scrublet uses a k-nearest neighbors approach to identify and score potential doublets.
Paper: Wolock, S. L., Lopez, R., & Klein, A. M. (2019). “Scrublet: Computational Identification of Cell Doublets in Single-Cell Transcriptomic Data.” Cell Systems, 8(4), 281–291.e9. https://doi.org/10.1016/j.cels.2018.11.005
Website: AllonKleinLab/scrublet

AnnData#

Summary: AnnData is a Python package that provides a framework for managing annotated data matrices, tailored for large-scale single-cell RNA-seq data. AnnData is widely used as the primary data structure in Scanpy, enabling efficient storage and handling of both raw and processed single-cell data.
Paper: Virshup, I., Rybakov, S., Theis, F. J., Angerer, P., & Wolf, F. A. (2024). “anndata: Access and store annotated data matrices.” The Journal of Open Source Software. https://doi.org/10.21105/joss.04371 % codespell:ignore theis
Website: https://anndata.readthedocs.io

Harmony#

Summary: Harmony is a tool designed for batch effect correction in single-cell RNA-seq datasets. It integrates datasets from different batches or conditions by aligning data in a shared embedding space, allowing biological variation to be preserved while minimizing technical differences.
Paper: Korsunsky, I., Millard, N., Fan, J., Slowikowski, K., Zhang, F., Wei, K., … & Raychaudhuri, S. (2019). “Fast, sensitive and accurate integration of single-cell data with Harmony.” Nature Methods, 16(12), 1289-1296. https://doi.org/10.1038/s41592-019-0619-0
Website: https://portals.broadinstitute.org/harmony

Memento#

Summary: Memento is a statistical tool tailored for single-cell RNA sequencing (scRNA-seq) analysis, with a focus on decoupling measurement noise from biological expression variability, thereby improving accuracy in differential expression studies.
Paper: Kim, M. C., Gate, R., Lee, D. S., Marson, A., Ntranos, V., Ye, C. J. (2024). “Method of moments framework for differential expression analysis of single-cell RNA sequencing data.” Cell, 187(22), P6393-6410.E16. https://doi.org/10.1016/j.cell.2024.08.022 https://doi.org/10.1038/s41592-021-01125-y)
Website: yelabucsf/scrna-parameter-estimation

scVI-tools#

Summary: scVI-tools is a framework built on top of PyTorch for scalable probabilistic modeling of single-cell data. It includes various models like scVI (single-cell variational inference), totalVI, and PEAKVI, used for data integration, dimensionality reduction, differential expression, and multi-omics data analysis.
Paper: Gayoso, A., Lopez, R., Xing, G., Boyeau, P., Wu, K., Jayasuriya, M., et al. (2022). “A Python library for probabilistic analysis of single-cell omics data.” Nature Biotechnology, 40, 163–166. https://www.nature.com/articles/s41587-021-01206-w
Website: https://scvi-tools.org