Set Up Conda Environment and Create Jupyter Kernel for scRNA-seq Analysis#

Shirley Li, Bioinformatician, TTS Research Technology xue.li37@tufts.edu

Date: 2024-11-01

Overview#

In this tutorial, you will learn how to:

  • Create a Conda environment for single-cell RNA-seq analysis using Python-only packages.

  • Install popular Python packages for scRNA-seq, such as Scanpy, and Scrublet.

  • Set up a Jupyter kernel that uses the Conda environment for easy access to the tools in a notebook interface.

Create a Conda Environment for scRNA-seq#

  1. Load miniforge and conda-env-mod module

module load miniforge/24.7.1-py312
module load conda-env-mod/default
  1. Configure your conda

    Note (steps in this session only needs to be executed ONCE)

    Since your home directory has limited storage, it’s recommended to install conda packages in your group research storage space. Follow these steps:

    Create two directories in your group research storage space (one for storing the envs, one for storing the pkgs, for example: condaenv, condapkg)

    mkdir /cluster/tufts/XXXXlab/$USER/condaenv/
    mkdir /cluster/tufts/XXXXlab/$USER/condapkg/
    

    If you haven’t used conda before on the cluster, create a file named “.condarc” in your home directory.

    Now add the following 4 lines to the .condarc file in your home directory (modify according to your real path to the directories):

    envs_dirs:
      - /cluster/tufts/XXXXlab/$USER/condaenv/
    pkgs_dirs:
      - /cluster/tufts/XXXXlab/$USER/condapkg/
    

    After this, your .condarc file should look like this:

    envs_dirs:
      - /cluster/tufts/XXXXlab/$USER/condaenv/
    pkgs_dirs:
      - /cluster/tufts/XXXXlab/$USER/condapkg/
    channels:
      - bioconda
      - conda-forge
      - defaults
    
  2. Create your conda environment with conda-env-mod

cd /cluster/tufts/XXXXlab/$USER/condaenv/
conda-env-mod create -p scrna_seq_py_env python=3.8  --jupyter

​ You will see something like this, and enter y to continue

  ...

The following NEW packages will be INSTALLED:

  _libgcc_mutex      conda-forge/linux-64::_libgcc_mutex-0.1-conda_forge
  _openmp_mutex      conda-forge/linux-64::_openmp_mutex-4.5-2_gnu
  asttokens          conda-forge/noarch::asttokens-2.4.1-pyhd8ed1ab_0
  bzip2              conda-forge/linux-64::bzip2-1.0.8-hd590300_5
  ca-certificates    conda-forge/linux-64::ca-certificates-2024.7.4-hbcca054_0
  ...

Proceed ([y]/n)? y

​ When it’s complete, you will see something like this:

...
Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... done
+---------------------------------------------------------------+
| To use this environment, load the following modules:          |
|     module load use.own                                       |
|     module load conda-env/scrna_seq_py_env-py3.12.5           |
| (then standard 'conda install' / 'pip install' / run scripts) |
+---------------------------------------------------------------+

Install Selected Python Packages#

  1. Activate conda environment and install new packages

    module load use.own
    module load conda-env/scrna_seq_py_env-py3.12.5
    
    conda list # check packages installed in this environment
    
    pip install jupyter
    pip install numpy
    pip install pandas
    pip install anndata
    conda install -c conda-forge scanpy
    conda install -c bioconda scrublet
    pip install harmony-pytorch
    pip install gseapy
    pip install scanorama
    pip install pyscenic
    pip install scvi-tools
    pip install -i https://test.pypi.org/simple/ memento
    pip install pooch
    conda install -c conda-forge python-igraph
    
    conda list # check again
    
  2. Create a jupyter kernel

    conda-env-mod kernel -n scrna_seq_py_env
    

    You will see something like this:

    requested kernel with arguments:  -n 'scrna_seq_py_env' --
    
    Jupyter kernel created: "Python (My scrna_seq_py_env Kernel)"
    +---------------------------------------------------------------+
    | We recommend installing packages into your kernel environment |
    | via the command line (with 'conda install' or 'pip install'). |
    +---------------------------------------------------------------+
    

Using Open OnDemand Jupyter Lab#

Natigate to Open Ondemand

In Open Ondemand dashboard, let’s go to Interactive APPs => Jupyter Lab and select the number of hours, number of cores, and Amount of memory that you would like to request and Launch this job.

Under Notebook, select the kernel you just created. Ex: scrna_seq_py_env

Start your python code from there.

Example code to check the installation:

# Import installed packages
import os
import seaborn as sns
import scanpy as sc
import scrublet as scr
import anndata
import harmony
import memento
import numpy as np
import pandas as pd
import scvi
import matplotlib.pyplot as plt

Single-Cell RNA-seq Analysis Packages#

Scanpy#

  • Summary: Scanpy is a widely used Python package for analyzing large-scale single-cell RNA-seq datasets. It is optimized for scalability and supports workflows for preprocessing, clustering, dimensionality reduction, differential expression, and visualization of single-cell data.

  • Paper: Wolf, F. A., Angerer, P., & Theis, F. J. (2018). “Scanpy: large-scale single-cell gene expression data analysis.” Genome Biology, 19(1), 15. https://doi.org/10.1186/s13059-017-1382-0 % codespell:ignore theis

  • Website: https://scanpy.readthedocs.io

Scrublet#

  • Summary: Scrublet is a Python tool designed to detect doublets in single-cell RNA-seq data. Doublets are instances where two cells are captured in a single droplet, which can distort downstream analysis. Scrublet uses a k-nearest neighbors approach to identify and score potential doublets.

  • Paper: Wolock, S. L., Lopez, R., & Klein, A. M. (2019). “Scrublet: Computational Identification of Cell Doublets in Single-Cell Transcriptomic Data.” Cell Systems, 8(4), 281–291.e9. https://doi.org/10.1016/j.cels.2018.11.005

  • Website: AllonKleinLab/scrublet

AnnData#

  • Summary: AnnData is a Python package that provides a framework for managing annotated data matrices, tailored for large-scale single-cell RNA-seq data. AnnData is widely used as the primary data structure in Scanpy, enabling efficient storage and handling of both raw and processed single-cell data.

  • Paper: Virshup, I., Rybakov, S., Theis, F. J., Angerer, P., & Wolf, F. A. (2024). “anndata: Access and store annotated data matrices.” The Journal of Open Source Software. https://doi.org/10.21105/joss.04371 % codespell:ignore theis

  • Website: https://anndata.readthedocs.io

Harmony#

  • Summary: Harmony is a tool designed for batch effect correction in single-cell RNA-seq datasets. It integrates datasets from different batches or conditions by aligning data in a shared embedding space, allowing biological variation to be preserved while minimizing technical differences.

  • Paper: Korsunsky, I., Millard, N., Fan, J., Slowikowski, K., Zhang, F., Wei, K., … & Raychaudhuri, S. (2019). “Fast, sensitive and accurate integration of single-cell data with Harmony.” Nature Methods, 16(12), 1289-1296. https://doi.org/10.1038/s41592-019-0619-0

  • Website: https://portals.broadinstitute.org/harmony

Memento#

  • Summary: Memento is a statistical tool tailored for single-cell RNA sequencing (scRNA-seq) analysis, with a focus on decoupling measurement noise from biological expression variability, thereby improving accuracy in differential expression studies.

  • Paper: Kim, M. C., Gate, R., Lee, D. S., Marson, A., Ntranos, V., Ye, C. J. (2024). “Method of moments framework for differential expression analysis of single-cell RNA sequencing data.” Cell, 187(22), P6393-6410.E16. https://doi.org/10.1016/j.cell.2024.08.022https://doi.org/10.1038/s41592-021-01125-y)

  • Website: yelabucsf/scrna-parameter-estimation

scVI-tools#

  • Summary: scVI-tools is a framework built on top of PyTorch for scalable probabilistic modeling of single-cell data. It includes various models like scVI (single-cell variational inference), totalVI, and PEAKVI, used for data integration, dimensionality reduction, differential expression, and multi-omics data analysis.

  • Paper: Gayoso, A., Lopez, R., Xing, G., Boyeau, P., Wu, K., Jayasuriya, M., et al. (2022). “A Python library for probabilistic analysis of single-cell omics data.” Nature Biotechnology, 40, 163–166. https://www.nature.com/articles/s41587-021-01206-w

  • Website: https://scvi-tools.org