Set Up Conda Environment and Create Jupyter Kernel for scRNA-seq Analysis#
Shirley Li, Bioinformatician, TTS Research Technology xue.li37@tufts.edu
Date: 2024-11-01
Overview#
In this tutorial, you will learn how to:
Create a Conda environment for single-cell RNA-seq analysis using Python-only packages.
Install popular Python packages for scRNA-seq, such as Scanpy, and Scrublet.
Set up a Jupyter kernel that uses the Conda environment for easy access to the tools in a notebook interface.
Create a Conda Environment for scRNA-seq#
Load
miniforge
andconda-env-mod
module
module load miniforge/24.7.1-py312
module load conda-env-mod/default
Configure your conda
Note (steps in this session only needs to be executed ONCE)
Since your home directory has limited storage, it’s recommended to install conda packages in your group research storage space. Follow these steps:
Create two directories in your group research storage space (one for storing the envs, one for storing the pkgs, for example: condaenv, condapkg)
mkdir /cluster/tufts/XXXXlab/$USER/condaenv/ mkdir /cluster/tufts/XXXXlab/$USER/condapkg/
If you haven’t used conda before on the cluster, create a file named “.condarc” in your home directory.
Now add the following 4 lines to the
.condarc
file in your home directory (modify according to your real path to the directories):envs_dirs: - /cluster/tufts/XXXXlab/$USER/condaenv/ pkgs_dirs: - /cluster/tufts/XXXXlab/$USER/condapkg/
After this, your
.condarc
file should look like this:envs_dirs: - /cluster/tufts/XXXXlab/$USER/condaenv/ pkgs_dirs: - /cluster/tufts/XXXXlab/$USER/condapkg/ channels: - bioconda - conda-forge - defaults
Create your conda environment with
conda-env-mod
cd /cluster/tufts/XXXXlab/$USER/condaenv/
conda-env-mod create -p scrna_seq_py_env python=3.8 --jupyter
You will see something like this, and enter y
to continue
...
The following NEW packages will be INSTALLED:
_libgcc_mutex conda-forge/linux-64::_libgcc_mutex-0.1-conda_forge
_openmp_mutex conda-forge/linux-64::_openmp_mutex-4.5-2_gnu
asttokens conda-forge/noarch::asttokens-2.4.1-pyhd8ed1ab_0
bzip2 conda-forge/linux-64::bzip2-1.0.8-hd590300_5
ca-certificates conda-forge/linux-64::ca-certificates-2024.7.4-hbcca054_0
...
Proceed ([y]/n)? y
When it’s complete, you will see something like this:
...
Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... done
+---------------------------------------------------------------+
| To use this environment, load the following modules: |
| module load use.own |
| module load conda-env/scrna_seq_py_env-py3.12.5 |
| (then standard 'conda install' / 'pip install' / run scripts) |
+---------------------------------------------------------------+
Install Selected Python Packages#
Activate conda environment and install new packages
module load use.own module load conda-env/scrna_seq_py_env-py3.12.5 conda list # check packages installed in this environment pip install jupyter pip install numpy pip install pandas pip install anndata conda install -c conda-forge scanpy conda install -c bioconda scrublet pip install harmony-pytorch pip install gseapy pip install scanorama pip install pyscenic pip install scvi-tools pip install -i https://test.pypi.org/simple/ memento pip install pooch conda install -c conda-forge python-igraph conda list # check again
Create a jupyter kernel
conda-env-mod kernel -n scrna_seq_py_env
You will see something like this:
requested kernel with arguments: -n 'scrna_seq_py_env' -- Jupyter kernel created: "Python (My scrna_seq_py_env Kernel)" +---------------------------------------------------------------+ | We recommend installing packages into your kernel environment | | via the command line (with 'conda install' or 'pip install'). | +---------------------------------------------------------------+
Using Open OnDemand Jupyter Lab#
Natigate to Open Ondemand
In Open Ondemand dashboard, let’s go to Interactive APPs
=> Jupyter Lab
and select the number of hours
, number of cores
, and Amount of memory
that you would like to request and Launch this job.
Under Notebook
, select the kernel you just created. Ex: scrna_seq_py_env
Start your python code from there.
Example code to check the installation:
# Import installed packages
import os
import seaborn as sns
import scanpy as sc
import scrublet as scr
import anndata
import harmony
import memento
import numpy as np
import pandas as pd
import scvi
import matplotlib.pyplot as plt
Single-Cell RNA-seq Analysis Packages#
Scanpy#
Summary:
Scanpy
is a widely used Python package for analyzing large-scale single-cell RNA-seq datasets. It is optimized for scalability and supports workflows for preprocessing, clustering, dimensionality reduction, differential expression, and visualization of single-cell data.Paper: Wolf, F. A., Angerer, P., & Theis, F. J. (2018). “Scanpy: large-scale single-cell gene expression data analysis.” Genome Biology, 19(1), 15. https://doi.org/10.1186/s13059-017-1382-0 % codespell:ignore theis
Website: https://scanpy.readthedocs.io
Scrublet#
Summary:
Scrublet
is a Python tool designed to detect doublets in single-cell RNA-seq data. Doublets are instances where two cells are captured in a single droplet, which can distort downstream analysis. Scrublet uses a k-nearest neighbors approach to identify and score potential doublets.Paper: Wolock, S. L., Lopez, R., & Klein, A. M. (2019). “Scrublet: Computational Identification of Cell Doublets in Single-Cell Transcriptomic Data.” Cell Systems, 8(4), 281–291.e9. https://doi.org/10.1016/j.cels.2018.11.005
Website: AllonKleinLab/scrublet
AnnData#
Summary:
AnnData
is a Python package that provides a framework for managing annotated data matrices, tailored for large-scale single-cell RNA-seq data. AnnData is widely used as the primary data structure inScanpy
, enabling efficient storage and handling of both raw and processed single-cell data.Paper: Virshup, I., Rybakov, S., Theis, F. J., Angerer, P., & Wolf, F. A. (2024). “anndata: Access and store annotated data matrices.” The Journal of Open Source Software. https://doi.org/10.21105/joss.04371 % codespell:ignore theis
Website: https://anndata.readthedocs.io
Harmony#
Summary:
Harmony
is a tool designed for batch effect correction in single-cell RNA-seq datasets. It integrates datasets from different batches or conditions by aligning data in a shared embedding space, allowing biological variation to be preserved while minimizing technical differences.Paper: Korsunsky, I., Millard, N., Fan, J., Slowikowski, K., Zhang, F., Wei, K., … & Raychaudhuri, S. (2019). “Fast, sensitive and accurate integration of single-cell data with Harmony.” Nature Methods, 16(12), 1289-1296. https://doi.org/10.1038/s41592-019-0619-0
Memento#
Summary:
Memento
is a statistical tool tailored for single-cell RNA sequencing (scRNA-seq) analysis, with a focus on decoupling measurement noise from biological expression variability, thereby improving accuracy in differential expression studies.Paper: Kim, M. C., Gate, R., Lee, D. S., Marson, A., Ntranos, V., Ye, C. J. (2024). “Method of moments framework for differential expression analysis of single-cell RNA sequencing data.” Cell, 187(22), P6393-6410.E16. https://doi.org/10.1016/j.cell.2024.08.022https://doi.org/10.1038/s41592-021-01125-y)
Website: yelabucsf/scrna-parameter-estimation
scVI-tools#
Summary:
scVI-tools
is a framework built on top of PyTorch for scalable probabilistic modeling of single-cell data. It includes various models like scVI (single-cell variational inference), totalVI, and PEAKVI, used for data integration, dimensionality reduction, differential expression, and multi-omics data analysis.Paper: Gayoso, A., Lopez, R., Xing, G., Boyeau, P., Wu, K., Jayasuriya, M., et al. (2022). “A Python library for probabilistic analysis of single-cell omics data.” Nature Biotechnology, 40, 163–166. https://www.nature.com/articles/s41587-021-01206-w
Website: https://scvi-tools.org