--- tags: bioinformatics --- # Set Up Conda Environment and Create Jupyter Kernel for scRNA-seq Analysis Shirley Li, Bioinformatician, TTS Research Technology xue.li37@tufts.edu Date: 2024-11-01 ## Overview In this tutorial, you will learn how to: - Create a Conda environment for single-cell RNA-seq analysis using Python-only packages. - Install popular Python packages for scRNA-seq, such as Scanpy, and Scrublet. - Set up a Jupyter kernel that uses the Conda environment for easy access to the tools in a notebook interface. ### Create a Conda Environment for scRNA-seq 1. Load `miniforge` and `conda-env-mod` module ``` module load miniforge/24.7.1-py312 module load conda-env-mod/default ``` 2. Configure your conda **Note (steps in this session only needs to be executed ONCE)** Since your home directory has limited storage, it’s recommended to install conda packages in your group research storage space. Follow these steps: Create two directories in your group research storage space (one for storing the envs, one for storing the pkgs, for example: condaenv, condapkg) ``` mkdir /cluster/tufts/XXXXlab/$USER/condaenv/ mkdir /cluster/tufts/XXXXlab/$USER/condapkg/ ``` If you haven’t used conda before on the cluster, create a file named “.condarc” in your home directory. Now add the following 4 lines to the `.condarc` file in your home directory (modify according to your real path to the directories): ``` envs_dirs: - /cluster/tufts/XXXXlab/$USER/condaenv/ pkgs_dirs: - /cluster/tufts/XXXXlab/$USER/condapkg/ ``` After this, your `.condarc` file should look like this: ``` envs_dirs: - /cluster/tufts/XXXXlab/$USER/condaenv/ pkgs_dirs: - /cluster/tufts/XXXXlab/$USER/condapkg/ channels: - bioconda - conda-forge - defaults ``` 1. Create your conda environment with `conda-env-mod` ``` cd /cluster/tufts/XXXXlab/$USER/condaenv/ conda-env-mod create -p scrna_seq_py_env python=3.8 --jupyter ``` ​ You will see something like this, and enter `y` to continue ``` ... The following NEW packages will be INSTALLED: _libgcc_mutex conda-forge/linux-64::_libgcc_mutex-0.1-conda_forge _openmp_mutex conda-forge/linux-64::_openmp_mutex-4.5-2_gnu asttokens conda-forge/noarch::asttokens-2.4.1-pyhd8ed1ab_0 bzip2 conda-forge/linux-64::bzip2-1.0.8-hd590300_5 ca-certificates conda-forge/linux-64::ca-certificates-2024.7.4-hbcca054_0 ... Proceed ([y]/n)? y ``` ​ When it’s complete, you will see something like this: ``` ... Preparing transaction: ...working... done Verifying transaction: ...working... done Executing transaction: ...working... done +---------------------------------------------------------------+ | To use this environment, load the following modules: | | module load use.own | | module load conda-env/scrna_seq_py_env-py3.12.5 | | (then standard 'conda install' / 'pip install' / run scripts) | +---------------------------------------------------------------+ ``` ### Install Selected Python Packages 1. Activate conda environment and install new packages ``` module load use.own module load conda-env/scrna_seq_py_env-py3.12.5 conda list # check packages installed in this environment pip install jupyter pip install numpy pip install pandas pip install anndata conda install -c conda-forge scanpy conda install -c bioconda scrublet pip install harmony-pytorch pip install gseapy pip install scanorama pip install pyscenic pip install scvi-tools pip install -i https://test.pypi.org/simple/ memento pip install pooch conda install -c conda-forge python-igraph conda list # check again ``` 1. Create a jupyter kernel ``` conda-env-mod kernel -n scrna_seq_py_env ``` You will see something like this: ``` requested kernel with arguments: -n 'scrna_seq_py_env' -- Jupyter kernel created: "Python (My scrna_seq_py_env Kernel)" +---------------------------------------------------------------+ | We recommend installing packages into your kernel environment | | via the command line (with 'conda install' or 'pip install'). | +---------------------------------------------------------------+ ``` ## Using Open OnDemand Jupyter Lab Natigate to [Open Ondemand](https://ondemand.pax.tufts.edu/) In Open Ondemand dashboard, let’s go to `Interactive APPs` => `Jupyter Lab` and select the `number of hours`, `number of cores`, and `Amount of memory` that you would like to request and Launch this job. Under `Notebook`, select the kernel you just created. Ex: `scrna_seq_py_env` Start your python code from there. Example code to check the installation: ``` # Import installed packages import os import seaborn as sns import scanpy as sc import scrublet as scr import anndata import harmony import memento import numpy as np import pandas as pd import scvi import matplotlib.pyplot as plt ``` ## Single-Cell RNA-seq Analysis Packages ### Scanpy - **Summary**: `Scanpy` is a widely used Python package for analyzing large-scale single-cell RNA-seq datasets. It is optimized for scalability and supports workflows for preprocessing, clustering, dimensionality reduction, differential expression, and visualization of single-cell data. - **Paper**: Wolf, F. A., Angerer, P., & Theis, F. J. (2018). "Scanpy: large-scale single-cell gene expression data analysis." _Genome Biology_, 19(1), 15. [https://doi.org/10.1186/s13059-017-1382-0](https://doi.org/10.1186/s13059-017-1382-0) % codespell:ignore theis - **Website**: [https://scanpy.readthedocs.io](https://scanpy.readthedocs.io) ### Scrublet - **Summary**: `Scrublet` is a Python tool designed to detect doublets in single-cell RNA-seq data. Doublets are instances where two cells are captured in a single droplet, which can distort downstream analysis. Scrublet uses a k-nearest neighbors approach to identify and score potential doublets. - **Paper**: Wolock, S. L., Lopez, R., & Klein, A. M. (2019). "Scrublet: Computational Identification of Cell Doublets in Single-Cell Transcriptomic Data." _Cell Systems_, 8(4), 281–291.e9. [https://doi.org/10.1016/j.cels.2018.11.005](https://doi.org/10.1016/j.cels.2018.11.005) - **Website**: [https://github.com/AllonKleinLab/scrublet](https://github.com/AllonKleinLab/scrublet) ### AnnData - **Summary**: `AnnData` is a Python package that provides a framework for managing annotated data matrices, tailored for large-scale single-cell RNA-seq data. AnnData is widely used as the primary data structure in `Scanpy`, enabling efficient storage and handling of both raw and processed single-cell data. - **Paper**: Virshup, I., Rybakov, S., Theis, F. J., Angerer, P., & Wolf, F. A. (2024). "anndata: Access and store annotated data matrices." _The Journal of Open Source Software_. [https://doi.org/10.21105/joss.04371](https://doi.org/10.21105/joss.04371) % codespell:ignore theis - **Website**: [https://anndata.readthedocs.io](https://anndata.readthedocs.io) ### Harmony - **Summary**: `Harmony` is a tool designed for batch effect correction in single-cell RNA-seq datasets. It integrates datasets from different batches or conditions by aligning data in a shared embedding space, allowing biological variation to be preserved while minimizing technical differences. - **Paper**: Korsunsky, I., Millard, N., Fan, J., Slowikowski, K., Zhang, F., Wei, K., ... & Raychaudhuri, S. (2019). "Fast, sensitive and accurate integration of single-cell data with Harmony." _Nature Methods_, 16(12), 1289-1296. [https://doi.org/10.1038/s41592-019-0619-0](https://doi.org/10.1038/s41592-019-0619-0) - **Website**: [https://portals.broadinstitute.org/harmony](https://portals.broadinstitute.org/harmony) ### Memento - **Summary**: `Memento` is a statistical tool tailored for single-cell RNA sequencing (scRNA-seq) analysis, with a focus on decoupling measurement noise from biological expression variability, thereby improving accuracy in differential expression studies. - **Paper**: Kim, M. C., Gate, R., Lee, D. S., Marson, A., Ntranos, V., Ye, C. J. (2024). "Method of moments framework for differential expression analysis of single-cell RNA sequencing data." _Cell_, 187(22), P6393-6410.E16. [https://doi.org/10.1016/j.cell.2024.08.022](https://doi.org/10.1016/j.cell.2024.08.022)https://doi.org/10.1038/s41592-021-01125-y) - **Website**: [https://github.com/yelabucsf/scrna-parameter-estimation](https://github.com/yelabucsf/scrna-parameter-estimation) ### scVI-tools - **Summary**: `scVI-tools` is a framework built on top of PyTorch for scalable probabilistic modeling of single-cell data. It includes various models like scVI (single-cell variational inference), totalVI, and PEAKVI, used for data integration, dimensionality reduction, differential expression, and multi-omics data analysis. - **Paper**: Gayoso, A., Lopez, R., Xing, G., Boyeau, P., Wu, K., Jayasuriya, M., et al. (2022). "A Python library for probabilistic analysis of single-cell omics data." _Nature Biotechnology_, 40, 163–166. [https://www.nature.com/articles/s41587-021-01206-w](https://www.nature.com/articles/s41587-021-01206-w) - **Website**: [https://scvi-tools.org](https://scvi-tools.org)