Single cell RNA sequencing (scRNAseq) is a next generation sequencing technology that produces gene expression data on thousands of single cells. This rich dataset can be mined to answer questions such as what types of cells are present in a sample and what is the frequency of the different cell types. Our group is interested in understanding the basis of chemotherapy resistance in ovarian cancer, which is the main cause of mortality in women with this deadly cancer[
There are multiple inherent problems that make it difficult to perform robust unbiased cell clustering using scRNAseq data. First, gene expression measurements produced by scRNAseq technologies are not exact and spike-in experiments suggest that the technique that we use (10× Genomics 3’ Chromium) only measures ~5%-10% of the polyA transcriptome[
To generate scRNAseq datasets, we use the 10× Genomics 3’ Chromium platform followed by sequencing on Illumina machines. The data are initially processed using CellRanger software to produce a gene expression matrix based on universal molecular identifier (UMI) counts. A typical dataset will be a matrix of several thousand cells by ~18,000 genes with each cell having UMI counts for 1000-3000 genes. One of the initial steps is to perform unbiased clustering of the cells based on each cell’s gene expression. In this paper, we describe our use of the Seurat R package[
Instead of providing “benchmarking” between different methods, the purpose of this paper is to provide guidance to setting parameters using the Seurat R package when analyzing scRNAseq generated from fresh tumor samples. We recommend doing the same type of comprehensive analysis for any method before relying on “default” parameters.
The basic clustering algorithm used in Seurat is a shared nearest neighbor (SNN) graph-based clustering method[
A fresh tumor sample was collected from a patient enrolled in our ovarian cancer precision medicine initiative (OCPMI) following our IRB approved protocol (#2018NTLS170). Fat, fibrous, and necrotic areas were removed from the tumor sample and a 0.9-g sample was used for scRNAseq. A single cell suspension was created using the Miltenyi Biotech GentleMACs Tissue Dissociation Kit following protocol 2.2.1. Briefly, the tumor sample was minced and placed in a specialized conical tube containing a mixture of dissociation enzymes in media. The tube was placed on a mechanical rotator for 30 s followed by incubation on a rotator at 37 °C for 30 min. The process was repeated, and the cell solution was poured through a 70-micron filter to remove cell clumps and debris. Cells were treated with red cell lysis buffer for 5-10 min at room temperature, centrifuged, and resuspended in hypothermosol. An aliquot was removed for measuring cell viability using the Cell Countess (Life Sciences). The single cell solution was diluted to a concentration of 1000 viable cells/µL and transported to the sequencing facility on ice.
Samples were sequenced at the University of Minnesota’s Genomics Center using the 10× Genomics Single Cell 3’ Protocol utilizing the ChromiumTM Single Cell 3’ Library & Gel Bead Kit and ChromiumTM Single Cell A Chip Kit following the manufacturer’s protocol (Protocol document CG000183 Rev C). Approximately 20,000 cells were partitioned into nanoliter-scale Gel Bead-In-EMulsions (GEMs) with one cell per GEM. Within each GEM, cells were lysed, and then primers were released and mixed with cell lysate. Incubation of the GEMs produced barcoded, full-length cDNA from mRNA. The full-length, barcoded cDNA was then amplified by PCR prior to library construction. Sequencing was performed using an Illumina HiSeq 2500 or NovaSeq to a depth of at least 100 thousand paired-end reads per cell.
Illumina raw sequencing output files were processed using Cell RangerTM software (v. 3.02) to produce a filtered gene × cell matrix of UMI counts. The matrix consists of three output files that define a sparse matrix (
We established a baseline analysis of our dataset based on our analysis of over 75 scRNAseq datasets from ovarian cancer patients. The baseline was established using preliminary attempts on many of the Seurat parameters as well as iterative analyses testing multiple parameter variations. For this study, we demonstrate the effect of changing these parameters by comparing clustering results to our baseline analysis.
Initial cell filtering parameters for selecting cells based on number of genes/cell, UMI counts/cell, and percent mitochondrial genes were established based on manual visualization of graphic outputs for these metrics (
Normalization is important for scRNAseq datasets due to the sparse data, bias and noise inherent in this technique and multiple methods have been proposed for scRNAseq[
Variable gene parameters were tested by comparing the three selection methods (vst, mean.var.plot, and dispersion). Comparisons were also conducted for different Loess Spans (0.3, 0.1, and 0.5), number of bins (1, 10, 20, and 100), binning method (equal_width or equal_frequency), and number of features based on percentage of total genes (20%, 1%, and 6.7%). Our baseline analysis used the vst method, a Loess span of 0.3, 20 bins using equal width, and a target of 6.7% of features
When scaling the data, we compared the following four combinations of variables to regress: UMI count, percent mitochondrial genes, both UMI count and percent mitochondrial genes, or no variables regressed out. We also tested three values for the scale.max parameter (10, 50, and 100). For the baseline, we regressed out UMI count and percent mitochondrial genes with a maximum scale of 50. Note that we did not interrogate the different models or block sizes and minimum cells to block.
The Seurat implementation of SNN requires the following parameter inputs when running the FindNeighbors function: reduction type, number of dimensions, a k parameter, a prune parameter, a nearest neighbor method (rann or annoy), an annoy distance metric, and a nearest neighbor error boundary. After identifying cell neighbors, the FindClusters function identifies clusters and also requires the following input parameters: algorithm choice (Louvain, refined Louvain, SLM, or Leiden), a resolution parameter, and other parameters that we left at default. We limited our analysis to a principle components reduction.
A major input parameter required by the Seurat pipeline is the number of reduction dimensions to use. The Seurat pipeline recommends using an Elbow plot of PC standard deviations
To obtain a robust clustering solution, we performed iterative analyses using a range of values for several of these parameters and compared the clustering solutions to find the number of clusters that is most frequently obtained. To automate this iterative analysis, we wrote an R script (seurat_pkpr_loop.R) that tests ranges of the four most important clustering parameters: number of reduction dimensions (
A final custom script was used to assign cell types to clusters based on percentage overlap of upregulated genes in that cluster compared to gene lists generated for known cell types based on literature. We named this annotation method cluster annotation by differential gene expression (CADGE). The custom R script (seurat_final.R) and the gene lists (annotated_gene_lists) used to perform CADGE are provided in the Supplementary Materials.
The percent cell type similarity was calculated by dividing the number of cells with equivalent cell type annotation by the total number of cells. The adjusted rand index (ARI) value was calculated by applying the adjustedRandIndex function from mClust[
The Seurat R package is a popular scRNAseq analysis pipeline. The R package has incorporated functions to normalize and scale data, identify variable genes, perform dimensional reduction, and identify cell clusters based on gene expression similarities. The pipeline consists of loading UMI count data, performing normalization, identifying variable genes, scaling the data with an option to regress out variables, finding cell neighbors, and finding cell clusters
Seurat clustering pipeline with parameters that must be set at each step
To analyze the effects of the different methods and parameters on clustering results, we established a “baseline” analysis of an scRNAseq dataset we generated from an ovarian cancer tumor specimen. The dataset was produced using the 10× Genomics 3’ Chromium platform followed by sequencing on an Illumina NovaSeq. To establish the baseline analysis, we filtered out cells and genes based on an analysis of more than 50 similar datasets (
Next, we iteratively ran the Seurat pipeline 5,120 times testing all possible combinations of 8 PC dimensions, 8
UMAP projection of 4,344 ells from an ovarian cancer tumor specimen colored by cluster number (A); number of cells per cluster (B); and cells colored by cell type determined by analysis of upregulated genes in each cluster compared to gene lists of known cell types (C) (see Methods). Blue: cancer epithelial cells; red: fibroblasts; yellow: immune cells; green: endothelial cells. UMAP: uniform manifold approximation and projection
The Seurat package gives three options for normalizing data: natural log transformation using log1p (LogNormalize), relative counts (RC), and a centered log ratio transformation (CLR). Our baseline analysis used the natural log transformation. Each of these methods produces slightly different clustering solutions. Both the RC and CLR methods produced fewer clusters compared to the LogNormalize method (11 and 10
Cluster number (A, D); similarity by cell type (B, E); and ARI (C, F) when comparing baseline method (LogNormalize, Scale Factor 10,000) to the RC or CLR methods (A-C) or to scale factors of 1000 or 100,000 (D-F). ARI: adjusted rand index
In addition to a normalization method, Seurat requires users to select a scale factor. Compared to our baseline scale factor value of 10,000, increasing the scale factor by a factor of 10 when normalizing the data resulted in loss of a cluster, while decreasing the scale factor by a factor of 10 generated an additional cluster when compared to the baseline
Visualization of cell clusters using UMAP or TSNE allows for manually associating clusters with one another. The validity of this manual grouping of clusters is supported by similar cell type annotations, which generally indicates that clusters within a group of clusters are all assigned the same cell type annotation
UMAP plots of clustering solutions using different methods of normalization. LogNormalize (A, D baseline) was compared to RC (B, E) and CLR (C, F) normalization methods. Colored circles are manually placed around clusters that were annotated similarly. BCL_bc: B cells; END_en: endothelial cells; EPI_ep: epithelial cancer cells; FIB_fi: fibroblasts; MAC_ma: macrophages. UMAP: uniform manifold approximation and projection
An analysis of the clustering solutions produced using different scale factors also reveals subtle differences between them. For example, the additional cluster identified when the scale factor was reduced (Cluster 14,
UMAP plots of clustering solutions using scale factors of: 10,000 (baseline) (A, D); 1000 (B, E); and 100,000 (C, F). The cells are colored by: cluster (A-C); and cell type (D-F). Red circles and arrows indicate cluster gained when scale factor was reduced to 1000 (B, E) and cluster lost when increasing scale factor to 100,000 (C, E). Boxes below (D-F) show enlarged versions of the clusters within the red circles. UMAP: uniform manifold approximation and projection
Three methods are available for selecting variable genes in the Seurat pipeline (vst, mean.var.plot, and dispersion). Running our baseline analysis using the three different methods resulted in the following number of variable genes per method: vst (baseline) = 1,125, mean.var.plot = 818, and dispersion = 1,125. Of the total number of unique genes selected by all three selection methods (
Results of using different methods of determining variable genes. Venn diagram illustrating overlap of variable genes detected by three different methods (A) (vst, mean.var.plot, and dispersion). Bar graphs showing differences in: cluster number (B); percent of cells called concordantly (C); and the ARI measurement (D). Mean.var and dispersion methods are compared to the baseline method, vst in (C, D). ARI: adjusted rand index
Clusters generated using three different lists of variable genes selected by the three methods (vst, mean.var.plot, and dispersion). Clusters were manually assigned to “superclusters” based on proximity in the TSNE plots. TSNE: t-distributed stochastic neighbor embedding
Because the vst method uses a local polynomial regression (loess), an input parameter required is the loess span. We compared our baseline analysis (loess span = 0.3) to loess span values of 0.1 and 0.5. There were almost no differences in cluster numbers, cell annotations, and ARI values when varying the loess span
The number of variable genes used will affect clustering. For our baseline analysis, we set the number of variable genes to be 6.7% of all genes detected. This number was chosen because it produced a list of ~1000 variable genes if 15,000 total genes are detected. There is no strong biological rationale, however, to choose this cut-off. We compared our baseline analysis (6.7%) to analyses using 1% and 20% of total genes. Our baseline analysis resulted in 1,125 variable genes, while the 1% and 20% cut-offs resulted in 168 and 3,358 variable genes, respectively. Reducing the variable gene list to 1% of total genes had a strong effect on clustering, with a large number of cells being placed in different clusters based on the ARI value
Bar graphs showing differences when using variable gene lists of 1,125 (0.067 baseline) compared to 168 genes (0.01) and 3,358 genes (0.2) in: (A) cluster number; (B) percent of cells called concordantly; and (C) the ARI measurement. ARI: adjusted rand index
Before scaling and centering data, it is often recommended to “regress out” variables that could affect data analysis. The Seurat pipeline allows users to regress out any variables. Our baseline analysis regressed out both the total UMI count and the percentage of total genes attributed to the 13 mitochondrial genes. When we compared our baseline to analyses that regressed out these variables individually, or did not regress out any variables, we found that cluster solutions were very similar and 97% of cells were assigned the same cell type
Bar graphs showing differences when regressing out total UMI count (RNA) and percent of UMI count attributed to mitochondrial genes (% mito) in: (A) cluster number; (B) percent of cells called concordantly; and (C) the ARI measurement. Baseline analysis regressed both RNA and %mito (both). UMI: universal molecular identifier; ARI: adjusted rand index
The ScaleData function also requires a maximum scale value, which defaults to 10 or 50 depending on the method used. For the baseline analysis, we used a maximum scale value of 50 and then compared it to maximum values of 10 or 100. We found almost equivalent clustering solutions using all three values
Identifying cell clusters is the main output of the Seurat pipeline. A large portion of subsequent downstream analysis will use these clusters to categorize cells and use their aggregated gene expression to make phenotype hypotheses and to make comparisons to other samples. A frequent observation noted in papers presenting their analysis of scRNAseq datasets is that they identified “X” number of cell types based on this clustering. We found, however, that the number of clusters identified can easily be changed by altering the parameters used for clustering. In this study, we interrogated the effects of altering many of these parameters and report which changes had a strong effect on the clustering solutions.
The Seurat R package, at its core, uses an SNN graph-based algorithm to identify cell clusters based on UMI count data. The most important parameters affecting the clustering solutions include the number of dimensional reductions to use, the k-parameter, a prune parameter, and a resolution parameter. Altering these values generated clustering solutions ranging from 6 to 27 clusters
Altering other parameters will also have subtle effects on the clustering solutions produced. We found that the normalization method
Increasing the number of variable genes to use, from ~1000 to ~3000, did not affect the clustering solution as much as lowering the number of variable genes from ~1000 to ~125
In conclusion, biologists will always be dependent on statisticians and bioinformaticians to analyze the large datasets being generated by rapidly advancing technologies. Relying on default parameters built into the analysis packages, however, could result in analyses that do not reveal the true biological attributes of the sample being studied. An analogy could be made to performing a cell culture experiment and altering variables such as the timing of measurements, temperature, or the media being used before making conclusions about the cell properties. We recommend that computational “replicates” be conducted when using R packages such as Seurat to analyze biological datasets by varying the input parameters with each replicate. This is especially important when there is not a strong biological rationale for setting a given parameter.
The authors would like to acknowledge the assistance received from Joshua Baller, Ying Zhang, Christine Henzler and Marissa Macchiatto at the Minnesota Supercomputing Institute at the University of Minnesota. The authors also are grateful to John Garbe, Emma Stanley and Jerry Daniels at the University of Minnesota Genomics Center for their assistance in generating the scRNAseq data.
Made substantial contributions to conception and design of the study: Wang J, Nelson AC, Winterhoff B
Made substantial contributions to design of the study, performed data acquisition and data analysis: Cepela J, Shetty M
Made substantial contributions to all aspects of the study, including writing the manuscript: Schneider I, Starr TK
All data, including raw UMI counts, R scripts, and R script output files are found in the supplemental files.
This work was supported by a grant from the Ovarian Cancer Research Alliance (Liz Tiberius grant to Winterhoff B), a grant from the University of Minnesota Grand Challenges project (to Winterhoff B, Nelson AC and Starr TK), and a grant from the University of Minnesota Masonic Cancer Center (to Starr TK).
All authors declared that there are no conflicts of interest.
The study was approved by the University of Minnesota’s Institutional Review Board (#2018NTLS170). Informed consent was obtained from the patient.
Not applicable.
© The Author(s) 2021.
Supplementary Materials