Creates an object ready for use in the scX app

createSCEobject creates the input object (a List) for the function launch_scX. This list includes a SingleCellExperiment object with normalized expression, reduced dimensions for visualization, and any additional data provided. Also, this object includes identified gene markers and differential expression analysis for each user-specified partition (if no partition was selected, a quick clusterization is computed and considered)

Usage

createSCEobject(
  xx,
  assay.name.raw = "counts",
  assay.name.normalization = "logcounts",
  metadata = NULL,
  partitionVars = NULL,
  metadataVars = NULL,
  chosen.hvg = NULL,
  nHVGs = 3000,
  nPCs = 50,
  calcRedDim = TRUE,
  markerList = NULL,
  paramFindMarkers = list(test.type = "wilcox", pval.type = "all", direction = "up"),
  BPPARAM = BiocParallel::SerialParam(),
  scx.clust = FALSE,
  scx.graph.k = 20,
  scx.graph.cluster.method = "louvain",
  BLUSPARAM = NULL,
  minSize = 30,
  calcAllPartitions = FALSE,
  cells2keep = NULL,
  nSubCells = 50000,
  descriptionText = NULL,
  verbose = TRUE
)

Arguments

xx: Either a numeric count matrix object (with genes as rows and cells as columns), a SingleCellExperiment object, a Seurat object or the directory containing the matrix.mtx, features.tsv, barcodes.tsv provided by CellRanger.
assay.name.raw: Assay name for the raw counts matrix if the input object is a SingleCellExperiment. Defaults to counts.
assay.name.normalization: Assay name for the normalized matrix if present in the SingleCellExperiment. If not present, it computes logcounts by default.
metadata: (Optional) A DataFrame containing cell metadata. The row names of the metadata data frame must include all cell names that appear as columns in xx.
partitionVars: (Optional) Names of metadata or colData columns to be used for gene markers and differential expression analysis. If set to NULL (default), a quick clustering step will be performed using quickCluster from the scran package.
metadataVars: (Optional) Names of additional metadata or colData columns to be used for coloring in plots. If set to NULL (default), only partitionVars columns will be available for coloring plots.
chosen.hvg: (Optional) A list of Highly Variable Genes. NOTE: If chosen.hvg=NULL and xx is a Seurat object with computed VariableFeatures, then this parameter will be set to that list of genes.
nHVGs: Number of Highly Variable Genes to use if chosen.hvg=NULL. Defaults to 3000.
nPCs: Number of Principal Components to use in PCA. Defaults to 50.
calcRedDim: Logical. Indicates whether to compute reduced dimensions (PCA, UMAP, t-SNE, UMAP2D, t-SNE2D). Defaults to TRUE. (Note: If set to FALSE, but there are no 2D and >3D reduced dimensions provided in the input, PCA embedding will be estimated anyway).
markerList: (Optional) A DataFrame with three columns named: 'Partition', 'Cluster', 'Gene'. It will be used to subset the cluster marker list shown in cluster markers in Markers tab in the scX app.
paramFindMarkers: A named list of parameters to pass to findMarkers to compute cluster gene markers. Defaults to list(test.type="wilcox", pval.type="all", direction="up").
BPPARAM: A BiocParallelParam object indicating whether and how parallelization should be performed across genes in the findMarkers function.
scx.clust: Logical. Indicates whether to perform a community detection algorithm on the data. Defaults to FALSE. (Note: If set to FALSE, but there is no partitionVars to compute marker analysis, it will set to TRUE).
scx.graph.k: An integer scalar specifying the number of nearest neighbors to consider during graph construction. Defaults to 20
scx.graph.cluster.method: Function specifying the method to use to detect communities in the shared NN graph. Alternatively, this may be a string containing the suffix of any igraph community detection algorithm (for example: "walktrap" will use cluster_walktrap). Defaults to "louvain".
BLUSPARAM: A BlusterParam object specifying the clustering algorithm to use, defaults to a graph-based method. If not NULL this parameter will be used instead scx.graph.k and scx.graph.cluster.method.
minSize: Numeric. The minimum cluster size for calculating gene marker statistics.
calcAllPartitions: Logical. Defaults to FALSE, which means that only partitions from partitionVars with 30 or fewer levels will be considered for marker and DEG calculations. If set to TRUE, it forces the computation of markers and DEGs for the entire list of partitionVars.
cells2keep: (Optional) A list of cell names to keep in case of subsampling. NOTE: Subsampling is only activated for visualization purposes in the case of large datasets; it is not used for computations. Only nSubCells cells will be used for visualization in the app, and their indexes are stored in the CELLS2KEEP element of the CSEO object.
nSubCells: Numeric. The maximum number of cells for randomly subsampling the data set (just for visualization purposes).
descriptionText: (Optional) A short description of the object being analyzed. This text will be displayed in the Summary module of the scX app.
verbose: Logical. Indicates whether to show step-by-step status while the function is running. Defaults to TRUE.

Value

A named List that serves as input for launch_scX, which contains the following fields:

SCE:: A SingleCellExperiment object with computed normalized expression and dimensional reduction embeddings (PCA, UMAP, t-SNE, in 2D & 3D). These are calculated using the list of chosen.hvg if not NULL, or the top nHVGs. colData contains the partitionVArs and metadataVars if they were specified; otherwise, only a quick clusterization will be available for preliminary analysis of the data.
sce.degs:: A named List of DataFrames where each DataFrame contains the consolidated marker statistics for each gene (row) for the cluster of the same name. Statistics are computed using findMarkers, and the user can choose the test.type parameters to pass to that function. See combineMarkers for the details of how these dataframes are created.
sce.markers:: A List of named Lists of DataFrames. Each one corresponds to the marker genes of every cluster in a partition (names of the nested lists). summary.stats: AUC if test.type=="wilcox" and -log.FC for test.type=="t" or test.type=="binom". log.FDR: -Log.FDR of the most appropriate inter-cluster comparison according to the selected p-value type. See findMarkers for the details of how these metrics are computed. boxcor: Correlation scores between the normalized gene expression profiles and a binary vector of cells, in which cells of the selected cluster have a value of 1.
(optional) text:: String if descriptionText not NULL.
CELLS2KEEP:: Numeric, the indices of the selected cells chosen for visualization in the scX application. (Alternatively) Character, if 'all' no subsampling will be performed for visualization purposes.
call:: A named list containing all parameters and their values passed to the createSCEobject call.
usage:: A named list containing all parameters and their corresponding values that were finally utilized by createSCEobject during preprocessing.

Details

This function handles the basic preprocessing steps for sc/sn-RNAseq data. It leverages functionality implemented in the scran, scater, and SingleCellExperiment packages. The steps include: converting the input to a SingleCellExperiment object, estimating QC metrics, normalizing the expression matrix, identifying highly variable genes, and computing embeddings for various dimensionality reduction techniques. If no cell partition is provided, a default clustering step is conducted to investigate marker genes and differential expression patterns. For datasets with more than 50k cells, subsampling of the SingleCellExperiment object is suggested to improve smooth interactive visualizations in the scX app. A detailed discussion of the employed preprocessing pipeline can be found in the OSCA book.

Partitions

It is recommended to include a curated partition of the data in the input object or in the metadata. Marker genes and differential expression analysis will be automatically computed for partitions with fewer than 31 levels specified as partitionVars. If the user wishes to run the calculations for partitions with more than 30 levels, calcAllPartitions must be set to TRUE (Please note that this step could be very time-consuming). You may want to color some partitions in plots without having to compute marker genes and DEGs analyses for them. To do this, pass those partitions as metadataVars.

Metadata

Metadata is passed to createSCEobject as a DataFrame where the rows must include all the cells present in the XX input. Metadata columns represent cell covariates. All character or numeric covariates passed in metadataVars with fewer than 31 unique values are set to factors and are referred to as "Categories" in the scX app. Numeric covariates with more than 30 unique values are called "Fields."

Normalization

If there is no normalized assay in the SingleCellExperiment object or Seurat input object a Normalization by deconvolution is performed as proposed in the OSCA book. First, we calculate clusters using the Walktrap community detection algorithm for graph-based clustering (default parameters from quickCluster). Then we compute scale factors for the cells using those clusters. Finally, we calculate the lognormalized expression matrix by applying a log2 transformation to the product of the raw matrix and scale factors considering a pseudocounts of 1.

Highly Variable Genes

If chosen.hvg is not specified, we will use modelGeneVar to calculate the variance and mean of the lognormalized expression values. By fitting a trend of the variance against the mean, a biological component of variation for each gene can be assigned as the residual from the trend (see scran documentation for more details). We consider the top nHVGs most variable genes with a biological component greater than 0.

Reduced Dimensional Analysis Techniques

scX is a tool that helps visualize the data properly by running several dimensionality reduction analysis techniques (PCA, UMAP2D, TSNE2D, UMAP3D, TSNE3D). To use the app, xx must have at least a 2D dimensional reduction embedding. If xx already has dimensionality reduction embeddings calculated, you can set calcRedDim to FALSE. NOTE: If set to FALSE, but there are no dimension reductions calculated in the SingleCellExperiment object, or if they all have less than 4 dimensions with none of them having 2D, a PCA will be calculated. Principal Component Analysis is calculated with runPCA using the chosen.hvg and retaining the first nPCs components for the normalized expression matrix. TSNE and UMAP are calculated with runTSNE and runUMAP functions using the PCA matrix.

Clustering

If partitionVars is user-specified, those categories are used for clustering the data. If partitionVars is NULL, none of the user-specified categories pass the filter (see "Partitions" for details) or scx.clust is set to TRUE a clustering performed with a graph-based community detection algorithm will be included in partitionVars to identify markers and DEGs. By default it computes cluster_louvain on a shared nearest neighbors graph with k=20. Both the clustering algorithm and the number of first neighbors can be changed using the scx.graph.cluster.method and scx.graph.k parameters. Also, the user can specify any BlusterParam object through BBLUSPARAM that will be passed to clusterCells to compute clusters.

Marker Genes Analysis

createSCEobject computes statistics to identify marker genes for every cluster for all partitions in partitionVars using findMarkers functions (with parameters specified by the user in paramFindMarkers). Only genes with FDR < 0.05 are selected for each cluster. The boxcor score is calculated for those genes as follows:

boxcor:: The boxcor is the correlation between a gene's expression vector (logcounts) and a binary vector, where only the cells from the selected cluster mark 1 while the rest of the cells mark 0.

The user can provide a DataFrame containing marker genes for each cluster in any partition through markerList. For each partition in markerList specified in partitionVars, the calculated marker statistics for each cluster will be subset to the genes in markerList. Note that the statistics were calculated using all genes.

Differential Expression Analysis

We use the findMarkers function to identify DEGs between clusters specified in partitionVars (direction="any",pval.type="all"). The test.type can be specified by the user in paramFindMarkers$test.type.

Subsampling cells

If the SingleCellExperiment object contains over nSubCells cells (50k by default), a random sample of that size will be chosen for visualization purposes in the application. Cell names that the user wants to keep in the visualizations can be specified in the cells2keep parameter. Please note that it is solely for producing efficient visualizations.

Author

Tomás Vega Waichman, Maria Luz Vercesi, Ariel A. Berardino, Maximiliano S. Beckel, Chernomoretz Lab and Collaborators

Examples

if (FALSE) {
# Quick start guide example:
library(scX)
cseo <- createSCEobject(xx = scXample, 
                        partitionVars = "inferred_cell_type", 
                        metadataVars = c("source_name", "age", "sex", "strain", "treatment", "pseudotime"),
                        descriptionText = "Quick Start Guide")
launch_scX(cseo)
}