Creates an object ready for use in the scX app
createSCEobject.Rd
createSCEobject
creates the input object (a List) for the function launch_scX
. This list includes a SingleCellExperiment object with normalized expression,
reduced dimensions for visualization, and any additional data provided. Also, this object includes identified gene markers and differential expression analysis for each user-specified partition (if no partition was selected, a quick clusterization is computed and considered)
Usage
createSCEobject(
xx,
assay.name.raw = "counts",
assay.name.normalization = "logcounts",
metadata = NULL,
partitionVars = NULL,
metadataVars = NULL,
chosen.hvg = NULL,
nHVGs = 3000,
nPCs = 50,
calcRedDim = TRUE,
markerList = NULL,
paramFindMarkers = list(test.type = "wilcox", pval.type = "all", direction = "up"),
BPPARAM = BiocParallel::SerialParam(),
scx.clust = FALSE,
scx.graph.k = 20,
scx.graph.cluster.method = "louvain",
BLUSPARAM = NULL,
minSize = 30,
calcAllPartitions = FALSE,
cells2keep = NULL,
nSubCells = 50000,
descriptionText = NULL,
verbose = TRUE
)
Arguments
- xx
Either a numeric count matrix object (with genes as rows and cells as columns), a SingleCellExperiment object, a Seurat object or the directory containing the matrix.mtx, features.tsv, barcodes.tsv provided by CellRanger.
- assay.name.raw
Assay name for the raw counts matrix if the input object is a SingleCellExperiment. Defaults to
counts
.- assay.name.normalization
Assay name for the normalized matrix if present in the SingleCellExperiment. If not present, it computes
logcounts
by default.- metadata
(Optional) A DataFrame containing cell metadata. The row names of the metadata data frame must include all cell names that appear as columns in
xx
.- partitionVars
(Optional) Names of metadata or
colData
columns to be used for gene markers and differential expression analysis. If set toNULL
(default), a quick clustering step will be performed usingquickCluster
from the scran package.- metadataVars
(Optional) Names of additional metadata or
colData
columns to be used for coloring in plots. If set toNULL
(default), onlypartitionVars
columns will be available for coloring plots.- chosen.hvg
(Optional) A list of Highly Variable Genes. NOTE: If
chosen.hvg=NULL
andxx
is a Seurat object with computedVariableFeatures
, then this parameter will be set to that list of genes.- nHVGs
Number of Highly Variable Genes to use if
chosen.hvg=NULL
. Defaults to 3000.- nPCs
Number of Principal Components to use in PCA. Defaults to 50.
- calcRedDim
Logical. Indicates whether to compute reduced dimensions (PCA, UMAP, t-SNE, UMAP2D, t-SNE2D). Defaults to
TRUE
. (Note: If set toFALSE
, but there are no 2D and >3D reduced dimensions provided in the input, PCA embedding will be estimated anyway).- markerList
(Optional) A DataFrame with three columns named: 'Partition', 'Cluster', 'Gene'. It will be used to subset the cluster marker list shown in cluster markers in Markers tab in the scX app.
- paramFindMarkers
A named list of parameters to pass to
findMarkers
to compute cluster gene markers. Defaults tolist(test.type="wilcox", pval.type="all", direction="up")
.- BPPARAM
A BiocParallelParam object indicating whether and how parallelization should be performed across genes in the
findMarkers
function.- scx.clust
Logical. Indicates whether to perform a community detection algorithm on the data. Defaults to
FALSE
. (Note: If set toFALSE
, but there is nopartitionVars
to compute marker analysis, it will set toTRUE
).- scx.graph.k
An integer scalar specifying the number of nearest neighbors to consider during graph construction. Defaults to 20
- scx.graph.cluster.method
Function specifying the method to use to detect communities in the shared NN graph. Alternatively, this may be a string containing the suffix of any igraph community detection algorithm (for example: "walktrap" will use
cluster_walktrap
). Defaults to "louvain".- BLUSPARAM
A BlusterParam object specifying the clustering algorithm to use, defaults to a graph-based method. If not
NULL
this parameter will be used insteadscx.graph.k
andscx.graph.cluster.method
.- minSize
Numeric. The minimum cluster size for calculating gene marker statistics.
- calcAllPartitions
Logical. Defaults to
FALSE
, which means that only partitions frompartitionVars
with 30 or fewer levels will be considered for marker and DEG calculations. If set to TRUE, it forces the computation of markers and DEGs for the entire list ofpartitionVars
.- cells2keep
(Optional) A list of cell names to keep in case of subsampling. NOTE: Subsampling is only activated for visualization purposes in the case of large datasets; it is not used for computations. Only
nSubCells
cells will be used for visualization in the app, and their indexes are stored in theCELLS2KEEP
element of the CSEO object.- nSubCells
Numeric. The maximum number of cells for randomly subsampling the data set (just for visualization purposes).
- descriptionText
(Optional) A short description of the object being analyzed. This text will be displayed in the Summary module of the
scX
app.- verbose
Logical. Indicates whether to show step-by-step status while the function is running. Defaults to
TRUE
.
Value
A named List that serves as input for launch_scX
, which contains the following fields:
SCE
:A SingleCellExperiment object with computed normalized expression and dimensional reduction embeddings (PCA, UMAP, t-SNE, in 2D & 3D). These are calculated using the list of
chosen.hvg
if notNULL
, or the topnHVGs
.colData
contains thepartitionVArs
andmetadataVars
if they were specified; otherwise, only a quick clusterization will be available for preliminary analysis of the data.sce.degs
:A named List of DataFrames where each DataFrame contains the consolidated marker statistics for each gene (row) for the cluster of the same name. Statistics are computed using
findMarkers
, and the user can choose thetest.type
parameters to pass to that function. SeecombineMarkers
for the details of how these dataframes are created.sce.markers
:A List of named Lists of DataFrames. Each one corresponds to the marker genes of every cluster in a partition (names of the nested lists).
summary.stats:
AUC iftest.type=="wilcox"
and -log.FC fortest.type=="t"
ortest.type=="binom"
.log.FDR:
-Log.FDR of the most appropriate inter-cluster comparison according to the selected p-value type. SeefindMarkers
for the details of how these metrics are computed.boxcor:
Correlation scores between the normalized gene expression profiles and a binary vector of cells, in which cells of the selected cluster have a value of 1.- (optional)
text
: String if
descriptionText
notNULL
.CELLS2KEEP
:Numeric, the indices of the selected cells chosen for visualization in the
scX
application. (Alternatively) Character, if 'all' no subsampling will be performed for visualization purposes.call
:A named list containing all parameters and their values passed to the
createSCEobject
call.usage
:A named list containing all parameters and their corresponding values that were finally utilized by
createSCEobject
during preprocessing.
Details
This function handles the basic preprocessing steps for sc/sn-RNAseq data. It leverages functionality implemented in the scran, scater, and SingleCellExperiment packages. The steps include: converting the input to a SingleCellExperiment object, estimating QC metrics, normalizing the expression matrix, identifying highly variable genes, and computing embeddings for various dimensionality reduction techniques. If no cell partition is provided, a default clustering step is conducted to investigate marker genes and differential expression patterns. For datasets with more than 50k cells, subsampling of the SingleCellExperiment object is suggested to improve smooth interactive visualizations in the scX app. A detailed discussion of the employed preprocessing pipeline can be found in the OSCA book.
Partitions
It is recommended to include a curated partition of the data in the input object or in the metadata
.
Marker genes and differential expression analysis will be automatically computed for partitions with fewer than 31 levels specified as partitionVars
.
If the user wishes to run the calculations for partitions with more than 30 levels, calcAllPartitions
must be set to TRUE
(Please note that this step could be very time-consuming).
You may want to color some partitions in plots without having to compute marker genes and DEGs analyses for them. To do this, pass those partitions as metadataVars
.
Metadata
Metadata is passed to createSCEobject as a DataFrame where the rows must include all the cells present in the XX
input. Metadata columns represent cell covariates. All character or numeric covariates passed in metadataVars
with fewer than 31 unique values are set to factors and are referred to as "Categories" in the scX app. Numeric covariates with more than 30 unique values are called "Fields."
Normalization
If there is no normalized assay in the SingleCellExperiment object or Seurat input object
a Normalization by deconvolution is performed as proposed in the OSCA book.
First, we calculate clusters using the Walktrap community detection algorithm for graph-based clustering (default parameters from quickCluster
).
Then we compute scale factors for the cells using those clusters.
Finally, we calculate the lognormalized expression matrix by applying a log2 transformation to the product of the raw matrix and scale factors considering a pseudocounts of 1.
Highly Variable Genes
If chosen.hvg
is not specified, we will use modelGeneVar
to calculate the variance and mean of the lognormalized expression values.
By fitting a trend of the variance against the mean, a biological component of variation for each gene can be assigned as the residual from the trend (see scran documentation for more details).
We consider the top nHVGs
most variable genes with a biological component greater than 0.
Reduced Dimensional Analysis Techniques
scX is a tool that helps visualize the data properly by running several dimensionality reduction analysis techniques (PCA, UMAP2D, TSNE2D, UMAP3D, TSNE3D). To use the app, xx
must have at least a 2D dimensional reduction embedding.
If xx
already has dimensionality reduction embeddings calculated, you can set calcRedDim
to FALSE
. NOTE: If set to FALSE
, but there are no dimension reductions calculated in the SingleCellExperiment object, or if they all have less than 4 dimensions with none of them having 2D, a PCA will be calculated.
Principal Component Analysis is calculated with runPCA
using the chosen.hvg
and retaining the first nPCs
components for the normalized expression matrix.
TSNE and UMAP are calculated with runTSNE
and runUMAP
functions using the PCA matrix.
Clustering
If partitionVars
is user-specified, those categories are used for clustering the data.
If partitionVars
is NULL
, none of the user-specified categories pass the filter (see "Partitions" for details) or scx.clust
is set to TRUE
a clustering performed with a graph-based community detection algorithm will be included in partitionVars to identify markers and DEGs. By default it computes cluster_louvain
on a shared nearest neighbors graph with k=20
. Both the clustering algorithm and the number of first neighbors can be changed using the scx.graph.cluster.method
and scx.graph.k
parameters. Also, the user can specify any BlusterParam object through BBLUSPARAM
that will be passed to clusterCells
to compute clusters.
Marker Genes Analysis
createSCEobject computes statistics to identify marker genes for every cluster
for all partitions in partitionVars
using findMarkers
functions (with parameters specified by the user in paramFindMarkers
).
Only genes with FDR < 0.05 are selected for each cluster. The boxcor score is calculated for those genes as follows:
boxcor
:The boxcor is the correlation between a gene's expression vector (logcounts) and a binary vector, where only the cells from the selected cluster mark 1 while the rest of the cells mark 0.
The user can provide a DataFrame containing marker genes for each cluster in any partition through markerList
. For each partition in markerList
specified in partitionVars
, the calculated marker statistics for each cluster will be subset to the genes in markerList
. Note that the statistics were calculated using all genes.
Differential Expression Analysis
We use the findMarkers
function to identify DEGs between clusters specified in partitionVars
(direction="any"
,pval.type="all"
).
The test.type can be specified by the user in paramFindMarkers$test.type
.
Subsampling cells
If the SingleCellExperiment object contains over nSubCells
cells (50k by default), a random sample of that size will be chosen for visualization purposes in the application.
Cell names that the user wants to keep in the visualizations can be specified in the cells2keep
parameter.
Please note that it is solely for producing efficient visualizations.
See also
Related functions from scran and scater packages as suggested in the OSCA book for:
Preprocessing steps:
quickCluster
,computeSumFactors
andlogNormCounts
.HVGs:
modelGeneVar
.Dimensionality Reduction Techniques:
runPCA
,runTSNE
andrunUMAP
.Clustering:
clusterCells
, BlusterParamMarker genes and DEGs analyses:
findMarkers
,combineMarkers
.
Author
Tomás Vega Waichman, Maria Luz Vercesi, Ariel A. Berardino, Maximiliano S. Beckel, Chernomoretz Lab and Collaborators
Examples
if (FALSE) {
# Quick start guide example:
library(scX)
cseo <- createSCEobject(xx = scXample,
partitionVars = "inferred_cell_type",
metadataVars = c("source_name", "age", "sex", "strain", "treatment", "pseudotime"),
descriptionText = "Quick Start Guide")
launch_scX(cseo)
}