This vignette demonstrates the powerful and intuitive workflow for single-cell RNA-seq annotation provided by the easybio
package. The process is designed to combine the speed of automated database matching with the reliability of interactive verification and manual curation.
The core workflow follows three logical steps:
matchCellMarker2()
to quickly get a list of potential cell types for each cluster based on its marker genes.check_marker()
and plotSeuratDot()
to build confidence in the annotations. This step helps answer two critical questions:
finsert()
to assign the final, high-confidence cell type labels.You can also view the R script for this workflow by running:
fs::file_show(system.file(package = 'easybio', 'example-single-cell.R'))
First, let’s load the necessary libraries and the example marker data included with easybio
. This data is derived from the 10x Genomics 3k PBMC dataset.
litedown::reactor(warning = FALSE) # vignette setting
library(easybio)
library(Seurat)
library(data.table)
# The pbmc.markers dataset is included in easybio
head(pbmc.markers)
p_val | avg_log2FC | pct.1 | pct.2 | p_val_adj | cluster | gene | |
---|---|---|---|---|---|---|---|
RPS12 | 0 | 0.739 | 1.000 | 0.991 | 0 | 0 | RPS12 |
RPS6 | 0 | 0.693 | 1.000 | 0.995 | 0 | 0 | RPS6 |
RPS27 | 0 | 0.737 | 0.999 | 0.992 | 0 | 0 | RPS27 |
RPL32 | 0 | 0.627 | 0.999 | 0.995 | 0 | 0 | RPL32 |
RPS14 | 0 | 0.634 | 1.000 | 0.994 | 0 | 0 | RPS14 |
RPS25 | 0 | 0.769 | 0.997 | 0.975 | 0 | 0 | RPS25 |
matchCellMarker2
We begin by feeding the cluster markers (from Seurat::FindAllMarkers
) into matchCellMarker2()
. This function compares our markers against the CellMarker2.0 database and returns a ranked list of potential cell types for each cluster.
marker_matched <- matchCellMarker2(marker = pbmc.markers, n = 50, spc = "Human")
# Let's look at the top 2 potential cell types for each cluster
marker_matched[, head(.SD, 2), by = cluster]
cluster | cell_name | uniqueN | N | ordered_symbol | orderN |
---|---|---|---|---|---|
0 | Naive CD8+ T cell | 6 | 34 | CCR7,LEF1,CD8B,MAL,NELL2,TSHZ2 | 14,12, 2, 2, 2, 2 |
0 | Naive T(Th0) cell | 3 | 32 | CCR7,LEF1,LRRN3 | 23, 8, 1 |
1 | Monocyte | 9 | 133 | CD14,S100A8,S100A9,S100A12,FCGR1A,MS4A6A,… | 82,22,15, 5, 4, 2,… |
1 | Macrophage | 8 | 63 | CD14,FCGR1A,CCL2,PLA2G7,RNASE1,S100A8,… | 46, 6, 2, 2, 2, 2,… |
2 | Regulatory T(Treg) cell | 11 | 148 | FOXP3,IL2RA,CTLA4,TNFRSF4,TNFRSF18,ICOS,… | 55,45,22, 7, 6, 4,… |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
6 | Cytotoxic T cell | 4 | 24 | PRF1,GZMB,GNLY,FGFBP2 | 9,8,6,1 |
7 | Plasmacytoid dendritic cell(pDC) | 8 | 42 | CLEC4C,LILRA4,SCT,LAMP5,LRRC26,SERPINF1,… | 19,16, 2, 1, 1, 1,… |
7 | Dendritic cell | 6 | 38 | FCER1A,CLEC10A,LILRA4,FLT3,CD1E,CLEC4C | 16,11, 4, 3, 2, 2 |
8 | Megakaryocyte | 9 | 52 | PPBP,PF4,ITGA2B,GP9,MYL9,TUBB1,… | 15,12, 9, 4, 4, 3,… |
8 | Endothelial cell | 6 | 41 | CLDN5,ESAM,GNG11,LCN2,SERPINE1,SPARC | 36, 1, 1, 1, 1, 1 |
The output table gives us uniqueN
(the number of unique matching markers) and N
(the total number of matches), which helps rank the potential annotations.
We can create a quick preliminary annotation by taking the top hit for each cluster.
cl2cell_auto <- marker_matched[, head(.SD, 1), by = .(cluster)]
cl2cell_auto <- setNames(cl2cell_auto[["cell_name"]], cl2cell_auto[["cluster"]])
print("Initial automated annotation:")
#> [1] "Initial automated annotation:"
cl2cell_auto
#> 0 1
#> "Naive CD8+ T cell" "Monocyte"
#> 2 3
#> "Regulatory T(Treg) cell" "B cell"
#> 4 5
#> "T cell" "Macrophage"
#> 6 7
#> "Natural killer cell" "Plasmacytoid dendritic cell(pDC)"
#> 8
#> "Megakaryocyte"
We can also get a global view of all possible annotations using plotPossibleCell
.
plotPossibleCell(marker_matched[, head(.SD), by = .(cluster)], min.uniqueN = 2)
This is the most critical step. Instead of blindly trusting the automated result, we use easybio
’s tools to verify it.
To see the evidence behind an annotation, we use check_marker()
with cis = TRUE
. This shows us which of our own marker genes from our data matched the database for a given annotation.
# Let's investigate clusters 1, 5, and 7
local_evidence <- check_marker(marker_matched, cl = c(1, 5, 7), topcellN = 2, cis = TRUE)
print(local_evidence)
#> $Monocyte
#> [1] "CD14" "S100A8" "S100A9" "S100A12" "FCGR1A" "MS4A6A" "CCL2"
#> [8] "CD93" "MPO"
#>
#> $Macrophage
#> [1] "CD14" "FCGR1A" "CCL2" "PLA2G7" "RNASE1" "S100A8" "S100A9" "MS4A6A"
#>
#> $Macrophage
#> [1] "C1QA" "C1QB" "MS4A7" "MS4A4A"
#>
#> $Monocyte
#> [1] "MS4A7" "C1QB" "C1QA"
#>
#> $`Plasmacytoid dendritic cell(pDC)`
#> [1] "CLEC4C" "LILRA4" "SCT" "LAMP5" "LRRC26" "SERPINF1" "SMPD3"
#> [8] "TNFRSF21"
#>
#> $`Dendritic cell`
#> [1] "FCER1A" "CLEC10A" "LILRA4" "FLT3" "CD1E" "CLEC4C"
#>
To validate an annotation, we use check_marker()
with cis = FALSE
(the default). This fetches the canonical markers for the suggested cell type from the database. We can then check if these well-known markers are expressed in our cluster.
canonical_markers <- check_marker(marker_matched, cl = c(1, 5, 7), topcellN = 2, cis = FALSE)
print(canonical_markers)
#> $Monocyte
#> [1] "CD14" "FCGR3A" "LYZ" "S100A8" "FCN1" "S100A9" "CD68" "VCAN"
#> [9] "IL1B" "MS4A7"
#>
#> $Macrophage
#> [1] "CD68" "CD14" "CD163" "CSF1R" "AIF1" "APOE" "MRC1" "FCGR3A"
#> [9] "C1QA" "SPP1"
#>
#> $`Plasmacytoid dendritic cell(pDC)`
#> [1] "IL3RA" "CLEC4C" "LILRA4" "JCHAIN" "TCF4" "GZMB" "IRF8" "IRF7"
#> [9] "ITM2C" "BCL11A"
#>
#> $`Dendritic cell`
#> [1] "CD1C" "FCER1A" "CLEC10A" "CD11C" "IL3RA" "HLA-DRA" "CLEC9A"
#> [8] "LAMP3" "CD1A" "ITGAX"
#>
plotSeuratDot
The best way to check marker expression is visually. plotSeuratDot
is designed to work seamlessly with check_marker
.
The entire pipeline from annotation to visualization can be done in a single, elegant pipe:
# For this example to be runnable, we need a Seurat object.
# We'll create a minimal one. In your real workflow, you would use your own srt object.
marker_genes <- unique(pbmc.markers$gene)
counts <- matrix(
abs(rnorm(length(marker_genes) * 50, mean = 1, sd = 2)),
nrow = length(marker_genes),
ncol = 50
)
rownames(counts) <- marker_genes
colnames(counts) <- paste0("cell_", 1:50)
srt <- CreateSeuratObject(counts = counts)
# Assign clusters that match the pbmc.markers data
srt$seurat_clusters <- sample(0:8, 50, replace = TRUE)
Idents(srt) <- "seurat_clusters"
# Now, let's plot the evidence for clusters 1, 5, and 7
matchCellMarker2(marker = pbmc.markers, n = 50, spc = "Human") |>
check_marker(cl = c(1, 5, 7), topcellN = 2, cis = TRUE) |>
plotSeuratDot(srt = srt)
This dot plot clearly shows the expression of the genes that led to the annotations for clusters 1, 5, and 7, allowing us to confidently assess the results.
After reviewing the evidence from the dot plots, we can make our final, informed decision. The finsert
function provides a convenient way to create the final annotation vector.
# Based on our exploration, we finalize the annotations
cl2cell_final <- finsert(
list(
c(3) ~ "B cell",
c(8) ~ "Megakaryocyte",
c(7) ~ "DC",
c(1, 5) ~ "Monocyte",
c(0, 2, 4) ~ "Naive CD8+ T cell",
c(6) ~ "Natural killer cell"
),
len = 9 # Ensure vector length covers all clusters (0-8)
)
print("Final curated annotation:")
#> [1] "Final curated annotation:"
cl2cell_final
#> 0 1 2
#> "Naive CD8+ T cell" "Monocyte" "Naive CD8+ T cell"
#> 3 4 5
#> "B cell" "Naive CD8+ T cell" "Monocyte"
#> 6 7 8
#> "Natural killer cell" "DC" "Megakaryocyte"
This cl2cell_final
vector can now be added to your Seurat object’s metadata for downstream analysis and plotting.
For specialized analyses, such as focusing on a specific tissue, working with a non-model organism, or using a proprietary list of markers, you can provide your own custom reference to matchCellMarker2
.
The reference must be a data.frame
(or data.table
) with at least two columns: cell_name
and marker
. The easiest way to create this is from a named list.
Step 1: Create a named list of your custom markers.
custom_ref_list <- list(
"T-cell" = c("CD3D", "CD3E", "CD3G"),
"B-cell" = c("CD79A", "MS4A1"),
"Myeloid" = c("LYZ", "CST3", "AIF1")
)
print(custom_ref_list)
#> $`T-cell`
#> [1] "CD3D" "CD3E" "CD3G"
#>
#> $`B-cell`
#> [1] "CD79A" "MS4A1"
#>
#> $Myeloid
#> [1] "LYZ" "CST3" "AIF1"
#>
Step 2: Convert the list to the required data.frame format.
easybio
provides the list2dt
helper function for this.
custom_ref_df <- list2dt(custom_ref_list, col_names = c("cell_name", "marker"))
head(custom_ref_df)
cell_name | marker |
---|---|
T-cell | CD3D |
T-cell | CD3E |
T-cell | CD3G |
B-cell | CD79A |
B-cell | MS4A1 |
Myeloid | LYZ |
Step 3: Run matchCellMarker2
with the ref
parameter.
When ref
is provided, the function ignores the spc
, tissueClass
, and tissueType
parameters for matching.
marker_custom <- matchCellMarker2(
marker = pbmc.markers,
n = 50,
ref = custom_ref_df
)
# Note that the cell_name column now contains our custom cell types
marker_custom[, head(.SD, 2), by = cluster]
cluster | cell_name | uniqueN | N | ordered_symbol | orderN |
---|---|---|---|---|---|
3 | B-cell | 2 | 2 | CD79A,MS4A1 | 1,1 |
easybio
also provides functions for direct queries.
get_marker()
Directly retrieve markers for any cell type of interest.
get_marker(spc = "Human", cell = c("Monocyte", "Neutrophil"), number = 5, min.count = 1)
#> $Monocyte
#> [1] "CD14" "FCGR3A" "LYZ" "S100A8" "FCN1"
#>
#> $Neutrophil
#> [1] "FCGR3B" "S100A9" "CSF3R" "S100A8" "FCGR3A"
#>
plotMarkerDistribution()
Check the distribution of a specific marker across all cell types and tissues in the database.
plotMarkerDistribution(mkr = "CD68")