Example Workflow for Single-Cell Annotation with easybio

cw

2025-08-30

Introduction

This vignette demonstrates the powerful and intuitive workflow for single-cell RNA-seq annotation provided by the easybio package. The process is designed to combine the speed of automated database matching with the reliability of interactive verification and manual curation.

The core workflow follows three logical steps:

  1. Automated Annotation: Use matchCellMarker2() to quickly get a list of potential cell types for each cluster based on its marker genes.
  2. Verification & Exploration: Interactively investigate the automated results using check_marker() and plotSeuratDot() to build confidence in the annotations. This step helps answer two critical questions:
    • Why was this annotation made?” (Which of my genes matched the database?)
    • “Is this annotation correct?” (Are the canonical markers for this cell type expressed in my cluster?)
  3. Final Curation: Based on the evidence gathered, use finsert() to assign the final, high-confidence cell type labels.

You can also view the R script for this workflow by running: fs::file_show(system.file(package = 'easybio', 'example-single-cell.R'))

Setup

First, let’s load the necessary libraries and the example marker data included with easybio. This data is derived from the 10x Genomics 3k PBMC dataset.

litedown::reactor(warning = FALSE) # vignette setting

library(easybio)
#> easybio has been updated with significant breaking changes in single-cell annotation workflow.
#> To learn the new workflow, please run:
#>   vignette("example-single-cell-annotation", package = "easybio")

library(Seurat)
#> Loading required package: SeuratObject
#> Loading required package: sp
#> 
#> Attaching package: 'SeuratObject'
#> 
#> The following objects are masked from 'package:base':
#> 
#>     intersect, t
#> 
library(data.table)
# The pbmc.markers dataset is included in easybio
head(pbmc.markers)
p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene
RPS12 0 0.739 1.000 0.991 0 0 RPS12
RPS6 0 0.693 1.000 0.995 0 0 RPS6
RPS27 0 0.737 0.999 0.992 0 0 RPS27
RPL32 0 0.627 0.999 0.995 0 0 RPL32
RPS14 0 0.634 1.000 0.994 0 0 RPS14
RPS25 0 0.769 0.997 0.975 0 0 RPS25

Step 1: Automated Annotation with matchCellMarker2

We begin by feeding the cluster markers (from Seurat::FindAllMarkers) into matchCellMarker2(). This function compares our markers against the CellMarker2.0 database and returns a ranked list of potential cell types for each cluster.

marker_matched <- matchCellMarker2(marker = pbmc.markers, n = 50, spc = "Human")

# Let's look at the top 2 potential cell types for each cluster
marker_matched[, head(.SD, 2), by = cluster]
cluster cell_name uniqueN N ordered_symbol orderN
0 Naive CD8+ T cell 6 34 CCR7,LEF1,CD8B,MAL,NELL2,TSHZ2 14,12, 2, 2, 2, 2
0 Naive T(Th0) cell 3 32 CCR7,LEF1,LRRN3 23, 8, 1
1 Monocyte 9 133 CD14,S100A8,S100A9,S100A12,FCGR1A,MS4A6A,… 82,22,15, 5, 4, 2,…
1 Macrophage 8 63 CD14,FCGR1A,CCL2,PLA2G7,RNASE1,S100A8,… 46, 6, 2, 2, 2, 2,…
2 Regulatory T(Treg) cell 11 148 FOXP3,IL2RA,CTLA4,TNFRSF4,TNFRSF18,ICOS,… 55,45,22, 7, 6, 4,…
6 Cytotoxic T cell 4 24 PRF1,GZMB,GNLY,FGFBP2 9,8,6,1
7 Plasmacytoid dendritic cell(pDC) 8 42 CLEC4C,LILRA4,SCT,LAMP5,LRRC26,SERPINF1,… 19,16, 2, 1, 1, 1,…
7 Dendritic cell 6 38 FCER1A,CLEC10A,LILRA4,FLT3,CD1E,CLEC4C 16,11, 4, 3, 2, 2
8 Megakaryocyte 9 52 PPBP,PF4,ITGA2B,GP9,MYL9,TUBB1,… 15,12, 9, 4, 4, 3,…
8 Endothelial cell 6 41 CLDN5,ESAM,GNG11,LCN2,SERPINE1,SPARC 36, 1, 1, 1, 1, 1

The output table gives us uniqueN (the number of unique matching markers) and N (the total number of matches), which helps rank the potential annotations.

We can create a quick preliminary annotation by taking the top hit for each cluster.

cl2cell_auto <- marker_matched[, head(.SD, 1), by = .(cluster)]
cl2cell_auto <- setNames(cl2cell_auto[["cell_name"]], cl2cell_auto[["cluster"]])
print("Initial automated annotation:")
#> [1] "Initial automated annotation:"
cl2cell_auto
#>                                  0                                  1 
#>                "Naive CD8+ T cell"                         "Monocyte" 
#>                                  2                                  3 
#>          "Regulatory T(Treg) cell"                           "B cell" 
#>                                  4                                  5 
#>                           "T cell"                       "Macrophage" 
#>                                  6                                  7 
#>              "Natural killer cell" "Plasmacytoid dendritic cell(pDC)" 
#>                                  8 
#>                    "Megakaryocyte" 

We can also get a global view of all possible annotations using plotPossibleCell.

plotPossibleCell(marker_matched[, head(.SD), by = .(cluster)], min.uniqueN = 2)

Step 2: Verification and Exploration

This is the most critical step. Instead of blindly trusting the automated result, we use easybio’s tools to verify it.

Answering “Why was this annotation made?”

To see the evidence behind an annotation, we use check_marker() with cis = TRUE. This shows us which of our own marker genes from our data matched the database for a given annotation.

# Let's investigate clusters 1, 5, and 7
local_evidence <- check_marker(marker_matched, cl = c(1, 5, 7), topcellN = 2, cis = TRUE)
print(local_evidence)
#> $Monocyte
#> [1] "CD14"    "S100A8"  "S100A9"  "S100A12" "FCGR1A"  "MS4A6A"  "CCL2"   
#> [8] "CD93"    "MPO"    
#> 
#> $Macrophage
#> [1] "CD14"   "FCGR1A" "CCL2"   "PLA2G7" "RNASE1" "S100A8" "S100A9" "MS4A6A"
#> 
#> $Macrophage
#> [1] "C1QA"   "C1QB"   "MS4A7"  "MS4A4A"
#> 
#> $Monocyte
#> [1] "MS4A7" "C1QB"  "C1QA" 
#> 
#> $`Plasmacytoid dendritic cell(pDC)`
#> [1] "CLEC4C"   "LILRA4"   "SCT"      "LAMP5"    "LRRC26"   "SERPINF1" "SMPD3"   
#> [8] "TNFRSF21"
#> 
#> $`Dendritic cell`
#> [1] "FCER1A"  "CLEC10A" "LILRA4"  "FLT3"    "CD1E"    "CLEC4C" 
#> 

Answering “Is this annotation correct?”

To validate an annotation, we use check_marker() with cis = FALSE (the default). This fetches the canonical markers for the suggested cell type from the database. We can then check if these well-known markers are expressed in our cluster.

canonical_markers <- check_marker(marker_matched, cl = c(1, 5, 7), topcellN = 2, cis = FALSE)
print(canonical_markers)
#> $Monocyte
#>  [1] "CD14"   "FCGR3A" "LYZ"    "S100A8" "FCN1"   "S100A9" "CD68"   "VCAN"  
#>  [9] "IL1B"   "MS4A7" 
#> 
#> $Macrophage
#>  [1] "CD68"   "CD14"   "CD163"  "CSF1R"  "AIF1"   "APOE"   "MRC1"   "FCGR3A"
#>  [9] "C1QA"   "SPP1"  
#> 
#> $`Plasmacytoid dendritic cell(pDC)`
#>  [1] "IL3RA"  "CLEC4C" "LILRA4" "JCHAIN" "TCF4"   "GZMB"   "IRF8"   "IRF7"  
#>  [9] "ITM2C"  "BCL11A"
#> 
#> $`Dendritic cell`
#>  [1] "CD1C"    "FCER1A"  "CLEC10A" "CD11C"   "IL3RA"   "HLA-DRA" "CLEC9A" 
#>  [8] "LAMP3"   "CD1A"    "ITGAX"  
#> 

Visual Confirmation with plotSeuratDot

The best way to check marker expression is visually. plotSeuratDot is designed to work seamlessly with check_marker.

The entire pipeline from annotation to visualization can be done in a single, elegant pipe:

# For this example to be runnable, we need a Seurat object.
# We'll create a minimal one. In your real workflow, you would use your own srt object.
marker_genes <- unique(pbmc.markers$gene)
counts <- matrix(
  abs(rnorm(length(marker_genes) * 50, mean = 1, sd = 2)),
  nrow = length(marker_genes),
  ncol = 50
)
rownames(counts) <- marker_genes
colnames(counts) <- paste0("cell_", 1:50)
srt <- CreateSeuratObject(counts = counts)
# Assign clusters that match the pbmc.markers data
srt$seurat_clusters <- sample(0:8, 50, replace = TRUE)
Idents(srt) <- "seurat_clusters"


# Now, let's plot the evidence for clusters 1, 5, and 7
matchCellMarker2(marker = pbmc.markers, n = 50, spc = "Human") |>
  check_marker(cl = c(1, 5, 7), topcellN = 2, cis = TRUE) |>
  plotSeuratDot(srt = srt)

This dot plot clearly shows the expression of the genes that led to the annotations for clusters 1, 5, and 7, allowing us to confidently assess the results.

Step 3: Final Manual Curation

After reviewing the evidence from the dot plots, we can make our final, informed decision. The finsert function provides a convenient way to create the final annotation vector.

# Based on our exploration, we finalize the annotations
cl2cell_final <- finsert(
  list(
    c(3) ~ "B cell",
    c(8) ~ "Megakaryocyte",
    c(7) ~ "DC",
    c(1, 5) ~ "Monocyte",
    c(0, 2, 4) ~ "Naive CD8+ T cell",
    c(6) ~ "Natural killer cell"
  ),
  len = 9 # Ensure vector length covers all clusters (0-8)
)
print("Final curated annotation:")
#> [1] "Final curated annotation:"
cl2cell_final
#>                     0                     1                     2 
#>   "Naive CD8+ T cell"            "Monocyte"   "Naive CD8+ T cell" 
#>                     3                     4                     5 
#>              "B cell"   "Naive CD8+ T cell"            "Monocyte" 
#>                     6                     7                     8 
#> "Natural killer cell"                  "DC"       "Megakaryocyte" 

This cl2cell_final vector can now be added to your Seurat object’s metadata for downstream analysis and plotting.

Using a Custom Marker Database

For specialized analyses, such as focusing on a specific tissue, working with a non-model organism, or using a proprietary list of markers, you can provide your own custom reference to matchCellMarker2.

The reference must be a data.frame (or data.table) with at least two columns: cell_name and marker. The easiest way to create this is from a named list.

Step 1: Create a named list of your custom markers.

custom_ref_list <- list(
  "T-cell" = c("CD3D", "CD3E", "CD3G"),
  "B-cell" = c("CD79A", "MS4A1"),
  "Myeloid" = c("LYZ", "CST3", "AIF1")
)
print(custom_ref_list)
#> $`T-cell`
#> [1] "CD3D" "CD3E" "CD3G"
#> 
#> $`B-cell`
#> [1] "CD79A" "MS4A1"
#> 
#> $Myeloid
#> [1] "LYZ"  "CST3" "AIF1"
#> 

Step 2: Convert the list to the required data.frame format. easybio provides the list2dt helper function for this.

custom_ref_df <- list2dt(custom_ref_list, col_names = c("cell_name", "marker"))
head(custom_ref_df)
cell_name marker
T-cell CD3D
T-cell CD3E
T-cell CD3G
B-cell CD79A
B-cell MS4A1
Myeloid LYZ

Step 3: Run matchCellMarker2 with the ref parameter. When ref is provided, the function ignores the spc, tissueClass, and tissueType parameters for matching.

marker_custom <- matchCellMarker2(
  marker = pbmc.markers,
  n = 50,
  ref = custom_ref_df
)
# Note that the cell_name column now contains our custom cell types
marker_custom[, head(.SD, 2), by = cluster]
cluster cell_name uniqueN N ordered_symbol orderN
3 B-cell 2 2 CD79A,MS4A1 1,1

Additional Utilities

easybio also provides functions for direct queries.

get_marker()

Directly retrieve markers for any cell type of interest.

get_marker(spc = "Human", cell = c("Monocyte", "Neutrophil"), number = 5, min.count = 1)
#> $Monocyte
#> [1] "CD14"   "FCGR3A" "LYZ"    "S100A8" "FCN1"  
#> 
#> $Neutrophil
#> [1] "FCGR3B" "S100A9" "CSF3R"  "S100A8" "FCGR3A"
#> 

plotMarkerDistribution()

Check the distribution of a specific marker across all cell types and tissues in the database.

plotMarkerDistribution(mkr = "CD68")