Example Workflow for Single-Cell Annotation with easybio

Introduction

This vignette demonstrates the powerful and intuitive workflow for single-cell RNA-seq annotation provided by the easybio package. The process is designed to combine the speed of automated database matching with the reliability of interactive verification and manual curation.

The core workflow follows three logical steps:

Automated Annotation: Use matchCellMarker2() to quickly get a list of potential cell types for each cluster based on its marker genes.
Verification & Exploration: Interactively investigate the automated results using check_marker() and plotSeuratDot() to build confidence in the annotations. This step helps answer two critical questions:
- “Why was this annotation made?” (Which of my genes matched the database?)
- “Is this annotation correct?” (Are the canonical markers for this cell type expressed in my cluster?)
Final Curation: Based on the evidence gathered, use finsert() to assign the final, high-confidence cell type labels.

You can also view the R script for this workflow by running: fs::file_show(system.file(package = 'easybio', 'example-single-cell.R'))

Setup

First, let’s load the necessary libraries and the example marker data included with easybio. This data is derived from the 10x Genomics 3k PBMC dataset.

litedown::reactor(warning = FALSE) # vignette setting

library(easybio)

#> easybio has been updated with significant breaking changes in single-cell annotation workflow.
#> To learn the new workflow, please run:
#>   vignette("example-sc-seq-workflow", package = "easybio")

library(Seurat)

#> Loading required package: SeuratObject

#> Loading required package: sp

#> 
#> Attaching package: 'SeuratObject'
#> 
#> The following objects are masked from 'package:base':
#> 
#>     intersect, t
#>

library(data.table)

# The pbmc.markers dataset is included in easybio
head(pbmc.markers)

	avg_log2FC	pct.1	pct.2	gene
RPS12	0.739	1.000	0.991	RPS12
RPS6	0.693	1.000	0.995	RPS6
RPS27	0.737	0.999	0.992	RPS27
RPL32	0.627	0.999	0.995	RPL32
RPS14	0.634	1.000	0.994	RPS14
RPS25	0.769	0.997	0.975	RPS25

Step 1: Automated Annotation with `matchCellMarker2`

We begin by feeding the cluster markers (from Seurat::FindAllMarkers) into matchCellMarker2(). This function compares our markers against the CellMarker2.0 database and returns a ranked list of potential cell types for each cluster.

marker_matched <- matchCellMarker2(marker = pbmc.markers, n = 50, spc = "Human")

# Let's look at the top 2 potential cell types for each cluster
marker_matched[, head(.SD, 2), by = cluster]

cluster	cell_name	uniqueN	N	ordered_symbol	orderN
0	Naive CD8+ T cell	6	34	CCR7,LEF1,CD8B,MAL,NELL2,TSHZ2	14,12, 2, 2, 2, 2
0	Naive T(Th0) cell	3	32	CCR7,LEF1,LRRN3	23, 8, 1
1	Monocyte	9	133	CD14,S100A8,S100A9,S100A12,FCGR1A,MS4A6A,…	82,22,15, 5, 4, 2,…
1	Macrophage	8	63	CD14,FCGR1A,CCL2,PLA2G7,RNASE1,S100A8,…	46, 6, 2, 2, 2, 2,…
2	Regulatory T(Treg) cell	11	148	FOXP3,IL2RA,CTLA4,TNFRSF4,TNFRSF18,ICOS,…	55,45,22, 7, 6, 4,…
⋮	⋮	⋮	⋮	⋮	⋮
6	Cytotoxic T cell	4	24	PRF1,GZMB,GNLY,FGFBP2	9,8,6,1
7	Plasmacytoid dendritic cell(pDC)	8	42	CLEC4C,LILRA4,SCT,LAMP5,LRRC26,SERPINF1,…	19,16, 2, 1, 1, 1,…
7	Dendritic cell	6	38	FCER1A,CLEC10A,LILRA4,FLT3,CD1E,CLEC4C	16,11, 4, 3, 2, 2
8	Megakaryocyte	9	52	PPBP,PF4,ITGA2B,GP9,MYL9,TUBB1,…	15,12, 9, 4, 4, 3,…
8	Endothelial cell	6	41	CLDN5,ESAM,GNG11,LCN2,SERPINE1,SPARC	36, 1, 1, 1, 1, 1

The output table gives us uniqueN (the number of unique matching markers) and N (the total number of matches), which helps rank the potential annotations.

We can create a quick preliminary annotation by taking the top hit for each cluster.

cl2cell_auto <- marker_matched[, head(.SD, 1), by = .(cluster)]
cl2cell_auto <- setNames(cl2cell_auto[["cell_name"]], cl2cell_auto[["cluster"]])
print("Initial automated annotation:")

#> [1] "Initial automated annotation:"

cl2cell_auto

#>                                  0                                  1 
#>                "Naive CD8+ T cell"                         "Monocyte" 
#>                                  2                                  3 
#>          "Regulatory T(Treg) cell"                           "B cell" 
#>                                  4                                  5 
#>                           "T cell"                       "Macrophage" 
#>                                  6                                  7 
#>              "Natural killer cell" "Plasmacytoid dendritic cell(pDC)" 
#>                                  8 
#>                    "Megakaryocyte"

We can also get a global view of all possible annotations using plotPossibleCell.

plotPossibleCell(marker_matched[, head(.SD), by = .(cluster)], min.uniqueN = 2)

Step 2: Verification and Exploration

This is the most critical step. Instead of blindly trusting the automated result, we use easybio’s tools to verify it.

Answering “Why was this annotation made?”

To see the evidence behind an annotation, we use check_marker() with cis = TRUE. This shows us which of our own marker genes from our data matched the database for a given annotation.

# Let's investigate clusters 1, 5, and 7
local_evidence <- check_marker(marker_matched, cl = c(1, 5, 7), topcellN = 2, cis = TRUE)
print(local_evidence)

#> $Monocyte
#> [1] "CD14"    "S100A8"  "S100A9"  "S100A12" "FCGR1A"  "MS4A6A"  "CCL2"   
#> [8] "CD93"    "MPO"    
#> 
#> $Macrophage
#> [1] "CD14"   "FCGR1A" "CCL2"   "PLA2G7" "RNASE1" "S100A8" "S100A9" "MS4A6A"
#> 
#> $Macrophage
#> [1] "C1QA"   "C1QB"   "MS4A7"  "MS4A4A"
#> 
#> $Monocyte
#> [1] "MS4A7" "C1QB"  "C1QA" 
#> 
#> $`Plasmacytoid dendritic cell(pDC)`
#> [1] "CLEC4C"   "LILRA4"   "SCT"      "LAMP5"    "LRRC26"   "SERPINF1" "SMPD3"   
#> [8] "TNFRSF21"
#> 
#> $`Dendritic cell`
#> [1] "FCER1A"  "CLEC10A" "LILRA4"  "FLT3"    "CD1E"    "CLEC4C" 
#>

Answering “Is this annotation correct?”

To validate an annotation, we use check_marker() with cis = FALSE (the default). This fetches the canonical markers for the suggested cell type from the database. We can then check if these well-known markers are expressed in our cluster.

canonical_markers <- check_marker(marker_matched, cl = c(1, 5, 7), topcellN = 2, cis = FALSE)
print(canonical_markers)

#> $Monocyte
#>  [1] "CD14"   "FCGR3A" "LYZ"    "S100A8" "FCN1"   "S100A9" "CD68"   "VCAN"  
#>  [9] "IL1B"   "MS4A7" 
#> 
#> $Macrophage
#>  [1] "CD68"   "CD14"   "CD163"  "CSF1R"  "AIF1"   "APOE"   "MRC1"   "FCGR3A"
#>  [9] "C1QA"   "SPP1"  
#> 
#> $`Plasmacytoid dendritic cell(pDC)`
#>  [1] "IL3RA"  "CLEC4C" "LILRA4" "JCHAIN" "TCF4"   "GZMB"   "IRF8"   "IRF7"  
#>  [9] "ITM2C"  "BCL11A"
#> 
#> $`Dendritic cell`
#>  [1] "CD1C"    "FCER1A"  "CLEC10A" "CD11C"   "IL3RA"   "HLA-DRA" "CLEC9A" 
#>  [8] "LAMP3"   "CD1A"    "ITGAX"  
#>

Visual Confirmation with `plotSeuratDot`

The best way to check marker expression is visually. plotSeuratDot is designed to work seamlessly with check_marker.

The entire pipeline from annotation to visualization can be done in a single, elegant pipe:

# For this example to be runnable, we need a Seurat object.
# We'll create a minimal one. In your real workflow, you would use your own srt object.
marker_genes <- unique(pbmc.markers$gene)
counts <- matrix(
  abs(rnorm(length(marker_genes) * 50, mean = 1, sd = 2)),
  nrow = length(marker_genes),
  ncol = 50
)
rownames(counts) <- marker_genes
colnames(counts) <- paste0("cell_", 1:50)
srt <- CreateSeuratObject(counts = counts)
# Assign clusters that match the pbmc.markers data
srt$seurat_clusters <- sample(0:8, 50, replace = TRUE)
Idents(srt) <- "seurat_clusters"


# Now, let's plot the evidence for clusters 1, 5, and 7
matchCellMarker2(marker = pbmc.markers, n = 50, spc = "Human") |>
  check_marker(cl = c(1, 5, 7), topcellN = 2, cis = TRUE) |>
  plotSeuratDot(srt = srt)

This dot plot clearly shows the expression of the genes that led to the annotations for clusters 1, 5, and 7, allowing us to confidently assess the results.

Step 3: Final Manual Curation

After reviewing the evidence from the dot plots, we can make our final, informed decision. The finsert function provides a convenient way to create the final annotation vector.

# Based on our exploration, we finalize the annotations
cl2cell_final <- finsert(
  list(
    c(3) ~ "B cell",
    c(8) ~ "Megakaryocyte",
    c(7) ~ "DC",
    c(1, 5) ~ "Monocyte",
    c(0, 2, 4) ~ "Naive CD8+ T cell",
    c(6) ~ "Natural killer cell"
  ),
  len = 9 # Ensure vector length covers all clusters (0-8)
)
print("Final curated annotation:")

#> [1] "Final curated annotation:"

cl2cell_final

#>                     0                     1                     2 
#>   "Naive CD8+ T cell"            "Monocyte"   "Naive CD8+ T cell" 
#>                     3                     4                     5 
#>              "B cell"   "Naive CD8+ T cell"            "Monocyte" 
#>                     6                     7                     8 
#> "Natural killer cell"                  "DC"       "Megakaryocyte"

This cl2cell_final vector can now be added to your Seurat object’s metadata for downstream analysis and plotting.

Using a Custom Marker Database

For specialized analyses, such as focusing on a specific tissue, working with a non-model organism, or using a proprietary list of markers, you can provide your own custom reference to matchCellMarker2.

The reference must be a data.frame (or data.table) with at least two columns: cell_name and marker. The easiest way to create this is from a named list.

Step 1: Create a named list of your custom markers.

custom_ref_list <- list(
  "T-cell" = c("CD3D", "CD3E", "CD3G"),
  "B-cell" = c("CD79A", "MS4A1"),
  "Myeloid" = c("LYZ", "CST3", "AIF1")
)
print(custom_ref_list)

#> $`T-cell`
#> [1] "CD3D" "CD3E" "CD3G"
#> 
#> $`B-cell`
#> [1] "CD79A" "MS4A1"
#> 
#> $Myeloid
#> [1] "LYZ"  "CST3" "AIF1"
#>

Step 2: Convert the list to the required data.frame format. easybio provides the list2dt helper function for this.

custom_ref_df <- list2dt(custom_ref_list, col_names = c("cell_name", "marker"))
head(custom_ref_df)

cell_name	marker
T-cell	CD3D
T-cell	CD3E
T-cell	CD3G
B-cell	CD79A
B-cell	MS4A1
Myeloid	LYZ

Step 3: Run matchCellMarker2 with the ref parameter. When ref is provided, the function ignores the spc, tissueClass, and tissueType parameters for matching.

marker_custom <- matchCellMarker2(
  marker = pbmc.markers,
  n = 50,
  ref = custom_ref_df
)
# Note that the cell_name column now contains our custom cell types
marker_custom[, head(.SD, 2), by = cluster]

cluster	cell_name	uniqueN	N	ordered_symbol	orderN
3	B-cell	2	2	CD79A,MS4A1	1,1

Additional Utilities

easybio also provides functions for direct queries.

`get_marker()`

Directly retrieve markers for any cell type of interest.

get_marker(spc = "Human", cell = c("Monocyte", "Neutrophil"), number = 5, min.count = 1)

#> $Monocyte
#> [1] "CD14"   "FCGR3A" "LYZ"    "S100A8" "FCN1"  
#> 
#> $Neutrophil
#> [1] "FCGR3B" "S100A9" "CSF3R"  "S100A8" "FCGR3A"
#>

`plotMarkerDistribution()`

Check the distribution of a specific marker across all cell types and tissues in the database.

plotMarkerDistribution(mkr = "CD68")

Example Workflow for Single-Cell Annotation with easybio

cw

2025-10-02

Introduction

Setup

Step 1: Automated Annotation with `matchCellMarker2`

Step 2: Verification and Exploration

Answering “Why was this annotation made?”

Answering “Is this annotation correct?”

Visual Confirmation with `plotSeuratDot`

Step 3: Final Manual Curation

Using a Custom Marker Database

Additional Utilities

`get_marker()`

`plotMarkerDistribution()`

Example Workflow for Single-Cell Annotation with easybio

cw

2025-10-02

Introduction

Setup

Step 1: Automated Annotation with matchCellMarker2

Step 2: Verification and Exploration

Answering “Why was this annotation made?”

Answering “Is this annotation correct?”

Visual Confirmation with plotSeuratDot

Step 3: Final Manual Curation

Using a Custom Marker Database

Additional Utilities

get_marker()

plotMarkerDistribution()

Step 1: Automated Annotation with `matchCellMarker2`

Visual Confirmation with `plotSeuratDot`

`get_marker()`

`plotMarkerDistribution()`