% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/scTrimDist.R
\name{scTrimDist}
\alias{scTrimDist}
\title{ScTrimDist: Trim extreme cells based on kNN distance within cell types}
\usage{
scTrimDist(
  seurat_obj,
  celltype_col,
  knn_k = 30,
  keep_frac = 0.05,
  normalization_method = "LogNormalize",
  nfeatures = 2000,
  assay = "RNA",
  npcs = 20,
  resolution = 0.5,
  log2FC_filter = 1,
  pred,
  verbose = TRUE
)
}
\arguments{
\item{seurat_obj}{A \code{Seurat} object containing single-cell expression data.}

\item{celltype_col}{Character scalar specifying the column in
\code{seurat_obj@meta.data} defining cell types or clusters.}

\item{knn_k}{Integer specifying the number of nearest neighbours.}

\item{keep_frac}{Numeric in (0,1) specifying the fraction of most extreme cells
to remove per cell type.}

\item{normalization_method}{Normalization method passed to
\code{Seurat::NormalizeData}.}

\item{nfeatures}{Number of variable features selected.}

\item{assay}{Assay used for expression data extraction.}

\item{npcs}{Number of principal components used downstream.}

\item{resolution}{Clustering resolution for \code{FindClusters}.}

\item{log2FC_filter}{Minimum log2 fold-change threshold for marker filtering.
If \code{NULL}, no filtering is applied.}

\item{pred}{A \code{SingleR} result object. Row names must correspond to cell
barcodes; \code{pred$labels} is used for annotation.}

\item{verbose}{Logical indicating whether progress messages are printed.}
}
\value{
A named list containing:
\itemize{
  \item \code{plot_outliers}: ggplot showing t-SNE with outliers highlighted.
  \item \code{trimmed_object}: Seurat object after trimming and reprocessing.
  \item \code{all_markers}: Data frame of marker genes.
  \item \code{knn_res}: List of kNN results per cell type.
}
}
\description{
Identifies and removes extreme (outlier) cells within each cell type or cluster
based on k-nearest neighbour (kNN) distances computed in the normalized
high-dimensional gene expression space. Cells located in sparsely populated
regions at the periphery of clusters are excluded prior to downstream analyses.
}
\details{
For each cell type (or cluster), a kNN search is performed using the normalized
gene expression matrix obtained from a standard Seurat preprocessing workflow.
For a given cell \eqn{i} in cluster \eqn{k}, the Euclidean distances
\eqn{D_{(j,i)}^k} to its \eqn{j = 1, \ldots, K} nearest neighbours are computed.

The minimum distance
\deqn{
\min D_i^k = \min_{j = 1, \ldots, K} D_{(j,i)}^k
}
is used as a measure of local neighbourhood density. Cells with large minimum
distances are interpreted as extreme or non-representative cells.

A fraction \eqn{\alpha} (specified via \code{keep_frac}) of the most extreme cells
is removed per cluster, defined as cells with
\deqn{
\min D_i^k > Q_{1 - \alpha}
}
where \eqn{Q_{1 - \alpha}} is the \eqn{(1 - \alpha)} quantile of the minimum
kNN distance distribution within the cluster.

After trimming, the remaining cells are re-normalized and reprocessed using
standard Seurat workflows. Cell type annotations are assigned using a
**precomputed SingleR result** supplied by the user, and cluster-specific
marker genes are identified.
}
