Type: Package
Title: Detecting Extremal Values in a Normal Linear Model
Version: 1.0.3
Date: 2025-09-12
Description: Provides a method to detect values poorly explained by a Gaussian linear model. The procedure is based on the maximum of the absolute value of the studentized residuals, which is a parameter-free statistic. This approach generalizes several procedures used to detect abnormal values during longitudinal monitoring of biological markers. For methodological details, see: Berthelot G., Saulière G., Dedecker J. (2025). "DEViaN-LM An R Package for Detecting Abnormal Values in the Gaussian Linear Model". HAL Id: hal-05230549. https://hal.science/hal-05230549.
License: GPL-3
Encoding: UTF-8
Imports: Rcpp
LinkingTo: Rcpp, RcppArmadillo
Suggests: testthat (≥ 3.0.0)
Config/testthat/edition: 3
RoxygenNote: 7.3.1
Depends: R (≥ 2.10)
LazyData: true
SystemRequirements: OpenMP
NeedsCompilation: yes
Packaged: 2025-09-19 08:07:27 UTC; gsauliere
Author: Guillaume Sauliere ORCID iD [aut, cre], Geoffroy Berthelot [aut], Jérôme Dedecker [aut]
Maintainer: Guillaume Sauliere <guillaumesauliere@hotmail.com>
Repository: CRAN
Date/Publication: 2025-09-24 08:10:02 UTC

Detection of Poorly Explained Values in Gaussian Linear Models

Description

The devianLM package provides tools to detect values that are poorly explained by a Gaussian linear model. The method is based on the maximum absolute value of studentized residuals, a statistic that is independent of the model parameters. This approach generalizes several procedures used to detect abnormal values, such as during the longitudinal monitoring of certain biological markers.

Details

The package offers two main functions:

These methods are particularly useful for regression diagnostics, quality control, and longitudinal monitoring in applied statistics.

Author(s)

Guillaume Saulière guillaumesauliere@hotmail.com \ Geoffroy Berthelot geoffroy.berthelot@insep.fr \ Jérôme Dedecker jerome.dedecker@u-paris.fr \

Examples

set.seed(123)
x <- as.matrix(rnorm(50))
y <- 2 * x + rnorm(50)

# Small n_sims for quick example
result <- devianlm_stats(y, x, n_sims = 100)

Identify outliers using devianLM method

Description

Identify outliers using devianLM method

Usage

devianlm_stats(
  y,
  x,
  threshold = NULL,
  n_sims = 50000,
  nthreads = detectCores() - 1,
  alpha = 0.95,
  ...
)

Arguments

y

a numeric variable

x

either a numeric variable or several numeric variables (explanatory variables) concatenated in a data frame.

threshold

numeric or NULL; if NULL, computed using devianlm_cpp()

n_sims

optional value which is the number of simulations, is set to 50.000 by default.

nthreads

optional value which is the number of CPU cores to use, is set to "number of CPU cores - 1" by default.

alpha

quantile of interest, is set to 0.95 by default.

...

additional arguments for get_devianlm_threshold()

Value

devianlm returns an object of class list with the following components:

reg_residuals

Numeric vector. The studentized residuals from the linear model.

outliers

Integer vector. The indices (positions in the original data) of observations identified as outliers based on the threshold.

threshold

Numeric value. The cutoff applied to the absolute value of the studentized residuals to flag outliers. If not provided, it is estimated using get_devianlm_threshold().

is_outliers

Integer vector. A binary vector (0 or 1) of the same length as reg_residuals, indicating whether each observation is considered an outlier (1) or not (0).

Examples

set.seed(123)
y <- salary$hourly_earnings_log
x <- cbind(salary$age, salary$educational_attainment, salary$children_number)

test_salary <- devianlm_stats(y, x, n_sims = 100, alpha = 0.95)

plot(test_salary$reg_residuals,
  pch = 16, cex = .8,
  ylim = c(-1 * max(abs(test_salary$reg_residuals)), max(abs(test_salary$reg_residuals))),
  xlab = "", ylab = "Studentized residuals",
  col = ifelse(test_salary$is_outliers, "red", "black"))

# Ajouter les lignes de seuil
abline(h = c(-test_salary$threshold, test_salary$threshold), col = "chartreuse2", lwd = 2)
 

get_devianlm_threshold : Compute threshold using Monte Carlo simulations

Description

This package determines whether the maximum of the absolute values of the studentized residuals of a Gaussian regression is abnormally high. The distribution of the maximum of the absolute of the studentized residuals (depending on the design matrix) is computed via Monte-Carlo simulations (with n_sims simulations).

Usage

get_devianlm_threshold(
  x,
  n_sims = 50000,
  nthreads = detectCores() - 1,
  alpha = 0.95
)

Arguments

x

either a numeric variable or several numeric variables (explanatory variables) concatenated in a data frame.

n_sims

optional value which is the number of simulations, is set to 50.000 by default.

nthreads

optional value which is the number of CPU cores to use, is set to "number of CPU cores - 1" by default.

alpha

quantile of interest, is set to 0.95 by default.

Value

Numeric value.

threshold

The quantile of order 1-alpha of the distribution of the maximum of the absolute of the studentized residuals (depending on the design matrix) is computed via Monte-Carlo simulations (with n_sims simulations).


Salary dataset

Description

A random sample from the 2012 Current Population Survey (CPS). It is the primary source of labor force statistics for the US population.

Usage

salary

Format

A data frame with 599 rows and 10 variables

See Also

Original data are available from <https://www.ilo.org/surveyLib/index.php/catalog/7379>.

The data dictionary is available from <https://www2.census.gov/programs-surveys/cps/datasets/2022/march/asec2022_ddl_pub_full.pdf>.