Help for package daltoolbox

Title:

Leveraging Experiment Lines to Data Analytics

Version:

1.2.747

Description:

The natural increase in the complexity of current research experiments and data demands better tools to enhance productivity in Data Analytics. The package is a framework designed to address the modern challenges in data analytics workflows. The package is inspired by Experiment Line concepts. It aims to provide seamless support for users in developing their data mining workflows by offering a uniform data model and method API. It enables the integration of various data mining activities, including data preprocessing, classification, regression, clustering, and time series prediction. It also offers options for hyper-parameter tuning and supports integration with existing libraries and languages. Overall, the package provides researchers with a comprehensive set of functionalities for data science, promoting ease of use, extensibility, and integration with various tools and libraries. Information on Experiment Line is based on Ogasawara et al. (2009) <doi:10.1007/978-3-642-02279-1_20>.

License:

MIT + file LICENSE

URL:

https://cefet-rj-dal.github.io/daltoolbox/, https://github.com/cefet-rj-dal/daltoolbox

BugReports:

https://github.com/cefet-rj-dal/daltoolbox/issues

Encoding:

UTF-8

Depends:

R (≥ 4.1.0)

RoxygenNote:

7.3.3

Imports:

FNN, caret, class, cluster, dbscan, dplyr, e1071, ggplot2, nnet, randomForest, reshape, tree

NeedsCompilation:

Packaged:

2025-10-26 05:19:39 UTC; gpca

Author:

Eduardo Ogasawara

[aut, ths, cre], Ana Carolina Sá [aut], Antonio Castro [aut], Caio Santos [aut], Diego Carvalho [ctb], Diego Salles [aut], Eduardo Bezerra [ctb], Esther Pacitti [ctb], Fabio Porto [ctb], Janio Lima [aut], Lucas Tavares [aut], Rafaelli Coutinho [ctb], Rebecca Salles [aut], Vinicius Saidy [aut], CEFET/RJ [cph]

Maintainer:

Eduardo Ogasawara <eogasawara@ieee.org>

Repository:

CRAN

Date/Publication:

2025-10-27 06:10:50 UTC

Boston Housing Data (Regression)

Description

housing values in suburbs of Boston.

crim: per capita crime rate by town.
zn: proportion of residential land zoned for lots over 25,000 sq.ft.
indus: proportion of non-retail business acres per town
chas: Charles River dummy variable (= 1 if tract bounds)
nox: nitric oxides concentration (parts per 10 million)
rm: average number of rooms per dwelling
age: proportion of owner-occupied units built prior to 1940
dis: weighted distances to five Boston employment centres
rad: index of accessibility to radial highways
tax: full-value property-tax rate per $10,000
ptratio: pupil-teacher ratio by town
black: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
lstat: percentage of lower status of the population
medv: Median value of owner-occupied homes in $1000's

Usage

data(Boston)

Format

Regression Dataset.

Source

This dataset was obtained from the MASS library.

References

Creator: Harrison, D. and Rubinfeld, D.L. Hedonic prices and the demand for clean air, J. Environ. Economics & Management, vol.5, 81-102, 1978.

Examples

data(Boston)
head(Boston)

Action

Description

Generic to apply the object to data (e.g., predict, transform).

Usage

action(obj, ...)

Arguments

obj

object: a dal_base object to apply the transformation on the input dataset.

...

optional arguments.

Value

returns the result of an action of the model applied in provided data

Examples

data(iris)
# an example is minmax normalization
trans <- minmax()
trans <- fit(trans, iris)
tiris <- action(trans, iris)

Action implementation for transform

Description

Default action() implementation that proxies to transform() for transforms.

Usage

## S3 method for class 'dal_transform'
action(obj, ...)

Arguments

obj

object

...

optional arguments

Value

returns a transformed data

Examples

#See ?minmax for an example of transformation

Adjust categorical mapping

Description

One‑hot encode a factor vector into a matrix of indicator columns.

Usage

adjust_class_label(x, valTrue = 1, valFalse = 0)

Arguments

x

vector to be categorized

valTrue

value to represent true

valFalse

value to represent false

Details

Values are mapped to valTrue/valFalse (default 1/0). The resulting matrix has column names equal to levels(x).

Value

returns an adjusted categorical mapping

Adjust to data frame

Description

Coerce an object to data.frame if needed (useful for S3 methods in this package).

Usage

adjust_data.frame(data)

Arguments

data

dataset

Value

returns a data.frame

Examples

data(iris)
df <- adjust_data.frame(iris)

Adjust factors

Description

Convert a vector to a factor with specified internal levels (ilevels) and labels (slevels).

Usage

adjust_factor(value, ilevels, slevels)

Arguments

value

vector to be converted into factor

ilevels

order for categorical values

slevels

labels for categorical values

Details

Numeric vectors are first converted to factors with ilevels as the level order, then relabeled to slevels.

Value

returns an adjusted factor

Adjust to matrix

Description

Coerce an object to matrix if needed (useful before algorithms that expect matrices).

Usage

adjust_matrix(data)

Arguments

data

dataset

Value

returns an adjusted matrix

Examples

data(iris)
mat <- adjust_matrix(iris)

Autoencoder base (encoder)

Description

Base class for encoder‑only autoencoders. Intended to be subclassed by concrete implementations that learn a lower‑dimensional latent representation.

Usage

autoenc_base_e(input_size, encoding_size)

Arguments

input_size

dimensionality of the input vector

encoding_size

dimensionality of the latent (encoded) vector

Details

This base does not train or transform by itself (identity). Implementations should override fit() to learn parameters and transform() to output the encoded representation.

Value

returns an autoenc_base_e object

References

Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the Dimensionality of Data with Neural Networks. Science.

Examples

# See an end‑to‑end example at:
# https://github.com/cefet-rj-dal/daltoolbox/blob/main/autoencoder/autoenc_base_e.md

Autoencoder base (encoder + decoder)

Description

Base class for autoencoders that both encode and decode. Intended to be subclassed by concrete implementations that learn to compress and reconstruct inputs.

Usage

autoenc_base_ed(input_size, encoding_size)

Arguments

input_size

dimensionality of the input vector

encoding_size

dimensionality of the latent (encoded) vector

Details

This base does not train or transform by itself (identity). Implementations should override fit() to learn parameters and transform() to perform encode+decode.

Value

returns an autoenc_base_ed object

References

Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the Dimensionality of Data with Neural Networks. Science.

Examples

# See an end‑to‑end example at:
# https://github.com/cefet-rj-dal/daltoolbox/blob/main/autoencoder/autoenc_base_ed.md

Categorical mapping (one‑hot encoding)

Description

Convert a factor column into dummy variables (one‑hot encoding) using model.matrix without intercept. Each level becomes a separate binary column.

Usage

categ_mapping(attribute)

Arguments

attribute

attribute to be categorized.

Details

This is a light wrapper around stats::model.matrix(~ attr - 1, data) that drops the original column and returns only the dummy variables.

Value

returns a data frame with binary attributes, one for each possible category.

Examples

cm <- categ_mapping("Species")
iris_cm <- transform(cm, iris)

# can be made in a single column
species <- iris[,"Species", drop=FALSE]
iris_cm <- transform(cm, species)

Decision Tree for classification

Description

Univariate decision tree for classification using recursive partitioning. This wrapper uses the tree package.

Usage

cla_dtree(attribute, slevels)

Arguments

attribute

attribute target to model building

slevels

the possible values for the target classification

Details

Decision trees split the feature space by maximizing node purity (e.g., Gini/entropy), yielding a human‑readable set of rules. They are fast and interpretable, and often used as base learners in ensembles.

Value

returns a classification object

References

Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984). Classification and Regression Trees. Wadsworth.

Examples

data(iris)
slevels <- levels(iris$Species)
model <- cla_dtree("Species", slevels)

# preparing dataset for random sampling
sr <- sample_random()
sr <- train_test(sr, iris)
train <- sr$train
test <- sr$test

model <- fit(model, train)

prediction <- predict(model, test)
predictand <- adjust_class_label(test[,"Species"])
test_eval <- evaluate(model, predictand, prediction)
test_eval$metrics

K-Nearest Neighbors (KNN) Classification

Description

Classification by majority vote among the k nearest neighbors. Uses class::knn.

Usage

cla_knn(attribute, slevels, k = 1)

Arguments

attribute

attribute target to model building.

slevels

possible values for the target classification.

k

a vector of integers indicating the number of neighbors to be considered.

Details

KNN is a simple, non‑parametric method. Choice of k trades bias/variance; distance metric is Euclidean by default.

Value

returns a knn object.

References

Cover, T. and Hart, P. (1967). Nearest neighbor pattern classification. IEEE Trans. Info. Theory.

Examples

data(iris)
slevels <- levels(iris$Species)
model <- cla_knn("Species", slevels, k=3)

# preparing dataset for random sampling
sr <- sample_random()
sr <- train_test(sr, iris)
train <- sr$train
test <- sr$test

model <- fit(model, train)

prediction <- predict(model, test)
predictand <- adjust_class_label(test[,"Species"])
test_eval <- evaluate(model, predictand, prediction)
test_eval$metrics

Majority baseline classifier

Description

Trivial classifier that always predicts the most frequent class observed in the training data. Useful as a baseline.

Usage

cla_majority(attribute, slevels)

Arguments

attribute

attribute target to model building.

slevels

possible values for the target classification.

Value

returns a classification object.

Examples

data(iris)
slevels <- levels(iris$Species)
model <- cla_majority("Species", slevels)

# preparing dataset for random sampling
sr <- sample_random()
sr <- train_test(sr, iris)
train <- sr$train
test <- sr$test

model <- fit(model, train)

prediction <- predict(model, test)
predictand <- adjust_class_label(test[,"Species"])
test_eval <- evaluate(model, predictand, prediction)
test_eval$metrics

MLP for classification

Description

Multi-Layer Perceptron classifier using nnet::nnet (single hidden layer).

Usage

cla_mlp(attribute, slevels, size = NULL, decay = 0.1, maxit = 1000)

Arguments

attribute

attribute target to model building

slevels

possible values for the target classification

size

number of nodes that will be used in the hidden layer

decay

how quickly it decreases in gradient descent

maxit

maximum iterations

Details

Uses softmax output with one‑hot targets from adjust_class_label. size controls hidden units and decay applies L2 regularization. Features should be scaled.

Value

returns a classification object

References

Rumelhart, D., Hinton, G., Williams, R. (1986). Learning representations by back‑propagating errors. Bishop, C. M. (1995). Neural Networks for Pattern Recognition.

Examples

data(iris)
slevels <- levels(iris$Species)
model <- cla_mlp("Species", slevels, size=3, decay=0.03)

# preparing dataset for random sampling
sr <- sample_random()
sr <- train_test(sr, iris)
train <- sr$train
test <- sr$test

model <- fit(model, train)

prediction <- predict(model, test)
predictand <- adjust_class_label(test[,"Species"])
test_eval <- evaluate(model, predictand, prediction)
test_eval$metrics

Naive Bayes Classifier

Description

Naive Bayes classification using e1071::naiveBayes.

Usage

cla_nb(attribute, slevels)

Arguments

attribute

attribute target to model building.

slevels

possible values for the target classification.

Details

Assumes conditional independence of features given the class label, enabling fast probabilistic classification.

Value

returns a classification object.

References

Mitchell, T. (1997). Machine Learning. McGraw‑Hill. (Naive Bayes)

Examples

data(iris)
slevels <- levels(iris$Species)
model <- cla_nb("Species", slevels)

# preparing dataset for random sampling
sr <- sample_random()
sr <- train_test(sr, iris)
train <- sr$train
test <- sr$test

model <- fit(model, train)

prediction <- predict(model, test)
predictand <- adjust_class_label(test[,"Species"])
test_eval <- evaluate(model, predictand, prediction)
test_eval$metrics

Random Forest for classification

Description

Ensemble classifier of decision trees using randomForest::randomForest.

Usage

cla_rf(attribute, slevels, nodesize = 5, ntree = 10, mtry = NULL)

Arguments

attribute

attribute target to model building

slevels

possible values for the target classification

nodesize

node size

ntree

number of trees

mtry

number of attributes to build tree

Details

Combines many decorrelated trees to reduce variance. Key hyperparameters: ntree, mtry, nodesize.

Value

returns a classification object

References

Breiman, L. (2001). Random Forests. Machine Learning 45(1):5–32. Liaw, A. and Wiener, M. (2002). Classification and Regression by randomForest. R News.

Examples

data(iris)
slevels <- levels(iris$Species)
model <- cla_rf("Species", slevels, ntree=5)

# preparing dataset for random sampling
sr <- sample_random()
sr <- train_test(sr, iris)
train <- sr$train
test <- sr$test

model <- fit(model, train)

prediction <- predict(model, test)
predictand <- adjust_class_label(test[,"Species"])
test_eval <- evaluate(model, predictand, prediction)
test_eval$metrics

SVM for classification

Description

Support Vector Machines (SVM) for classification using e1071::svm.

Usage

cla_svm(attribute, slevels, epsilon = 0.1, cost = 10, kernel = "radial")

Arguments

attribute

attribute target to model building

slevels

possible values for the target classification

epsilon

parameter that controls the width of the margin around the separating hyperplane

cost

parameter that controls the trade-off between having a wide margin and correctly classifying training data points

kernel

the type of kernel function to be used in the SVM algorithm (linear, radial, polynomial, sigmoid)

Details

SVMs find a maximum‑margin hyperplane in a transformed feature space defined by a kernel (linear, radial, polynomial, sigmoid). The cost controls the trade‑off between margin width and training error; epsilon affects stopping; kernel sets the feature map.

Value

returns a SVM classification object

References

Cortes, C. and Vapnik, V. (1995). Support-Vector Networks. Machine Learning 20(3):273–297. Chang, C.-C. and Lin, C.-J. (2011). LIBSVM: A library for support vector machines.

Examples

data(iris)
slevels <- levels(iris$Species)
model <- cla_svm("Species", slevels, epsilon=0.0,cost=20.000)

# preparing dataset for random sampling
sr <- sample_random()
sr <- train_test(sr, iris)
train <- sr$train
test <- sr$test

model <- fit(model, train)

prediction <- predict(model, test)
predictand <- adjust_class_label(test[,"Species"])
test_eval <- evaluate(model, predictand, prediction)
test_eval$metrics

Classification tuning (k-fold CV)

Description

Tune hyperparameters of a base classifier via k‑fold cross‑validation using a chosen metric.

Usage

cla_tune(base_model, folds = 10, ranges = NULL, metric = "accuracy")

Arguments

base_model

base model for tuning

folds

number of folds for cross-validation

ranges

a list of hyperparameter ranges to explore

metric

metric used to optimize

Value

returns a cla_tune object

References

Kohavi, R. (1995). A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection.

Examples

# preparing dataset for random sampling
sr <- sample_random()
sr <- train_test(sr, iris)
train <- sr$train
test <- sr$test

# hyper parameter setup
tune <- cla_tune(cla_mlp("Species", levels(iris$Species)),
  ranges=list(size=c(3:5), decay=c(0.1)))

# hyper parameter optimization
model <- fit(tune, train)

# testing optimization
test_prediction <- predict(model, test)
test_predictand <- adjust_class_label(test[,"Species"])
test_eval <- evaluate(model, test_predictand, test_prediction)
test_eval$metrics

Classification base class

Description

Ancestor class for classification models providing common fields (target attribute and levels) and evaluation helpers.

Usage

classification(attribute, slevels)

Arguments

attribute

attribute target to model building

slevels

possible values for the target classification

Value

returns a classification object

Examples

#See ?cla_dtree for a classification example using a decision tree

Clustering tuning (intrinsic metric)

Description

Tune clustering hyperparameters by evaluating an intrinsic metric over a parameter grid and selecting the elbow (max curvature).

Usage

clu_tune(base_model, folds = 10, ranges = NULL)

Arguments

base_model

base model for tuning

folds

number of folds for cross-validation

ranges

a list of hyperparameter ranges to explore

Value

returns a clu_tune object.

References

Satopaa, V. et al. (2011). Finding a “Kneedle” in a Haystack.

Examples

data(iris)

# fit model
model <- clu_tune(cluster_kmeans(k = 0), ranges = list(k = 1:10))

model <- fit(model, iris[,1:4])
model$k

Cluster

Description

Generic for clustering methods

Usage

cluster(obj, ...)

Arguments

obj

a clusterer object

...

optional arguments

Value

clustered data

Examples

#See ?cluster_kmeans for an example of transformation

DBSCAN

Description

Density-Based Spatial Clustering of Applications with Noise using dbscan::dbscan.

Usage

cluster_dbscan(minPts = 3, eps = NULL)

Arguments

minPts

minimum number of points

eps

distance value

Details

Discovers clusters as dense regions separated by sparse areas. Hyperparameters are eps (neighborhood radius) and minPts (minimum points). If eps is missing, it is estimated from the kNN distance curve elbow.

Value

returns a dbscan object

References

Ester, M., Kriegel, H.-P., Sander, J., Xu, X. (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise.

Examples

# setup clustering
model <- cluster_dbscan(minPts = 3)

#load dataset
data(iris)

# build model
model <- fit(model, iris[,1:4])
clu <- cluster(model, iris[,1:4])
table(clu)

# evaluate model using external metric
eval <- evaluate(model, clu, iris$Species)
eval

k-means

Description

k-means clustering using stats::kmeans.

Usage

cluster_kmeans(k = 1)

Arguments

k

the number of clusters to form.

Details

Partitions data into k clusters minimizing within‑cluster sum of squares. The intrinsic quality metric returned is the total within‑cluster SSE (lower is better).

Value

returns a k-means object.

References

MacQueen, J. (1967). Some Methods for classification and Analysis of Multivariate Observations. Lloyd, S. (1982). Least squares quantization in PCM.

Examples

# setup clustering
model <- cluster_kmeans(k=3)

#load dataset
data(iris)

# build model
model <- fit(model, iris[,1:4])
clu <- cluster(model, iris[,1:4])
table(clu)

# evaluate model using external metric
eval <- evaluate(model, clu, iris$Species)
eval

PAM (Partitioning Around Medoids)

Description

Clustering around representative data points (medoids) using cluster::pam.

Usage

cluster_pam(k = 1)

Arguments

k

the number of clusters to generate.

Details

More robust to outliers than k‑means. The intrinsic metric reported is the within‑cluster SSE to medoids.

Value

returns PAM object.

References

Kaufman, L. and Rousseeuw, P. J. (1990). Finding Groups in Data: An Introduction to Cluster Analysis.

Examples

# setup clustering
model <- cluster_pam(k = 3)

#load dataset
data(iris)

# build model
model <- fit(model, iris[,1:4])
clu <- cluster(model, iris[,1:4])
table(clu)

# evaluate model using external metric
eval <- evaluate(model, clu, iris$Species)
eval

Clusterer

Description

Base class for clustering algorithms and related evaluation utilities.

Usage

clusterer()

Value

returns a clusterer object

Examples

#See ?cluster_kmeans for an example of transformation

Class dal_base

Description

Minimal abstract base class for all DAL objects. Defines the common generics fit() and action() used by transforms and learners.

Usage

dal_base()

Value

returns a dal_base object

Examples

trans <- dal_base()

Graphics utilities

Description

A collection of small plotting helpers built on ggplot2 used across the package to quickly visualize vectors, grouped summaries and time series. All functions return a ggplot2::ggplot object so you can further customize the theme, scales, and annotations.

Details

Conventions adopted:

Input data generally follows the pattern: first column is an index or category (x), remaining columns are numeric series; in some functions a long format is expected with columns named x, value, variable.
The colors parameter accepts either a single color or a vector mapped to groups/variables.
Transparency is controlled by alpha where provided.
All helpers set a light theme_bw() baseline and place legends at the bottom by default.

DAL Learner (base class)

Description

Base ancestor for learning tasks (classification, regression, clustering, time series). Provides common behavior such as proxying action() to the model‑specific operation (e.g., predict() for predictors, cluster() for clusterers) and an evaluate() generic.

An example of a learner is a decision tree (see cla_dtree).

Usage

dal_learner()

Value

returns a learner object

Examples

#See ?cla_dtree for a classification example using a decision tree

DAL Transform

Description

Base class for data transformations with optional fit()/inverse_transform() support.

Usage

dal_transform()

Details

The default transform() calls the underlying action.default(); subclasses should implement transform.className and optionally inverse_transform.className.

Value

returns a dal_transform object

Examples

# See ?minmax or ?zscore for examples

DAL Tune (base for hyperparameter search)

Description

Base class for hyperparameter optimization that stores a base model, a fold count, and a parameter grid. Specializations (classification/regression/clustering) implement the evaluation logic.

Usage

dal_tune(base_model, folds = 10, ranges)

Arguments

base_model

base model for tuning

folds

number of folds for cross-validation

ranges

a list of hyperparameter ranges to explore

Details

Ranges are expanded via expand.grid, and selection is delegated to select_hyper() which can be overridden by subclasses to implement custom criteria.

Value

returns a dal_tune object

Examples

#See ?cla_tune for classification tuning
#See ?reg_tune for regression tuning
#See ?ts_tune for time series tuning

Data sampling abstractions

Description

Base class for sampling strategies that provide train/test splitting and k‑fold partitioning. Two standard implementations are sample_random() and sample_stratified().

Usage

data_sample()

Value

returns an object of class data_sample

Examples

#using random sampling
sample <- sample_random()
tt <- train_test(sample, iris)

# distribution of train
table(tt$train$Species)

# preparing dataset into four folds
folds <- k_fold(sample, iris, 4)

# distribution of folds
tbl <- NULL
for (f in folds) {
 tbl <- rbind(tbl, table(f$Species))
}
head(tbl)

PCA

Description

Principal Component Analysis (PCA) for unsupervised dimensionality reduction. Transforms correlated variables into orthogonal principal components ordered by explained variance.

Usage

dt_pca(attribute = NULL, components = NULL)

Arguments

attribute

target attribute to model building

components

number of components for PCA

Details

Fits PCA on (optionally) the numeric predictors only (excluding attribute when provided), removes constant columns, and selects the number of components by an elbow rule (minimum curvature) unless components is set explicitly.

Value

returns an object of class dt_pca

References

Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components.

Examples

mypca <- dt_pca("Species")
# Automatically fitting number of components
mypca <- fit(mypca, iris)
iris.pca <- transform(mypca, iris)
head(iris.pca)
head(mypca$pca.transf)
# Manual establishment of number of components
mypca <- dt_pca("Species", 3)
mypca <- fit(mypca, datasets::iris)
iris.pca <- transform(mypca, iris)
head(iris.pca)
head(mypca$pca.transf)

Evaluate

Description

Evaluate learner performance. The actual evaluate varies according to the type of learner (clustering, classification, regression, time series regression)

Usage

evaluate(obj, ...)

Arguments

obj

object

...

optional arguments

Value

returns the evaluation

Examples

data(iris)
slevels <- levels(iris$Species)
model <- cla_dtree("Species", slevels)
model <- fit(model, iris)
prediction <- predict(model, iris)
predictand <- adjust_class_label(iris[,"Species"])
test_eval <- evaluate(model, predictand, prediction)
test_eval$metrics

Fit

Description

Generic to train/adjust an object using provided data and optional parameters.

Usage

fit(obj, ...)

Arguments

obj

object

...

optional arguments.

Value

returns a object after fitting

Examples

data(iris)
# an example is minmax normalization
trans <- minmax()
trans <- fit(trans, iris)
tiris <- action(trans, iris)

tune hyperparameters of ml model

Description

Tunes the hyperparameters of a machine learning model for classification

Usage

## S3 method for class 'cla_tune'
fit(obj, data, ...)

Arguments

obj

an object containing the model and tuning configuration

data

the dataset used for training and evaluation

...

optional arguments

Value

a fitted obj

fit dbscan model

Description

Fits a DBSCAN clustering model by setting the eps parameter. If eps is not provided, it is estimated based on the k-nearest neighbor distances. It wraps dbscan library

Usage

## S3 method for class 'cluster_dbscan'
fit(obj, data, ...)

Arguments

obj

an object containing the DBSCAN model configuration, including minPts and optionally eps

data

the dataset to use for fitting the model

...

optional arguments

Value

returns a fitted obj with the eps parameter set

Maximum curvature analysis (elbow detection)

Description

Computes a smoothing spline over a sequence and returns the location/value of maximum curvature, often used as an "elbow" detector.

Usage

fit_curvature_max()

Value

returns an object of class fit_curvature_max, which inherits from the fit_curvature and dal_transform classes. The object contains a list with the following elements:

x: The position in which the maximum curvature is reached.
y: The value where the the maximum curvature occurs.
yfit: The value of the maximum curvature.

Examples

x <- seq(from=1,to=10,by=0.5)
dat <- data.frame(x = x, value = -log(x), variable = "log")
myfit <- fit_curvature_max()
res <- transform(myfit, dat$value)
head(res)

Minimum curvature analysis (elbow detection)

Description

Computes a smoothing spline over a sequence and returns the location/value of minimum curvature, complementary to maximum curvature and useful in elbow detection.

Usage

fit_curvature_min()

Value

Returns an object of class fit_curvature_max, which inherits from the fit_curvature and dal_transform classes. The object contains a list with the following elements:

x: The position in which the minimum curvature is reached.
y: The value where the the minimum curvature occurs.
yfit: The value of the minimum curvature.

Examples

x <- seq(from=1,to=10,by=0.5)
dat <- data.frame(x = x, value = log(x), variable = "log")
myfit <- fit_curvature_min()
res <- transform(myfit, dat$value)
head(res)

Inverse Transform

Description

Optional inverse operation for a transformation; defaults to identity.

Usage

inverse_transform(obj, ...)

Arguments

obj

a dal_transform object.

...

optional arguments.

Value

dataset inverse transformed.

Examples

#See ?minmax for an example of transformation

K-fold sampling

Description

Split a dataset into k folds using a sampling strategy.

Usage

k_fold(obj, data, k)

Arguments

obj

an object representing the sampling method

data

dataset to be partitioned

k

number of folds

Value

returns a list of k data frames

Examples

#using random sampling
sample <- sample_random()

# preparing dataset into four folds
folds <- k_fold(sample, iris, 4)

# distribution of folds
tbl <- NULL
for (f in folds) {
 tbl <- rbind(tbl, table(f$Species))
}
head(tbl)

Min-max normalization

Description

Linearly scales numeric columns to the [0,1] range per column.

Usage

minmax()

Details

For each numeric column j, computes (x - min_j) / (max_j - min_j). Constant columns map to 0.

minmax = (x-min(x))/(max(x)-min(x))

Value

returns an object of class minmax

References

Han, J., Kamber, M., Pei, J. (2011). Data Mining: Concepts and Techniques. (Normalization section)

Examples

data(iris)
head(iris)

trans <- minmax()
trans <- fit(trans, iris)
tiris <- transform(trans, iris)
head(tiris)

itiris <- inverse_transform(trans, tiris)
head(itiris)

Outlier removal by boxplot (IQR rule)

Description

Removes outliers from numeric columns using Tukey's boxplot rule: values below Q1 - alpha·IQR or above Q3 + alpha·IQR are flagged as outliers.

Usage

outliers_boxplot(alpha = 1.5)

Arguments

alpha

boxplot outlier threshold (default 1.5, but can be 3.0 to remove extreme values)

Details

The default alpha=1.5 corresponds to the standard boxplot whiskers; alpha=3 is used for extreme outliers.

Value

returns an outlier object

References

Tukey, J. W. (1977). Exploratory Data Analysis. Addison‑Wesley.

Examples

# code for outlier removal
out_obj <- outliers_boxplot() # class for outlier analysis
out_obj <- fit(out_obj, iris) # computing boundaries
iris.clean <- transform(out_obj, iris) # returning cleaned dataset

#inspection of cleaned dataset
nrow(iris.clean)

idx <- attr(iris.clean, "idx")
table(idx)
iris.outliers_boxplot <- iris[idx,]
iris.outliers_boxplot

Outlier removal by Gaussian 3-sigma rule

Description

Removes outliers from numeric columns using the 3‑sigma rule under a Gaussian assumption: values outside mean ± alpha·sd are flagged as outliers.

Usage

outliers_gaussian(alpha = 3)

Arguments

alpha

gaussian threshold (default 3)

Value

returns an outlier object

References

Pukelsheim, F. (1994). The Three Sigma Rule. The American Statistician 48(2):88–91.

Examples

# code for outlier removal
out_obj <- outliers_gaussian() # class for outlier analysis
out_obj <- fit(out_obj, iris) # computing boundaries
iris.clean <- transform(out_obj, iris) # returning cleaned dataset

#inspection of cleaned dataset
nrow(iris.clean)

idx <- attr(iris.clean, "idx")
table(idx)
iris.outliers_gaussian <- iris[idx,]
iris.outliers_gaussian

Plot bar graph

Description

Draw a simple bar chart from a two‑column data.frame: first column as categories (x), second as values.

Usage

plot_bar(data, label_x = "", label_y = "", colors = NULL, alpha = 1)

Arguments

data

two‑column data.frame: category in the first column, numeric values in the second

label_x

x‑axis label

label_y

y‑axis label

colors

optional fill color (single value)

alpha

bar transparency (0–1)

Details

If colors is provided, a constant fill is used; otherwise ggplot2's default palette applies. alpha controls bar transparency. The first column is coerced to factor when needed.

Value

returns a ggplot2::ggplot graphic

Examples

#summarizing iris dataset
data <- iris |> dplyr::group_by(Species) |>
dplyr::summarize(Sepal.Length=mean(Sepal.Length))
head(data)

# plotting data
grf <- plot_bar(data, colors="blue")
plot(grf)

Plot boxplot

Description

Boxplots for each numeric column of a data.frame.

Usage

plot_boxplot(data, label_x = "", label_y = "", colors = NULL, barwidth = 0.25)

Arguments

data

data.frame with one or more numeric columns

label_x

x‑axis label

label_y

y‑axis label

colors

optional fill color for boxes

barwidth

width of the box (numeric)

Details

The data is melted to long format and a box is drawn per original column. If colors is provided, a constant fill is applied to all boxes. Use barwidth to control box width.

Value

returns a ggplot2::ggplot graphic

Examples

grf <- plot_boxplot(iris, colors="white")
plot(grf)

Boxplot per class

Description

Boxplots of a numeric column grouped by a class label.

Usage

plot_boxplot_class(
  data,
  class_label,
  label_x = "",
  label_y = "",
  colors = NULL
)

Arguments

data

data.frame with a grouping column and one numeric column

class_label

name of the grouping (class) column

label_x

x‑axis label

label_y

y‑axis label

colors

optional fill color for the boxes

Details

Expects a data.frame with the grouping column named in class_label and one numeric column. The function melts to long format and draws per‑group distributions.

Value

returns a ggplot2::ggplot graphic

Examples

grf <- plot_boxplot_class(iris |> dplyr::select(Sepal.Width, Species),
class_label = "Species", colors=c("red", "green", "blue"))
plot(grf)

Plot density

Description

Kernel density plot for one or multiple numeric columns.

Usage

plot_density(
  data,
  label_x = "",
  label_y = "",
  colors = NULL,
  bin = NULL,
  alpha = 0.25
)

Arguments

data

data.frame with one or more numeric columns

label_x

x‑axis label

label_y

y‑axis label

colors

optional fill color (single column) or vector for groups

bin

optional bin width passed to geom_density

alpha

fill transparency (0–1)

Details

If data has multiple numeric columns, densities are overlaid and filled by column (group). When a single column is provided, colors (if set) is used as a constant fill. The bin argument is passed to geom_density(binwidth=...).

Value

returns a ggplot2::ggplot graphic

Examples

grf <- plot_density(iris |> dplyr::select(Sepal.Width), colors="blue")
plot(grf)

Plot density per class

Description

Kernel density plot grouped by a class label.

Usage

plot_density_class(
  data,
  class_label,
  label_x = "",
  label_y = "",
  colors = NULL,
  bin = NULL,
  alpha = 0.5
)

Arguments

data

data.frame with class label and a numeric column

class_label

name of the grouping (class) column

label_x

x‑axis label

label_y

y‑axis label

colors

optional vector of fills per class

bin

optional bin width passed to geom_density

alpha

fill transparency (0–1)

Details

Expects data with a grouping column named in class_label and one numeric column. Each group is filled with a distinct color (if provided).

Value

returns a ggplot2::ggplot graphic

Examples

grf <- plot_density_class(iris |> dplyr::select(Sepal.Width, Species),
class = "Species", colors=c("red", "green", "blue"))
plot(grf)

Plot grouped bar

Description

Grouped (side‑by‑side) bar chart for multiple series per category.

Usage

plot_groupedbar(data, label_x = "", label_y = "", colors = NULL, alpha = 1)

Arguments

data

data.frame with category in first column and series in remaining columns

label_x

x‑axis label

label_y

y‑axis label

colors

optional vector of fill colors, one per series

alpha

bar transparency (0–1)

Details

Expects a data.frame where the first column is the category (x) and the remaining columns are numeric series. Bars are grouped by series. Provide colors with length equal to the number of series to set fills.

Value

returns a ggplot2::ggplot graphic

Examples

#summarizing iris dataset
data <- iris |> dplyr::group_by(Species) |>
dplyr::summarize(Sepal.Length=mean(Sepal.Length), Sepal.Width=mean(Sepal.Width))
head(data)

#ploting data
grf <- plot_groupedbar(data, colors=c("blue", "red"))
plot(grf)

Plot histogram

Description

Histogram for a numeric column using ggplot2.

Usage

plot_hist(data, label_x = "", label_y = "", color = "white", alpha = 0.25)

Arguments

data

data.frame with one numeric column (first column is used if multiple)

label_x

x‑axis label

label_y

y‑axis label

color

fill color

alpha

transparency level (0–1)

Details

If multiple columns are provided, only the first is used. Breaks are computed via graphics::hist to mirror base R binning. color controls the fill; alpha the transparency.

Value

returns a ggplot2::ggplot graphic

Examples

grf <- plot_hist(iris |> dplyr::select(Sepal.Width), color=c("blue"))
plot(grf)

Plot lollipop

Description

Lollipop chart (stick + circle + value label) per category.

Usage

plot_lollipop(
  data,
  label_x = "",
  label_y = "",
  colors = NULL,
  color_text = "black",
  size_text = 3,
  size_ball = 8,
  alpha_ball = 0.2,
  min_value = 0,
  max_value_gap = 1
)

Arguments

data

data.frame with category and numeric values

label_x

x‑axis label

label_y

y‑axis label

colors

stick/circle color

color_text

color of the text inside the circle

size_text

text size

size_ball

circle size

alpha_ball

circle transparency (0–1)

min_value

minimum baseline for the stick

max_value_gap

gap from value to stick end

Details

Expects a data.frame with category in the first column and numeric values in subsequent columns. Circles are drawn at values, with vertical segments extending from min_value to value - max_value_gap.

Value

returns a ggplot2::ggplot graphic

Examples

#summarizing iris dataset
data <- iris |> dplyr::group_by(Species) |>
dplyr::summarize(Sepal.Length=mean(Sepal.Length))
head(data)

#ploting data
grf <- plot_lollipop(data, colors="blue", max_value_gap=0.2)
plot(grf)

Plot pie

Description

Pie chart from a two‑column data.frame (category, value) using polar coordinates.

Usage

plot_pieplot(
  data,
  label_x = "",
  label_y = "",
  colors = NULL,
  textcolor = "white",
  bordercolor = "black"
)

Arguments

data

two‑column data.frame with category and value

label_x

x‑axis label (unused in pie, kept for symmetry)

label_y

y‑axis label (unused in pie)

colors

vector of slice fills

textcolor

label text color

bordercolor

slice border color

Details

Slices are sized by the second (numeric) column. Text and border colors can be customized.

Value

returns a ggplot2::ggplot graphic

Examples

#summarizing iris dataset
data <- iris |> dplyr::group_by(Species) |>
dplyr::summarize(Sepal.Length=mean(Sepal.Length))
head(data)

#ploting data
grf <- plot_pieplot(data, colors=c("red", "green", "blue"))
plot(grf)

Plot points

Description

Dot chart for multiple series across categories (points only).

Usage

plot_points(data, label_x = "", label_y = "", colors = NULL)

Arguments

data

data.frame with category + one or more numeric columns

label_x

x‑axis label

label_y

y‑axis label

colors

optional color vector for series

Details

Expects a data.frame with category in the first column and one or more numeric series. Points are colored by series (legend shows original column names). Supply colors to override the palette.

Value

returns a ggplot2::ggplot graphic

Examples

x <- seq(0, 10, 0.25)
data <- data.frame(x, sin=sin(x), cosine=cos(x)+5)
head(data)

grf <- plot_points(data, colors=c("red", "green"))
plot(grf)

Plot radar

Description

Radar (spider) chart for a single profile of variables using polar coordinates.

Usage

plot_radar(data, label_x = "", label_y = "", colors = NULL)

Arguments

data

two‑column data.frame: variable name and value

label_x

x‑axis label (unused; variable names are shown around the circle)

label_y

y‑axis label

colors

line/fill color for the polygon

Details

Expects a two‑column data.frame with variable names in the first column and numeric values in the second.

Value

returns a ggplot2::ggplot graphic

Examples

data <- data.frame(name = "Petal.Length", value = mean(iris$Petal.Length))
data <- rbind(data, data.frame(name = "Petal.Width", value = mean(iris$Petal.Width)))
data <- rbind(data, data.frame(name = "Sepal.Length", value = mean(iris$Sepal.Length)))
data <- rbind(data, data.frame(name = "Sepal.Width", value = mean(iris$Sepal.Width)))

grf <- plot_radar(data, colors="red") + ggplot2::ylim(0, NA)
plot(grf)

Scatter graph

Description

Scatter plot from a long data.frame with columns named x, value, and variable.

Usage

plot_scatter(data, label_x = "", label_y = "", colors = NULL)

Arguments

data

long data.frame with columns x, value, variable

label_x

x‑axis label

label_y

y‑axis label

colors

optional color(s); for numeric variable, supply a gradient as c(low, high)

Details

Colors are mapped to variable. If variable is numeric, a gradient color scale is used when colors is provided.

Value

return a ggplot2::ggplot graphic

Examples

grf <- plot_scatter(iris |> dplyr::select(x = Sepal.Length,
value = Sepal.Width, variable = Species),
label_x = "Sepal.Length", label_y = "Sepal.Width",
colors=c("red", "green", "blue"))
plot(grf)

Plot series

Description

Line plot for one or more series over a common x index.

Usage

plot_series(data, label_x = "", label_y = "", colors = NULL)

Arguments

data

data.frame with x in the first column and series in remaining columns

label_x

x‑axis label

label_y

y‑axis label

colors

optional vector of colors for series

Details

Expects a data.frame where the first column is the x index and remaining columns are numeric series. Points and lines are drawn per series; supply colors to override the palette.

Value

returns a ggplot2::ggplot graphic

Examples

x <- seq(0, 10, 0.25)
data <- data.frame(x, sin=sin(x))
head(data)

grf <- plot_series(data, colors=c("red"))
plot(grf)

Plot stacked bar

Description

Stacked bar chart for multiple series per category.

Usage

plot_stackedbar(data, label_x = "", label_y = "", colors = NULL, alpha = 1)

Arguments

data

data.frame with category in first column and series in remaining columns

label_x

x‑axis label

label_y

y‑axis label

colors

optional vector of fill colors, one per series

alpha

bar transparency (0–1)

Details

Expects a data.frame with category in the first column and series in remaining columns. Bars are stacked within each category. Provide colors (one per series) to control fills.

Value

returns a ggplot2::ggplot graphic

Examples

#summarizing iris dataset
data <- iris |> dplyr::group_by(Species) |>
dplyr::summarize(Sepal.Length=mean(Sepal.Length), Sepal.Width=mean(Sepal.Width))

#plotting data
grf <- plot_stackedbar(data, colors=c("blue", "red"))
plot(grf)

Plot time series chart

Description

Simple time series plot with points and a line.

Usage

plot_ts(x = NULL, y, label_x = "", label_y = "", color = "black")

Arguments

x

time index (numeric vector) or NULL to use 1:length(y)

y

numeric series

label_x

x‑axis label

label_y

y‑axis label

color

color for the series

Details

If x is NULL, an integer index 1:n is used. The color applies to both points and line.

Value

returns a ggplot2::ggplot graphic

Examples

x <- seq(0, 10, 0.25)
y <- sin(x)

grf <- plot_ts(x = x, y = y, color=c("red"))
plot(grf)

Plot time series with predictions

Description

Plot original series plus dashed lines for in‑sample adjustment and optional out‑of‑sample predictions.

Usage

plot_ts_pred(
  x = NULL,
  y,
  yadj,
  ypred = NULL,
  label_x = "",
  label_y = "",
  color = "black",
  color_adjust = "blue",
  color_prediction = "green"
)

Arguments

x

time index (numeric vector) or NULL to use 1:length(y)

y

numeric time series

yadj

fitted/adjusted values for the training window

ypred

optional predicted values after the training window

label_x

x‑axis title

label_y

y‑axis title

color

color for the original series

color_adjust

color for the adjusted values (dashed)

color_prediction

color for the predictions (dashed)

Details

yadj length defines the training segment; ypred (if provided) is appended after yadj.

Value

returns a ggplot2::ggplot graphic

Examples

x <- base::seq(0, 10, 0.25)
yvalues <- sin(x) + rnorm(41,0,0.1)
adjust <- sin(x[1:35])
prediction <- sin(x[36:41])
grf <- plot_ts_pred(y=yvalues, yadj=adjust, ypred=prediction)
plot(grf)

Predictor (base for classification/regression)

Description

Ancestor class for supervised predictors (classification and regression). Provides a default fit() to record feature names and proxies action() to predict().

An example predictor is a decision tree classifier (cla_dtree).

Usage

predictor()

Value

returns a predictor object

Examples

#See ?cla_dtree for a classification example using a decision tree

Decision Tree for regression

Description

Regression tree using recursive partitioning via the tree package.

Usage

reg_dtree(attribute)

Arguments

attribute

attribute target to model building.

Details

Splits are chosen to reduce squared error within nodes; result is an interpretable set of piecewise constants.

Value

returns a decision tree regression object

References

Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984). Classification and Regression Trees. Wadsworth.

Examples

data(Boston)
model <- reg_dtree("medv")

# preparing dataset for random sampling
sr <- sample_random()
sr <- train_test(sr, Boston)
train <- sr$train
test <- sr$test

model <- fit(model, train)

test_prediction <- predict(model, test)
test_predictand <- test[,"medv"]
test_eval <- evaluate(model, test_predictand, test_prediction)
test_eval$metrics

K-Nearest Neighbors (KNN) Regression

Description

KNN regression using FNN::knn.reg, predicting by averaging the targets of the k nearest neighbors.

Usage

reg_knn(attribute, k)

Arguments

attribute

attribute target to model building

k

number of k neighbors

Details

Non‑parametric approach suitable for local smoothing. Sensitive to feature scaling; consider normalization beforehand.

Value

returns a knn regression object

References

Altman, N. (1992). An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression.

Examples

data(Boston)
model <- reg_knn("medv", k=3)

# preparing dataset for random sampling
sr <- sample_random()
sr <- train_test(sr, Boston)
train <- sr$train
test <- sr$test

model <- fit(model, train)

test_prediction <- predict(model, test)
test_predictand <- test[,"medv"]
test_eval <- evaluate(model, test_predictand, test_prediction)
test_eval$metrics

MLP for regression

Description

Multi-Layer Perceptron regression using nnet::nnet (single hidden layer).

Usage

reg_mlp(attribute, size = NULL, decay = 0.05, maxit = 1000)

Arguments

attribute

attribute target to model building

size

number of neurons in hidden layers

decay

decay learning rate

maxit

number of maximum iterations for training

Details

Feedforward neural network with size hidden units and L2 regularization controlled by decay. Data should be scaled for stable training.

Value

returns a object of class reg_mlp

References

Bishop, C. M. (1995). Neural Networks for Pattern Recognition. Oxford University Press.

Examples

data(Boston)
model <- reg_mlp("medv", size=5, decay=0.54)

# preparing dataset for random sampling
sr <- sample_random()
sr <- train_test(sr, Boston)
train <- sr$train
test <- sr$test

model <- fit(model, train)

test_prediction <- predict(model, test)
test_predictand <- test[,"medv"]
test_eval <- evaluate(model, test_predictand, test_prediction)
test_eval$metrics

Random Forest for regression

Description

Regression via Random Forests, an ensemble of decision trees trained on bootstrap samples with random feature subsetting at each split. This wrapper uses the randomForest package API.

Usage

reg_rf(attribute, nodesize = 1, ntree = 10, mtry = NULL)

Arguments

attribute

attribute target to model building

nodesize

node size

ntree

number of trees

mtry

number of attributes to build tree

Details

Random Forests reduce variance and are robust to overfitting on tabular data. Key hyperparameters are the number of trees (ntree), the number of variables tried at each split (mtry), and the minimum node size (nodesize).

Value

returns an object of class reg_rfobj

References

Breiman, L. (2001). Random Forests. Machine Learning 45(1):5–32. Liaw, A. and Wiener, M. (2002). Classification and Regression by randomForest. R News.

Examples

data(Boston)
model <- reg_rf("medv", ntree=10)

# preparing dataset for random sampling
sr <- sample_random()
sr <- train_test(sr, Boston)
train <- sr$train
test <- sr$test

model <- fit(model, train)

test_prediction <- predict(model, test)
test_predictand <- test[,"medv"]
test_eval <- evaluate(model, test_predictand, test_prediction)
test_eval$metrics

SVM for regression

Description

Support Vector Regression (SVR) using e1071::svm.

Usage

reg_svm(attribute, epsilon = 0.1, cost = 10, kernel = "radial")

Arguments

attribute

attribute target to model building

epsilon

parameter that controls the width of the margin around the separating hyperplane

cost

parameter that controls the trade-off between having a wide margin and correctly classifying training data points

kernel

the type of kernel function to be used in the SVM algorithm (linear, radial, polynomial, sigmoid)

Details

SVR optimizes a margin with an epsilon‑insensitive loss around the regression function. The cost controls regularization strength; epsilon sets the width of the insensitive tube; and kernel defines the feature map (linear, radial, polynomial, sigmoid).

Value

returns a SVM regression object

References

Drucker, H., Burges, C., Kaufman, L., Smola, A., Vapnik, V. (1997). Support Vector Regression Machines. Chang, C.-C. and Lin, C.-J. (2011). LIBSVM: A library for support vector machines.

Examples

data(Boston)
model <- reg_svm("medv", epsilon=0.2,cost=40.000)

# preparing dataset for random sampling
sr <- sample_random()
sr <- train_test(sr, Boston)
train <- sr$train
test <- sr$test

model <- fit(model, train)

test_prediction <- predict(model, test)
test_predictand <- test[,"medv"]
test_eval <- evaluate(model, test_predictand, test_prediction)
test_eval$metrics

Regression tuning (k-fold CV)

Description

Tune hyperparameters of a base regressor via k‑fold cross‑validation minimizing an error metric (MSE).

Usage

reg_tune(base_model, folds = 10, ranges = NULL)

Arguments

base_model

base model for tuning

folds

number of folds for cross-validation

ranges

a list of hyperparameter ranges to explore

Value

returns a reg_tune object.

References

Kohavi, R. (1995). A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection.

Examples

# preparing dataset for random sampling
data(Boston)
sr <- sample_random()
sr <- train_test(sr, Boston)
train <- sr$train
test <- sr$test

# hyper parameter setup
tune <- reg_tune(reg_mlp("medv"), ranges = list(size=c(3), decay=c(0.1,0.5)))

# hyper parameter optimization
model <- fit(tune, train, ranges)

test_prediction <- predict(model, test)
test_predictand <- test[,"medv"]
test_eval <- evaluate(model, test_predictand, test_prediction)
test_eval$metrics

Regression base class

Description

Ancestor class for regression models. Stores the target attribute and provides common evaluation metrics.

Usage

regression(attribute)

Arguments

attribute

attribute target to model building

Value

returns a regression object

Examples

#See ?reg_dtree for a regression example using a decision tree

Random sampling

Description

Train/test split and k‑fold partitioning by simple random sampling.

Usage

sample_random()

Value

returns an object of class 'sample_random

Examples

#using random sampling
sample <- sample_random()
tt <- train_test(sample, iris)

# distribution of train
table(tt$train$Species)

# preparing dataset into four folds
folds <- k_fold(sample, iris, 4)

# distribution of folds
tbl <- NULL
for (f in folds) {
 tbl <- rbind(tbl, table(f$Species))
}
head(tbl)

Stratified sampling

Description

Train/test split and k‑fold partitioning that preserve the target class proportions (strata).

Usage

sample_stratified(attribute)

Arguments

attribute

attribute target to model building

Value

returns an object of class sample_stratified

Examples

#using stratified sampling
sample <- sample_stratified("Species")
tt <- train_test(sample, iris)

# distribution of train
table(tt$train$Species)

# preparing dataset into four folds
folds <- k_fold(sample, iris, 4)

# distribution of folds
tbl <- NULL
for (f in folds) {
 tbl <- rbind(tbl, table(f$Species))
}
head(tbl)

Selection of hyperparameters

Description

Generic to select the best hyperparameters from cross‑validation results; subclasses can override.

Usage

select_hyper(obj, hyperparameters)

Arguments

obj

the object or model used for hyperparameter selection.

hyperparameters

data set with hyper parameters and quality measure from execution

Value

returns the index of selected hyper parameter

selection of hyperparameters

Description

Selects the optimal hyperparameter by maximizing the average classification metric. It wraps dplyr library.

Usage

## S3 method for class 'cla_tune'
select_hyper(obj, hyperparameters)

Arguments

obj

an object representing the model or tuning process

hyperparameters

a dataframe with columns key (hyperparameter configuration) and metric (classification metric)

Value

returns a optimized key number of hyperparameters

Assign parameters

Description

Assign a named list of parameters to matching fields in the object (best‑effort).

Usage

set_params(obj, params)

Arguments

obj

object of class dal_base

params

parameters to set obj

Value

returns an object with parameters set

Examples

obj <- set_params(dal_base(), list(x = 0))

Default Assign parameters

Description

Default method for set_params (returns object unchanged).

Usage

## Default S3 method:
set_params(obj, params)

Arguments

obj

object

params

parameters

Value

returns the object unchanged

Smoothing (binning/quantization)

Description

Family of smoothing methods that reduce noise by replacing values with the mean of a bin/cluster. Supported strategies: equal‑interval bins, equal‑frequency (quantile) bins, and clustering‑based bins (k‑means).

Usage

smoothing(n)

Arguments

n

number of bins

Details

The smoothing level is controlled by n (number of bins/levels). The helper tune() can choose an n by locating the elbow (maximum curvature) of the MSE curve across candidates. After fit(), values are mapped to bin means via transform().

Value

returns an object of class smoothing

Examples

data(iris)
obj <- smoothing_inter(n = 2)
obj <- fit(obj, iris$Sepal.Length)
sl.bi <- transform(obj, iris$Sepal.Length)
table(sl.bi)
obj$interval

entro <- evaluate(obj, as.factor(names(sl.bi)), iris$Species)
entro$entropy

Smoothing by clustering (k-means)

Description

Quantize a numeric vector into n levels using k‑means on the values and replace each value by its cluster mean (vector quantization).

Usage

smoothing_cluster(n)

Arguments

n

number of bins

Value

returns an object of class smoothing_cluster

References

MacQueen, J. (1967). Some Methods for classification and Analysis of Multivariate Observations.

Examples

data(iris)
obj <- smoothing_cluster(n = 2)
obj <- fit(obj, iris$Sepal.Length)
sl.bi <- transform(obj, iris$Sepal.Length)
table(sl.bi)
obj$interval

entro <- evaluate(obj, as.factor(names(sl.bi)), iris$Species)
entro$entropy

Smoothing by equal frequency

Description

Discretize a numeric vector into n bins with approximately equal frequency (quantile cuts), and replace each value by the mean of its bin.

Usage

smoothing_freq(n)

Arguments

n

number of bins

Value

returns an object of class smoothing_freq

References

Han, J., Kamber, M., Pei, J. (2011). Data Mining: Concepts and Techniques. (Discretization)

Examples

data(iris)
obj <- smoothing_freq(n = 2)
obj <- fit(obj, iris$Sepal.Length)
sl.bi <- transform(obj, iris$Sepal.Length)
table(sl.bi)
obj$interval

entro <- evaluate(obj, as.factor(names(sl.bi)), iris$Species)
entro$entropy

Smoothing by equal interval

Description

Discretize a numeric vector into n equal‑width intervals (robust bounds via boxplot whiskers) and replace each value by the bin mean.

Usage

smoothing_inter(n)

Arguments

n

number of bins

Value

returns an object of class smoothing_inter

References

Han, J., Kamber, M., Pei, J. (2011). Data Mining: Concepts and Techniques. (Discretization)

Examples

data(iris)
obj <- smoothing_inter(n = 2)
obj <- fit(obj, iris$Sepal.Length)
sl.bi <- transform(obj, iris$Sepal.Length)
table(sl.bi)
obj$interval

entro <- evaluate(obj, as.factor(names(sl.bi)), iris$Species)
entro$entropy

Train-Test Partition

Description

Partition a dataset into training and test sets using a sampling strategy.

Usage

train_test(obj, data, perc = 0.8, ...)

Arguments

obj

an object of a class that supports the train_test method

data

dataset to be partitioned

perc

a numeric value between 0 and 1 specifying the proportion of data to be used for training

...

additional optional arguments passed to specific methods.

Value

returns an list with two elements:

train: A data frame containing the training set
test: A data frame containing the test set

Examples

#using random sampling
sample <- sample_random()
tt <- train_test(sample, iris)

# distribution of train
table(tt$train$Species)

k-fold training and test partition object

Description

Splits a dataset into training and test sets based on k-fold cross-validation. The function takes a list of data partitions (folds) and a specified fold index k. It returns the data corresponding to the k-th fold as the test set, and combines all other folds to form the training set.

Usage

train_test_from_folds(folds, k)

Arguments

folds

data partitioned into folds

k

k-fold for test set, all reminder for training set

Value

returns a list with two elements:

train: A data frame containing the combined data from all folds except the k-th fold, used as the training set.
test: A data frame corresponding to the k-th fold, used as the test set.

Examples

# Create k-fold partitions of a dataset (e.g., iris)
folds <- k_fold(sample_random(), iris, k = 5)

# Use the first fold as the test set and combine the remaining folds for the training set
train_test_split <- train_test_from_folds(folds, k = 1)

# Display the training set
head(train_test_split$train)

# Display the test set
head(train_test_split$test)

Transform

Description

Generic to apply a transformation to data.

Usage

transform(obj, ...)

Arguments

obj

a dal_transform object.

...

optional arguments.

Value

returns a transformed data.

Examples

#See ?minmax for an example of transformation

Z-score normalization

Description

Standardize numeric columns to zero mean and unit variance, optionally rescaled to a target mean (nmean) and sd (nsd).

Usage

zscore(nmean = 0, nsd = 1)

Arguments

nmean

new mean for normalized data

nsd

new standard deviation for normalized data

Details

For each numeric column j, computes ((x - mean_j)/sd_j) * nsd + nmean. Constant columns become nmean.

zscore = (x - mean(x))/sd(x)

Value

returns the z-score transformation object

References

Han, J., Kamber, M., Pei, J. (2011). Data Mining: Concepts and Techniques. (Standardization)

Examples

data(iris)
head(iris)

trans <- zscore()
trans <- fit(trans, iris)
tiris <- transform(trans, iris)
head(tiris)

itiris <- inverse_transform(trans, tiris)
head(itiris)