| Title: | Leveraging Experiment Lines to Data Analytics |
| Version: | 1.2.747 |
| Description: | The natural increase in the complexity of current research experiments and data demands better tools to enhance productivity in Data Analytics. The package is a framework designed to address the modern challenges in data analytics workflows. The package is inspired by Experiment Line concepts. It aims to provide seamless support for users in developing their data mining workflows by offering a uniform data model and method API. It enables the integration of various data mining activities, including data preprocessing, classification, regression, clustering, and time series prediction. It also offers options for hyper-parameter tuning and supports integration with existing libraries and languages. Overall, the package provides researchers with a comprehensive set of functionalities for data science, promoting ease of use, extensibility, and integration with various tools and libraries. Information on Experiment Line is based on Ogasawara et al. (2009) <doi:10.1007/978-3-642-02279-1_20>. |
| License: | MIT + file LICENSE |
| URL: | https://cefet-rj-dal.github.io/daltoolbox/, https://github.com/cefet-rj-dal/daltoolbox |
| BugReports: | https://github.com/cefet-rj-dal/daltoolbox/issues |
| Encoding: | UTF-8 |
| Depends: | R (≥ 4.1.0) |
| RoxygenNote: | 7.3.3 |
| Imports: | FNN, caret, class, cluster, dbscan, dplyr, e1071, ggplot2, nnet, randomForest, reshape, tree |
| NeedsCompilation: | no |
| Packaged: | 2025-10-26 05:19:39 UTC; gpca |
| Author: | Eduardo Ogasawara |
| Maintainer: | Eduardo Ogasawara <eogasawara@ieee.org> |
| Repository: | CRAN |
| Date/Publication: | 2025-10-27 06:10:50 UTC |
Boston Housing Data (Regression)
Description
housing values in suburbs of Boston.
crim: per capita crime rate by town.
zn: proportion of residential land zoned for lots over 25,000 sq.ft.
indus: proportion of non-retail business acres per town
chas: Charles River dummy variable (= 1 if tract bounds)
nox: nitric oxides concentration (parts per 10 million)
rm: average number of rooms per dwelling
age: proportion of owner-occupied units built prior to 1940
dis: weighted distances to five Boston employment centres
rad: index of accessibility to radial highways
tax: full-value property-tax rate per $10,000
ptratio: pupil-teacher ratio by town
black: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
lstat: percentage of lower status of the population
medv: Median value of owner-occupied homes in $1000's
Usage
data(Boston)
Format
Regression Dataset.
Source
This dataset was obtained from the MASS library.
References
Creator: Harrison, D. and Rubinfeld, D.L. Hedonic prices and the demand for clean air, J. Environ. Economics & Management, vol.5, 81-102, 1978.
Examples
data(Boston)
head(Boston)
Action
Description
Generic to apply the object to data (e.g., predict, transform).
Usage
action(obj, ...)
Arguments
obj |
object: a dal_base object to apply the transformation on the input dataset. |
... |
optional arguments. |
Value
returns the result of an action of the model applied in provided data
Examples
data(iris)
# an example is minmax normalization
trans <- minmax()
trans <- fit(trans, iris)
tiris <- action(trans, iris)
Action implementation for transform
Description
Default action() implementation that proxies to transform() for transforms.
Usage
## S3 method for class 'dal_transform'
action(obj, ...)
Arguments
obj |
object |
... |
optional arguments |
Value
returns a transformed data
Examples
#See ?minmax for an example of transformation
Adjust categorical mapping
Description
One‑hot encode a factor vector into a matrix of indicator columns.
Usage
adjust_class_label(x, valTrue = 1, valFalse = 0)
Arguments
x |
vector to be categorized |
valTrue |
value to represent true |
valFalse |
value to represent false |
Details
Values are mapped to valTrue/valFalse (default 1/0). The resulting matrix has column names equal to levels(x).
Value
returns an adjusted categorical mapping
Adjust to data frame
Description
Coerce an object to data.frame if needed (useful for S3 methods in this package).
Usage
adjust_data.frame(data)
Arguments
data |
dataset |
Value
returns a data.frame
Examples
data(iris)
df <- adjust_data.frame(iris)
Adjust factors
Description
Convert a vector to a factor with specified internal levels (ilevels) and labels (slevels).
Usage
adjust_factor(value, ilevels, slevels)
Arguments
value |
vector to be converted into factor |
ilevels |
order for categorical values |
slevels |
labels for categorical values |
Details
Numeric vectors are first converted to factors with ilevels as the level order, then relabeled to slevels.
Value
returns an adjusted factor
Adjust to matrix
Description
Coerce an object to matrix if needed (useful before algorithms that expect matrices).
Usage
adjust_matrix(data)
Arguments
data |
dataset |
Value
returns an adjusted matrix
Examples
data(iris)
mat <- adjust_matrix(iris)
Autoencoder base (encoder)
Description
Base class for encoder‑only autoencoders. Intended to be subclassed by concrete implementations that learn a lower‑dimensional latent representation.
Usage
autoenc_base_e(input_size, encoding_size)
Arguments
input_size |
dimensionality of the input vector |
encoding_size |
dimensionality of the latent (encoded) vector |
Details
This base does not train or transform by itself (identity). Implementations should
override fit() to learn parameters and transform() to output the encoded representation.
Value
returns an autoenc_base_e object
References
Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the Dimensionality of Data with Neural Networks. Science.
Examples
# See an end‑to‑end example at:
# https://github.com/cefet-rj-dal/daltoolbox/blob/main/autoencoder/autoenc_base_e.md
Autoencoder base (encoder + decoder)
Description
Base class for autoencoders that both encode and decode. Intended to be subclassed by concrete implementations that learn to compress and reconstruct inputs.
Usage
autoenc_base_ed(input_size, encoding_size)
Arguments
input_size |
dimensionality of the input vector |
encoding_size |
dimensionality of the latent (encoded) vector |
Details
This base does not train or transform by itself (identity). Implementations should
override fit() to learn parameters and transform() to perform encode+decode.
Value
returns an autoenc_base_ed object
References
Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the Dimensionality of Data with Neural Networks. Science.
Examples
# See an end‑to‑end example at:
# https://github.com/cefet-rj-dal/daltoolbox/blob/main/autoencoder/autoenc_base_ed.md
Categorical mapping (one‑hot encoding)
Description
Convert a factor column into dummy variables (one‑hot encoding) using model.matrix without intercept.
Each level becomes a separate binary column.
Usage
categ_mapping(attribute)
Arguments
attribute |
attribute to be categorized. |
Details
This is a light wrapper around stats::model.matrix(~ attr - 1, data) that drops the original column
and returns only the dummy variables.
Value
returns a data frame with binary attributes, one for each possible category.
Examples
cm <- categ_mapping("Species")
iris_cm <- transform(cm, iris)
# can be made in a single column
species <- iris[,"Species", drop=FALSE]
iris_cm <- transform(cm, species)
Decision Tree for classification
Description
Univariate decision tree for classification using recursive partitioning.
This wrapper uses the tree package.
Usage
cla_dtree(attribute, slevels)
Arguments
attribute |
attribute target to model building |
slevels |
the possible values for the target classification |
Details
Decision trees split the feature space by maximizing node purity (e.g., Gini/entropy), yielding a human‑readable set of rules. They are fast and interpretable, and often used as base learners in ensembles.
Value
returns a classification object
References
Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984). Classification and Regression Trees. Wadsworth.
Examples
data(iris)
slevels <- levels(iris$Species)
model <- cla_dtree("Species", slevels)
# preparing dataset for random sampling
sr <- sample_random()
sr <- train_test(sr, iris)
train <- sr$train
test <- sr$test
model <- fit(model, train)
prediction <- predict(model, test)
predictand <- adjust_class_label(test[,"Species"])
test_eval <- evaluate(model, predictand, prediction)
test_eval$metrics
K-Nearest Neighbors (KNN) Classification
Description
Classification by majority vote among the k nearest neighbors. Uses class::knn.
Usage
cla_knn(attribute, slevels, k = 1)
Arguments
attribute |
attribute target to model building. |
slevels |
possible values for the target classification. |
k |
a vector of integers indicating the number of neighbors to be considered. |
Details
KNN is a simple, non‑parametric method. Choice of k trades bias/variance; distance metric is Euclidean by default.
Value
returns a knn object.
References
Cover, T. and Hart, P. (1967). Nearest neighbor pattern classification. IEEE Trans. Info. Theory.
Examples
data(iris)
slevels <- levels(iris$Species)
model <- cla_knn("Species", slevels, k=3)
# preparing dataset for random sampling
sr <- sample_random()
sr <- train_test(sr, iris)
train <- sr$train
test <- sr$test
model <- fit(model, train)
prediction <- predict(model, test)
predictand <- adjust_class_label(test[,"Species"])
test_eval <- evaluate(model, predictand, prediction)
test_eval$metrics
Majority baseline classifier
Description
Trivial classifier that always predicts the most frequent class observed in the training data. Useful as a baseline.
Usage
cla_majority(attribute, slevels)
Arguments
attribute |
attribute target to model building. |
slevels |
possible values for the target classification. |
Value
returns a classification object.
Examples
data(iris)
slevels <- levels(iris$Species)
model <- cla_majority("Species", slevels)
# preparing dataset for random sampling
sr <- sample_random()
sr <- train_test(sr, iris)
train <- sr$train
test <- sr$test
model <- fit(model, train)
prediction <- predict(model, test)
predictand <- adjust_class_label(test[,"Species"])
test_eval <- evaluate(model, predictand, prediction)
test_eval$metrics
MLP for classification
Description
Multi-Layer Perceptron classifier using nnet::nnet (single hidden layer).
Usage
cla_mlp(attribute, slevels, size = NULL, decay = 0.1, maxit = 1000)
Arguments
attribute |
attribute target to model building |
slevels |
possible values for the target classification |
size |
number of nodes that will be used in the hidden layer |
decay |
how quickly it decreases in gradient descent |
maxit |
maximum iterations |
Details
Uses softmax output with one‑hot targets from adjust_class_label. size controls hidden units and
decay applies L2 regularization. Features should be scaled.
Value
returns a classification object
References
Rumelhart, D., Hinton, G., Williams, R. (1986). Learning representations by back‑propagating errors. Bishop, C. M. (1995). Neural Networks for Pattern Recognition.
Examples
data(iris)
slevels <- levels(iris$Species)
model <- cla_mlp("Species", slevels, size=3, decay=0.03)
# preparing dataset for random sampling
sr <- sample_random()
sr <- train_test(sr, iris)
train <- sr$train
test <- sr$test
model <- fit(model, train)
prediction <- predict(model, test)
predictand <- adjust_class_label(test[,"Species"])
test_eval <- evaluate(model, predictand, prediction)
test_eval$metrics
Naive Bayes Classifier
Description
Naive Bayes classification using e1071::naiveBayes.
Usage
cla_nb(attribute, slevels)
Arguments
attribute |
attribute target to model building. |
slevels |
possible values for the target classification. |
Details
Assumes conditional independence of features given the class label, enabling fast probabilistic classification.
Value
returns a classification object.
References
Mitchell, T. (1997). Machine Learning. McGraw‑Hill. (Naive Bayes)
Examples
data(iris)
slevels <- levels(iris$Species)
model <- cla_nb("Species", slevels)
# preparing dataset for random sampling
sr <- sample_random()
sr <- train_test(sr, iris)
train <- sr$train
test <- sr$test
model <- fit(model, train)
prediction <- predict(model, test)
predictand <- adjust_class_label(test[,"Species"])
test_eval <- evaluate(model, predictand, prediction)
test_eval$metrics
Random Forest for classification
Description
Ensemble classifier of decision trees using randomForest::randomForest.
Usage
cla_rf(attribute, slevels, nodesize = 5, ntree = 10, mtry = NULL)
Arguments
attribute |
attribute target to model building |
slevels |
possible values for the target classification |
nodesize |
node size |
ntree |
number of trees |
mtry |
number of attributes to build tree |
Details
Combines many decorrelated trees to reduce variance. Key hyperparameters: ntree, mtry, nodesize.
Value
returns a classification object
References
Breiman, L. (2001). Random Forests. Machine Learning 45(1):5–32. Liaw, A. and Wiener, M. (2002). Classification and Regression by randomForest. R News.
Examples
data(iris)
slevels <- levels(iris$Species)
model <- cla_rf("Species", slevels, ntree=5)
# preparing dataset for random sampling
sr <- sample_random()
sr <- train_test(sr, iris)
train <- sr$train
test <- sr$test
model <- fit(model, train)
prediction <- predict(model, test)
predictand <- adjust_class_label(test[,"Species"])
test_eval <- evaluate(model, predictand, prediction)
test_eval$metrics
SVM for classification
Description
Support Vector Machines (SVM) for classification using e1071::svm.
Usage
cla_svm(attribute, slevels, epsilon = 0.1, cost = 10, kernel = "radial")
Arguments
attribute |
attribute target to model building |
slevels |
possible values for the target classification |
epsilon |
parameter that controls the width of the margin around the separating hyperplane |
cost |
parameter that controls the trade-off between having a wide margin and correctly classifying training data points |
kernel |
the type of kernel function to be used in the SVM algorithm (linear, radial, polynomial, sigmoid) |
Details
SVMs find a maximum‑margin hyperplane in a transformed feature space defined
by a kernel (linear, radial, polynomial, sigmoid). The cost controls the trade‑off
between margin width and training error; epsilon affects stopping; kernel sets the feature map.
Value
returns a SVM classification object
References
Cortes, C. and Vapnik, V. (1995). Support-Vector Networks. Machine Learning 20(3):273–297. Chang, C.-C. and Lin, C.-J. (2011). LIBSVM: A library for support vector machines.
Examples
data(iris)
slevels <- levels(iris$Species)
model <- cla_svm("Species", slevels, epsilon=0.0,cost=20.000)
# preparing dataset for random sampling
sr <- sample_random()
sr <- train_test(sr, iris)
train <- sr$train
test <- sr$test
model <- fit(model, train)
prediction <- predict(model, test)
predictand <- adjust_class_label(test[,"Species"])
test_eval <- evaluate(model, predictand, prediction)
test_eval$metrics
Classification tuning (k-fold CV)
Description
Tune hyperparameters of a base classifier via k‑fold cross‑validation using a chosen metric.
Usage
cla_tune(base_model, folds = 10, ranges = NULL, metric = "accuracy")
Arguments
base_model |
base model for tuning |
folds |
number of folds for cross-validation |
ranges |
a list of hyperparameter ranges to explore |
metric |
metric used to optimize |
Value
returns a cla_tune object
References
Kohavi, R. (1995). A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection.
Examples
# preparing dataset for random sampling
sr <- sample_random()
sr <- train_test(sr, iris)
train <- sr$train
test <- sr$test
# hyper parameter setup
tune <- cla_tune(cla_mlp("Species", levels(iris$Species)),
ranges=list(size=c(3:5), decay=c(0.1)))
# hyper parameter optimization
model <- fit(tune, train)
# testing optimization
test_prediction <- predict(model, test)
test_predictand <- adjust_class_label(test[,"Species"])
test_eval <- evaluate(model, test_predictand, test_prediction)
test_eval$metrics
Classification base class
Description
Ancestor class for classification models providing common fields (target attribute and levels) and evaluation helpers.
Usage
classification(attribute, slevels)
Arguments
attribute |
attribute target to model building |
slevels |
possible values for the target classification |
Value
returns a classification object
Examples
#See ?cla_dtree for a classification example using a decision tree
Clustering tuning (intrinsic metric)
Description
Tune clustering hyperparameters by evaluating an intrinsic metric over a parameter grid and selecting the elbow (max curvature).
Usage
clu_tune(base_model, folds = 10, ranges = NULL)
Arguments
base_model |
base model for tuning |
folds |
number of folds for cross-validation |
ranges |
a list of hyperparameter ranges to explore |
Value
returns a clu_tune object.
References
Satopaa, V. et al. (2011). Finding a “Kneedle” in a Haystack.
Examples
data(iris)
# fit model
model <- clu_tune(cluster_kmeans(k = 0), ranges = list(k = 1:10))
model <- fit(model, iris[,1:4])
model$k
Cluster
Description
Generic for clustering methods
Usage
cluster(obj, ...)
Arguments
obj |
a |
... |
optional arguments |
Value
clustered data
Examples
#See ?cluster_kmeans for an example of transformation
DBSCAN
Description
Density-Based Spatial Clustering of Applications with Noise using dbscan::dbscan.
Usage
cluster_dbscan(minPts = 3, eps = NULL)
Arguments
minPts |
minimum number of points |
eps |
distance value |
Details
Discovers clusters as dense regions separated by sparse areas. Hyperparameters are eps (neighborhood radius)
and minPts (minimum points). If eps is missing, it is estimated from the kNN distance curve elbow.
Value
returns a dbscan object
References
Ester, M., Kriegel, H.-P., Sander, J., Xu, X. (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise.
Examples
# setup clustering
model <- cluster_dbscan(minPts = 3)
#load dataset
data(iris)
# build model
model <- fit(model, iris[,1:4])
clu <- cluster(model, iris[,1:4])
table(clu)
# evaluate model using external metric
eval <- evaluate(model, clu, iris$Species)
eval
k-means
Description
k-means clustering using stats::kmeans.
Usage
cluster_kmeans(k = 1)
Arguments
k |
the number of clusters to form. |
Details
Partitions data into k clusters minimizing within‑cluster sum of squares. The intrinsic quality metric returned is the total within‑cluster SSE (lower is better).
Value
returns a k-means object.
References
MacQueen, J. (1967). Some Methods for classification and Analysis of Multivariate Observations. Lloyd, S. (1982). Least squares quantization in PCM.
Examples
# setup clustering
model <- cluster_kmeans(k=3)
#load dataset
data(iris)
# build model
model <- fit(model, iris[,1:4])
clu <- cluster(model, iris[,1:4])
table(clu)
# evaluate model using external metric
eval <- evaluate(model, clu, iris$Species)
eval
PAM (Partitioning Around Medoids)
Description
Clustering around representative data points (medoids) using cluster::pam.
Usage
cluster_pam(k = 1)
Arguments
k |
the number of clusters to generate. |
Details
More robust to outliers than k‑means. The intrinsic metric reported is the within‑cluster SSE to medoids.
Value
returns PAM object.
References
Kaufman, L. and Rousseeuw, P. J. (1990). Finding Groups in Data: An Introduction to Cluster Analysis.
Examples
# setup clustering
model <- cluster_pam(k = 3)
#load dataset
data(iris)
# build model
model <- fit(model, iris[,1:4])
clu <- cluster(model, iris[,1:4])
table(clu)
# evaluate model using external metric
eval <- evaluate(model, clu, iris$Species)
eval
Clusterer
Description
Base class for clustering algorithms and related evaluation utilities.
Usage
clusterer()
Value
returns a clusterer object
Examples
#See ?cluster_kmeans for an example of transformation
Class dal_base
Description
Minimal abstract base class for all DAL objects. Defines the common generics fit() and action()
used by transforms and learners.
Usage
dal_base()
Value
returns a dal_base object
Examples
trans <- dal_base()
Graphics utilities
Description
A collection of small plotting helpers built on ggplot2 used across the package
to quickly visualize vectors, grouped summaries and time series. All functions return a
ggplot2::ggplot object so you can further customize the theme, scales, and annotations.
Details
Conventions adopted:
Input data generally follows the pattern: first column is an index or category (x), remaining columns are numeric series; in some functions a long format is expected with columns named
x,value,variable.The
colorsparameter accepts either a single color or a vector mapped to groups/variables.Transparency is controlled by
alphawhere provided.All helpers set a light
theme_bw()baseline and place legends at the bottom by default.
See Also
ggplot2
DAL Learner (base class)
Description
Base ancestor for learning tasks (classification, regression, clustering, time series).
Provides common behavior such as proxying action() to the model‑specific operation
(e.g., predict() for predictors, cluster() for clusterers) and an evaluate() generic.
An example of a learner is a decision tree (see cla_dtree).
Usage
dal_learner()
Value
returns a learner object
Examples
#See ?cla_dtree for a classification example using a decision tree
DAL Transform
Description
Base class for data transformations with optional fit()/inverse_transform() support.
Usage
dal_transform()
Details
The default transform() calls the underlying action.default(); subclasses should implement
transform.className and optionally inverse_transform.className.
Value
returns a dal_transform object
Examples
# See ?minmax or ?zscore for examples
DAL Tune (base for hyperparameter search)
Description
Base class for hyperparameter optimization that stores a base model, a fold count, and a parameter grid. Specializations (classification/regression/clustering) implement the evaluation logic.
Usage
dal_tune(base_model, folds = 10, ranges)
Arguments
base_model |
base model for tuning |
folds |
number of folds for cross-validation |
ranges |
a list of hyperparameter ranges to explore |
Details
Ranges are expanded via expand.grid, and selection is delegated to select_hyper() which can be
overridden by subclasses to implement custom criteria.
Value
returns a dal_tune object
Examples
#See ?cla_tune for classification tuning
#See ?reg_tune for regression tuning
#See ?ts_tune for time series tuning
Data sampling abstractions
Description
Base class for sampling strategies that provide train/test splitting and k‑fold partitioning.
Two standard implementations are sample_random() and sample_stratified().
Usage
data_sample()
Value
returns an object of class data_sample
Examples
#using random sampling
sample <- sample_random()
tt <- train_test(sample, iris)
# distribution of train
table(tt$train$Species)
# preparing dataset into four folds
folds <- k_fold(sample, iris, 4)
# distribution of folds
tbl <- NULL
for (f in folds) {
tbl <- rbind(tbl, table(f$Species))
}
head(tbl)
PCA
Description
Principal Component Analysis (PCA) for unsupervised dimensionality reduction. Transforms correlated variables into orthogonal principal components ordered by explained variance.
Usage
dt_pca(attribute = NULL, components = NULL)
Arguments
attribute |
target attribute to model building |
components |
number of components for PCA |
Details
Fits PCA on (optionally) the numeric predictors only (excluding attribute when provided),
removes constant columns, and selects the number of components by an elbow rule (minimum curvature)
unless components is set explicitly.
Value
returns an object of class dt_pca
References
Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components.
Examples
mypca <- dt_pca("Species")
# Automatically fitting number of components
mypca <- fit(mypca, iris)
iris.pca <- transform(mypca, iris)
head(iris.pca)
head(mypca$pca.transf)
# Manual establishment of number of components
mypca <- dt_pca("Species", 3)
mypca <- fit(mypca, datasets::iris)
iris.pca <- transform(mypca, iris)
head(iris.pca)
head(mypca$pca.transf)
Evaluate
Description
Evaluate learner performance. The actual evaluate varies according to the type of learner (clustering, classification, regression, time series regression)
Usage
evaluate(obj, ...)
Arguments
obj |
object |
... |
optional arguments |
Value
returns the evaluation
Examples
data(iris)
slevels <- levels(iris$Species)
model <- cla_dtree("Species", slevels)
model <- fit(model, iris)
prediction <- predict(model, iris)
predictand <- adjust_class_label(iris[,"Species"])
test_eval <- evaluate(model, predictand, prediction)
test_eval$metrics
Fit
Description
Generic to train/adjust an object using provided data and optional parameters.
Usage
fit(obj, ...)
Arguments
obj |
object |
... |
optional arguments. |
Value
returns a object after fitting
Examples
data(iris)
# an example is minmax normalization
trans <- minmax()
trans <- fit(trans, iris)
tiris <- action(trans, iris)
tune hyperparameters of ml model
Description
Tunes the hyperparameters of a machine learning model for classification
Usage
## S3 method for class 'cla_tune'
fit(obj, data, ...)
Arguments
obj |
an object containing the model and tuning configuration |
data |
the dataset used for training and evaluation |
... |
optional arguments |
Value
a fitted obj
fit dbscan model
Description
Fits a DBSCAN clustering model by setting the eps parameter.
If eps is not provided, it is estimated based on the k-nearest neighbor distances.
It wraps dbscan library
Usage
## S3 method for class 'cluster_dbscan'
fit(obj, data, ...)
Arguments
obj |
an object containing the DBSCAN model configuration, including |
data |
the dataset to use for fitting the model |
... |
optional arguments |
Value
returns a fitted obj with the eps parameter set
Maximum curvature analysis (elbow detection)
Description
Computes a smoothing spline over a sequence and returns the location/value of maximum curvature, often used as an "elbow" detector.
Usage
fit_curvature_max()
Value
returns an object of class fit_curvature_max, which inherits from the fit_curvature and dal_transform classes. The object contains a list with the following elements:
x: The position in which the maximum curvature is reached.
y: The value where the the maximum curvature occurs.
yfit: The value of the maximum curvature.
Examples
x <- seq(from=1,to=10,by=0.5)
dat <- data.frame(x = x, value = -log(x), variable = "log")
myfit <- fit_curvature_max()
res <- transform(myfit, dat$value)
head(res)
Minimum curvature analysis (elbow detection)
Description
Computes a smoothing spline over a sequence and returns the location/value of minimum curvature, complementary to maximum curvature and useful in elbow detection.
Usage
fit_curvature_min()
Value
Returns an object of class fit_curvature_max, which inherits from the fit_curvature and dal_transform classes. The object contains a list with the following elements:
x: The position in which the minimum curvature is reached.
y: The value where the the minimum curvature occurs.
yfit: The value of the minimum curvature.
Examples
x <- seq(from=1,to=10,by=0.5)
dat <- data.frame(x = x, value = log(x), variable = "log")
myfit <- fit_curvature_min()
res <- transform(myfit, dat$value)
head(res)
Inverse Transform
Description
Optional inverse operation for a transformation; defaults to identity.
Usage
inverse_transform(obj, ...)
Arguments
obj |
a dal_transform object. |
... |
optional arguments. |
Value
dataset inverse transformed.
Examples
#See ?minmax for an example of transformation
K-fold sampling
Description
Split a dataset into k folds using a sampling strategy.
Usage
k_fold(obj, data, k)
Arguments
obj |
an object representing the sampling method |
data |
dataset to be partitioned |
k |
number of folds |
Value
returns a list of k data frames
Examples
#using random sampling
sample <- sample_random()
# preparing dataset into four folds
folds <- k_fold(sample, iris, 4)
# distribution of folds
tbl <- NULL
for (f in folds) {
tbl <- rbind(tbl, table(f$Species))
}
head(tbl)
Min-max normalization
Description
Linearly scales numeric columns to the [0,1] range per column.
Usage
minmax()
Details
For each numeric column j, computes (x - min_j) / (max_j - min_j). Constant columns map to 0.
minmax = (x-min(x))/(max(x)-min(x))
Value
returns an object of class minmax
References
Han, J., Kamber, M., Pei, J. (2011). Data Mining: Concepts and Techniques. (Normalization section)
Examples
data(iris)
head(iris)
trans <- minmax()
trans <- fit(trans, iris)
tiris <- transform(trans, iris)
head(tiris)
itiris <- inverse_transform(trans, tiris)
head(itiris)
Outlier removal by boxplot (IQR rule)
Description
Removes outliers from numeric columns using Tukey's boxplot rule: values below Q1 - alpha·IQR or above Q3 + alpha·IQR are flagged as outliers.
Usage
outliers_boxplot(alpha = 1.5)
Arguments
alpha |
boxplot outlier threshold (default 1.5, but can be 3.0 to remove extreme values) |
Details
The default alpha=1.5 corresponds to the standard boxplot whiskers; alpha=3 is used for extreme outliers.
Value
returns an outlier object
References
Tukey, J. W. (1977). Exploratory Data Analysis. Addison‑Wesley.
Examples
# code for outlier removal
out_obj <- outliers_boxplot() # class for outlier analysis
out_obj <- fit(out_obj, iris) # computing boundaries
iris.clean <- transform(out_obj, iris) # returning cleaned dataset
#inspection of cleaned dataset
nrow(iris.clean)
idx <- attr(iris.clean, "idx")
table(idx)
iris.outliers_boxplot <- iris[idx,]
iris.outliers_boxplot
Outlier removal by Gaussian 3-sigma rule
Description
Removes outliers from numeric columns using the 3‑sigma rule under a Gaussian assumption: values outside mean ± alpha·sd are flagged as outliers.
Usage
outliers_gaussian(alpha = 3)
Arguments
alpha |
gaussian threshold (default 3) |
Value
returns an outlier object
References
Pukelsheim, F. (1994). The Three Sigma Rule. The American Statistician 48(2):88–91.
Examples
# code for outlier removal
out_obj <- outliers_gaussian() # class for outlier analysis
out_obj <- fit(out_obj, iris) # computing boundaries
iris.clean <- transform(out_obj, iris) # returning cleaned dataset
#inspection of cleaned dataset
nrow(iris.clean)
idx <- attr(iris.clean, "idx")
table(idx)
iris.outliers_gaussian <- iris[idx,]
iris.outliers_gaussian
Plot bar graph
Description
Draw a simple bar chart from a two‑column data.frame: first column as categories (x), second as values.
Usage
plot_bar(data, label_x = "", label_y = "", colors = NULL, alpha = 1)
Arguments
data |
two‑column data.frame: category in the first column, numeric values in the second |
label_x |
x‑axis label |
label_y |
y‑axis label |
colors |
optional fill color (single value) |
alpha |
bar transparency (0–1) |
Details
If colors is provided, a constant fill is used; otherwise ggplot2's default palette applies.
alpha controls bar transparency. The first column is coerced to factor when needed.
Value
returns a ggplot2::ggplot graphic
Examples
#summarizing iris dataset
data <- iris |> dplyr::group_by(Species) |>
dplyr::summarize(Sepal.Length=mean(Sepal.Length))
head(data)
# plotting data
grf <- plot_bar(data, colors="blue")
plot(grf)
Plot boxplot
Description
Boxplots for each numeric column of a data.frame.
Usage
plot_boxplot(data, label_x = "", label_y = "", colors = NULL, barwidth = 0.25)
Arguments
data |
data.frame with one or more numeric columns |
label_x |
x‑axis label |
label_y |
y‑axis label |
colors |
optional fill color for boxes |
barwidth |
width of the box (numeric) |
Details
The data is melted to long format and a box is drawn per original column. If colors is provided,
a constant fill is applied to all boxes. Use barwidth to control box width.
Value
returns a ggplot2::ggplot graphic
Examples
grf <- plot_boxplot(iris, colors="white")
plot(grf)
Boxplot per class
Description
Boxplots of a numeric column grouped by a class label.
Usage
plot_boxplot_class(
data,
class_label,
label_x = "",
label_y = "",
colors = NULL
)
Arguments
data |
data.frame with a grouping column and one numeric column |
class_label |
name of the grouping (class) column |
label_x |
x‑axis label |
label_y |
y‑axis label |
colors |
optional fill color for the boxes |
Details
Expects a data.frame with the grouping column named in class_label and one numeric column.
The function melts to long format and draws per‑group distributions.
Value
returns a ggplot2::ggplot graphic
Examples
grf <- plot_boxplot_class(iris |> dplyr::select(Sepal.Width, Species),
class_label = "Species", colors=c("red", "green", "blue"))
plot(grf)
Plot density
Description
Kernel density plot for one or multiple numeric columns.
Usage
plot_density(
data,
label_x = "",
label_y = "",
colors = NULL,
bin = NULL,
alpha = 0.25
)
Arguments
data |
data.frame with one or more numeric columns |
label_x |
x‑axis label |
label_y |
y‑axis label |
colors |
optional fill color (single column) or vector for groups |
bin |
optional bin width passed to |
alpha |
fill transparency (0–1) |
Details
If data has multiple numeric columns, densities are overlaid and filled by column (group).
When a single column is provided, colors (if set) is used as a constant fill.
The bin argument is passed to geom_density(binwidth=...).
Value
returns a ggplot2::ggplot graphic
Examples
grf <- plot_density(iris |> dplyr::select(Sepal.Width), colors="blue")
plot(grf)
Plot density per class
Description
Kernel density plot grouped by a class label.
Usage
plot_density_class(
data,
class_label,
label_x = "",
label_y = "",
colors = NULL,
bin = NULL,
alpha = 0.5
)
Arguments
data |
data.frame with class label and a numeric column |
class_label |
name of the grouping (class) column |
label_x |
x‑axis label |
label_y |
y‑axis label |
colors |
optional vector of fills per class |
bin |
optional bin width passed to |
alpha |
fill transparency (0–1) |
Details
Expects data with a grouping column named in class_label and one numeric column. Each group is
filled with a distinct color (if provided).
Value
returns a ggplot2::ggplot graphic
Examples
grf <- plot_density_class(iris |> dplyr::select(Sepal.Width, Species),
class = "Species", colors=c("red", "green", "blue"))
plot(grf)
Plot grouped bar
Description
Grouped (side‑by‑side) bar chart for multiple series per category.
Usage
plot_groupedbar(data, label_x = "", label_y = "", colors = NULL, alpha = 1)
Arguments
data |
data.frame with category in first column and series in remaining columns |
label_x |
x‑axis label |
label_y |
y‑axis label |
colors |
optional vector of fill colors, one per series |
alpha |
bar transparency (0–1) |
Details
Expects a data.frame where the first column is the category (x) and the remaining columns are
numeric series. Bars are grouped by series. Provide colors with length equal to the number of series to set fills.
Value
returns a ggplot2::ggplot graphic
Examples
#summarizing iris dataset
data <- iris |> dplyr::group_by(Species) |>
dplyr::summarize(Sepal.Length=mean(Sepal.Length), Sepal.Width=mean(Sepal.Width))
head(data)
#ploting data
grf <- plot_groupedbar(data, colors=c("blue", "red"))
plot(grf)
Plot histogram
Description
Histogram for a numeric column using ggplot2.
Usage
plot_hist(data, label_x = "", label_y = "", color = "white", alpha = 0.25)
Arguments
data |
data.frame with one numeric column (first column is used if multiple) |
label_x |
x‑axis label |
label_y |
y‑axis label |
color |
fill color |
alpha |
transparency level (0–1) |
Details
If multiple columns are provided, only the first is used. Breaks are computed via graphics::hist to
mirror base R binning. color controls the fill; alpha the transparency.
Value
returns a ggplot2::ggplot graphic
Examples
grf <- plot_hist(iris |> dplyr::select(Sepal.Width), color=c("blue"))
plot(grf)
Plot lollipop
Description
Lollipop chart (stick + circle + value label) per category.
Usage
plot_lollipop(
data,
label_x = "",
label_y = "",
colors = NULL,
color_text = "black",
size_text = 3,
size_ball = 8,
alpha_ball = 0.2,
min_value = 0,
max_value_gap = 1
)
Arguments
data |
data.frame with category and numeric values |
label_x |
x‑axis label |
label_y |
y‑axis label |
colors |
stick/circle color |
color_text |
color of the text inside the circle |
size_text |
text size |
size_ball |
circle size |
alpha_ball |
circle transparency (0–1) |
min_value |
minimum baseline for the stick |
max_value_gap |
gap from value to stick end |
Details
Expects a data.frame with category in the first column and numeric values in subsequent columns.
Circles are drawn at values, with vertical segments extending from min_value to value - max_value_gap.
Value
returns a ggplot2::ggplot graphic
Examples
#summarizing iris dataset
data <- iris |> dplyr::group_by(Species) |>
dplyr::summarize(Sepal.Length=mean(Sepal.Length))
head(data)
#ploting data
grf <- plot_lollipop(data, colors="blue", max_value_gap=0.2)
plot(grf)
Plot pie
Description
Pie chart from a two‑column data.frame (category, value) using polar coordinates.
Usage
plot_pieplot(
data,
label_x = "",
label_y = "",
colors = NULL,
textcolor = "white",
bordercolor = "black"
)
Arguments
data |
two‑column data.frame with category and value |
label_x |
x‑axis label (unused in pie, kept for symmetry) |
label_y |
y‑axis label (unused in pie) |
colors |
vector of slice fills |
textcolor |
label text color |
bordercolor |
slice border color |
Details
Slices are sized by the second (numeric) column. Text and border colors can be customized.
Value
returns a ggplot2::ggplot graphic
Examples
#summarizing iris dataset
data <- iris |> dplyr::group_by(Species) |>
dplyr::summarize(Sepal.Length=mean(Sepal.Length))
head(data)
#ploting data
grf <- plot_pieplot(data, colors=c("red", "green", "blue"))
plot(grf)
Plot points
Description
Dot chart for multiple series across categories (points only).
Usage
plot_points(data, label_x = "", label_y = "", colors = NULL)
Arguments
data |
data.frame with category + one or more numeric columns |
label_x |
x‑axis label |
label_y |
y‑axis label |
colors |
optional color vector for series |
Details
Expects a data.frame with category in the first column and one or more numeric series.
Points are colored by series (legend shows original column names). Supply colors to override the palette.
Value
returns a ggplot2::ggplot graphic
Examples
x <- seq(0, 10, 0.25)
data <- data.frame(x, sin=sin(x), cosine=cos(x)+5)
head(data)
grf <- plot_points(data, colors=c("red", "green"))
plot(grf)
Plot radar
Description
Radar (spider) chart for a single profile of variables using polar coordinates.
Usage
plot_radar(data, label_x = "", label_y = "", colors = NULL)
Arguments
data |
two‑column data.frame: variable name and value |
label_x |
x‑axis label (unused; variable names are shown around the circle) |
label_y |
y‑axis label |
colors |
line/fill color for the polygon |
Details
Expects a two‑column data.frame with variable names in the first column and numeric values in the second.
Value
returns a ggplot2::ggplot graphic
Examples
data <- data.frame(name = "Petal.Length", value = mean(iris$Petal.Length))
data <- rbind(data, data.frame(name = "Petal.Width", value = mean(iris$Petal.Width)))
data <- rbind(data, data.frame(name = "Sepal.Length", value = mean(iris$Sepal.Length)))
data <- rbind(data, data.frame(name = "Sepal.Width", value = mean(iris$Sepal.Width)))
grf <- plot_radar(data, colors="red") + ggplot2::ylim(0, NA)
plot(grf)
Scatter graph
Description
Scatter plot from a long data.frame with columns named x, value, and variable.
Usage
plot_scatter(data, label_x = "", label_y = "", colors = NULL)
Arguments
data |
long data.frame with columns |
label_x |
x‑axis label |
label_y |
y‑axis label |
colors |
optional color(s); for numeric |
Details
Colors are mapped to variable. If variable is numeric, a gradient color scale is used when colors is provided.
Value
return a ggplot2::ggplot graphic
Examples
grf <- plot_scatter(iris |> dplyr::select(x = Sepal.Length,
value = Sepal.Width, variable = Species),
label_x = "Sepal.Length", label_y = "Sepal.Width",
colors=c("red", "green", "blue"))
plot(grf)
Plot series
Description
Line plot for one or more series over a common x index.
Usage
plot_series(data, label_x = "", label_y = "", colors = NULL)
Arguments
data |
data.frame with x in the first column and series in remaining columns |
label_x |
x‑axis label |
label_y |
y‑axis label |
colors |
optional vector of colors for series |
Details
Expects a data.frame where the first column is the x index and remaining columns are numeric series.
Points and lines are drawn per series; supply colors to override the palette.
Value
returns a ggplot2::ggplot graphic
Examples
x <- seq(0, 10, 0.25)
data <- data.frame(x, sin=sin(x))
head(data)
grf <- plot_series(data, colors=c("red"))
plot(grf)
Plot stacked bar
Description
Stacked bar chart for multiple series per category.
Usage
plot_stackedbar(data, label_x = "", label_y = "", colors = NULL, alpha = 1)
Arguments
data |
data.frame with category in first column and series in remaining columns |
label_x |
x‑axis label |
label_y |
y‑axis label |
colors |
optional vector of fill colors, one per series |
alpha |
bar transparency (0–1) |
Details
Expects a data.frame with category in the first column and series in remaining columns.
Bars are stacked within each category. Provide colors (one per series) to control fills.
Value
returns a ggplot2::ggplot graphic
Examples
#summarizing iris dataset
data <- iris |> dplyr::group_by(Species) |>
dplyr::summarize(Sepal.Length=mean(Sepal.Length), Sepal.Width=mean(Sepal.Width))
#plotting data
grf <- plot_stackedbar(data, colors=c("blue", "red"))
plot(grf)
Plot time series chart
Description
Simple time series plot with points and a line.
Usage
plot_ts(x = NULL, y, label_x = "", label_y = "", color = "black")
Arguments
x |
time index (numeric vector) or NULL to use 1:length(y) |
y |
numeric series |
label_x |
x‑axis label |
label_y |
y‑axis label |
color |
color for the series |
Details
If x is NULL, an integer index 1:n is used. The color applies to both points and line.
Value
returns a ggplot2::ggplot graphic
Examples
x <- seq(0, 10, 0.25)
y <- sin(x)
grf <- plot_ts(x = x, y = y, color=c("red"))
plot(grf)
Plot time series with predictions
Description
Plot original series plus dashed lines for in‑sample adjustment and optional out‑of‑sample predictions.
Usage
plot_ts_pred(
x = NULL,
y,
yadj,
ypred = NULL,
label_x = "",
label_y = "",
color = "black",
color_adjust = "blue",
color_prediction = "green"
)
Arguments
x |
time index (numeric vector) or NULL to use 1:length(y) |
y |
numeric time series |
yadj |
fitted/adjusted values for the training window |
ypred |
optional predicted values after the training window |
label_x |
x‑axis title |
label_y |
y‑axis title |
color |
color for the original series |
color_adjust |
color for the adjusted values (dashed) |
color_prediction |
color for the predictions (dashed) |
Details
yadj length defines the training segment; ypred (if provided) is appended after yadj.
Value
returns a ggplot2::ggplot graphic
Examples
x <- base::seq(0, 10, 0.25)
yvalues <- sin(x) + rnorm(41,0,0.1)
adjust <- sin(x[1:35])
prediction <- sin(x[36:41])
grf <- plot_ts_pred(y=yvalues, yadj=adjust, ypred=prediction)
plot(grf)
Predictor (base for classification/regression)
Description
Ancestor class for supervised predictors (classification and regression).
Provides a default fit() to record feature names and proxies action() to predict().
An example predictor is a decision tree classifier (cla_dtree).
Usage
predictor()
Value
returns a predictor object
Examples
#See ?cla_dtree for a classification example using a decision tree
Decision Tree for regression
Description
Regression tree using recursive partitioning via the tree package.
Usage
reg_dtree(attribute)
Arguments
attribute |
attribute target to model building. |
Details
Splits are chosen to reduce squared error within nodes; result is an interpretable set of piecewise constants.
Value
returns a decision tree regression object
References
Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984). Classification and Regression Trees. Wadsworth.
Examples
data(Boston)
model <- reg_dtree("medv")
# preparing dataset for random sampling
sr <- sample_random()
sr <- train_test(sr, Boston)
train <- sr$train
test <- sr$test
model <- fit(model, train)
test_prediction <- predict(model, test)
test_predictand <- test[,"medv"]
test_eval <- evaluate(model, test_predictand, test_prediction)
test_eval$metrics
K-Nearest Neighbors (KNN) Regression
Description
KNN regression using FNN::knn.reg, predicting by averaging the targets of the k nearest neighbors.
Usage
reg_knn(attribute, k)
Arguments
attribute |
attribute target to model building |
k |
number of k neighbors |
Details
Non‑parametric approach suitable for local smoothing. Sensitive to feature scaling; consider normalization beforehand.
Value
returns a knn regression object
References
Altman, N. (1992). An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression.
Examples
data(Boston)
model <- reg_knn("medv", k=3)
# preparing dataset for random sampling
sr <- sample_random()
sr <- train_test(sr, Boston)
train <- sr$train
test <- sr$test
model <- fit(model, train)
test_prediction <- predict(model, test)
test_predictand <- test[,"medv"]
test_eval <- evaluate(model, test_predictand, test_prediction)
test_eval$metrics
MLP for regression
Description
Multi-Layer Perceptron regression using nnet::nnet (single hidden layer).
Usage
reg_mlp(attribute, size = NULL, decay = 0.05, maxit = 1000)
Arguments
attribute |
attribute target to model building |
size |
number of neurons in hidden layers |
decay |
decay learning rate |
maxit |
number of maximum iterations for training |
Details
Feedforward neural network with size hidden units and L2 regularization controlled by decay.
Data should be scaled for stable training.
Value
returns a object of class reg_mlp
References
Bishop, C. M. (1995). Neural Networks for Pattern Recognition. Oxford University Press.
Examples
data(Boston)
model <- reg_mlp("medv", size=5, decay=0.54)
# preparing dataset for random sampling
sr <- sample_random()
sr <- train_test(sr, Boston)
train <- sr$train
test <- sr$test
model <- fit(model, train)
test_prediction <- predict(model, test)
test_predictand <- test[,"medv"]
test_eval <- evaluate(model, test_predictand, test_prediction)
test_eval$metrics
Random Forest for regression
Description
Regression via Random Forests, an ensemble of decision trees trained
on bootstrap samples with random feature subsetting at each split. This wrapper
uses the randomForest package API.
Usage
reg_rf(attribute, nodesize = 1, ntree = 10, mtry = NULL)
Arguments
attribute |
attribute target to model building |
nodesize |
node size |
ntree |
number of trees |
mtry |
number of attributes to build tree |
Details
Random Forests reduce variance and are robust to overfitting on tabular data.
Key hyperparameters are the number of trees (ntree), the number of variables tried at
each split (mtry), and the minimum node size (nodesize).
Value
returns an object of class reg_rfobj
References
Breiman, L. (2001). Random Forests. Machine Learning 45(1):5–32. Liaw, A. and Wiener, M. (2002). Classification and Regression by randomForest. R News.
Examples
data(Boston)
model <- reg_rf("medv", ntree=10)
# preparing dataset for random sampling
sr <- sample_random()
sr <- train_test(sr, Boston)
train <- sr$train
test <- sr$test
model <- fit(model, train)
test_prediction <- predict(model, test)
test_predictand <- test[,"medv"]
test_eval <- evaluate(model, test_predictand, test_prediction)
test_eval$metrics
SVM for regression
Description
Support Vector Regression (SVR) using e1071::svm.
Usage
reg_svm(attribute, epsilon = 0.1, cost = 10, kernel = "radial")
Arguments
attribute |
attribute target to model building |
epsilon |
parameter that controls the width of the margin around the separating hyperplane |
cost |
parameter that controls the trade-off between having a wide margin and correctly classifying training data points |
kernel |
the type of kernel function to be used in the SVM algorithm (linear, radial, polynomial, sigmoid) |
Details
SVR optimizes a margin with an epsilon‑insensitive loss around the regression function.
The cost controls regularization strength; epsilon sets the width of the insensitive tube; and
kernel defines the feature map (linear, radial, polynomial, sigmoid).
Value
returns a SVM regression object
References
Drucker, H., Burges, C., Kaufman, L., Smola, A., Vapnik, V. (1997). Support Vector Regression Machines. Chang, C.-C. and Lin, C.-J. (2011). LIBSVM: A library for support vector machines.
Examples
data(Boston)
model <- reg_svm("medv", epsilon=0.2,cost=40.000)
# preparing dataset for random sampling
sr <- sample_random()
sr <- train_test(sr, Boston)
train <- sr$train
test <- sr$test
model <- fit(model, train)
test_prediction <- predict(model, test)
test_predictand <- test[,"medv"]
test_eval <- evaluate(model, test_predictand, test_prediction)
test_eval$metrics
Regression tuning (k-fold CV)
Description
Tune hyperparameters of a base regressor via k‑fold cross‑validation minimizing an error metric (MSE).
Usage
reg_tune(base_model, folds = 10, ranges = NULL)
Arguments
base_model |
base model for tuning |
folds |
number of folds for cross-validation |
ranges |
a list of hyperparameter ranges to explore |
Value
returns a reg_tune object.
References
Kohavi, R. (1995). A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection.
Examples
# preparing dataset for random sampling
data(Boston)
sr <- sample_random()
sr <- train_test(sr, Boston)
train <- sr$train
test <- sr$test
# hyper parameter setup
tune <- reg_tune(reg_mlp("medv"), ranges = list(size=c(3), decay=c(0.1,0.5)))
# hyper parameter optimization
model <- fit(tune, train, ranges)
test_prediction <- predict(model, test)
test_predictand <- test[,"medv"]
test_eval <- evaluate(model, test_predictand, test_prediction)
test_eval$metrics
Regression base class
Description
Ancestor class for regression models. Stores the target attribute and provides common evaluation metrics.
Usage
regression(attribute)
Arguments
attribute |
attribute target to model building |
Value
returns a regression object
Examples
#See ?reg_dtree for a regression example using a decision tree
Random sampling
Description
Train/test split and k‑fold partitioning by simple random sampling.
Usage
sample_random()
Value
returns an object of class 'sample_random
Examples
#using random sampling
sample <- sample_random()
tt <- train_test(sample, iris)
# distribution of train
table(tt$train$Species)
# preparing dataset into four folds
folds <- k_fold(sample, iris, 4)
# distribution of folds
tbl <- NULL
for (f in folds) {
tbl <- rbind(tbl, table(f$Species))
}
head(tbl)
Stratified sampling
Description
Train/test split and k‑fold partitioning that preserve the target class proportions (strata).
Usage
sample_stratified(attribute)
Arguments
attribute |
attribute target to model building |
Value
returns an object of class sample_stratified
Examples
#using stratified sampling
sample <- sample_stratified("Species")
tt <- train_test(sample, iris)
# distribution of train
table(tt$train$Species)
# preparing dataset into four folds
folds <- k_fold(sample, iris, 4)
# distribution of folds
tbl <- NULL
for (f in folds) {
tbl <- rbind(tbl, table(f$Species))
}
head(tbl)
Selection of hyperparameters
Description
Generic to select the best hyperparameters from cross‑validation results; subclasses can override.
Usage
select_hyper(obj, hyperparameters)
Arguments
obj |
the object or model used for hyperparameter selection. |
hyperparameters |
data set with hyper parameters and quality measure from execution |
Value
returns the index of selected hyper parameter
selection of hyperparameters
Description
Selects the optimal hyperparameter by maximizing the average classification metric. It wraps dplyr library.
Usage
## S3 method for class 'cla_tune'
select_hyper(obj, hyperparameters)
Arguments
obj |
an object representing the model or tuning process |
hyperparameters |
a dataframe with columns |
Value
returns a optimized key number of hyperparameters
Assign parameters
Description
Assign a named list of parameters to matching fields in the object (best‑effort).
Usage
set_params(obj, params)
Arguments
obj |
object of class dal_base |
params |
parameters to set obj |
Value
returns an object with parameters set
Examples
obj <- set_params(dal_base(), list(x = 0))
Default Assign parameters
Description
Default method for set_params (returns object unchanged).
Usage
## Default S3 method:
set_params(obj, params)
Arguments
obj |
object |
params |
parameters |
Value
returns the object unchanged
Smoothing (binning/quantization)
Description
Family of smoothing methods that reduce noise by replacing values with the mean of a bin/cluster. Supported strategies: equal‑interval bins, equal‑frequency (quantile) bins, and clustering‑based bins (k‑means).
Usage
smoothing(n)
Arguments
n |
number of bins |
Details
The smoothing level is controlled by n (number of bins/levels). The helper tune() can choose
an n by locating the elbow (maximum curvature) of the MSE curve across candidates. After fit(),
values are mapped to bin means via transform().
Value
returns an object of class smoothing
Examples
data(iris)
obj <- smoothing_inter(n = 2)
obj <- fit(obj, iris$Sepal.Length)
sl.bi <- transform(obj, iris$Sepal.Length)
table(sl.bi)
obj$interval
entro <- evaluate(obj, as.factor(names(sl.bi)), iris$Species)
entro$entropy
Smoothing by clustering (k-means)
Description
Quantize a numeric vector into n levels using k‑means on the values and
replace each value by its cluster mean (vector quantization).
Usage
smoothing_cluster(n)
Arguments
n |
number of bins |
Value
returns an object of class smoothing_cluster
References
MacQueen, J. (1967). Some Methods for classification and Analysis of Multivariate Observations.
Examples
data(iris)
obj <- smoothing_cluster(n = 2)
obj <- fit(obj, iris$Sepal.Length)
sl.bi <- transform(obj, iris$Sepal.Length)
table(sl.bi)
obj$interval
entro <- evaluate(obj, as.factor(names(sl.bi)), iris$Species)
entro$entropy
Smoothing by equal frequency
Description
Discretize a numeric vector into n bins with approximately equal frequency (quantile cuts),
and replace each value by the mean of its bin.
Usage
smoothing_freq(n)
Arguments
n |
number of bins |
Value
returns an object of class smoothing_freq
References
Han, J., Kamber, M., Pei, J. (2011). Data Mining: Concepts and Techniques. (Discretization)
Examples
data(iris)
obj <- smoothing_freq(n = 2)
obj <- fit(obj, iris$Sepal.Length)
sl.bi <- transform(obj, iris$Sepal.Length)
table(sl.bi)
obj$interval
entro <- evaluate(obj, as.factor(names(sl.bi)), iris$Species)
entro$entropy
Smoothing by equal interval
Description
Discretize a numeric vector into n equal‑width intervals (robust bounds via boxplot whiskers)
and replace each value by the bin mean.
Usage
smoothing_inter(n)
Arguments
n |
number of bins |
Value
returns an object of class smoothing_inter
References
Han, J., Kamber, M., Pei, J. (2011). Data Mining: Concepts and Techniques. (Discretization)
Examples
data(iris)
obj <- smoothing_inter(n = 2)
obj <- fit(obj, iris$Sepal.Length)
sl.bi <- transform(obj, iris$Sepal.Length)
table(sl.bi)
obj$interval
entro <- evaluate(obj, as.factor(names(sl.bi)), iris$Species)
entro$entropy
Train-Test Partition
Description
Partition a dataset into training and test sets using a sampling strategy.
Usage
train_test(obj, data, perc = 0.8, ...)
Arguments
obj |
an object of a class that supports the |
data |
dataset to be partitioned |
perc |
a numeric value between 0 and 1 specifying the proportion of data to be used for training |
... |
additional optional arguments passed to specific methods. |
Value
returns an list with two elements:
train: A data frame containing the training set
test: A data frame containing the test set
Examples
#using random sampling
sample <- sample_random()
tt <- train_test(sample, iris)
# distribution of train
table(tt$train$Species)
k-fold training and test partition object
Description
Splits a dataset into training and test sets based on k-fold cross-validation. The function takes a list of data partitions (folds) and a specified fold index k. It returns the data corresponding to the k-th fold as the test set, and combines all other folds to form the training set.
Usage
train_test_from_folds(folds, k)
Arguments
folds |
data partitioned into folds |
k |
k-fold for test set, all reminder for training set |
Value
returns a list with two elements:
train: A data frame containing the combined data from all folds except the k-th fold, used as the training set.
test: A data frame corresponding to the k-th fold, used as the test set.
Examples
# Create k-fold partitions of a dataset (e.g., iris)
folds <- k_fold(sample_random(), iris, k = 5)
# Use the first fold as the test set and combine the remaining folds for the training set
train_test_split <- train_test_from_folds(folds, k = 1)
# Display the training set
head(train_test_split$train)
# Display the test set
head(train_test_split$test)
Transform
Description
Generic to apply a transformation to data.
Usage
transform(obj, ...)
Arguments
obj |
a |
... |
optional arguments. |
Value
returns a transformed data.
Examples
#See ?minmax for an example of transformation
Z-score normalization
Description
Standardize numeric columns to zero mean and unit variance, optionally rescaled to a target mean (nmean) and sd (nsd).
Usage
zscore(nmean = 0, nsd = 1)
Arguments
nmean |
new mean for normalized data |
nsd |
new standard deviation for normalized data |
Details
For each numeric column j, computes ((x - mean_j)/sd_j) * nsd + nmean. Constant columns become nmean.
zscore = (x - mean(x))/sd(x)
Value
returns the z-score transformation object
References
Han, J., Kamber, M., Pei, J. (2011). Data Mining: Concepts and Techniques. (Standardization)
Examples
data(iris)
head(iris)
trans <- zscore()
trans <- fit(trans, iris)
tiris <- transform(trans, iris)
head(tiris)
itiris <- inverse_transform(trans, tiris)
head(itiris)