The exponential growth of scientific literature in sport science domains presents both opportunities and challenges for researchers. While vast amounts of knowledge are being generated, systematically synthesizing and identifying research trends has become increasingly difficult. SportMiner addresses this challenge by providing a comprehensive, integrated toolkit for mining, analyzing, and visualizing sport science literature.
Traditional literature review methods are time-consuming and potentially biased. Researchers need automated tools to:
SportMiner makes the following contributions:
Install the released version from CRAN:
Or install the development version from GitHub:
SportMiner uses the Scopus API for literature retrieval. Obtain a free API key from the Elsevier Developer Portal.
This section demonstrates a complete analysis workflow from literature search through topic modeling and visualization.
Scopus queries follow a structured syntax with field codes and Boolean operators.
# Complex query with multiple conditions
query <- paste0(
'TITLE-ABS-KEY(',
'("machine learning" OR "deep learning" OR "artificial intelligence") ',
'AND ("sports" OR "athlete*" OR "performance") ',
'AND NOT "e-sports"',
') ',
'AND DOCTYPE(ar) ', # Articles only
'AND PUBYEAR > 2018 ', # Published after 2018
'AND LANGUAGE(english) ', # English only
'AND SUBJAREA(MEDI OR HEAL OR COMP)' # Relevant subject areas
)Document Type Filters: - DOCTYPE(ar):
Journal articles - DOCTYPE(re): Review articles -
DOCTYPE(cp): Conference papers
Date Filters: - PUBYEAR = 2024: Exact
year - PUBYEAR > 2019: After 2019 -
PUBYEAR > 2019 AND PUBYEAR < 2025: Between years
Subject Area Filters: - SUBJAREA(MEDI):
Medicine - SUBJAREA(HEAL): Health Professions -
SUBJAREA(COMP): Computer Science -
SUBJAREA(PSYC): Psychology
papers <- sm_search_scopus(
query = query,
max_count = 200,
batch_size = 100,
view = "COMPLETE",
verbose = TRUE
)
# Inspect results
dim(papers)
head(papers[, c("title", "year", "author_keywords")])The function returns a data frame with columns including
title, abstract, author_keywords,
year, doi, and eid.
Raw abstracts require preprocessing before topic modeling.
processed_data <- sm_preprocess_text(
data = papers,
text_col = "abstract",
doc_id_col = "doc_id",
min_word_length = 3
)
head(processed_data)The preprocessing pipeline performs:
Create a sparse matrix representation of term frequencies.
dtm <- sm_create_dtm(
word_counts = processed_data,
min_term_freq = 3,
max_term_freq = 0.5
)
# Matrix dimensions
print(paste("Documents:", dtm$nrow, "| Terms:", dtm$ncol))
# Sparsity
sparsity <- 100 * (1 - slam::row_sums(dtm > 0) / (dtm$nrow * dtm$ncol))
print(paste("Sparsity:", round(sparsity, 2), "%"))Parameters min_term_freq and max_term_freq
control vocabulary size: - min_term_freq: Minimum document
frequency (removes rare terms) - max_term_freq: Maximum
document proportion (removes very common terms)
Determine the appropriate number of topics using model evaluation metrics.
k_selection <- sm_select_optimal_k(
dtm = dtm,
k_range = seq(4, 20, by = 2),
method = "gibbs",
plot = TRUE
)
# View results
print(k_selection$metrics)
print(paste("Optimal k:", k_selection$optimal_k))The function compares models across different values of \(k\) using perplexity, a measure of model fit (lower is better).
Fit a Latent Dirichlet Allocation (LDA) model with the optimal number of topics.
lda_model <- sm_train_lda(
dtm = dtm,
k = k_selection$optimal_k,
method = "gibbs",
iter = 2000,
alpha = 50 / k_selection$optimal_k, # Symmetric Dirichlet prior
seed = 1729
)
# Examine top terms per topic
terms_matrix <- topicmodels::terms(lda_model, 10)
print(terms_matrix)LDA (Blei, Ng, and Jordan 2003) models each document as a mixture of topics, where each topic is a distribution over words. The Gibbs sampling method (Griffiths and Steyvers 2004) estimates model parameters through Markov Chain Monte Carlo.
Compare multiple topic modeling approaches.
comparison <- sm_compare_models(
dtm = dtm,
k = 10,
seed = 1729,
verbose = TRUE
)
# View metrics
print(comparison$metrics)
print(paste("Recommended model:", comparison$recommendation))
# Extract best model
best_model <- comparison$models[[tolower(comparison$recommendation)]]The function fits three models: - LDA: Standard Latent Dirichlet Allocation - CTM: Correlated Topic Model (Blei and Lafferty 2007) (allows topic correlations) - STM: Structural Topic Model (Roberts et al. 2014) (not yet implemented)
Display the most important terms for each topic.
The visualization shows term importance (beta values) within each topic. Higher beta indicates greater relevance to the topic.
Show how topics are distributed across the document collection.
Examine how topic prevalence changes over publication years.
# Ensure papers have doc_id matching DTM rownames
papers$doc_id <- rownames(dtm)
plot_trends <- sm_plot_topic_trends(
model = lda_model,
dtm = dtm,
metadata = papers,
year_col = "year",
doc_id_col = "doc_id"
)
print(plot_trends)This visualization reveals emerging and declining research themes over time.
Analyze relationships between author keywords.
network_plot <- sm_keyword_network(
data = papers,
keyword_col = "author_keywords",
min_cooccurrence = 3,
top_n = 30
)
print(network_plot)Network analysis reveals: - Node size: Keyword frequency - Edge width: Co-occurrence strength - Communities: Clusters of related keywords
Override default preprocessing parameters.
LDA performance depends on hyperparameters.
Save models and visualizations for publication.
# Save model
saveRDS(lda_model, "lda_model.rds")
# Save plots
ggplot2::ggsave("topic_terms.png", plot_terms,
width = 12, height = 8, dpi = 300)
ggplot2::ggsave("topic_trends.png", plot_trends,
width = 12, height = 6, dpi = 300)
# Export document-topic assignments
topics <- topicmodels::topics(lda_model, 1)
papers$dominant_topic <- paste0("Topic_", topics)
write.csv(papers, "papers_with_topics.csv", row.names = FALSE)
# Export topic-term matrix
beta <- topicmodels::posterior(lda_model)$terms
write.csv(beta, "topic_term_matrix.csv")This case study demonstrates SportMiner on a systematic review of sports analytics literature.
What are the main research themes in sports analytics over the past decade, and how have they evolved?
# Comprehensive search query
query_case <- paste0(
'TITLE-ABS-KEY(',
'("sports analytics" OR "sports data science" OR "sports informatics" OR ',
'"performance analysis" OR "match analysis") ',
'AND ("data" OR "analytics" OR "statistics" OR "modeling")',
') ',
'AND DOCTYPE(ar OR re) ',
'AND PUBYEAR > 2013 ',
'AND LANGUAGE(english)'
)
# Retrieve papers
papers_case <- sm_search_scopus(query_case, max_count = 500, verbose = TRUE)
# Full preprocessing pipeline
processed_case <- sm_preprocess_text(papers_case, text_col = "abstract")
dtm_case <- sm_create_dtm(processed_case, min_term_freq = 5, max_term_freq = 0.4)
# Model selection
k_case <- sm_select_optimal_k(dtm_case, k_range = seq(6, 18, by = 2), plot = TRUE)
# Train final model
model_case <- sm_train_lda(dtm_case, k = k_case$optimal_k,
iter = 2000, seed = 1729)
# Visualizations
terms_plot <- sm_plot_topic_terms(model_case, n_terms = 12)
trends_plot <- sm_plot_topic_trends(model_case, dtm_case, papers_case)The topic model with \(k = 12\) topics identified distinct research themes:
Temporal trends reveal: - Increasing focus on deep learning and AI (2018-2024) - Declining emphasis on traditional statistical methods - Emerging interest in explainable AI and interpretability
SportMiner is designed for efficiency with large document collections.
sm_select_optimal_k()Always set random seeds for reproducible results:
SportMiner provides an integrated, efficient workflow for analyzing sport science literature. The package combines database querying, text preprocessing, topic modeling, and visualization in a unified framework. Researchers can rapidly identify research trends, discover thematic structures, and track field evolution over time.
Planned enhancements include:
We thank the reviewers for valuable feedback that improved this package.
sm_set_api_key(): Configure Scopus API credentialssm_search_scopus(): Search Scopus databasesm_get_indexed_keywords(): Retrieve indexed keywords
for paperssm_preprocess_text(): Tokenize and clean text datasm_create_dtm(): Create document-term matrixsm_train_lda(): Fit LDA modelsm_select_optimal_k(): Select optimal number of
topicssm_compare_models(): Compare LDA, CTM, and STMsm_plot_topic_terms(): Visualize top terms per
topicsm_plot_topic_frequency(): Show topic distributionsm_plot_topic_trends(): Plot topic trends over
timesm_keyword_network(): Create keyword co-occurrence
networktheme_sportminer(): Custom ggplot2 theme