1. Introduction

The PTSDdiag package provides tools for analyzing and optimizing PTSD diagnostic criteria using PCL-5 (PTSD Checklist for DSM-5) data. Post-Traumatic Stress Disorder (PTSD) diagnosis traditionally requires a complex evaluation of multiple symptom criteria across different clusters. This package aims to simplify this process while maintaining diagnostic accuracy.

This vignette demonstrates how to use the package to:

2. Installation

2.1. Installing the Package

This package is currently only hosted on GitHub. Installation requires the devtools package to be installed and loaded.

# Install devtools if not already installed
if (!require("devtools")) install.packages("devtools")

# Install PTSDdiag
library("devtools")
devtools::install_github("WeidmannL/PTSDdiag")

introduction.R

2.2. Loading Required Packages

Once the PTSDdiag is installed, it can be loaded the usual way.

library("PTSDdiag")

introduction.R

Load additional packages:

library(psych)     # For reliability analysis

introduction.R

3. Basic Usage

3.1. Loading Sample Data

This package includes a simulated dataset that mirrors hypothetical PCL-5-assessments. It contains 5,000 simulated patient responses, each rating 20 PTSD symptoms according to DSM-5 criteria.

Rating Scale

Each symptom is rated on a 5-point scale:

  • 0 = Not at all
  • 1 = A little bit
  • 2 = Moderately
  • 3 = Quite a bit
  • 4 = Extremely

Symptom Clusters

The PCL-5 organizes symptoms into four distinct clusters according to DSM-5 criteria:

  • Symptoms 1-5: Criterion B (Intrusion)
  • Symptoms 6-7: Criterion C (Avoidance)
  • Symptoms 8-14: Criterion D (Negative alterations in cognitions and mood)
  • Symptoms 15-20: Criterion E (Alterations in arousal and reactivity)

Data Format Requirements

Input data must be:

  • Numeric values 0-4 only
  • No missing values
  • 20 columns (one per symptom)
  • Row-wise observations

Let’s load the included sample data:

# Load the sample data
data("simulated_ptsd")

introduction.R

and take a look at the first few rows of the sample data:

# Display first few rows
head(simulated_ptsd)
#>   S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15 S16 S17 S18 S19 S20
#> 1  2  1  4  1  4  3  3  3  4   4   4   4   4   4   4   4   3   3   4   3
#> 2  3  3  4  3  3  2  3  3  3   3   4   2   4   4   3   4   3   3   3   4
#> 3  0  1  1  2  2  2  3  0  3   2   3   3   2   1   3   1   2   2   1   3
#> 4  3  3  3  3  3  3  4  1  4   4   2   3   4   2   2   2   4   3   3   3
#> 5  2  3  3  2  3  1  2  1  4   3   2   3   3   4   3   3   4   4   3   3
#> 6  1  1  2  1  2  2  1  0  2   1   2   3   3   3   1   2   1   0   3   3

introduction.R

3.2. Data Preparation

The first step is to standardize column names for consistent analysis. Before standardization, columns might have various names:

#  Example of potential input formats
names(simulated_ptsd)
#>  [1] "S1"  "S2"  "S3"  "S4"  "S5"  "S6"  "S7"  "S8"  "S9"  "S10" "S11" "S12"
#> [13] "S13" "S14" "S15" "S16" "S17" "S18" "S19" "S20"

introduction.R

After standardization, columns will be named systematically:

# Rename columns to standard format (symptom_1 through symptom_20)
simulated_ptsd_renamed <- rename_ptsd_columns(simulated_ptsd)

# Show new names
names(simulated_ptsd_renamed)
#>  [1] "symptom_1"  "symptom_2"  "symptom_3"  "symptom_4"  "symptom_5" 
#>  [6] "symptom_6"  "symptom_7"  "symptom_8"  "symptom_9"  "symptom_10"
#> [11] "symptom_11" "symptom_12" "symptom_13" "symptom_14" "symptom_15"
#> [16] "symptom_16" "symptom_17" "symptom_18" "symptom_19" "symptom_20"

introduction.R

3.3. Basic Descriptive Statistics

We’ll now process the data through several steps to calculate scores and determine diagnoses:

# Step 1: Calculate total scores (range 0-80)
simulated_ptsd_total <- calculate_ptsd_total(simulated_ptsd_renamed)

# Step 2: Apply DSM-5 diagnostic criteria and determine PTSD diagnoses
simulated_ptsd_total_diagnosed <- create_ptsd_diagnosis_nonbinarized(simulated_ptsd_total)

# Step 3: Generate summary statistics
summary_stats <- summarize_ptsd(simulated_ptsd_total_diagnosed)
print(summary_stats)
#>   mean_total sd_total n_diagnosed
#> 1     57.772 12.36218        4710

introduction.R

The summary statistics provide:

  • Mean total score: Indicates average symptom severity in the sample
  • Standard deviation: Shows the spread of severity scores
  • Number of diagnosed cases: Based on full DSM-5 criteria

3.4. Reliability Analysis

Cronbach’s alpha is calculated to assess the internal consistency of the PCL-5:

cronbach <- psych::alpha(subset(simulated_ptsd_total_diagnosed, select = (-total)))
print(cronbach$total)
#>  raw_alpha std.alpha   G6(smc) average_r      S/N         ase     mean
#>   0.912677 0.9156217 0.9212097  0.340688 10.85138 0.001758793 2.795905
#>         sd  median_r
#>  0.5937591 0.3318068

introduction.R

4. Finding Optimal Symptom Combinations

Now we come to the actual analysis. Current PTSD diagnosis requires evaluating 20 symptoms across four clusters with complex rules. Our goal is to identify simplified diagnostic criteria that:

We would like to identify the three six-symptom combinations that best represent the group of PTSD patients compared to the original DSM-5 criteria. We determine these optimal six-symptom combinations under two different structural approaches (a hierarchical approach, requiring at least one symptom from each cluster, and a non-hierarchical approach, ignoring cluster membership).

4.1. Hierarchical Analysis

First, let’s find out the three optimal six-symptom combinations, of which at least four symptoms must be present for the diagnosis, where one symptom from each DSM-5 criterion cluster must be included. This approach maintains the hierarchical structure of PTSD diagnosis. As a reminder, these are the symptom clusters in the PCL-5:

  • Items 1-5: Intrusion symptoms (Criterion B)
  • Items 6-7: Avoidance symptoms (Criterion C)
  • Items 8-14: Negative alterations in cognitions and mood (Criterion D)
  • Items 15-20: Alterations in arousal and reactivity (Criterion E)

The definition of the “optimal combination” can be determined with the score_by argument. Optimization can be based on either:

  • Minimizing false cases (both false positives and false negatives)
  • Minimizing only false negatives (newly non-diagnosed cases)

In our example, we want to miss as few diagnoses as possible compared to the original DSM-5 criteria, so we want to minimize the false negative cases (newly_nondiagnosed).

# Find best combinations with hierarchical approach, minimizing false negatives
best_combinations_hierarchical <- analyze_best_six_symptoms_four_required_clusters(
  simulated_ptsd_renamed,
  score_by = "newly_nondiagnosed"
)

introduction.R

Understanding the Output

The function returns three key elements. Let’s take a look at it.

  1. Selected Symptom Combinations:
best_combinations_hierarchical$best_symptoms

introduction.R

  1. Data comparing original DSM-5 diagnosis with diagnoses based on the three best combinations
# Shows true/false values for original vs. new criteria
head(best_combinations_hierarchical$diagnosis_comparison, 10)

introduction.R

  1. Summary table showing diagnostic accuracy metrics for each combination
best_combinations_hierarchical$summary

introduction.R

The summary table includes:

  • Total Diagnosed: Number and percentage of cases diagnosed
  • Total Non-Diagnosed: Number and percentage of non-diagnosed cases
  • True Positive/Negative: Agreement with original criteria
  • False Positive/Negative: Disagreement with original criteria
  • Sensitivity/Specificity: Diagnostic accuracy measures
  • PPV/NPV: Predictive values

4.2. Non-Hierarchical Analysis

Now we do the same for the non-hierarchical approach. We want to find the three optimal six-symptom combinations, of which at least four symptoms must be present for the diagnosis, regardless of cluster membership.

Here too, the definition of the “optimal combination” can be determined using the score_by argument. Optimization again can be based on either:

  • Minimizing false cases (both false positives and false negatives)
  • Minimizing only false negatives (newly non-diagnosed cases)

In our example, we want to miss as few diagnoses as possible compared to the original DSM-5 criteria, so we want to minimize the false negative cases (newly_nondiagnosed) in the non-hierarchical approach as well.

# Find best combinations with non-hierarchical approach, minimizing false negatives
best_combinations_nonhierarchical <- analyze_best_six_symptoms_four_required(
  simulated_ptsd_renamed,
  score_by = "newly_nondiagnosed"
)

introduction.R

Understanding the Output

Again, let’s take a look at the output

  1. Selected Symptom Combinations:
best_combinations_nonhierarchical$best_symptoms

introduction.R

  1. Data comparing original DSM-5 diagnosis with diagnoses based on the three best combinations
# Shows true/false values for original vs. new criteria
head(best_combinations_nonhierarchical$diagnosis_comparison, 10)

introduction.R

  1. Summary table showing diagnostic accuracy metrics for each combination
best_combinations_nonhierarchical$summary

introduction.R

5. Model Validation

After identifying optimal symptom combinations, it’s crucial to validate their performance on independent data. The PTSDdiag package provides two validation approaches: Holdout Validation and Cross-Validation.

5.1. Holdout Validation

Holdout Validation splits the data into training and test sets, allowing us to evaluate how well the identified symptom combinations generalize to new data.

# Perform holdout validation with 70/30 split
validation_results <- holdout_validation(
  simulated_ptsd_renamed,
  train_ratio = 0.7,
  score_by = "newly_nondiagnosed",
  seed = 123
)

introduction.R

Results

The Holdout Validation results show how the symptom combinations perform on data they weren’t trained on, providing a more realistic estimate of their diagnostic accuracy.

Examining Results for Non-Hierarchical Model:

# Best combinations identified on training data
validation_results$without_clusters$best_combinations

# Performance summary on test data
validation_results$without_clusters$summary

introduction.R

Examining Results for Hierarchical Model:

# Best combinations identified on training data (with cluster representation)
validation_results$with_clusters$best_combinations

# Performance summary on test data
validation_results$with_clusters$summary

introduction.R

5.2. Cross-Validation

For a more robust assessment, k-fold Cross-Validation tests the stability of symptom combinations across multiple data splits.

# Perform 5-fold cross-validation
cv_results <- cross_validation(
  simulated_ptsd_renamed,
  k = 5,
  score_by = "newly_nondiagnosed",
  seed = 123
)

introduction.R

Results by Fold

The function provides detailed results for each fold.

Examining Results for Non-Hierarchical Model:

# Summary statistics for each fold (non-hierarchical model)
cv_results$without_clusters$summary_by_fold

introduction.R

Examining Results for Hierarchical Model:

# Summary statistics for each fold (hierarchical model)
cv_results$with_clusters$summary_by_fold

introduction.R

Stable Combinations Across Folds

If certain symptom combinations appear in multiple folds, the function calculates their average performance.

Examining Results for Non-Hierarchical Model:

# Check for combinations that appeared in multiple folds (non-hierarchical)
if (!is.null(cv_results$without_clusters$combinations_summary)) {
  print("Stable combinations in non-hierarchical model:")
  cv_results$without_clusters$combinations_summary
} else {
  print("No combinations appeared in multiple folds for the non-hierarchical model")
}

introduction.R

Examining Results for Hierarchical Model:

# Check for combinations that appeared in multiple folds (hierarchical)
if (!is.null(cv_results$with_clusters$combinations_summary)) {
  print("Stable combinations in hierarchical model:")
  cv_results$with_clusters$combinations_summary
} else {
  print("No combinations appeared in multiple folds for the hierarchical model")
}

introduction.R

Interpreting Cross-Validation Results

Cross-Validation helps identify:

  1. Consistency: Whether the same symptom combinations are selected across different data subsets
  2. Stability: How diagnostic accuracy varies across folds
  3. Generalization: Average performance metrics provide robust estimates of real-world performance

Key metrics to examine:

  • Splits_Appeared: How many folds selected this combination (higher = more stable)
  • Sensitivity: Proportion of true PTSD cases correctly identified
  • Specificity: Proportion of non-PTSD cases correctly identified
  • PPV/NPV: Predictive values indicating diagnostic confidence

5.3. Comparing Validation Approaches

Both validation methods serve complementary purposes.

Holdout Validation:

  • Simpler and faster
  • Single train-test split
  • Good for initial assessment
  • May be less stable with small datasets

Cross-Validation:

  • More computationally intensive
  • Multiple train-test splits
  • Better estimate of generalization
  • Identifies stable symptom patterns
# Example: Compare sensitivity from both methods
# Note: Results will vary based on random splits

# Holdout Validation sensitivity (first combination, non-hierarchical)
holdout_summary <- validation_results$without_clusters$summary$x$data
if (nrow(holdout_summary) > 1) {
  holdout_sensitivity <- holdout_summary[2, "Sensitivity"]
  print(paste("Holdout sensitivity for first combination:", 
              round(holdout_sensitivity, 3)))
}

# Cross-Validation average sensitivity (if stable combinations exist)
if (!is.null(cv_results$without_clusters$combinations_summary)) {
  cv_summary <- cv_results$without_clusters$combinations_summary$x$data
  if (nrow(cv_summary) > 0) {
    cv_sensitivity <- cv_summary[1, "Sensitivity"]
    print(paste("Cross-validation average sensitivity:", 
                round(cv_sensitivity, 3)))
  }
}

introduction.R

5.4. Validation Best Practices

When validating PTSD diagnostic models:

  1. Set a seed for reproducibility: seed = 123
  2. Choose appropriate validation:
    • Use holdout for quick assessment
    • Use cross-validation for publication-quality results
  3. Consider your optimization goal:
    • score_by = "newly_nondiagnosed" to minimize missed diagnoses
    • score_by = "false_cases" to minimize total misclassifications
  4. Examine multiple metrics:
    • Sensitivity is crucial for screening
    • Specificity matters for differential diagnosis
    • PPV/NPV inform clinical decision-making

6. Conclusion

With the PTSDdiag package, PCL-5 data can be processed and analyzed efficiently. It allows to identify reduced optimal symptom combinations for PTSD diagnosis and to compare different diagnostic approaches by generating detailed diagnostic accuracy metrics.