# RealSurvSim

**RealSurvSim** is an R package that provides a variety of methods for simulating survival (time-to-event) datasets. It is particularly useful for survival analysis applications in research and simulation studies. The package includes both non-parametric (kernel density estimation), parametric, and bootstrap-based simulation approaches for generating realistic time-to-event data.


## Features

- **Parametric Simulation**: Fit a distribution (e.g., exponential, Weibull, log-logistic, mixture distributions) to existing data and generate new samples from the fitted distribution.
- **Kernel Density Simulation**: Non-parametric simulation via kernel density estimation, using an accept-reject approach.
- **Bootstrap Methods**:  
  - **Conditional Bootstrap (`cond`)**: Splits event and censoring times, then resamples to preserve the observed event/censoring ratio.  
  - **Case Resampling (`case`)**: Simple random resampling of entire observations with replacement.
- **Flexible Group/Strata Handling**: Simulate data separately by group while preserving group sizes or allowing user-specified sample sizes.

---

## Installation

### 1. From Source

If you have downloaded or cloned this repository:

```r
# Install devtools if you don't already have it
install.packages("devtools")

# Then, from the root of the package directory:
devtools::install_github()

```
## Dependencies

This package uses several R libraries for density estimation, distribution fitting, and survival analysis. They will be automatically installed (if not already present) when installing **RealSurvSim**. Key dependencies include:

- **kdensity** (for kernel density estimation)  
- **fitdistrplus** (for fitting various distributions to data)  
- **flexsurv** (for Gompertz and other survival distributions)  
- **univariateML** (for maximum-likelihood estimation of some distributions, e.g., inverse gamma)  
- **actuar** (for distributions like log-logistic and inverse gamma)  
- **survival** (core survival analysis functionality)

---

## Usage

Below is an overview of the core functions and some example usages. For detailed information on parameters and return values, refer to the function documentation.

### Core Functions

1. **`data_simul_KDE(orig_vals, n = NULL, kernel = "gaussian")`**  
   Simulates data via **kernel density estimation** from a numeric vector of original values.  
   - **Parameters**:  
     - `orig_vals`: Numeric vector of original data values.  
     - `n`: Number of observations to simulate (defaults to the length of `orig_vals`).  
     - `kernel`: The kernel to use for KDE (currently supports `"gaussian"`).  
   - **Returns**: A numeric vector of simulated values.

2. **`data_simul_Estim(orig_vals, n = NULL, distrib = "exp")`**  
   Fits a specified **parametric distribution** to `orig_vals` and draws new samples from the fitted distribution.  
   - Supported distributions include: `"inverse_gamma"`, `"gompertz"`, `"llogis"`, `"gumbel"`, `"myMix"`, `"exp"`.

3. **`data_simul_bootstr(dat, n = NULL, type = "cond")`**  
   Bootstrap-based simulation of event and censoring times.  
   - **Parameters**:  
     - `dat`: Dataframe containing at least `V1` (time) and `V2` (censor indicator, 0/1).  
     - `n`: Number of observations to sample. Defaults to the same size as `dat`.  
     - `type`: `"cond"` for conditional bootstrap or `"case"` for case-resampling.  
   - **Returns**: A resampled or reconstructed dataframe containing simulated times and censor indicators.

4. **`RealSurvSim(dat, col_time, col_status, col_group, reps = 10000, random_seed = 123, n = NULL, simul_type, distribs = c("exp", "exp", "exp", "exp"))`**  
   The main wrapper function for simulating **multiple** survival datasets using one of four approaches:  
   - `"cond"`: Conditional bootstrap  
   - `"case"`: Case resampling  
   - `"distr"`: Parametric distribution-based simulation  
   - `"KDE"`: Kernel density estimation-based simulation  

   - **Parameters**:  
     - `dat`: Original (or reconstructed) dataset with time, status, and group columns.  
     - `col_time`: Column name/index for time.  
     - `col_status`: Column name/index for censoring indicator (1=event, 0=censored).  
     - `col_group`: Column name/index for treatment/group identifier.  
     - `reps`: Number of datasets to simulate (default 10,000).  
     - `random_seed`: Random seed (default 123) for reproducibility.  
     - `n`: Vector specifying sample sizes per group (optional).  
     - `simul_type`: Single string specifying the simulation method (`"cond"`, `"case"`, `"distr"`, `"KDE"`).  
     - `distribs`: Which distributions to use if `simul_type = "distr"`.  

   - **Returns**:  
     A **list** containing multiple simulated datasets (one for each repetition). Each dataset is a **data.frame** with columns `V1` (time), `V2` (status), and `V3` (group).

---

### Examples

Below are brief examples demonstrating how to simulate data. In practice, replace the placeholders (`example_data`, `"time"`, etc.) with your actual dataset and column names.

```r
library(RealSurvSim)

# Example dataset construction (for demonstration):
set.seed(123)
example_data <- data.frame(
  time = rexp(100, rate = 0.1),            # Times
  status = sample(0:1, 100, replace = TRUE), # 0=censored, 1=event
  group = sample(0:1, 100, replace = TRUE)   # Two groups, 0 or 1
)

# 1. Kernel Density Estimation Simulation
sim_kde <- RealSurvSim(
  dat = example_data,
  col_time   = "time",
  col_status = "status",
  col_group  = "group",
  reps       = 5,            # Simulate 5 datasets
  simul_type = "KDE"         # Use KDE-based simulation
)
str(sim_kde$datasets)  # Check the structure of generated datasets

# 2. Parametric Distribution Simulation
sim_distr <- RealSurvSim(
  dat = example_data,
  col_time   = "time",
  col_status = "status",
  col_group  = "group",
  reps       = 5,
  simul_type = "distr",
  distribs   = c("exp", "exp", "exp", "exp")
)
str(sim_distr$datasets)

# 3. Conditional Bootstrap
sim_cond <- RealSurvSim(
  dat = example_data,
  col_time   = "time",
  col_status = "status",
  col_group  = "group",
  reps       = 5,
  simul_type = "cond"
)
str(sim_cond$datasets)

# 4. Case Resampling
sim_case <- RealSurvSim(
  dat = example_data,
  col_time   = "time",
  col_status = "status",
  col_group  = "group",
  reps       = 5,
  simul_type = "case"
)
str(sim_case$datasets)

data(liang)
data(wu)
# 5. liang_kde<- RealSurvSim(liang, liang$V1, liang$V2, liang$V3, reps=3, simul_type = "KDE")

# For arbitary n
# 6. arbliang_distr<- RealSurvSim(liang,  liang$V1, liang$V2, liang$V3,reps=10,n = c(40,50), simul_type = "distr", distrib=c("exp", "llogis","llogis", "exp"))

# 7. arbwu_case<- RealSurvSim(wu, wu$V1, wu$V2, wu$V3, reps=100,n = c(40,50),  simul_type = "case")
```
## References and Further Reading

**Underlying Paper for the Package**  
[*Analysis and Methods for Survival Data (arXiv:2308.07842)*](https://ar5iv.labs.arxiv.org/html/2308.07842)

**Data Reconstruction Algorithm**  
Guyot et al. (2012), describing the algorithm for reconstructing survival data from published Kaplan-Meier curves.

**WebPlotDigitizer**  
[WebPlotDigitizer](https://automeris.io/WebPlotDigitizer/) for extracting data points from Kaplan-Meier curves.




