% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/Simulations.R
\name{SimulateMultiCondition}
\alias{SimulateMultiCondition}
\title{Simulate NR-seq data for multiple replicates of multiple biological conditions}
\usage{
SimulateMultiCondition(
  nfeatures,
  metadf,
  mean_formula,
  param_details = NULL,
  seqdepth = nfeatures * 2500,
  label_time = 2,
  pnew = 0.05,
  pold = 0.001,
  readlength = 200,
  Ucont_alpha = 25,
  Ucont_beta = 75,
  feature_prefix = "Gene",
  dispslope = 5,
  dispint = 0.01,
  logkdegsdtrend_slope = -0.3,
  logkdegsdtrend_intercept = -2.25,
  logksynsdtrend_slope = -0.3,
  logksynsdtrend_intercept = -2.25,
  logkdeg_mean = -1.9,
  logkdeg_sd = 0.7,
  logksyn_mean = 2.3,
  logksyn_sd = 0.7,
  logkdeg_diff_avg = 0,
  logksyn_diff_avg = 0,
  logkdeg_diff_sd = 0.5,
  logksyn_diff_sd = 0.5,
  pdiff_kd = 0.1,
  pdiff_ks = 0,
  pdiff_both = 0,
  pdo = 0
)
}
\arguments{
\item{nfeatures}{Number of "features" (e.g., genes) to simulate data for}

\item{metadf}{A data frame with the following columns:
\itemize{
\item sample: Names given to samples to simulate.
\item \code{<details>}: Any number of columns with any names (not taken by other metadf columns)
storing factors by which the samples can be stratified. These can be referenced
in \code{mean_formula}, described below.
}
These parameters (described more below) can also be included in metadf to specify sample-specific simulation
parameter:
\itemize{
\item seqdepth
\item label_time
\item pnew
\item pold
\item readlength
\item Ucont
}}

\item{mean_formula}{A formula object that specifies the linear model used to
relate the factors in the \code{<details>} columns of \code{metadf} to average log(kdegs) and
log(ksyns) in each sample.}

\item{param_details}{A data frame with one row for each column of the design matrix
obtained from \code{model.matrix(mean_formula, metadf)} that describes how to simulate
the linear model parameters. The columns of this data frame are:
\itemize{
\item param: Name of linear model parameter as it appears in the column names of the
design matrix from \code{model.matrix(mean_formula, metadf)}.
\item reference: Boolean; TRUE if you want to treat that parameter as a "reference". This
means that all other parameter values that aren't global parameters are set equal to this
unless otherwise determined (see \verb{pdiff_*} parameters for how it is determined if a parameter
will differ from the reference).
\item global: Boolean; TRUE if you want to treat that parameter as a global parameter. This means
that a single value is used for all features.
\item logkdeg_mean: If parameter is the reference, then its value for the log(kdeg) linear model
will be drawn from a normal distribution with this mean. If it is a global parameter, then this
value will be used. If it is neither of these, then its value in the log(kdeg) linear model will
either be the reference (if there is no difference between this condition's value and the reference)
or the reference's value + a normally distributed random variable centered on this value.
\item logkdeg_sd: sd used for draws from normal distribution as described for \code{logkdeg_mean}.
\item logksyn_mean: Same as \code{logkdeg_mean} but for log(ksyn) linear model.
\item logksyn_sd: Same as \code{logkdeg_sd} but for log(kdeg) linear model.
\item pdiff_ks: Proportion of features whose value of this parameter in the log(ksyn) linear model
will differ from the reference's. Should be a number between 0 and 1, inclusive. For example, if
\code{pdiff_ks} is 0.1, then for 10\% of features, this parameter will equal the reference parameter +
a normally distributed random variable with mean \code{logksyn_mean} and sd \code{logksyn_sd}. For the other
90\% of features, this parameter will equal the reference.
\item pdiff_kd: Same as \code{pdiff_ks} but for log(kdeg) linear model.
\item pdiff_both: Proportion of features whose value for this parameter in BOTH the
log(kdeg) and log(ksyn) linear models will differ from the reference. Value must be
between 0 and min(c(pdiff_kd, pdiff_ks)) in that row.
}
If param_details is not specified by the user, the first column of the design matrix
is assumed to represent the reference parameter, all parameters are assumed to be
non-global, logkdeg_mean and logksyn_mean are set to the equivalently named parameter values
described below for the reference and \code{logkdeg_diff_avg} and \code{logksyn_diff_avg} for all other parameters,
logkdeg_sd and logksyn_sd are set to the equivalently named parameter values
described below for the reference and \code{logkdeg_diff_sd} and \code{logksyn_diff_sd} for all other parameters,
and pdiff_kd, pdiff_ks, and pdiff_both are all set to the equivalently named parameter values.}

\item{seqdepth}{Only relevant if \code{read_vect} is not provided; in that case, this is
the total number of reads to simulate.}

\item{label_time}{Length of s^4^U feed to simulate.}

\item{pnew}{Probability that a T is mutated to a C if a read is new.}

\item{pold}{Probability that a T is mutated to a C if a read is old.}

\item{readlength}{Length of simulated reads. In this simple simulation, all reads
are simulated as being exactly this length.}

\item{Ucont_alpha}{Probability that a nucleotide in a simulated read from a given feature
is a U is drawn from a beta distribution with shape1 = \code{Ucont_alpha}.}

\item{Ucont_beta}{Probability that a nucleotide in a simulated read from a given feature
is a U is drawn from a beta distribution with shape2 = \code{Ucont_beta}.}

\item{feature_prefix}{Name given to the i-th feature is \code{paste0(feature_prefix, i)}. Shows up in the
\code{feature} column of the output simulated data table.}

\item{dispslope}{Negative binomial dispersion parameter "slope" with respect to read counts. See
DESeq2 paper for dispersion model used.}

\item{dispint}{Negative binomial dispersion parameter "intercept" with respect to read counts. See
DESeq2 paper for dispersion model used.}

\item{logkdegsdtrend_slope}{Slope for log10(read count) vs. log(kdeg) replicate variability trend}

\item{logkdegsdtrend_intercept}{Intercept for log10(read count) vs. log(kdeg) replicate variability trend}

\item{logksynsdtrend_slope}{Slope for log10(read count) vs. log(ksyn) replicate variability trend}

\item{logksynsdtrend_intercept}{Intercept for log10(read count) vs. log(ksyn) replicate variability trend}

\item{logkdeg_mean}{Mean of normal distribution from which reference log(kdeg)
linear model parameter is drawn from for each feature if \code{param_details} is not provided.}

\item{logkdeg_sd}{Standard deviation of normal distribution from which reference log(kdeg)
linear model parameter is drawn from for each feature if \code{param_details} is not provided.}

\item{logksyn_mean}{Mean of normal distribution from which reference log(ksyn)
linear model parameter is drawn from for each feature if \code{param_details} is not provided.}

\item{logksyn_sd}{Standard deviation of normal distribution from which reference log(ksyn)
linear model parameter is drawn from for each feature if \code{param_details} is not provided.}

\item{logkdeg_diff_avg}{Mean of normal distribution from which non-reference log(kdeg)
linear model parameters are drawn from for each feature if \code{param_details} is not provided.}

\item{logksyn_diff_avg}{Mean of normal distribution from which reference log(ksyn)
linear model parameter are drawn from for each feature if \code{param_details} is not provided.}

\item{logkdeg_diff_sd}{Standard deviation of normal distribution from which reference log(kdeg)
linear model parameter are drawn from for each feature if \code{param_details} is not provided.}

\item{logksyn_diff_sd}{Standard deviation of normal distribution from which reference log(ksyn)
linear model parameter are drawn from for each feature if \code{param_details} is not provided.}

\item{pdiff_kd}{Proportion of features for which non-reference log(kdeg) linear model parameters
differ from the reference.}

\item{pdiff_ks}{Proportion of features for which non-reference log(ksyn) linear model parameters
differ from the reference.}

\item{pdiff_both}{Proportion of features for which BOTH non-reference log(kdeg) and log(ksyn) linear model parameters
differ from the reference.
ksyns are simulated}

\item{pdo}{Dropout rate; think of this as the probability that a s4U containing
molecule is lost during library preparation and sequencing. If \code{pdo} is 0 (default)
then there is not dropout.}
}
\value{
A list containing 6 elements:
\itemize{
\item cB: Tibble that can be provided as the \code{cB} arg to \code{EZbakRData()}.
\item metadf: Tibble that can be provided as the \code{metadf} arg to \code{EZbakRData()}.
\item PerRepTruth: Tibble containing replicate-by-replicate simulated ground truth
\item AvgTruth: Tibble containing average simulated ground truth
\item param_details: Tibble containing information about simulated linear model parameters
\item UnbiasedFractions: Tibble containing no dropout ground truth
}
}
\description{
\code{SimulateMultiCondition} is a highly flexibly simulator that combines linear modeling
of log(kdeg)'s and log(ksyn)'s with \code{SimulateOneRep} to simulate an NR-seq dataset. The linear model
allows you to simulate multiple distinct treatments, batch effects, interaction effects,
etc. The current downside for its flexibility is its relative complexity to implement.
Easier to use simulators are on the way to EZbakR.
}
\examples{
simdata <- SimulateMultiCondition(30,
                                  data.frame(sample = c('sampleA', 'sampleB'),
                                  treatment = c('treatment1', 'treatment2')),
                                  mean_formula = ~treatment-1)
}
