MDgof-Examples

Note the examples are not actually run but their results are stored in the included data set examples.mdgof.vignette. This is done to satisfy CRAN submission rules regarding the execution time of vignettes. If you want to run the Rmarkdown file yourself, set ReRunExamples=TRUE.

ReRunExamples=FALSE

This package brings together a number of routines for the goodness-of-fit problem for multivariate data. There is a data set \(\mathbf{x}\) and a cumulative distribution function \(F\) (possibly depending on parameters that have to be estimated from x), and we want to test \(H_0:\mathbf{X}\sim F\).

The highlights of this package are:

For descriptions of the included method see the vignette MDgof-Methods.

Example 1: Continuous data without parameter estimation

We generate a two-dimensional data set of size 100 from a multivariate standard normal distribution and run the test with the null hypothesis \(H_0:X\sim N(\mathbf{0}, \mathbf{I_2})\):

set.seed(123)
# cumulative distribution function under the null hypothesis
pnull=function(x) { 
  f=function(x) mvtnorm::pmvnorm(rep(-Inf, length(x)), x)
  if(!is.matrix(x)) return(f(x))
  apply(x, 1, f)
}
# function that generates data under the null hypothesis
rnull=function() mvtnorm::rmvnorm(100, c(0, 0))
x=rnull()
examples[["ex1a"]]=gof_test(x, pnull, rnull, B=B)
examples[["ex1a"]]
#> $statistics
#>      qKS       qK     qCvM      qAD       BR      MMD       ES       EP 
#>  0.09020  0.14800  0.09610  0.56700  3.54100  0.00457 10.30000 12.17000 
#> 
#> $p.values
#>    qKS     qK   qCvM    qAD     BR    MMD     ES     EP 
#> 0.5650 0.4400 0.6150 0.9050 0.4100 0.6950 0.4148 0.4319

\(B\) is the number of simulation runs to estimated the p-value, \(B=5000\) is the default but if the data set is large a smaller value should suffice.

If the density of the distribution under the null hypothesis is also known it can be included and two more tests can be run:

dnull=function(x) {
  f=function(x) mvtnorm::dmvnorm(x)
  if(!is.matrix(x)) return(f(x))
  apply(x, 1, f)
}
examples[["ex1b"]]=gof_test(x, pnull, rnull, dnull=dnull, B=B)
examples[["ex1b"]]
#> $statistics
#>      qKS       qK     qCvM      qAD       BB       BR      MMD      KSD 
#>  0.09020  0.14800  0.09610  0.56700  0.00134  3.64500  0.00790  0.97800 
#>       ES       EP 
#> 10.30000 12.17000 
#> 
#> $p.values
#>    qKS     qK   qCvM    qAD     BB     BR    MMD    KSD     ES     EP 
#> 0.5500 0.5000 0.5900 0.9250 1.0000 0.4700 0.4500 0.8100 0.4148 0.4319

Arguments of gof_test for continuous data

Example 2: Continuous data with parameter estimation

In this example we will test whether the data comes from bivariate normal distribution, with means and variance-covariance estimated from the data.

pnull=function(x, p) {
  f=function(x) mvtnorm::pmvnorm(rep(-Inf, length(x)), 
      x, mean=p[1:2],  
      sigma=matrix(c(p[3], p[5], p[5], p[4]), 2, 2))
  if(!is.matrix(x)) return(f(x))
  apply(x, 1, f)
}
rnull=function(p) mvtnorm::rmvnorm(200, 
      mean=p[1:2],
      sigma=matrix(c(p[3], p[5], p[5], p[4]), 2, 2))
dnull=function(x, p) {
  f=function(x) mvtnorm::dmvnorm(x, 
      mean=p[1:2],
      sigma=matrix(c(p[3], p[5], p[5], p[4]), 2, 2))
  if(!is.matrix(x)) return(f(x))
  apply(x, 1, f)
}
phat=function(x) 
   c(apply(x, 2, mean), c(apply(x, 2, var), cov(x[,1],x[,2])))
x=rnull(c(1, 2, 4, 9,0.6))
phat(x)
#> [1] 1.2812099 1.8038581 3.9441709 8.0894505 0.1389893
examples[["ex2a"]]=gof_test(x, pnull, rnull, phat, dnull=dnull, B=B)
examples[["ex2a"]]
#> $statistics
#>     qKS      qK    qCvM     qAD      BB      BR     MMD     KSD      ES      EP 
#> 0.08060 0.14100 0.05590 0.48000 0.00607 4.56200 0.00121 0.15500 3.96100 8.39000 
#> 
#> $p.values
#>    qKS     qK   qCvM    qAD     BB     BR    MMD    KSD     ES     EP 
#> 0.3100 0.2850 0.5900 0.8700 0.5450 0.7950 0.9700 0.4700 0.6819 0.2995

In contrast in the next case we will generate data from a multivariate t distribution with 5 degrees of freedom:

y=mvtnorm::rmvt(200, sigma=diag(2), df=5)
examples[["ex2b"]]=gof_test(y, pnull, rnull, 
    phat, dnull=dnull,  B=B)
examples[["ex2b"]]
#> $statistics
#>      qKS       qK     qCvM      qAD       BB       BR      MMD      KSD 
#>  0.11200  0.18200  0.12400  3.42500  0.00214  3.50500  0.01240  0.68100 
#>       ES       EP 
#>  7.84200 15.01000 
#> 
#> $p.values
#>    qKS     qK   qCvM    qAD     BB     BR    MMD    KSD     ES     EP 
#> 0.0100 0.0050 0.0200 0.0250 1.0000 0.5350 0.0300 0.6900 0.0198 0.0359

New Tests

gof_test allows the user to supply their own test method, either one that finds its own p-value or a method that requires simulation.

As an example the routine newTS (included in the package) finds either the test statistic or the p value of a chi square test in two dimensions for either continuous or discrete data.

Example 3

In this example we use the same setup as in example 2. newTS requires the following object TSextra (a list):

TSextra=list(Continuous=TRUE, 
             WithEstimation=TRUE, 
             Withpvalue=TRUE)
examples[["ex3a"]]=gof_test(x, pnull, rnull, phat, 
         TS=newTS, TSextra=TSextra,  
         B=B)
examples[["ex3a"]]
#> $statistics
#>  ES33  EP33 
#> 0.100 0.306 
#> 
#> $p.values
#> ES33 EP33 
#> 0.49 0.33
examples[["ex3b"]]=gof_test(y, pnull, rnull, phat, 
         TS=newTS, TSextra=TSextra,  
         B=B)
examples[["ex3b"]]
#> $statistics
#>     ES33     EP33 
#> 0.000000 0.000118 
#> 
#> $p.values
#>  ES33  EP33 
#> 0.975 1.000

Arguments and output of new test routine for continuous data

The arguments have to be:

The output vector of the routine has to be a named vector. If the routine is written in Rcpp parallel programming is not available.

Adjusted p values

If several tests are run one can use the routine gof_test_adjusted_pvalue to find a p value that is adjusted for simultaneous testing:

examples[["ex3c"]]=gof_test_adjusted_pvalue(x, pnull, 
                rnull, dnull=dnull, phat=phat, 
                B=c(B,B))
a=examples[["ex3c"]]
for(i in seq_along(a)) 
    print(paste(names(a)[i],": ", round(a[i],4)))
#> [1] "qKS :  0.8502"
#> [1] "qK :  0.7295"
#> [1] "qCvM :  0.8744"
#> [1] "qAD :  0.8261"
#> [1] "BB :  0.7101"
#> [1] "BR :  0.0193"
#> [1] "MMD :  0.628"
#> [1] "KSD :  0.5411"
#> [1] "ES :  0.7158"
#> [1] "EP :  0.7212"
#> [1] "Min p :  0.1871"

The BR test had the smallest p-value (\(0.0193\)), which is adjusted to \(0.1817\) to account for the fact that \(10\) tests where run simultaneously.

Power Estimation

The routine gof_power allows the user to estimate the power of the tests.

Example 4

Let’s say we want to estimate the power when the null hypothesis specifies a standard bivariate normal distribution, but the data actually comes from a bivariate normal distribution with mean vector \((0,0)\), variances equal to 1 and a covariance \(\rho\). We first need a function that generates these data sets:

pnull=function(x) {
  f=function(x) mvtnorm::pmvnorm(rep(-Inf, length(x)), x)
  if(!is.matrix(x)) return(f(x))
  apply(x, 1, f)
}
rnull=function() mvtnorm::rmvnorm(100, c(0, 0))
dnull=function(x) {
  f=function(x) mvtnorm::dmvnorm(x)
  if(!is.matrix(x)) return(f(x))
  apply(x, 1, f)
}
ralt=function(s) mvtnorm::rmvnorm(100, c(0, 0),
                sigma=matrix(c(1, s, s, 1), 2, 2))

Now we can run

examples[["ex4a"]]=gof_power(pnull, rnull, ralt, c(0, 0.8),
          dnull=dnull, B=B, maxProcessor=1)
examples[["ex4a"]]
#>       qKS    qK  qCvM   qAD    BB    BR   MMD   KSD   ES    EP
#> 0   0.080 0.045 0.035 0.065 0.075 0.035 0.045 0.105 0.06 0.075
#> 0.8 0.775 0.460 0.900 0.950 0.975 0.585 0.925 1.000 1.00 1.000

Arguments of gof_power

Again the user can provide their own routine:

TSextra=list(Continuous=TRUE, 
             WithEstimation=FALSE, 
             Withpvalue=TRUE)
examples[["ex4b"]]=gof_power(pnull, rnull, ralt, c(0, 0.8), TS=newTS, 
          TSextra=TSextra, B=500, 
          With.p.value=TRUE)
examples[["ex4b"]]
#>      ES33  EP33
#> 0   0.044 0.074
#> 0.8 0.988 1.000

Discrete Data

First note that tests for discrete data are implemented only in two dimensions. The data set is a matrix with three columns. Each row has a specific value for the x variable, the y variable and the corresponding counts.

Example 5

We consider the case of data from binomial distributions:

pnull=function(x) {
  f=function(x) 
    sum(dbinom(0:x[1], 5, 0.5)*pbinom(x[2], 3, 0.5+0:x[1]/20))
  if(!is.matrix(x)) return(f(x))
  apply(x, 1, f)
}
rnull=function() {
   x=rbinom(1000, 5, 0.5)
   y=rbinom(1000, 3, 0.5+x/20)
   MDgof::sq2rec(table(x, y))
}
x=rnull()
x
#>       [,1] [,2] [,3]
#>  [1,]    0    0    0
#>  [2,]    1    0   15
#>  [3,]    2    0   23
#>  [4,]    3    0   11
#>  [5,]    4    0    3
#>  [6,]    5    0    0
#>  [7,]    0    1   10
#>  [8,]    1    1   52
#>  [9,]    2    1   95
#> [10,]    3    1   76
#> [11,]    4    1   30
#> [12,]    5    1    4
#> [13,]    0    2   16
#> [14,]    1    2   63
#> [15,]    2    2  122
#> [16,]    3    2  134
#> [17,]    4    2   69
#> [18,]    5    2    7
#> [19,]    0    3    0
#> [20,]    1    3   28
#> [21,]    2    3   69
#> [22,]    3    3   90
#> [23,]    4    3   68
#> [24,]    5    3   15
examples[["ex5"]]=gof_test(x, pnull, rnull, B=B)
examples[["ex5"]]
#> $statistics
#>       qKS        qK      qCvM       qAD ChiSquare 
#>   0.02820   0.03980   0.00237   0.01290  13.06000 
#> 
#> $p.values
#>       qKS        qK      qCvM       qAD ChiSquare 
#>    0.2050    0.1750    0.4000    0.7350    0.8355

The MDgof routine sq2rec above changes the format of the data from that output by the table command to what is required by gof_test.

The arguments of gof_test for discrete data are the same as those for continuous data. The routines determine that the data is discrete if x is a matrix with three columns and the the values in the third column are integers.

Example 6

One use of the discrete data methods is if the data is actually continuous but the data set is very large, so that the continuous methods take to long to run. In that case one can discretize the data and run a discrete method. In this example we have one million observations from a bivariate random variable and we want to test whether they are from independent dependent Beta distributions, with a parameter to be estimated from the data. The null hypothesis specifies the model \(X\sim Beta(a, a)\), \(Y|X=x\sim Beta(x, x)\), with the parameter \(a\). Here is an example of data drawn from this model, using \(a=2\):

rnull_cont=function(p) {
  x=rbeta(250, p, p)
  y=rbeta(250, x, x)
  cbind(x=x, y=y)
}  
dta_cont=rnull_cont(2)
ggplot2::ggplot(data=as.data.frame(dta_cont), ggplot2::aes(x,y)) + 
  ggplot2::geom_point()

Here we have the joint density

\[ \begin{aligned} &f(x, y) = f_X(x)f_{Y|X=x}(y|x) = \\ &\frac{\Gamma(2a)}{\Gamma(a)^2}[x(1-x)]^a \frac{\Gamma(2x)}{\Gamma(x)^2}[y(1-y)]^x \end{aligned} \]

First we need an estimator of \(a\). We can find the maximum likelihood estimator using the Newton-Raphson algorithm:

\[ \begin{aligned} &l(a) = n\left[\log \Gamma(2a)-2\log \Gamma(a)+\frac{a}{n}\sum\log [x_i(1-x_i)]\right]\\ &\frac{dl}{da}=2n\left[di(2a)-di(a)+\frac{1}{2n}\sum\log [x_i(1-x_i)]\right]\\ &\frac{d^2l}{da^2}=2n\left[2 tri(2a)-tri(a)\right]\\ \end{aligned} \] where \(di\) and \(tri\) are the first and second derivative of the log gamma function. Here is an implementation

phat_cont=function(x) {
  n=nrow(x)
  sx=sum( log(x[,1]*(1-x[,1])) )/2/n
  num=function(a) digamma(2*a)-digamma(a)+sx
  denom=function(a) 2*trigamma(2*a)-trigamma(a)
  anew=2
  repeat {
    aold=anew
    anew=aold-num(aold)/denom(aold)
    if(abs(aold-anew)<0.001) break
  }
  anew
}
phat_cont(dta_cont)
#> [1] 2.063892

For the function \(pnull\) we make use of the following general calculation:

\[ \begin{aligned} &F(x,y) = P(X\le x;Y\le y) =\\ &\int_{-\infty}^x \int_{-\infty}^y f(u,v) dvdu = \\ &\int_{-\infty}^x \int_{-\infty}^y f_X(u)f_{Y|X=u}(v|u) dvdu = \\ &\int_{-\infty}^x f_X(u) \left\{\int_{-\infty}^y f_{Y|X=u}(v|u) dv\right\}du = \\ &\int_{-\infty}^x f_X(u)F_{Y|X=u}(v|u) dv \end{aligned} \]

which is implemented in

pnull=function(x, p) {
   f=function(x) {
     g=function(u) dbeta(u, p, p)*pbeta(x[2], u, u)  
     integrate(g, 0, x[1])$value
   }
   if(!is.matrix(x)) return(f(x))
   apply(x, 1, f)  
}

and with this we can run the test

examples[["ex6a"]]=gof_test(dta_cont, pnull, rnull_cont, 
         phat=phat_cont, 
         Ranges=matrix(c(0, 1,0, 1), 2, 2),
         maxProcessor = 1, B=B)
examples[["ex6a"]]
#> $statistics
#>      qKS       qK     qCvM      qAD       BR      MMD       ES       EP 
#>  0.06720  0.11100  0.20100  1.57400  1.82900  0.00286 28.19000 30.08000 
#> 
#> $p.values
#>    qKS     qK   qCvM    qAD     BR    MMD     ES     EP 
#> 0.3000 0.1200 0.1600 0.1400 0.6600 0.4000 0.1349 0.0904

Now, though, let’s assume that the data set has one million observations. In that case the routine above will be much to slow. Instead we can discretize the problem. In order to make this work we will have to discretize all the routines, except for pnull as the distribution function is the same for discrete and continuous random variables.

Let’s say we decide to discretize the data into a 50x30 grid of equal size bins. As in example 5 we can generate the data using the Binomial distribution. However, we will also have to find the bin probabilities, essentially the density of the discretized random vector. This is done in

rnull_disc=function(p) {
  pnull=function(x, p) {
     f=function(x) {
       g=function(u) dbeta(u, p, p)*pbeta(x[2], u, u)  
       integrate(g, 0, x[1])$value
     }
     if(!is.matrix(x)) return(f(x))
     apply(x, 1, f)  
  }
  nb=c(50, 30)
  xvals=1:nb[1]/nb[1]
  yvals=1:nb[2]/nb[2]
  z=cbind(rep(xvals, nb[2]), rep(yvals, each=nb[1]), 0)
  prob=c(t(MDgof::p2dC(z, pnull, p)))
  z[, 3]=rbinom(prod(nb), 1e6, prob)
  z
}
dta_disc=rnull_disc(2)
dta_disc[1:10, ]
#>       [,1]       [,2]  [,3]
#>  [1,] 0.02 0.03333333 20196
#>  [2,] 0.04 0.03333333 33428
#>  [3,] 0.06 0.03333333     0
#>  [4,] 0.08 0.03333333   527
#>  [5,] 0.10 0.03333333   592
#>  [6,] 0.12 0.03333333 20160
#>  [7,] 0.14 0.03333333 66612
#>  [8,] 0.16 0.03333333     0
#>  [9,] 0.18 0.03333333   566
#> [10,] 0.20 0.03333333     6

The MDgof routine \(p2dC\) finds the probabilities of rectangles defined by the grid from the distribution function.

Next we need to rewrite the routine that finds the estimator, now based on the discrete data:

phat_disc=function(x) {
  nb=c(50, 30)
  n=sum(x[,3]) #sample size
  valsx=unique(x[,1])-1/nb[1]/2#x values, center of bins
  x=tapply(x[,3], x[,1], sum) # counts for x alone
  sx=sum(x*log(valsx*(1-valsx)))/2/n
  num=function(a) digamma(2*a)-digamma(a)+sx
  denom=function(a) 2*trigamma(2*a)-trigamma(a)
  anew=2
  repeat {
    aold=anew
    anew=aold-num(aold)/denom(aold)
    if(abs(aold-anew)<0.00001) break
  }
  anew
}
p=phat_disc(dta_disc)
p
#> [1] 1.083052

We want to apply this test to the original continuous data set, so we need to discretize it as well. This can be done with the MDgof routine discretize. Then we can run the test

set.seed(111)
x=rbeta(1e6, 2, 2)
y=rbeta(1e6, x, x)
dta_cont=cbind(x=x, y=y)
dta_disc=MDgof::discretize(dta_cont, 
                  Range=matrix(c(0,1,0,1),2,2),
                  nbins=c(50, 30))
head(dta_disc)
#>      [,1]       [,2] [,3]
#> [1,] 0.02 0.03333333  613
#> [2,] 0.04 0.03333333 1555
#> [3,] 0.06 0.03333333 2396
#> [4,] 0.08 0.03333333 3144
#> [5,] 0.10 0.03333333 3690
#> [6,] 0.12 0.03333333 4156
examples[["ex6b"]]=gof_test(dta_disc, pnull, rnull_disc, phat=phat_disc, B=B)
examples[["ex6b"]]
#> $statistics
#>       qKS        qK      qCvM       qAD ChiSquare 
#>  7.89e-04  1.34e-03  1.26e-04  8.96e-04  1.44e+03 
#> 
#> $p.values
#>       qKS        qK      qCvM       qAD ChiSquare 
#>    0.8850    0.8900    0.7350    0.7800    0.7364

User supplied tests

The routine newTS (included in package) does a chi-square test for discrete data:

TSextra=list(Continuous=FALSE, 
             WithEstimation=TRUE, 
             Withpvalue=FALSE)
examples[["ex6c"]]=gof_test(dta_disc, pnull, rnull_disc, phat_disc, TS=newTS, TSextra=TSextra,  B=B)
examples[["ex6c"]]
#> $statistics
#> ES33 
#> 1440 
#> 
#> $p.values
#>   ES33 
#> 0.7488

Power Estimation

Power estimation for discrete data is done the same way as for continuous data:

Example 7

We consider again the case of data from binomial distributions as in example 5. As an alternative we change the distribution of Y.

pnull=function(x) {
  f=function(x) 
    sum(dbinom(0:x[1], 5, 0.5)*pbinom(x[2], 3, 0.5+0:x[1]/20))
  if(!is.matrix(x)) return(f(x))
  apply(x, 1, f)
}
rnull=function() {
   x=rbinom(1000, 5, 0.5)
   y=rbinom(1000, 3, 0.5+x/20)
   MDgof::sq2rec(table(x, y))
}
ralt=function(p) {
   x=rbinom(1000, 5, p)
   y=rbinom(1000, 3, 0.5+x/20)
   MDgof::sq2rec(table(x, y))
}
x=ralt(0.475)
examples[["ex7a"]]=gof_test(x, pnull, rnull, B=B)
examples[["ex7a"]]
#> $statistics
#>       qKS        qK      qCvM       qAD ChiSquare 
#>    0.0480    0.0532    0.0107    0.0843   29.4600 
#> 
#> $p.values
#>       qKS        qK      qCvM       qAD ChiSquare 
#>    0.0000    0.0200    0.0050    0.0000    0.0591
examples[["ex7b"]]= gof_power(pnull, rnull, ralt, c(0.5, 0.475),B=200)
examples[["ex7b"]]
#>         qKS    qK  qCvM   qAD Chisquare
#> 0.5   0.065 0.075 0.060 0.050      0.06
#> 0.475 0.845 0.620 0.795 0.715      0.48

or using a user-supplied test:

TSextra=list(Continuous=FALSE, 
             WithEstimation=FALSE, 
             Withpvalue=TRUE)
examples[["ex7c"]]=gof_test(x, pnull, rnull, 
         TS=newTS, TSextra=TSextra,  
         B=B)
examples[["ex7c"]]
#> $statistics
#>  ES33 
#> 0.059 
#> 
#> $p.values
#>  ES33 
#> 0.932
examples[["ex7d"]]=gof_power(pnull, rnull, ralt, c(0.5, 0.475),
         TS=newTS, TSextra=TSextra,  
         With.p.value = TRUE, B=B)
examples[["ex7d"]]
#> NULL

Benchmarking

The routine run.studies allows the user to quickly compare the power of a new test with the power of the included tests for 30 different combinations of null hypothesis vs alternative for continuous or discrete data and with or without parameter estimation. It also allows the user to find the power for those case studies for different sample size and type I error probabilities.

Let’s say that we want to determine whether method A or B has better power for a specific case of null hypothesis and alternative. Then if we find the powers of the tests for (say) \(n=100,\alpha=0.05\) and find that method A is better, A will aslo be better for any other combination of \(n\) and \(\alpha\). For this reason just checking one suffices to detemine the ranking of the methods (for this one case).

Example 8

Say we want to compare the power of newTS for continuous data with parameter estimation with the powers of the included tests:

TSextra=list(Continuous=TRUE, 
             WithEstimation=TRUE, 
             Withpvalue=TRUE)
a=run.studies(study=1:2, # do first two 
            Continuous=TRUE, 
            WithEstimation = TRUE,
            TS=newTS, 
            TSextra=TSextra,
            With.p.value=TRUE, 
            B=B)

Arguments of run.studies

MDgof includes five data sets with the results for the included tests using a sample size of 250, a true type I error probability of 0.05 and two values of the parameter under the alternative, one which makes the null hypothesis true and one which makes it false. They are

If the user wants different numbers he can run:

examples[["ex8"]]= run.studies(1:2, 
            Continuous=TRUE, WithEstimation = FALSE,
            param_alt=cbind(c(0, 0.4), c(0, 0.4)), 
            alpha=0.1, nsample=500, B=B)
examples[["ex8"]][["Null"]][1:2, ]
#>                      qKS    qK  qCvM   qAD    BB    BR   MMD   KSD    ES   EP
#> uniform.diagonal.n 0.082 0.114 0.133 0.116 0.097 0.114 0.121 0.099 0.125 0.13
#> uniform.diagonal.b 0.949 0.978 0.990 1.000 0.942 0.998 0.986 0.961 1.000 1.00
examples[["ex8"]][["Pow"]][1:2, ]
#>                      qKS    qK  qCvM   qAD    BB    BR   MMD   KSD   ES    EP
#> uniform.diagonal.n 0.082 0.114 0.133 0.116 0.097 0.114 0.121 0.099 0.07 0.085
#> uniform.diagonal.b 0.949 0.978 0.990 1.000 0.942 0.998 0.986 0.961 1.00 1.000