Introduction

The openVA package provides a single interface to several algorithms that assign a cause of death to verbal autopsy (VA) data. In this vignette, we walk through an example of using this interface – namely, the openVA::codeVA() function – to run the expert algorithms (EAVA) for neonatal and child causes of death applied to simulated data sets included in the package. Example code running the InSilicoVA algorithm is also included to illustrate some openVA tools for comparing results across different algorithms.

Getting Started

We begin by attaching the openVA package, which attaches and prints the versions of the core packages and, if installed, the optional packages. EAVA is among the optional packages and must be installed alongside openVA, e.g., install.packages(c("openVA", "EAVA"))

library(openVA)
## ────────────────────── Attaching packages for openVA 1.2.0 ─────────────────────
## ✔ InSilicoVA 1.4.2
## ✔ InterVA4   1.7.6
## ✔ InterVA5   1.1.3
## ✔ Tariff     1.0.5
## ── Optional packages (require manual installation if not attached) ─────────────
## ✔ nbc4va        1.2  
## ✔ vacalibration 2.1  
## ✔ EAVA          1.0.0

Data

Preparation of the raw VA input data is handled outside of openVA. The EAVA package provides the EAVA::odk2EAVA() function for transforming VA data collected using the 2016 WHO VA questionnaire into the expected format. Note that this differs from the format expected by the InSilicoVA and InterVA5 algorithms, for which the Python packages pyCrossVA (https://github.com/verbal-autopsy-software/pyCrossVA) can be used to prepare input data collected using the WHO questionnaires.

The examples presented below make use of simulated data sets constructed with the sole purpose of providing illustrative code. These different data sets (for EAVA and InSilicoVA) do overlap, but are not complete or realistic in the sense that they resemble actual records from a VA interview. Thus, the following “results” should not be interpreted as a serious comparison of the algorithms themselves.

Assigning Causes of Death

EAVA

The EAVA algorithm is selected by passing “EAVA” as an argument to both the data.type and model parameters of openVA::codeVA(). In the following code chunk, we use the “DataEAVA” example data set that has already been transformed into the appropriate format – the EAVA algorithm requires particular column names for the input data frame. Finally, the algorithm logic is specific to the age group, either “neonate” or “child”, as indicated by the argument passed to the age_group parameter. If “neonate” is used, then only live births that resulted in a death before 28 days will be assigned a cause (as determined by particular variables in the input data). Similarly, the “child” argument will restrict the results to deaths occurring after 27 days but before 5 years of age.

data("DataEAVA")
EAVA_neonate <- codeVA(data = DataEAVA,
                       data.type = "EAVA", model = "EAVA",
                       age_group = "neonate")

EAVA_child <- codeVA(data = DataEAVA,
                     data.type = "EAVA", model = "EAVA",
                     age_group = "child")

InSilicoVA

The variables used by the InSilicoVA algorithm differ from those informing the EAVA logic, so a companion data set, “RandomVA6”, has been included in openVA. (It was created from the same source files as the “DataEAVA” example data set.) One potential implication is a difference in the deaths that are processed by each algorithm. InSilicoVA only assigns causes to VA records that have valid responses for age and sex; whereas, if sex is missing, the EAVA algorithm will still assign a cause. In the example data set we see that there are 54 records with no information on the sex of the decedent. To make the results comparable across algorithms we impute the value “y” for variable i019a so that these records will be assigned a cause by InSilicoVA.

data(RandomVA6)
table(RandomVA6$i019a,    # indicator for 'male'
      RandomVA6$i019b)    # indicator for 'female'
##    
##       n   y
##   n  54 825
##   y 857   0
index_missing <- RandomVA6$i019a == "n" & RandomVA6$i019b == "n"
RandomVA6[index_missing, "i019a"] <- "y"

Another difference between these two algorithms should also be noted. InSilicoVA assigns a cause to all records (with valid information on age and sex), whereas EAVA will assign “Unspecified” where reported symptoms do not meet the diagnostic criteria for any cause in the hierarchy. When comparing results, the user will need to decide if they want to restrict the analytic data set to VA records with a specified EAVA cause (and pass only these records to InSilicoVA).

Next, we fit InSilicoVA separately for child and neonate deaths (facilitating the comparison with EAVA results). The output from InSilicoVA, which tracks the posterior sampling, has been omitted to keep the vignette concise.

is_neonate <- RandomVA6[, "i022g"] == "y"
subpop <- rep("child", nrow(RandomVA6))
subpop[is_neonate] <- "neonate"
table(subpop)

fit_insilico_neonate <- codeVA(data = RandomVA6[is_neonate,],
                               data.type = "WHO2016",
                               model = "InSilicoVA")

fit_insilico_child <- codeVA(data = RandomVA6[!is_neonate,],
                             data.type = "WHO2016",
                             model = "InSilicoVA")
# (output omitted)

Summarizing EAVA Results

Output from the codeVA() function is summarized by presenting the cause-specific mortality fractions (CSMFs) as either a table or a plot. The former is achieved with the summary() function, and we can specify the number of causes to include in the CSMF with the top parameter. The number of deaths (in the age group) contributing to the distribution is also included in the print out.

summary(EAVA_neonate, top = 10)
## EAVA fitted on  792 neonate deaths.
## Top 10 CSMFs:
## 
##  cause        proportion
##  Preterm      0.2008    
##  Intrapartum  0.1843    
##  Pneumonia    0.1465    
##  Diarrhea     0.1326    
##  Other        0.1111    
##  Sepsis       0.0808    
##  NNT          0.0467    
##  Meningitis   0.0354    
##  Unspecified  0.0328    
##  Malformation 0.0290

We can create bar plot of the ordered CSMFs from the object returned by codeVA() using the plotVA() function. Again, we can specify the number of causes to include in the summary, along with the title used for the plot.

plotVA(EAVA_child, title = "EAVA results for children", top = 7)

If you need the CSMFs for further computation or to create your own summary, then the getCSMF() function will return the CSMFs as a vector.

EAVA_child_csmf <- getCSMF(EAVA_child)
EAVA_child_csmf
##                    AIDS      Diarrhea/Dysentery                  Injury 
##             0.083686441             0.208686441             0.055084746 
##             Intrapartum            Malformation                 Measles 
##             0.003177966             0.079449153             0.092161017 
## Meningitis/Encephalitis        Other infections               Pneumonia 
##             0.116525424             0.037076271             0.096398305 
##                 Preterm             Unspecified 
##             0.093220339             0.134533898

Comparing CoD Assignments: EAVA & InSilicoVA

CSMFs

InSilicoVA uses a different list of causes of death (CoD) than the EAVA algorithm when using a data type in the WHO group. (The list of causes can be retrieved by loading data(causetextV5), which creates a matrix named causetextV5 with the CoDs.) Mappings from the WHO VA cause list to the list used by the EAVA algorithm are included as data sets in the openVA package. There are separate mappings for neonatal and child CoDs, and the (peculiar) format facilitates the creation of a plot depicting a comparison of the CSMFs. In the following code chunk, we load the mapping for neonatal CoDs (grouping_eava_neonate), create a list of the codeVA() outputs for InSilicoVA and EAVA, and pass these objects to the stackplotVA() function. The result is a stacked bar plot with the CSMF from InSilicoVA coded in the causes used by EAVA (specified by passing the grouping_eava_neonate data frame to the grouping parameter). We also restrict the legend to only include those causes appearing in the data by setting the filter_legend parameter to TRUE. In the resulting plot, the fraction of deaths due to CoDs that are not included in the EAVA cause list, are included in the category labeled “not in neonate CoD hierarchy.”

data("grouping_eava_neonate")

compare_neonate <- list("InSilicoVA" = fit_insilico_neonate,
                        "EAVA" = EAVA_neonate)
stackplotVA(compare_neonate, xlab = "", angle = 0,
            grouping = grouping_eava_neonate,
            filter_legend = TRUE)

In the next example, we compare the CSMFs for child deaths, but with a different arrangement of the bar plots by specifying the “dodge” option as the type of plot (along with setting the horiz parameter to TRUE).

data("grouping_eava_child")

compare_child <- list("InSilicoVA" = fit_insilico_child,
                      "EAVA" = EAVA_child)
stackplotVA(compare_child, xlab = "", angle = 0, grouping = grouping_eava_child,
            horiz = TRUE,
            type = "dodge",
            filter_legend = TRUE)

Individual Cause Assignment

Thus far our comparison of the two algorithms has focused on the aggregate level, as captured by the CSMFs. It is also possible to compare how the algorithms assign causes to each individual death. As a first step, we extract the most likely cause that InSilicoVA assigns to each individual neonatal death using the getTopCOD() function. This function returns a data frame with two columns – “ID”, and “cause1” – and we rename the latter so that we know the CoD is assigned by InSilicoVA.

insilico_indiv_neonate <- getTopCOD(fit_insilico_neonate)
names(insilico_indiv_neonate) <- c("ID", "InSilicoVA")

The EAVA algorithm already provides the CoD assignments for each individual death, so we transform the output into a data frame and merge it with our result from the previous code chunk (i.e., insilico_indiv_neonates).

df_eava_neonate <- data.frame("ID" = EAVA_neonate$ID, "EAVA" = EAVA_neonate$cause)
combined_neonate <- merge(insilico_indiv_neonate, df_eava_neonate)

Finally, we cross tabulate the cause assignments for each neonatal death and create a heat map to visualize the results. Note that we have have not mapped to WHO VA cause list (used by InSilicoVA) to the causes assigned by EAVA, but this is also possible using the data set “grouping_eava_neonate” included in the openVA package.

xtab_neonate <- table(combined_neonate$InSilicoVA,
                      combined_neonate$EAVA)
xtab_neonate |>
  as.data.frame() |>
  ggplot(aes(x = Var1, y = Var2, fill = Freq)) +
  geom_tile() + 
  labs(title = "Individual Cause Assignments for Neonatal Deaths") +
  xlab("InSilicoVA") + ylab("EAVA") +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.95, hjust = 1))

Conclusion

We conclude by summarizing the main functions and points:

  • codeVA() – fit an algorithm to VA data

    • other software is needed to prepare VA data (e.g., pyCrossVA or EAVA::odk2EAVA())
  • summary() and plotVA() – print out and plot the CSMF

    • getCSMF() – provides the CSMF as a vector

    • stackplotVA() – compares CSMFs from different algorithms

  • getTopCOD() – returns the most likely cause assigned by InSilicoVA

The openVA package facilitates the analysis of VA data by providing general functions that fit and summarize results from different algorithms. There are also tools to compare results at either the aggregate or individual level. To suggest additional functionality, please submit issues to the GitHub repository: https://github.com/verbal-autopsy-software/openVA/issues.