Handle Missing Data

On this page, we describe how to use function impute_var() in R package spsRdata to impute missing values in a data set.

Note: We offer function impute_var, which is simply a wrapper for two popular missing-data-imputation methods, amelia, mice, and miceRanger. Our goal is to facilitate data collection and data cleaning such that users can use functions in R package spsR easily. Please read papers and user guides of a missing-data-imputation method you choose (Amelia, MICE, and MICERanger).

Example: Impute Missing Values with `impute_var`

Please first install R package spsRdata.

library(devtools)
if(!require(spsRdata)) install_github("naoki-egami/spsRdata", dependencies = TRUE)

Then, call R package spsRdata.

library(spsRdata)

impute_var in R package spsRdata imputes missing values in both continuous and categorical variables.

Categorical variables (character or factor) are converted into dummies for each unique value prior to imputation. The data set can be either cross-sectional or time-series cross-sectional. Here, we demonstrate the utility of impute_var using the country-level data set sps_country_data, but impute_var can handle any type of data set.

data(sps_country_data)

As an example, we will impute two variables, Polity score (e_p_polity) and GDP (NY.GDP.MKTP.CD), which have some missing values.

apply(sps_country_data[, c('e_p_polity', 'NY.GDP.MKTP.CD')], 2, function(x) sum(is.na(x)))

##     e_p_polity NY.GDP.MKTP.CD 
##            822            107

We impute the two variables using amelia. By specifying id_site and id_time, function impute_var can respect the time-series cross-sectional data structure. When we only specify id_site, the underlying imputation method assumes the cross-sectional data.

imputed_data <- impute_var(data = sps_country_data, 
                           id_unit = 'country', 
                           id_time = 'year', 
                           var_impute = c('e_p_polity', 'NY.GDP.MKTP.CD'),
                           var_ord = 'e_p_polity',
                           method = "amelia")

apply(imputed_data[, c('e_p_polity', 'NY.GDP.MKTP.CD')], 2, function(x) sum(is.na(x)))

##     e_p_polity NY.GDP.MKTP.CD 
##              0              0

Arguments

data: A data.frame containing variables to impute.
id_site: A unique identifier for sites. A column name in data.
id_time: A unique identifier for time index. A column name in data. If unspecified (NULL; default), it assumes data is cross-sectional.
var_impute: A vector with one or more variable names for which imputation is performed. If unspecified (NULL), it imputes all variables in data except for the id_site and id_time variables.
var_ord: (Optional) A vector of names of ordinal variables in var_impute. Binary variables can be included in either var_ord or var_nom.
var_nom: (Optional) A vector of names of nominal variables (non-ordinal categorical variables) in var_impute.
var_lgstc: (Optional) A vector of names of proportional variables (ranges between 0 and 1) in var_impute.
var_predictor: A vector with one or more variable names that we use as predictors to impute variables in var_impute. If unspecified (NULL), the function uses all variables in data except for variables in var_impute.
method: Imputation method. Choose amelia, mice or miceranger. Default is amelia.
n_impute: The number of imputed data sets to produce (equivalent of argument m in amelia() and miceRanger()). Default is 5.
...: Arguments passed onto Amelia::amelia() or mice::mice() or miceRanger::miceRanger().

Note that, in providing a collection of imputation methods, our function may not be as transparent in its imputing process compared to directly using a particular imputation package of choice. For a more advanced imputation method, we encourage users to implement the imputation by directly calling the preferred imputation package function.

library(Amelia)
imputed_amelia <- amelia(x = sps_country_data, 
                         cs = 'country', 
                         ts = 'year', 
                         idvars = c('iso3', 'region', 'subregion', 'lang'))

Imputation Methods

Amelia. Please read Honaker, King, and Blackwell (2011) for more details.
MICE. Please read Van Buuren and Groothuis-Oudshoorn (2011) for more details.
MICERanger. Please read Wilson (2020) for more details.

Practical Suggestions

Missing data imputation works better when users supply more variables and observations because the underlying method can learn relationships between different variables better. Therefore, in practice, we recommend imputing missing data first before sub-setting the data to the target population of sites and site-level variables they diversify. For example, if users want to use sps_country_data we provide in spsRdata, they can impute missing data first using the full data and then subset the data to focus on their target population of sites and site-level variables of interest.

References

Honaker, J., King, G., & Blackwell, M. (2011). Amelia II: A program for missing data. Journal of statistical software, 45, 1-47.

King, G., Honaker, J., Joseph, A., & Scheve, K. (2001). Analyzing incomplete political science data: An alternative algorithm for multiple imputation. American political science review, 95(1), 49-69.

Van Buuren, S., & Groothuis-Oudshoorn, K. (2011). mice: Multivariate imputation by chained equations in R. Journal of statistical software, 45, 1-67.

Wilson, S. (2020). miceRanger: Multiple imputation by chained equations with random forests. R package version, 1(5).

Example: Impute Missing Values with impute_var

Practical Suggestions

References

Example: Impute Missing Values with `impute_var`