Handle Missing Data
missing_data.Rmd
On this page, we describe how to use function
impute_var()
in R package spsRdata
to impute
missing values in a data set.
Note: We offer function
impute_var
,
which is simply a wrapper for two popular missing-data-imputation
methods, amelia
, mice
, and
miceRanger
. Our goal is to facilitate data collection and
data cleaning such that users can use functions in R package
spsR
easily. Please read papers and user guides of a
missing-data-imputation method you choose (Amelia, MICE, and MICERanger).
Example: Impute Missing Values
with impute_var
Please first install R package spsRdata
.
library(devtools)
if(!require(spsRdata)) install_github("naoki-egami/spsRdata", dependencies = TRUE)
Then, call R package spsRdata
.
impute_var
in R package spsRdata
imputes
missing values in both continuous and categorical variables.
Categorical variables (character
or factor
)
are converted into dummies for each unique value prior to imputation.
The data set can be either cross-sectional or time-series
cross-sectional. Here, we demonstrate the utility of
impute_var
using the country-level data set
sps_country_data
, but impute_var
can handle
any type of data set.
data(sps_country_data)
As an example, we will impute two variables, Polity score
(e_p_polity
) and GDP (NY.GDP.MKTP.CD
), which
have some missing values.
## e_p_polity NY.GDP.MKTP.CD
## 822 107
We impute the two variables using amelia
. By specifying
id_site
and id_time
, function
impute_var
can respect the time-series cross-sectional data
structure. When we only specify id_site
, the underlying
imputation method assumes the cross-sectional data.
imputed_data <- impute_var(data = sps_country_data,
id_unit = 'country',
id_time = 'year',
var_impute = c('e_p_polity', 'NY.GDP.MKTP.CD'),
var_ord = 'e_p_polity',
method = "amelia")
## e_p_polity NY.GDP.MKTP.CD
## 0 0
Arguments
-
data
: Adata.frame
containing variables to impute. -
id_site
: A unique identifier for sites. A column name indata
. -
id_time
: A unique identifier for time index. A column name indata
. If unspecified (NULL
; default), it assumesdata
is cross-sectional. -
var_impute
: A vector with one or more variable names for which imputation is performed. If unspecified (NULL
), it imputes all variables indata
except for theid_site
andid_time
variables. -
var_ord
: (Optional) A vector of names of ordinal variables invar_impute
. Binary variables can be included in eithervar_ord
orvar_nom
. -
var_nom
: (Optional) A vector of names of nominal variables (non-ordinal categorical variables) invar_impute
. -
var_lgstc
: (Optional) A vector of names of proportional variables (ranges between 0 and 1) invar_impute
. -
var_predictor
: A vector with one or more variable names that we use as predictors to impute variables invar_impute
. If unspecified (NULL
), the function uses all variables indata
except for variables invar_impute
. -
method
: Imputation method. Chooseamelia
,mice
ormiceranger
. Default isamelia
. -
n_impute
: The number of imputed data sets to produce (equivalent of argumentm
inamelia()
andmiceRanger()
). Default is 5. -
...
: Arguments passed ontoAmelia::amelia()
ormice::mice()
ormiceRanger::miceRanger()
.
Note that, in providing a collection of imputation methods, our function may not be as transparent in its imputing process compared to directly using a particular imputation package of choice. For a more advanced imputation method, we encourage users to implement the imputation by directly calling the preferred imputation package function.
library(Amelia)
imputed_amelia <- amelia(x = sps_country_data,
cs = 'country',
ts = 'year',
idvars = c('iso3', 'region', 'subregion', 'lang'))
Imputation Methods
- Amelia. Please read Honaker, King, and Blackwell (2011) for more details.
- MICE. Please read Van Buuren and Groothuis-Oudshoorn (2011) for more details.
- MICERanger. Please read Wilson (2020) for more details.
Practical Suggestions
Missing data imputation works better when users supply more variables
and observations because the underlying method can learn relationships
between different variables better. Therefore, in practice, we recommend
imputing missing data first before sub-setting the data to the target
population of sites and site-level variables they diversify. For
example, if users want to use sps_country_data
we provide
in spsRdata
, they can impute missing data first using the
full data and then subset the data to focus on their target population
of sites and site-level variables of interest.
References
Honaker, J., King, G., & Blackwell, M. (2011). Amelia II: A program for missing data. Journal of statistical software, 45, 1-47.
King, G., Honaker, J., Joseph, A., & Scheve, K. (2001). Analyzing incomplete political science data: An alternative algorithm for multiple imputation. American political science review, 95(1), 49-69.
Van Buuren, S., & Groothuis-Oudshoorn, K. (2011). mice: Multivariate imputation by chained equations in R. Journal of statistical software, 45, 1-67.
Wilson, S. (2020). miceRanger: Multiple imputation by chained equations with random forests. R package version, 1(5).