Prepare Country-Level Variables
prepare_data.Rmd
On this page, we explain how to prepare country-level variables
with spsRdata
package (10 minute read).
Overview:
We provide the overview ofspsRdata
package.Default Country-Level Data set:
We introduce a default country-level data set to ease data collection.-
Prepare Your Own Country-Level Data:
We explain how to prepare your own country-level data using several functions inspsRdata
1. Overview
The purpose of R package spsRdata
is to facilitate data
collection for multi-country studies. Users can supply
data they prepared on this page in function sps()
for site
selection.
Users can start with the default data set
sps_country_data
that contains a wide range of
commonly-used country-level variables in the social sciences. R package
spsRdata
also offers several functions to merge additional
country-level variables from well-known data sets (e.g., V-Dem and World
Bank) as well as from any external data users want to add.
Note: We emphasize that R package sps
and the
corresponding SPS algorithm can be applied to any type of study site
(e.g., states, districts, villages, or schools). As
multi-country studies are most popular in political science, R
package spsRdata
focuses on country-level data collection
for now, but we are planning to add more functionalities to facilitate
data collection at different levels (e.g., states and cities) in the US
as well.
2. Default Country-Level Data set
Please first install R package spsRdata
.
library(devtools)
if(!require(spsRdata)) install_github("naoki-egami/spsRdata", dependencies = TRUE)
Then, call both R packages spsR
and
spsRdata
.
In R package spsRdata
, we provide a country-level data
set (sps_country_data
) comprised of 22 variables for 179
countries between 2010 and 2022. It contains a wide range of variables
commonly used in the social sciences: they represent political,
economic, and social characteristics of countries. These variables are
collected from the following sources.
-
Varieties of
Democracy (V-Dem) project
- V-Dem data set collects variables from a variety of sources including United Nations, Quality of Government, Clio Infra, and Polity5 Project, and currently includes the most comprehensive political indicators.
- World Bank Indicators
- UN Statistics
- University of Groningen’s World Languages
sps_country_data
includes the following variables.
## country iso3 year region subregion AG.LND.TOTL.K2 e_p_polity v2x_libdem v2x_polyarchy v2x_elecreg v2x_corr v2x_divparctrl v2x_gender v2x_rule v2x_ex_military SL.UEM.TOTL.ZS NY.GDP.MKTP.CD lang SP.POP.TOTL SE.PRM.CUAT.ZS v2x_civlib v2xpe_exlsocgr
## 1 Mexico MEX 2010 Americas Central America 1943950 8 0.459 0.649 1 0.637 -1.567 0.745 0.581 0.071 5.30 1.057801e+12 Spanish 112532401 74.74190 0.688 0.538
## 2 Mexico MEX 2011 Americas Central America 1943950 8 0.457 0.647 1 0.637 -1.567 0.746 0.581 0.071 5.17 1.180487e+12 Spanish 114150481 NA 0.687 0.538
## 3 Mexico MEX 2012 Americas Central America 1943950 8 0.449 0.645 1 0.662 -1.454 0.761 0.562 0.071 4.89 1.201094e+12 Spanish 115755909 77.18234 0.694 0.538
Descriptions of Variables:
Variable | Description | Scale | Coverage | Source |
---|---|---|---|---|
Key Identifiers | ||||
country | Country name used by V-Dem (original variable: country_name) | Text | 2010 - 2022 | V-Dem |
iso3 | ISO3 country code (original variable: country_text_id) | 3-letter code | 2010 - 2022 | World Bank |
year | Year coded annually from 1789-2022 | Integer (1789 to 2022) | 2010 - 2022 | V-Dem |
Geographic Indicators | ||||
region | UN region category | Text (7 unique values) | 2010 - 2022 | UN Statistics |
subregion | UN subregion category | Text (20 unique values) | 2010 - 2022 | UN Statistics |
AG.LND.TOTL.K2 | A country’s total land area (excl. area under inland water bodies, national claims to continental shelf, and exclusive economic zones) | Continuous (in sq. km) | 2010 - 2022 | World Bank |
Democracy Indicators | ||||
e_p_polity | Polity score computed by subtracting the autocracy score from the democracy score; Data for 2019 and 2020 exists for US only. | Interval, from -10 (strongly autocratic) to +10 (strongly democratic) | 2010 - 2018 | V-Dem |
v2x_libdem | Liberal democracy index: measures to what extent the individual and minority rights are protected against the tyranny of the state and the tyranny of the majority | Interval, from low to high (0-1) | 2010 - 2022 | V-Dem |
v2x_polyarchy | Electoral democracy index: measures to what extent the ideal of electoral democracy is achieved | Interval, from low to high (0-1) | 2010 - 2022 | V-Dem |
Political Indicators | ||||
v2x_elecreg | Electoral regime index: measures whether regularly scheduled national elections on course | Binary (0 = (a) the election was “aborted” (those elected did not resume power) or (b) an “electoral interruption”; 1 = an executive or legislative election is held) | 2010 - 2022 | V-Dem |
v2x_corr | Political corruption index: measures pervasiveness of political corruption | Interval, from low to high (0-1) | 2010 - 2022 | V-Dem |
v2x_divparctrl | Divided party control index: measures whether executive and legislature are controlled by different political parties. | Interval, the positive extreme signifies Divided party control, the intermediate values signify Unified coalition control, and the negative extreme signifies Unified party control | 2010 - 2022 | V-Dem |
v2x_gender | Women political empowerment index: measures level of increasing capacity for women, leading to greater choice, agency, and participation in societal decision-making | Interval, from low to high (0-1) | 2010 - 2022 | V-Dem |
v2x_rule | Rule of law index: measures to what extent laws are enforced transparently, independently, predictably, and impartially | Interval, from low to high (0-1) | 2010 - 2022 | V-Dem |
v2x_ex_military | Military dimension index: measures to what extent the power base of the chief executive determined by the military | Interval, from low to high (0-1) | 2010 - 2022 | V-Dem |
Economic Indicators | ||||
SL.UEM.TOTL.ZS | Unemployment: share of the labor force that is without work but available for and seeking employment | Interval, from 0 to 100 (% of total labor force) | 2010 - 2022 | World Bank |
NY.GDP.MKTP.CD | GDP | Continuous (current US dollar) | 2010 - 2022 | World Bank |
Sociodemographic Indicators | ||||
lang | Official language used in the country: separated by “|” if there are more than one official languages | Text | 2010 - 2022 | Groningen |
SP.POP.TOTL | Total population | Continuous | 2010 - 2022 | World Bank |
SE.PRM.CUAT.ZS | Share of population ages over 25 that attained or completed primary education | Interval, from 0 to 100 (% of population 25+) | 2010 - 2020 | World Bank |
Social Indicators | ||||
v2x_civlib | Civil liberties index: measures the extent to which civil liberty is respected. | Interval, from low to high (0-1) | 2010 - 2022 | V-Dem |
v2xpe_exlsocgr | Exclusion by social group index: measures to what extent individuals are denied access to services or participation in governed spaces based on their identity or belonging to a particular group. | Interval, from low to high (0-1); Lower scores indicate a normatively better situation (e.g. more democratic) and higher scores a normatively worse situation (e.g. less democratic) | 2010 - 2021 | V-Dem |
Users can check the codebook using
data(sps_country_codebook)
.
data(sps_country_codebook)
sps_country_codebook[sps_country_codebook$Variable %in%
c("country", "year", "NY.GDP.MKTP.CD", "SP.POP.TOTL"), ]
## Variable Description Scale Coverage Source
## 1 country Country name used by V-Dem (original variable: country_name) Text 2010 - 2022 V-Dem
## 3 year Year coded annually from 1789-2022 Integer (1789 to 2022) 2010 - 2022 V-Dem
## 17 NY.GDP.MKTP.CD GDP Continuous (current US dollar) 2010 - 2022 World Bank
## 19 SP.POP.TOTL Total population Continuous 2010 - 2022 World Bank
3. Prepare Your Own Country-Level Data
We follow our example in the Get Started
Page and use a multi-country survey experiment by Naumann et
al. (2018) to illustrate how to use sps_country_data
and
various functions in spsRdata
.
3.1 Start with
sps_country_data
In our example, we are interested in 8 country-level variables discussed in the original paper: GDP, unemployment rates, size of migrant population, general support for immigration, the proportion of female, the mean age, the mean education, and sub-regions in Europe. We illustrate how to collect these variables step by step.
We are interested in 15 European countries and variables measured in 2015 (the year in which Naumann et al (2018)’s experiments were conducted). If users are running experiments now, they can, of course, choose site-level variables measured in most recent years.
country_use <- c("Austria", "Belgium", "Czechia", "Denmark", "Finland",
"France", "Germany", "Ireland","Netherlands", "Norway",
"Slovenia", "Spain", "Sweden", "Switzerland", "United Kingdom")
sps_country_data1 <-
sps_country_data[sps_country_data$year == 2015 &
sps_country_data$country %in% country_use,]
Three out of the 8 variables are already in our default data set:
sub-regions (subregion
), GDP (NY.GDP.MKTP.CD
)
and unemployment rate (SL.UNEM.TOTL.ZS
). So we subset the
data set to include only those variables we’re interested in.
sps_country_data2 <- sps_country_data1[, c("country", "iso3",
"subregion", "year",
"NY.GDP.MKTP.CD", "SL.UEM.TOTL.ZS")]
3.2 Merge Country-Level Variables from V-Dem or World Bank Data Sets
Users can also search for other country-level variables in V-Dem and
World Bank using keywords via the function search_sps_data
.
Here, we look for data on size of migrant population:
search_result <- search_sps_data(keyword = "international migrant")
head(search_result, n = 3)
## # A tibble: 3 × 5
## Variable Name Description Coverage Source
## <chr> <chr> <chr> <chr> <chr>
## 1 SG.POP.MIGR.FE.ZS Female migrants (% of international migrant stock) Percentage of female migrants out of total international migrant stock. International migrant stock is the number of people born in a country other than that in which they live. It also includes refugees. Annual World…
## 2 SM.POP.TOTL International migrant stock, total International migrant stock is the number of people born in a country other than that in which they live. It also includes refugees. The data used to estimate the international migrant stock at a particular time are obtained mainly from population censuses. The estimates are derived from the data on foreign-born population--people who have residence in one country but were born in another country. When data on the foreign-born population are not available, data on foreign population--that is, people who are citizens of a country other than the country in which they reside--are used as estimates. After the breakup of the Soviet Union in 1991 people living in one of the newly independent countries who were born in another were classified as international migrants. Estimates of migrant stock in the newly independent states from 1990 on are based on the 1989 census of the Soviet Union. For countries wit… Annual World…
## 3 SM.POP.TOTL.ZS International migrant stock (% of population) International migrant stock is the number of people born in a country other than that in which they live. It also includes refugees. The data used to estimate the international migrant stock at a particular time are obtained mainly from population censuses. The estimates are derived from the data on foreign-born population--people who have residence in one country but were born in another country. When data on the foreign-born population are not available, data on foreign population--that is, people who are citizens of a country other than the country in which they reside--are used as estimates. After the breakup of the Soviet Union in 1991 people living in one of the newly independent countries who were born in another were classified as international migrants. Estimates of migrant stock in the newly independent states from 1990 on are based on the 1989 census of the Soviet Union. For countries wit… Annual World…
Based on the search return, users can add variable(s) to the baseline
data set using merge_sps_data
. In our example, we’re
interested in adding SM.POP.TOTL.ZS
that captures the
proportion of migrant population.
sps_country_data3 <- merge_sps_data(data = sps_country_data2,
vars = c("SM.POP.TOTL.ZS"))
Arguments
-
data
: Adata.frame
where the new variables should be merged onto. -
vars
: A vector with one or more variable names that should be merged ontodata
. We merge variables from V-Dem or World Bank data sets (usesearch_sps_data()
to search for names of new variables).
head(sps_country_data3, n = 3)
## iso3 year country subregion NY.GDP.MKTP.CD SL.UEM.TOTL.ZS SM.POP.TOTL.ZS
## 1 SWE 2015 Sweden Northern Europe 5.051038e+11 7.43 16.76756
## 2 CHE 2015 Switzerland Western Europe 6.941182e+11 4.80 29.38669
## 3 FRA 2015 France Western Europe 2.439189e+12 10.35 12.08848
3.3 Merge External Data sets
Users can also merge their own external data set. In our example, 4
variables—the proportion of female, the mean age, the mean education,
and general support for immigration—come from European Social Survey
data. As an illustration, we included this external data as
ess_data
.
## # A tibble: 3 × 5
## # Groups: country [3]
## country female age education support_imm
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Austria 0.509 49.5 2.33 2.95
## 2 Belgium 0.492 47.5 2.74 3.13
## 3 Czech Republic 0.529 46.7 2.68 2.28
First, it is strongly recommended to standardize country names using
ISO3 code. Users can do this easily with R package
countrycode
.
library(countrycode)
ess_data$iso3 <- countrycode(sourcevar = ess_data$country,
origin = "country.name", destination = "iso3c")
3.4 Check Data
Users can run simple descriptive analyses to examine distribution and
missing values. The function desc_stat
returns simple
descriptive statistics and number of missing entries in each year in the
data set. Character and factor variables are automatically binarized to
obtain descriptive statistics:
check_missing(data = sps_country_data4[, !colnames(sps_country_data4) %in% c('iso3', 'country')],
id_time = 'year')
##
## -------------------------
## Descriptive Summary
## -------------------------
##
## variable mean stdev min p10 p25 p50 p75 p90 max n
## 1 year 2.015000e+03 0.000000e+00 2.015000e+03 2.015000e+03 2.015000e+03 2.015000e+03 2.015000e+03 2.015000e+03 2.015000e+03 15
## 2 NY.GDP.MKTP.CD 9.456782e+11 1.067538e+12 4.310751e+10 2.066336e+11 2.972241e+11 4.623356e+11 9.808649e+11 2.736590e+12 3.357586e+12 15
## 3 SL.UEM.TOTL.ZS 7.967333e+00 4.399676e+00 4.300000e+00 4.692000e+00 5.175000e+00 6.870000e+00 9.170000e+00 1.017400e+01 2.206000e+01 15
## 4 SM.POP.TOTL.ZS 1.344696e+01 5.756509e+00 3.842226e+00 7.483406e+00 1.155432e+01 1.269024e+01 1.539799e+01 1.718646e+01 2.938669e+01 15
## 5 female 5.097057e-01 2.845523e-02 4.704025e-01 4.768593e-01 4.885782e-01 5.072974e-01 5.330339e-01 5.490700e-01 5.564516e-01 15
## 6 age 4.942636e+01 1.888551e+00 4.669628e+01 4.746564e+01 4.777149e+01 4.948649e+01 5.037790e+01 5.172993e+01 5.336549e+01 15
## 7 education 2.646158e+00 1.831415e-01 2.298802e+00 2.403758e+00 2.528171e+00 2.677203e+00 2.777084e+00 2.851173e+00 2.903405e+00 15
## 8 support_imm 3.224444e+00 4.159755e-01 2.277581e+00 2.837090e+00 3.015142e+00 3.266733e+00 3.426971e+00 3.682870e+00 4.009691e+00 15
## 9 subregion:Northern Europe 4.000000e-01 5.070926e-01 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 15
## 10 subregion:Western Europe 4.000000e-01 5.070926e-01 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 15
## 11 subregion:Southern Europe 1.333333e-01 3.518658e-01 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 6.000000e-01 1.000000e+00 15
## 12 subregion:Eastern Europe 6.666667e-02 2.581989e-01 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00 15
##
##
## -------------------------
## Missing Values
## -------------------------
##
## variable nmiss.total nmiss.2015
## 1 year 0 0
## 2 NY.GDP.MKTP.CD 0 0
## 3 SL.UEM.TOTL.ZS 0 0
## 4 SM.POP.TOTL.ZS 0 0
## 5 female 0 0
## 6 age 0 0
## 7 education 0 0
## 8 support_imm 0 0
## 9 subregion:Northern Europe 0 0
## 10 subregion:Western Europe 0 0
## 11 subregion:Southern Europe 0 0
## 12 subregion:Eastern Europe 0 0
A list object desc
returns descriptive summary of each
variable in the data set, whereas miss
returns total counts
of missing values (i.e., number of missing countries) in each year in
the data set. If the years in which a user is interested in using
contains a lot of missing values, we suggest imputing missing values
using the impute_var
function in R package
spsR
. Please see Handle
Missing Data.
3.5 Clean Data for
sps()
Finally, we clean data for function sps()
. First, we
need to have informative row names and column names for the data.
# row names
row.names(sps_country_data4) <- sps_country_data4$country
# column names
colnames(sps_country_data4)[5:11] <-
c('GDP', 'Unemployment', 'Immigration',
'Female', 'Age', 'Education', 'Immig_Support')
Second, we can use clean_for_sps
to prepare data for
function sps()
.
column_for_sps <- c("subregion",
"GDP", "Unemployment", "Immigration",
"Female", "Age", "Education", "Immig_Support")
X_Imm <- clean_for_sps(X = sps_country_data4[, column_for_sps], scale = TRUE)
## ## subregion is a non-numeric variable, and is binarized.
## ## Standardizing numeric variables to make each variable mean zero and standard deviation one.
This function automatically (1) binarizes all non-numeric variables
and (2) standardizes numeric variables (means zero and standard
deviations one) when argument scale = TRUE
. Please only
include variables to diversify. For example, users should not include
site names to X
.
This X_Imm
is ready for function sps()
.
head(X_Imm, n = 3)
## GDP Unemployment Immigration Female Age Education Immig_Support subregion_Eastern Europe subregion_Northern Europe subregion_Southern Europe subregion_Western Europe
## Sweden -0.4127015 -0.1221302 0.5768431 -0.4083383 0.2240878 1.0770300 1.8877250 0 1 0 0
## Switzerland -0.2356451 -0.7199015 2.7689923 -0.9268037 -0.7338824 -0.6687373 0.5298085 0 0 0 1
## France 1.3990235 0.5415550 -0.2359897 0.3020230 0.2073008 -0.4071176 -0.4658665 0 0 0 1
References
Michael Coppedge et al. 2023. “V-Dem Dataset V13.” Varieties of Democracy (V-Dem) Project. https://doi.org/10.23696/VDEMDS23.
University of Groningen. 2016. “World Languages.” University of Groningen Open Data. www.resourcewatch.org.