Skip to contents


On this page, we explain how to prepare country-level variables with spsRdata package (10 minute read).


Table of Contents
  1. Overview:
    We provide the overview of spsRdata package.

  2. Default Country-Level Data set:
    We introduce a default country-level data set to ease data collection.

  3. Prepare Your Own Country-Level Data:
    We explain how to prepare your own country-level data using several functions in spsRdata


1.  Overview

The purpose of R package spsRdata is to facilitate data collection for multi-country studies. Users can supply data they prepared on this page in function sps() for site selection.

Users can start with the default data set sps_country_data that contains a wide range of commonly-used country-level variables in the social sciences. R package spsRdata also offers several functions to merge additional country-level variables from well-known data sets (e.g., V-Dem and World Bank) as well as from any external data users want to add.

Note: We emphasize that R package sps and the corresponding SPS algorithm can be applied to any type of study site (e.g., states, districts, villages, or schools). As multi-country studies are most popular in political science, R package spsRdata focuses on country-level data collection for now, but we are planning to add more functionalities to facilitate data collection at different levels (e.g., states and cities) in the US as well.

2.  Default Country-Level Data set

Please first install R package spsRdata.

library(devtools)
if(!require(spsRdata)) install_github("naoki-egami/spsRdata", dependencies = TRUE)

Then, call both R packages spsR and spsRdata.

In R package spsRdata, we provide a country-level data set (sps_country_data) comprised of 22 variables for 179 countries between 2010 and 2022. It contains a wide range of variables commonly used in the social sciences: they represent political, economic, and social characteristics of countries. These variables are collected from the following sources.

sps_country_data includes the following variables.

data(sps_country_data)
head(sps_country_data, n = 3)
##   country iso3 year   region       subregion AG.LND.TOTL.K2 e_p_polity v2x_libdem v2x_polyarchy v2x_elecreg v2x_corr v2x_divparctrl v2x_gender v2x_rule v2x_ex_military SL.UEM.TOTL.ZS NY.GDP.MKTP.CD    lang SP.POP.TOTL SE.PRM.CUAT.ZS v2x_civlib v2xpe_exlsocgr
## 1  Mexico  MEX 2010 Americas Central America        1943950          8      0.459         0.649           1    0.637         -1.567      0.745    0.581           0.071           5.30   1.057801e+12 Spanish   112532401       74.74190      0.688          0.538
## 2  Mexico  MEX 2011 Americas Central America        1943950          8      0.457         0.647           1    0.637         -1.567      0.746    0.581           0.071           5.17   1.180487e+12 Spanish   114150481             NA      0.687          0.538
## 3  Mexico  MEX 2012 Americas Central America        1943950          8      0.449         0.645           1    0.662         -1.454      0.761    0.562           0.071           4.89   1.201094e+12 Spanish   115755909       77.18234      0.694          0.538

Descriptions of Variables:

Variable Description Scale Coverage Source
Key Identifiers
country Country name used by V-Dem (original variable: country_name) Text 2010 - 2022 V-Dem
iso3 ISO3 country code (original variable: country_text_id) 3-letter code 2010 - 2022 World Bank
year Year coded annually from 1789-2022 Integer (1789 to 2022) 2010 - 2022 V-Dem
Geographic Indicators
region UN region category Text (7 unique values) 2010 - 2022 UN Statistics
subregion UN subregion category Text (20 unique values) 2010 - 2022 UN Statistics
AG.LND.TOTL.K2 A country’s total land area (excl. area under inland water bodies, national claims to continental shelf, and exclusive economic zones) Continuous (in sq. km) 2010 - 2022 World Bank
Democracy Indicators
e_p_polity Polity score computed by subtracting the autocracy score from the democracy score; Data for 2019 and 2020 exists for US only. Interval, from -10 (strongly autocratic) to +10 (strongly democratic) 2010 - 2018 V-Dem
v2x_libdem Liberal democracy index: measures to what extent the individual and minority rights are protected against the tyranny of the state and the tyranny of the majority Interval, from low to high (0-1) 2010 - 2022 V-Dem
v2x_polyarchy Electoral democracy index: measures to what extent the ideal of electoral democracy is achieved Interval, from low to high (0-1) 2010 - 2022 V-Dem
Political Indicators
v2x_elecreg Electoral regime index: measures whether regularly scheduled national elections on course Binary (0 = (a) the election was “aborted” (those elected did not resume power) or (b) an “electoral interruption”; 1 = an executive or legislative election is held) 2010 - 2022 V-Dem
v2x_corr Political corruption index: measures pervasiveness of political corruption Interval, from low to high (0-1) 2010 - 2022 V-Dem
v2x_divparctrl Divided party control index: measures whether executive and legislature are controlled by different political parties. Interval, the positive extreme signifies Divided party control, the intermediate values signify Unified coalition control, and the negative extreme signifies Unified party control 2010 - 2022 V-Dem
v2x_gender Women political empowerment index: measures level of increasing capacity for women, leading to greater choice, agency, and participation in societal decision-making Interval, from low to high (0-1) 2010 - 2022 V-Dem
v2x_rule Rule of law index: measures to what extent laws are enforced transparently, independently, predictably, and impartially Interval, from low to high (0-1) 2010 - 2022 V-Dem
v2x_ex_military Military dimension index: measures to what extent the power base of the chief executive determined by the military Interval, from low to high (0-1) 2010 - 2022 V-Dem
Economic Indicators
SL.UEM.TOTL.ZS Unemployment: share of the labor force that is without work but available for and seeking employment Interval, from 0 to 100 (% of total labor force) 2010 - 2022 World Bank
NY.GDP.MKTP.CD GDP Continuous (current US dollar) 2010 - 2022 World Bank
Sociodemographic Indicators
lang Official language used in the country: separated by “|” if there are more than one official languages Text 2010 - 2022 Groningen
SP.POP.TOTL Total population Continuous 2010 - 2022 World Bank
SE.PRM.CUAT.ZS Share of population ages over 25 that attained or completed primary education Interval, from 0 to 100 (% of population 25+) 2010 - 2020 World Bank
Social Indicators
v2x_civlib Civil liberties index: measures the extent to which civil liberty is respected. Interval, from low to high (0-1) 2010 - 2022 V-Dem
v2xpe_exlsocgr Exclusion by social group index: measures to what extent individuals are denied access to services or participation in governed spaces based on their identity or belonging to a particular group. Interval, from low to high (0-1); Lower scores indicate a normatively better situation (e.g. more democratic) and higher scores a normatively worse situation (e.g. less democratic) 2010 - 2021 V-Dem

Users can check the codebook using data(sps_country_codebook).

data(sps_country_codebook)
sps_country_codebook[sps_country_codebook$Variable %in% 
                       c("country", "year", "NY.GDP.MKTP.CD", "SP.POP.TOTL"), ]
##          Variable                                                  Description                          Scale    Coverage     Source
## 1         country Country name used by V-Dem (original variable: country_name)                           Text 2010 - 2022      V-Dem
## 3            year                           Year coded annually from 1789-2022         Integer (1789 to 2022) 2010 - 2022      V-Dem
## 17 NY.GDP.MKTP.CD                                                          GDP Continuous (current US dollar) 2010 - 2022 World Bank
## 19    SP.POP.TOTL                                             Total population                     Continuous 2010 - 2022 World Bank


3.  Prepare Your Own Country-Level Data

We follow our example in the Get Started Page and use a multi-country survey experiment by Naumann et al. (2018) to illustrate how to use sps_country_data and various functions in spsRdata.

3.1  Start with sps_country_data

In our example, we are interested in 8 country-level variables discussed in the original paper: GDP, unemployment rates, size of migrant population, general support for immigration, the proportion of female, the mean age, the mean education, and sub-regions in Europe. We illustrate how to collect these variables step by step.

We are interested in 15 European countries and variables measured in 2015 (the year in which Naumann et al (2018)’s experiments were conducted). If users are running experiments now, they can, of course, choose site-level variables measured in most recent years.

country_use <- c("Austria", "Belgium", "Czechia", "Denmark", "Finland", 
                 "France", "Germany", "Ireland","Netherlands", "Norway", 
                 "Slovenia", "Spain", "Sweden", "Switzerland", "United Kingdom")

sps_country_data1 <- 
  sps_country_data[sps_country_data$year == 2015 & 
                     sps_country_data$country %in% country_use,]

Three out of the 8 variables are already in our default data set: sub-regions (subregion), GDP (NY.GDP.MKTP.CD) and unemployment rate (SL.UNEM.TOTL.ZS). So we subset the data set to include only those variables we’re interested in.

sps_country_data2 <- sps_country_data1[, c("country", "iso3", 
                                           "subregion", "year", 
                                           "NY.GDP.MKTP.CD", "SL.UEM.TOTL.ZS")]


3.2  Merge Country-Level Variables from V-Dem or World Bank Data Sets

Users can also search for other country-level variables in V-Dem and World Bank using keywords via the function search_sps_data. Here, we look for data on size of migrant population:

search_result <- search_sps_data(keyword = "international migrant")
head(search_result, n = 3)
## # A tibble: 3 × 5
##   Variable          Name                                               Descriptionoverage Source
##   <chr>             <chr>                                              <chrchr>    <chr> 
## 1 SG.POP.MIGR.FE.ZS Female migrants (% of international migrant stock) Percentage of female migrants out of total international migrant stock. International migrant stock is the number of people born in a country other than that in which they live. It also includes refugeesnnual   World…
## 2 SM.POP.TOTL       International migrant stock, total                 International migrant stock is the number of people born in a country other than that in which they live. It also includes refugees. The data used to estimate the international migrant stock at a particular time are obtained mainly from population censuses. The estimates are derived from the data on foreign-born population--people who have residence in one country but were born in another country. When data on the foreign-born population are not available, data on foreign population--that is, people who are citizens of a country other than the country in which they reside--are used as estimates. After the breakup of the Soviet Union in 1991 people living in one of the newly independent countries who were born in another were classified as international migrants. Estimates of migrant stock in the newly independent states from 1990 on are based on the 1989 census of the Soviet Union. For countries wit… Annual   World…
## 3 SM.POP.TOTL.ZS    International migrant stock (% of population)      International migrant stock is the number of people born in a country other than that in which they live. It also includes refugees. The data used to estimate the international migrant stock at a particular time are obtained mainly from population censuses. The estimates are derived from the data on foreign-born population--people who have residence in one country but were born in another country. When data on the foreign-born population are not available, data on foreign population--that is, people who are citizens of a country other than the country in which they reside--are used as estimates. After the breakup of the Soviet Union in 1991 people living in one of the newly independent countries who were born in another were classified as international migrants. Estimates of migrant stock in the newly independent states from 1990 on are based on the 1989 census of the Soviet Union. For countries wit… Annual   World…

Based on the search return, users can add variable(s) to the baseline data set using merge_sps_data. In our example, we’re interested in adding SM.POP.TOTL.ZS that captures the proportion of migrant population.

sps_country_data3 <- merge_sps_data(data = sps_country_data2, 
                                    vars = c("SM.POP.TOTL.ZS"))

Arguments

  • data: A data.frame where the new variables should be merged onto.
  • vars: A vector with one or more variable names that should be merged onto data. We merge variables from V-Dem or World Bank data sets (use search_sps_data() to search for names of new variables).
head(sps_country_data3, n = 3)
##   iso3 year     country       subregion NY.GDP.MKTP.CD SL.UEM.TOTL.ZS SM.POP.TOTL.ZS
## 1  SWE 2015      Sweden Northern Europe   5.051038e+11           7.43       16.76756
## 2  CHE 2015 Switzerland  Western Europe   6.941182e+11           4.80       29.38669
## 3  FRA 2015      France  Western Europe   2.439189e+12          10.35       12.08848


3.3  Merge External Data sets

Users can also merge their own external data set. In our example, 4 variables—the proportion of female, the mean age, the mean education, and general support for immigration—come from European Social Survey data. As an illustration, we included this external data as ess_data.

data(ess_data)
head(ess_data, n = 3)
## # A tibble: 3 × 5
## # Groups:   country [3]
##   country        female   age education support_imm
##   <chr>           <dbl> <dbl>     <dbl>       <dbl>
## 1 Austria         0.509  49.5      2.33        2.95
## 2 Belgium         0.492  47.5      2.74        3.13
## 3 Czech Republic  0.529  46.7      2.68        2.28

First, it is strongly recommended to standardize country names using ISO3 code. Users can do this easily with R package countrycode.

library(countrycode)
ess_data$iso3 <- countrycode(sourcevar = ess_data$country, 
                             origin = "country.name", destination = "iso3c")
sps_country_data4 <- merge(sps_country_data3, 
                           ess_data[, c('iso3', 'female', 'age', 'education', 'support_imm')], 
                           by = 'iso3',
                           all.x = TRUE,
                           sort = FALSE)
3.4  Check Data

Users can run simple descriptive analyses to examine distribution and missing values. The function desc_stat returns simple descriptive statistics and number of missing entries in each year in the data set. Character and factor variables are automatically binarized to obtain descriptive statistics:

check_missing(data = sps_country_data4[, !colnames(sps_country_data4) %in% c('iso3', 'country')],
                      id_time = 'year')
## 
## -------------------------
##  Descriptive Summary
## -------------------------
## 
##                     variable         mean        stdev          min          p10          p25          p50          p75          p90          max  n
## 1                       year 2.015000e+03 0.000000e+00 2.015000e+03 2.015000e+03 2.015000e+03 2.015000e+03 2.015000e+03 2.015000e+03 2.015000e+03 15
## 2             NY.GDP.MKTP.CD 9.456782e+11 1.067538e+12 4.310751e+10 2.066336e+11 2.972241e+11 4.623356e+11 9.808649e+11 2.736590e+12 3.357586e+12 15
## 3             SL.UEM.TOTL.ZS 7.967333e+00 4.399676e+00 4.300000e+00 4.692000e+00 5.175000e+00 6.870000e+00 9.170000e+00 1.017400e+01 2.206000e+01 15
## 4             SM.POP.TOTL.ZS 1.344696e+01 5.756509e+00 3.842226e+00 7.483406e+00 1.155432e+01 1.269024e+01 1.539799e+01 1.718646e+01 2.938669e+01 15
## 5                     female 5.097057e-01 2.845523e-02 4.704025e-01 4.768593e-01 4.885782e-01 5.072974e-01 5.330339e-01 5.490700e-01 5.564516e-01 15
## 6                        age 4.942636e+01 1.888551e+00 4.669628e+01 4.746564e+01 4.777149e+01 4.948649e+01 5.037790e+01 5.172993e+01 5.336549e+01 15
## 7                  education 2.646158e+00 1.831415e-01 2.298802e+00 2.403758e+00 2.528171e+00 2.677203e+00 2.777084e+00 2.851173e+00 2.903405e+00 15
## 8                support_imm 3.224444e+00 4.159755e-01 2.277581e+00 2.837090e+00 3.015142e+00 3.266733e+00 3.426971e+00 3.682870e+00 4.009691e+00 15
## 9  subregion:Northern Europe 4.000000e-01 5.070926e-01 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 15
## 10  subregion:Western Europe 4.000000e-01 5.070926e-01 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 15
## 11 subregion:Southern Europe 1.333333e-01 3.518658e-01 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 6.000000e-01 1.000000e+00 15
## 12  subregion:Eastern Europe 6.666667e-02 2.581989e-01 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00 15
## 
## 
## -------------------------
##  Missing Values
## -------------------------
## 
##                     variable nmiss.total nmiss.2015
## 1                       year           0          0
## 2             NY.GDP.MKTP.CD           0          0
## 3             SL.UEM.TOTL.ZS           0          0
## 4             SM.POP.TOTL.ZS           0          0
## 5                     female           0          0
## 6                        age           0          0
## 7                  education           0          0
## 8                support_imm           0          0
## 9  subregion:Northern Europe           0          0
## 10  subregion:Western Europe           0          0
## 11 subregion:Southern Europe           0          0
## 12  subregion:Eastern Europe           0          0

A list object desc returns descriptive summary of each variable in the data set, whereas miss returns total counts of missing values (i.e., number of missing countries) in each year in the data set. If the years in which a user is interested in using contains a lot of missing values, we suggest imputing missing values using the impute_var function in R package spsR. Please see Handle Missing Data.

3.5  Clean Data for sps()

Finally, we clean data for function sps(). First, we need to have informative row names and column names for the data.

# row names
row.names(sps_country_data4) <- sps_country_data4$country

# column names
colnames(sps_country_data4)[5:11] <- 
  c('GDP', 'Unemployment', 'Immigration', 
    'Female', 'Age', 'Education', 'Immig_Support')

Second, we can use clean_for_sps to prepare data for function sps().

column_for_sps <- c("subregion", 
                    "GDP", "Unemployment", "Immigration", 
                    "Female", "Age", "Education", "Immig_Support")

X_Imm <- clean_for_sps(X = sps_country_data4[, column_for_sps], scale = TRUE)
## ## subregion is a non-numeric variable, and is binarized.
## ## Standardizing numeric variables to make each variable mean zero and standard deviation one.

This function automatically (1) binarizes all non-numeric variables and (2) standardizes numeric variables (means zero and standard deviations one) when argument scale = TRUE. Please only include variables to diversify. For example, users should not include site names to X.

This X_Imm is ready for function sps().

head(X_Imm, n = 3)
##                    GDP Unemployment Immigration     Female        Age  Education Immig_Support subregion_Eastern Europe subregion_Northern Europe subregion_Southern Europe subregion_Western Europe
## Sweden      -0.4127015   -0.1221302   0.5768431 -0.4083383  0.2240878  1.0770300     1.8877250                        0                         1                         0                        0
## Switzerland -0.2356451   -0.7199015   2.7689923 -0.9268037 -0.7338824 -0.6687373     0.5298085                        0                         0                         0                        1
## France       1.3990235    0.5415550  -0.2359897  0.3020230  0.2073008 -0.4071176    -0.4658665                        0                         0                         0                        1


References

Michael Coppedge et al. 2023. “V-Dem Dataset V13.” Varieties of Democracy (V-Dem) Project. https://doi.org/10.23696/VDEMDS23.

University of Groningen. 2016. “World Languages.” University of Groningen Open Data. www.resourcewatch.org.