Skip to contents


On this page, we explain how to use stratify_sps() to incorporate various practical constraints and domain knowledge to sps().


Table of Contents


Example

On this page, we use Naumann et al. (2018) (data(X_Imm)) as an example.

library(spsR)
data("X_Imm")

Before we begin, we make the scale of variables comparable by standardizing continuous variables.

var_cont <-  c('GDP', 'Unemployment', 'Immigration', 
               'Female', 'Age', 'Education', 'Immig_Support')
X_Imm[, var_cont] <- scale(X_Imm[, var_cont])

Case 1: Improve Diversity of Site-Level Variables

In some cases, researchers might want to stratify some variables to make sure that study sites satisfy certain conditions.

For example, users might want to make sure that we select at least one country from low, middle, and high GDP countries:

  • at least one country from the top 20 percentile of the GDP distribution
  • at least one country between 40th and 60th percentiles of the GDP distribution
  • at least one country from the bottom 20 percentile of the GDP distribution

Users can enforce this condition using stratify_sps().

Researchers can ensure to select at least 1 site that have GDP larger than or equal to the 80 percentile (quantile(X_Imm[, "GDP"], prob = c(0.8))).

q_80 <- quantile(X_Imm[, "GDP"], prob = c(0.8))

st_GDP_80 <- 
  stratify_sps(X = X_Imm, 
               num_site = list("at least", 1), 
               condition = list("GDP", "larger than or equal to", q_80))
## 3 sites satisfy the specified `condition` and sps() will select at least 1 site from them.

Similarly, researchers can ensure to select at least 1 site that have GDP between the 40th and 60th percentile (quantile(X_Imm[, "GDP"], prob = c(0.4, 0.6))).

q_40_60 <- quantile(X_Imm[, "GDP"], prob = c(0.4, 0.6))
         
st_GDP_40_60 <- 
  stratify_sps(X = X_Imm, 
               num_site = list("at least", 1), 
               condition = list("GDP", "between", q_40_60))
## 3 sites satisfy the specified `condition` and sps() will select at least 1 site from them.

Users can also ensure to select at least 1 site that have GDP smaller than or equal to the 20 percentile (quantile(X_Imm[, "GDP"], prob = c(0.2))).

q_20 <- quantile(X_Imm[, "GDP"], prob = c(0.2))

st_GDP_20 <-  
  stratify_sps(X = X_Imm, 
               num_site = list("at least", 1), 
               condition = list("GDP", "smaller than or equal to", q_20))
## 3 sites satisfy the specified `condition` and sps() will select at least 1 site from them.

Arguments

  • num_site: A list of two elements, e.g., list("at least", 1). This argument specifies the number of sites that should satisfy condition specified below. The first element should be either at least or at most. The second element is integer. For example, list("at least", 1) means that we stratify SPS such that we select at least 1 site that satisfies condition (specified below).

  • condition: A list of three elements, e.g., list("GDP", "larger than or equal to", 1). This argument specifies conditions for stratification. The first element should be a name of a site-level variable. The second element should be either larger than or equal to, smaller than or equal to, or between. The third element is a vector of length 1 or 2. When the second element is between, the third element should be a vector of two values. For example, list("GDP", "larger than or equal to", 1) means that we stratify SPS such that we select num_site sites that have GDP larger than or equal to 1.

Finally, users can combine these different stratification into one list and supply it to function sps().

out_st <- sps(X = X_Imm, N_s = 6, 
              stratify = list(st_GDP_80, st_GDP_40_60, st_GDP_20))
## Selecting Study Sites...
out_st$selected_sites
## [1] "Switzerland" "Czechia"     "Spain"       "France"      "Ireland"    
## [6] "Norway"
Stratify Each Variable

In practice, it is often recommended to add simple stratification to every variable to make sure that we can cover low and high values in each dimension. For example, after standardizing each variable, we can make sure to select at least 1 site above 0.5 standard deviation and at least 1 site below −0.5 standard deviation.

st_large <- st_small <- list()
for(v in 1:7){
st_large[[v]] <- stratify_sps(X = X_Imm, 
                              num_site = list("at least", 1), 
                              condition = list(colnames(X_Imm)[v], "larger than or equal to", 0.5))
st_small[[v]] <- stratify_sps(X = X_Imm, 
                              num_site = list("at least", 1), 
                              condition = list(colnames(X_Imm)[v], "smaller than or equal to", -0.5))
}
st_combined <- c(st_large, st_small)
out_st <- sps(X = X_Imm, N_s = 6, stratify = st_combined)
## Selecting Study Sites...
out_st$selected_sites
## [1] "Switzerland" "Czechia"     "Germany"     "Denmark"     "Spain"      
## [6] "Netherlands"


Case 2: Select Sites from Different Subgroups

It is often useful to select sites from different subgroups. For example, users might want to make sure that we select at least one country from each of 4 sub-regions of Europe.

The following code ensures to select at least 1 site from countries that have Northern Europe = 1.

st_NE <- 
  stratify_sps(X=X_Imm, 
               num_site=list("at least", 1), 
               condition=list("Northern Europe", "larger than or equal to", 1))
## 6 sites satisfy the specified `condition` and sps() will select at least 1 site from them.

Similarly, the following codes ensure that we select at least 1 site from countries that have Eastern Europe = 1, Western Europe = 1 and Southern Europe = 1, respectively.

st_EE <- 
  stratify_sps(X=X_Imm, 
               num_site=list("at least", 1), 
               condition=list("Eastern Europe", "larger than or equal to", 1))
## 1 site satisfies the specified `condition` and sps() will select at least 1 site from them.
st_WE <- 
  stratify_sps(X=X_Imm, 
               num_site=list("at least", 1), 
               condition=list("Western Europe", "larger than or equal to", 1))
## 6 sites satisfy the specified `condition` and sps() will select at least 1 site from them.
st_SE <- 
  stratify_sps(X=X_Imm, 
               num_site=list("at least", 1), 
               condition=list("Southern Europe", "larger than or equal to", 1))
## 2 sites satisfy the specified `condition` and sps() will select at least 1 site from them.

Finally, users combine them into one list and supply it to function sps().

out_region <- sps(X = X_Imm, 
                  N_s = 6, 
                  stratify = list(st_NE, st_EE, st_WE, st_SE))
## Selecting Study Sites...
out_region$selected_sites
## [1] "Switzerland" "Czechia"     "Germany"     "Denmark"     "Spain"      
## [6] "Netherlands"

Case 3: Always Include or Always Exclude

In some cases, researchers might want to always include or always exclude certain sites.

Let’s start with cases when we always include certain sites. Users might want to conduct studies in certain sites, for example, because those sites are substantively important or provide “hard tests” for a given theory.

Suppose users always want to include “Sweden” as one of study sites. If so, users can explicitly incorporate this constraint as follows.

out <- sps(X = X_Imm, N_s = 6, site_include = c("Sweden"))
## Selecting Study Sites...
out$selected_sites
## [1] "Switzerland" "Czechia"     "Spain"       "France"      "Ireland"    
## [6] "Sweden"

In other cases, researchers might want to always exclude certain sites. For example, users might not be able to conduct studies in certain sites because survey firms do not offer service in certain countries.

Suppose users always want to exclude “Denmark” and “Spain” from study sites. If so, users can explicitly incorporate this constraint as follows.

out <- sps(X = X_Imm, N_s = 6, site_exclude = c("Denmark", "Spain"))
## Selecting Study Sites...
out$selected_sites
## [1] "Switzerland"    "Czechia"        "France"         "United Kingdom"
## [5] "Ireland"        "Norway"

Of course, users can incorporate both constraints together.

out <- sps(X = X_Imm, N_s = 6, 
           site_include = c("Sweden"), 
           site_exclude = c("Denmark", "Spain"))
## Selecting Study Sites...
out$selected_sites
## [1] "Belgium"     "Switzerland" "Czechia"     "France"      "Ireland"    
## [6] "Sweden"