What if the Unit of Analysis is not the same as the Unit of Labeling?
aggregate.Rmd
On this page, we explain how to setup data for
dsl
when the unit of analysis is not the same as the unit
of labeling.
Overview
In some applications, the unit of analysis is not the same as the unit of labeling. Researchers often annotate each document, but the unit of analysis might be some aggregates of documents. For example, users might code whether each online post by political candidates mentions an economic policy: the unit of labeling is at the online post level. But researchers might be interested in how the proportion of posts mentioning an economic policy varies between different candidates: the unit of analysis is at the candidate level. Here, each candidate has multiple posts, and the main text-based variable is defined as the proportion of online posts mentioning an economic policy for each candidate.
In such applications, how can users apply DSL? In general, DSL should be applied at the unit of analysis (i.e., the candidate-level data). Then, how can we prepare the candidate-level data from the post-level data that are expert-coded? We need two simple steps: (1) aggregating expert-coding and (2) aggregating sampling probabilities
1. Aggregating Expert-Coding
First, we discuss how to aggregate expert coding at the post-level data up to the candidate-level data.
We start with the post level data.
## cand_id post_id economy pred_economy post_sample_prob
## 1 1 1_1 0 0 0.3
## 2 1 1_2 0 1 0.3
## 6 2 2_1 NA 0 0.3
## 7 2 2_2 NA 1 0.3
## 3 3 3_1 1 1 0.3
## 4 3 3_2 0 0 0.3
## 5 3 3_3 NA 0 0.3
In this data, variable economy
represents expert-coded
labels about whether a given post mentions an economic policy. The first
candidate has two posts that are all labeled, and for this candidate, we
can compute the proportion of online posts mentioning an economic policy
to be 0
. For the second candidates, none of the posts are
labeled, so this second candidate is not labeled. For the third
candidate, two of the posts are labeled. As long as posts are randomly
selected for expert annotations, researchers can use the mean of labeled
posts to estimate the proportion of online posts mentioning an economic
policy to be 0.5
for this particular candidate.
Variable pred_economy
represents text labels predicted
by a user-specified automated text annotation method. Variable
post_sample_prob
is the post-level sampling probability for
labeling. In this example, all the posts have the equal probability of
being sampled with 30%.
In general, when posts are randomly selected for expert annotations,
users can compute the expert-coded mean_economy
variable at
the candidate-level by ignoring NAs
and computing the
proportion of online posts mentioning an economic policy just focusing
on labeled documents. For candidates that do not have any labeled posts,
those candidates will be recorded as non-labeled in the candidate-level
data. For variable pred_economy
, users can directly compute
the mean within each candidate.
Users can implement this step by simply using group_by()
and summarize()
functions in tidyverse
or
using tapply()
.
library(tidyverse)
economy_cand <- data_post %>%
group_by(cand_id) %>%
summarize(mean_economy = mean(economy, na.rm = T),
mean_pred_economy = mean(pred_economy, na.rm = T))
# Turn NaN to NA (get ready for `dsl`)
economy_cand$mean_economy[is.nan(economy_cand$mean_economy)] <- NA
head(economy_cand)
## # A tibble: 6 × 3
## cand_id mean_economy mean_pred_economy
## <dbl> <dbl> <dbl>
## 1 1 0 0.5
## 2 2 NA 0.5
## 3 3 0.5 0.333
## 4 4 0 0.5
## 5 5 0 0.25
## 6 6 NA 0.667
At the candidate-level, variable mean_economy
is the
expert-coded proportion of posts mentioning an economic policy, and
variable mean_pred_economy
is the predicted proportion of
posts mentioning an economic policy.
Here, we are using the mean to aggregate posts within each candidate, but users can use any function of their choice to define their text-based variable. In general, as long as posts are randomly selected for expert annotations, users simply need to apply their definition of text-based variables for labeled documents (ignoring non-labeled documents) and treat candidates that have no labeled posts as non-labeled.
2. Aggregating Sampling Probabilities
Users also need to translate the sampling probability from the post-level to the candidate-level.
In particular, the sampling probability of having expert-coding for each candidate is the probability of having expert-coding on at least one document within each candidate. When users randomly sample documents, they need to take into account different numbers of posts within each candidate. This is because if a given candidate has more posts, she is more likely to have at least one expert-coded document.
To compute this properly, users can rely on the following simple function. Mathematically this computes the probability of having at least one expert annotation for each candidate.
cand_sample_prob <- data_post %>%
group_by(cand_id) %>%
summarize(cand_sample_prob = 1 - prod(1 - post_sample_prob))
head(cand_sample_prob)
## # A tibble: 6 × 2
## cand_id cand_sample_prob
## <dbl> <dbl>
## 1 1 0.51
## 2 2 0.51
## 3 3 0.657
## 4 4 0.760
## 5 5 0.760
## 6 6 0.657
If users already know that their main statistical analyses are at the candidate level, they can consider two-stage sampling, i.e., randomly sample candidates first and then randomly sample posts within each candidate. Using this two-stage sampling will make the calculation of the sampling probabilities at the candidate level even simpler. They have to just use the first stage (the probability of sampling each candidate for expert annotations).
3. Merging Other Candidate-level data
To prepare the final candidate-level data, we can combine two data sets above as well as other candidate-level data.
# First we merge (mean_economy, mean_pred_economy) and cand_sample_prob
cand_main <- merge(economy_cand, cand_sample_prob, by = "cand_id")
Then, users might merge some candidate-level data.
data("data_cand") # example candidate-level data
data_cand_merged <- merge(cand_main, data_cand, by = "cand_id")
head(data_cand_merged)
## cand_id mean_economy mean_pred_economy cand_sample_prob gender edu ideology
## 1 1 0.0 0.5000000 0.5100 0 2 5
## 2 2 NA 0.5000000 0.5100 0 4 3
## 3 3 0.5 0.3333333 0.6570 1 4 2
## 4 4 0.0 0.5000000 0.7599 1 4 3
## 5 5 0.0 0.2500000 0.7599 0 5 4
## 6 6 NA 0.6666667 0.6570 1 4 3
4. Fitting DSL with the Candidate-level Data
Finally, run dsl
on the candidate-level data by
specifying the sampling probability.
out_agg <- dsl(model = "lm",
formula = mean_economy ~ gender + edu + ideology,
predicted_var = "mean_economy",
prediction = "mean_pred_economy",
sample_prob = "cand_sample_prob",
data = data_cand_merged)
## Cross-Fitting: 1/10..2/10..3/10..4/10..5/10..6/10..7/10..8/10..9/10..10/10..
summary(out_agg)
## ==================
## DSL Specification:
## ==================
## Model: lm
## Call: mean_economy ~ gender + edu + ideology
##
## Predicted Variables: mean_economy
## Prediction: mean_pred_economy
##
## Number of Labeled Observations: 674
## Random Sampling for Labeling with Equal Probability: No
## (Sampling probabilities are defined in `sample_prob`)
##
## =============
## Coefficients:
## =============
## Estimate Std. Error CI Lower CI Upper p value
## (Intercept) 0.4056 0.0696 0.2692 0.5420 0.0000 ***
## gender -0.0101 0.0348 -0.0782 0.0581 0.3861
## edu 0.0005 0.0128 -0.0246 0.0257 0.4834
## ideology -0.0090 0.0134 -0.0352 0.0172 0.2494
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 95% confidence intervals (CI) are reported.