Skip to contents


On this page, we answer frequently asked questions.



How Many Documents Should Experts Annotate?

In practice, researchers might wonder how many documents they have to expert annotate. To help researchers in such common scenarios, we develop a data-driven power analysis: after annotating a small number of documents, we can predict how many more documents researchers need to annotate in order to achieve a user-specified size of standard error.

As an example, we use the data we introduced in the Get Started page.

library(dsl)
data("PanChen")
head(PanChen)
##   countyWrong pred_countyWrong SendOrNot prefecWrong connect2b prevalence
## 1           0                0         0           0         1          0
## 2           0                1         0           0         0          0
## 3           1                0         0           0         0          0
## 4          NA                0         0           0         1          0
## 5          NA                0         0           1         0          0
## 6          NA                0         0           0         1          0
##   regionj groupIssue
## 1       0          1
## 2       0          1
## 3       0          1
## 4       0          1
## 5       0          1
## 6       0          1

In particular, users can apply function power_dsl() to predict standard errors with different size of expert-coded data. Researchers can use the same arguments from dsl and add one argument labeled_size, which represents the number of expert-coded documents for which we predict standard errors. Recall that the current number of expert-coded documents is 500.

power_out <- power_dsl(labeled_size = c(600, 700, 800, 900, 1000),
                       model = "logit", 
                       formula = SendOrNot ~ countyWrong + prefecWrong + 
                         connect2b + prevalence + regionj + groupIssue,
                       predicted_var = "countyWrong",
                       prediction = "pred_countyWrong",
                       data = PanChen)
## Cross-Fitting: 1/10..2/10..3/10..4/10..5/10..6/10..7/10..8/10..9/10..10/10..
## Power Analysis: 1/10..2/10..3/10..4/10..5/10..6/10..7/10..8/10..9/10..10/10..
summary(power_out)
## 
## ==========================
## Predicted Standard Errors:
## ==========================
##      (Intercept) countyWrong prefecWrong connect2b prevalence   regionj
## 500    0.3621287   0.2229501   0.2969601 0.1196955  0.1519671 0.4565481
## 600    0.3616287   0.2045568   0.2962946 0.1189752  0.1514800 0.4563130
## 700    0.3612711   0.1903336   0.2958184 0.1184581  0.1511311 0.4561449
## 800    0.3610027   0.1789257   0.2954606 0.1180687  0.1508689 0.4560189
## 900    0.3607938   0.1695230   0.2951821 0.1177650  0.1506646 0.4559208
## 1000   0.3606266   0.1616074   0.2949591 0.1175214  0.1505010 0.4558423
##      groupIssue
## 500   0.3597029
## 600   0.3596288
## 700   0.3595758
## 800   0.3595361
## 900   0.3595052
## 1000  0.3594804
## ---

The first row shows the current standard errors for coefficients. The remaining rows show predicted standard errors.

Importantly, as we increase the number of expert annotations for countyWrong, standard errors for countyWrong are predicted to decrease sharply. For the other variables, we do not see similarly sharp reduction because the other variables are already observed for the entire data. Small reduction in their standard errors come from correlations with countyWrong.

Researchers can use function plot to visualize the power analysis.

plot(power_out, coef_name = "countyWrong")

Finally, researchers can also apply power_dsl() directly to the output from function dsl().

out <- dsl(model = "logit", 
           formula = SendOrNot ~ countyWrong + prefecWrong + 
             connect2b + prevalence + regionj + groupIssue,
           predicted_var = "countyWrong",
           prediction = "pred_countyWrong",
           data = PanChen)
## Cross-Fitting: 1/10..2/10..3/10..4/10..5/10..6/10..7/10..8/10..9/10..10/10..
power_out2 <- power_dsl(dsl_out = out, 
                        labeled_size = c(600, 700, 800, 900, 1000))
## 
## Power Analysis: 1/10..2/10..3/10..4/10..5/10..6/10..7/10..8/10..9/10..10/10..
summary(power_out2)
## 
## ==========================
## Predicted Standard Errors:
## ==========================
##      (Intercept) countyWrong prefecWrong connect2b prevalence   regionj
## 500    0.3621287   0.2229501   0.2969601 0.1196955  0.1519671 0.4565481
## 600    0.3616287   0.2045568   0.2962946 0.1189752  0.1514800 0.4563130
## 700    0.3612711   0.1903336   0.2958184 0.1184581  0.1511311 0.4561449
## 800    0.3610027   0.1789257   0.2954606 0.1180687  0.1508689 0.4560189
## 900    0.3607938   0.1695230   0.2951821 0.1177650  0.1506646 0.4559208
## 1000   0.3606266   0.1616074   0.2949591 0.1175214  0.1505010 0.4558423
##      groupIssue
## 500   0.3597029
## 600   0.3596288
## 700   0.3595758
## 800   0.3595361
## 900   0.3595052
## 1000  0.3594804
## ---

We can see that power_out2 is identical to power_out.


What Types of Predicted Variables Can DSL Handle?

As we have shown in various examples in Get Started Page and Models Available in DSL, users can incorporate predicted variables (i.e., text-based variables) as the outcome and/or independent variables.

For example, when users have text-based variables as both the outcome and independent variables, they can specify them as a vector in predicted_var.

data("data_logit") # example data 

head(data_logit)
##    Y         X1 pred_Y    pred_X1         X2          X3          X4
## 1  0 -0.1443742      0  0.9486148 -0.2079232 -0.06932595  1.65748976
## 2  1 -1.2864079      1 -1.8766233 -1.6961906 -0.27274691  1.54678252
## 3  1 -1.4665386      1 -2.4866911 -0.6212224 -0.70787173 -0.27913330
## 4  0 -1.3554115      1  0.5138155  1.2299990 -0.86519213  0.18177847
## 5 NA         NA      1 -2.1116582  0.6624512  1.19504665 -0.02724476
## 6 NA         NA      1 -0.6391120 -0.8971608  1.20072203 -0.51241357
##            X5          X6         X7        X8           X9        X10
## 1  0.96825927  0.05389556 -2.0762690 0.2653279  1.458743948  0.4127003
## 2 -0.46626141  0.73182950 -1.7677551 1.0790758 -0.003172833 -0.5313550
## 3 -0.26973904 -1.75925076 -0.1585703 0.6823146  0.050500221 -1.2679467
## 4  0.41947288 -0.32375937  0.3226157 0.5131867 -2.065367939  1.6111660
## 5 -0.94927647  1.01808653  0.7608077 1.6830265 -1.907959205 -0.6979879
## 6  0.09162689  1.15086391 -0.7917356 0.9765960 -0.144924483 -1.2848539
out_logit <- dsl(model = "logit", 
                 formula = Y ~ X1 + X2 + X4,
                 predicted_var = c("Y", "X1"),
                 prediction = c("pred_Y", "pred_X1"),
                 data = data_logit)
## Cross-Fitting: 1/10..2/10..3/10..4/10..5/10..6/10..7/10..8/10..9/10..10/10..

When there are more than one variable in predicted_var, rows that observe all the variables in predicted_var are counted as labeled, and rows that have at least one NA in variables in predicted_var are counted as non-labeled.


(More Complex) Sampling Strategies?

Researchers might consider sampling strategies for expert annotations other than random sampling with equal probabilities. As long as the sampling probability for each document is decided by the researchers and is greater than zero, DSL is applicable. For example, DSL allows stratified or block sampling schemes (i.e., change the sampling probability of documents based on document-level observed covariates) and can cover any case where the sampling probability depends on the LLM annotation, document-level covariates, independent variables, or the outcome variable. This generality is important because researchers might want to over-sample documents that are difficult to annotate.

In function dsl(), when using these more complex sampling strategies, users should specify the sampling probability of each document. For example, in the following data set, we set the sampling probability for expert annotation to be 30% for documents with X1 = 1 and set it to be 10% for documents with X1 = 0. As you see below, variable sample_prob is 0.3 when X1 = 1 and 0.1 when X1 = 0. Thus, this is random sampling with unequal probabilities.

data("data_unequal") # example data 
head(data_unequal)
##    Y pred_Y sample_prob X1         X2           X3
## 1  0      1         0.1  0 -0.7075315  0.774843489
## 2  0      0         0.1  0 -0.1282225 -0.351258120
## 3  0      1         0.3  1 -0.9908576  0.733213596
## 4 NA      1         0.3  1 -1.1068022 -0.403284103
## 5 NA      1         0.1  0  0.2493619  0.510531594
## 6 NA      1         0.1  0 -1.0298485  0.004936994

When running dsl(), users just need to specify this unequal sampling probability in argument sample_prob.

out_unequal <- dsl(model = "logit", 
                   formula = Y ~ X1 + X2 + X3,
                   predicted_var = c("Y"),
                   prediction = c("pred_Y"),
                   sample_prob = "sample_prob",
                   data = data_unequal)
## Cross-Fitting: 1/10..2/10..3/10..4/10..5/10..6/10..7/10..8/10..9/10..10/10..
summary(out_unequal)
## ==================
## DSL Specification:
## ==================
## Model:  logit
## Call:  Y ~ X1 + X2 + X3
## 
## Predicted Variables:  Y
## Prediction:  pred_Y
## 
## Number of Labeled Observations:  334
## Random Sampling for Labeling with Equal Probability:  No
## (Sampling probabilities are defined in `sample_prob`)
## 
## =============
## Coefficients:
## =============
##             Estimate Std. Error CI Lower CI Upper p value  
## (Intercept)  -0.2152     0.1185  -0.4475   0.0171  0.0347 *
## X1            0.1055     0.3351  -0.5512   0.7622  0.3764  
## X2            0.1072     0.1087  -0.1058   0.3203  0.1619  
## X3           -0.0453     0.1159  -0.2725   0.1819  0.3480  
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 95% confidence intervals (CI) are reported.

When users rely on random sampling with unequal probabilities, Random Sampling with Equal Probability becomes No.


How Can We Implement Cluster Standard Errors in DSL?

Users need to supply a variable name that defines clusters to argument cluster in function dsl(). This option is available for all model types (lm, logit, and felm).

data("data_felm") # example data 

out_felm_one <- dsl(model = "felm", 
                    formula = log_pcap ~ log_gsp + log_pc + unemp,
                    predicted_var =  "log_gsp",
                    prediction = "pred_log_gsp",
                    fixed_effect = "oneway", 
                    index = c("state"), 
                    cluster = "state",
                    data = data_felm)
## Cross-Fitting: 1/10..2/10..3/10..4/10..5/10..6/10..7/10..8/10..9/10..10/10..
summary(out_felm_one)
## ==================
## DSL Specification:
## ==================
## Model:  felm (oneway)
## Call:  log_pcap ~ log_gsp + log_pc + unemp
## Fixed Effects:  state
## 
## Predicted Variables:  log_gsp
## Prediction:  pred_log_gsp
## 
## Number of Labeled Observations:  334
## Random Sampling for Labeling with Equal Probability: Yes
## 
## =============
## Coefficients:
## =============
##         Estimate Std. Error CI Lower CI Upper p value    
## log_gsp   0.0031     0.0020  -0.0007   0.0069  0.0563   .
## log_pc    0.5357     0.0419   0.4537   0.6178  0.0000 ***
## unemp     0.0067     0.0021   0.0025   0.0108  0.0008 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 95% confidence intervals (CI) are reported.
## Standard errors are clustered by state.

When users applied cluster standard errors, the level of clustering is reported after the Coefficients table, e.g., as Standard errors are clustered by state.


What if the Unit of Analysis is not the same as the Unit of Labeling?

In some applications, the unit of analysis is not the same as the unit of labeling. Researchers often annotate each document, but the unit of analysis might be some aggregates of documents. For example, users might code whether each online post by political candidates mentions an economic policy: the unit of labeling is at the online post level. But researchers might be interested in how the proportion of posts mentioning an economic policy varies between different candidates: the unit of analysis is at the candidate level. Here, each candidate has multiple posts, and the main text-based variable is defined as the proportion of online posts mentioning an economic policy for each candidate.

In such applications, how can users apply DSL? We provide step by step guide here.


How to use DSL for Estimating the Category Proportions over Time or across Groups?

Many scholars are interested in estimating the proportion of all documents in each user-specified category. For example, we might study how the proportion of censored documents changes over time, or how the proportion of social media posts containing hate speech differs across groups, such as Democrats and Republicans. These questions can be analyzed within the DSL framework, too.

We start with the basic data structure for dsl. In this data, Y denotes whether a document belongs to a category of interest, and pred_Y denotes the predicted value. We estimate the category proportions over time here as an example, but the same approach applies to estimating the category proportions across groups.

data("data_time")
head(data_time)
##    Y pred_Y year
## 1  1      1 2001
## 2  1      0 2001
## 3  0      1 2001
## 4 NA      1 2001
## 5 NA      0 2001
## 6 NA      0 2001

To estimate the category proportions over time, researchers can regress a text category on time using linear regression without an intercept. This is numerically equivalent to computing the DSL estimates within each year separately. To implement this, researchers can simply add -1 to formula to remove an intercept. Note that this estimation approach uses the numerical equivalence between linear regression (lm) and subgroup means (as scholars use linear regression to compute difference-in-means in randomized experiments), and thus, researchers do not need to use logistic regression (logit).

out_time <- dsl(model = "lm",
                formula = Y ~ as.factor(year) - 1,
                predicted_var = "Y",
                prediction = "pred_Y",
                data = data_time)
## Cross-Fitting: 1/10..2/10..3/10..4/10..5/10..6/10..7/10..8/10..9/10..10/10..
summary(out_time)
## ==================
## DSL Specification:
## ==================
## Model:  lm
## Call:  Y ~ as.factor(year) - 1
## 
## Predicted Variables:  Y
## Prediction:  pred_Y
## 
## Number of Labeled Observations:  500
## Random Sampling for Labeling with Equal Probability: Yes
## 
## =============
## Coefficients:
## =============
##                     Estimate Std. Error CI Lower CI Upper p value    
## as.factor(year)2001   0.6447     0.0652   0.5169   0.7725   0e+00 ***
## as.factor(year)2002   0.6233     0.0653   0.4953   0.7514   0e+00 ***
## as.factor(year)2003   0.2959     0.0808   0.1376   0.4542   1e-04 ***
## as.factor(year)2004   0.4441     0.0693   0.3084   0.5799   0e+00 ***
## as.factor(year)2005   0.6290     0.0722   0.4876   0.7705   0e+00 ***
## as.factor(year)2006   0.4278     0.0727   0.2853   0.5702   0e+00 ***
## as.factor(year)2007   0.5710     0.0766   0.4209   0.7212   0e+00 ***
## as.factor(year)2008   0.6704     0.0674   0.5382   0.8026   0e+00 ***
## as.factor(year)2009   0.4957     0.0639   0.3704   0.6210   0e+00 ***
## as.factor(year)2010   0.4760     0.0704   0.3380   0.6139   0e+00 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 95% confidence intervals (CI) are reported.

Here, coefficients in front of each year are the estimated category proportions for each year.


How Can We Internally Implement Supervised Machine Learning for Predictions?

One of the most common use cases of the package is when researchers have LLM annotations as prediction for some key text-based variables defined in predicted_var.

Alternatively, researchers can also internally implement the classical supervised machine learning methods to make predictions using the expert-coded data.

By building on SuperLearner package, we offer the following 41 supervised ML methods.

##  [1] "grf"              "bartMachine"      "bayesglm"         "biglasso"        
##  [5] "caret"            "caret.rpart"      "cforest"          "earth"           
##  [9] "gam"              "gbm"              "glm"              "glm.interaction" 
## [13] "glmnet"           "ipredbagg"        "kernelKnn"        "knn"             
## [17] "ksvm"             "lda"              "leekasso"         "lm"              
## [21] "loess"            "logreg"           "mean"             "nnet"            
## [25] "nnls"             "polymars"         "qda"              "randomForest"    
## [29] "ranger"           "ridge"            "rpart"            "rpartPrune"      
## [33] "speedglm"         "speedlm"          "step"             "step.forward"    
## [37] "step.interaction" "stepAIC"          "svm"              "xgboost"

When researchers internally implement the classical supervised machine learning methods, the data structure looks like the data used in the classical supervised machine learning. In this example, Y requires text annotation and there are 10 variables available to predict Y.

data("data_sl")  # example data 
head(data_sl)
##            Y          X1          X2         X3          X4         X5
## 1 -0.2050247 -0.09910396 -0.87960666  0.8015421  1.29028152  0.8532547
## 2 -1.4856037  1.23123941  1.99007602  0.3868839 -0.07223693 -1.2962462
## 3  0.9894281  0.46408759  0.42323988 -0.1097634  0.25241778 -0.9464925
## 4         NA -1.12227958 -0.06897893 -0.4601287  1.49975679  1.5618861
## 5         NA -0.67877168 -0.33290315 -1.4747134 -1.89504047  0.5137869
## 6         NA  0.04734624 -0.18367264 -1.1965101  0.11178831  0.1975828
##           X6         X7          X8         X9        X10
## 1 -0.4914913 -1.0640511  0.01688910  1.8391539 -2.3456977
## 2 -1.2428339  0.0644001  1.92829279 -1.3745543 -0.5644520
## 3  0.8812547  1.5719390 -1.31725336  0.9007558  0.4595894
## 4  0.2493619  0.5105316  0.02339191 -0.3850724 -1.2070657
## 5  0.2394639 -1.4016767 -0.30634290  1.5934046  0.4291247
## 6 -1.6337523 -0.3078537 -0.51741473 -1.1128772 -0.5747400

To internally implement supervised machine learning methods, users have to decide on a supervised machine learning model (specified in sl_method) and predictors (specified in feature). In formula, researchers can then specify the downstream statistical model they run after predicting Y. In this example, the main downstream analysis is linear regression where the outcome is Y and independent variables are X1,X2, and X4.

out_lm_grf <- dsl(model = "lm", 
                  formula = Y ~ X1 + X2 + X4,
                  predicted_var = c("Y"),
                  sl_method = "grf",
                  feature =  c("X1", "X2", "X3", "X4", "X5", "X6", "X7", "X8", "X9", "X10"),
                  data = data_sl)
## Cross-Fitting: 1/10..2/10..3/10..4/10..5/10..6/10..7/10..8/10..9/10..10/10..

Here, we set sl_method = "grf" to use generalized random forest as the supervised machine learning model.

summary(out_lm_grf)
## ==================
## DSL Specification:
## ==================
## Model:  lm
## Call:  Y ~ X1 + X2 + X4
## 
## Predicted Variables:  Y
## Prediction:  with predictors specified in `covariates`
## 
## Number of Labeled Observations:  500
## Random Sampling for Labeling with Equal Probability: Yes
## 
## =============
## Coefficients:
## =============
##             Estimate Std. Error CI Lower CI Upper p value    
## (Intercept)   0.0299     0.0497  -0.0676   0.1273  0.2741    
## X1            0.3980     0.0534   0.2934   0.5026  0.0000 ***
## X2           -0.2432     0.0496  -0.3405  -0.1460  0.0000 ***
## X4           -0.2711     0.0498  -0.3688  -0.1734  0.0000 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 95% confidence intervals (CI) are reported.

Finally, for many existing methods that require stringent assumptions about prediction errors, they often have strict rules about whether variables used in the prediction stage can overlap with variables in the main analyses. Fortunately, in DSL, researchers do not need to worry about these issues because DSL does not require any assumption about prediction error. This is why researchers can include variables in the main statistical analyses (X1, X2, and X4) in the prediction step as well.