Skip to contents


On this page, we provide the introduction to dsl package.


Table of Contents
  • Overview
    We provide the overview of dsl package.

  • dsl: DSL Estimator
    We explain how to use function dsl() to use predicted variables in statistical analyses.


DSL is a general estimation framework for using predicted variables in statistical analyses. The package is especially useful for researchers trying to use large language models (LLMs) to annotate a large number of documents they analyze subsequently. DSL allows users to obtain statistically valid estimates and standard errors, even when LLM annotations contain arbitrary non-random prediction errors and biases.

Overview

In text-as-data applications, one of the most common tasks is text annotation (or text classification) to generate text-based variables for subsequent statistical analyses. For example, researchers might first annotate whether each online post contains hate speech such that they can later study who are more likely to post hate speech and what types of interventions can reduce hate speech. Over the last decade, scholars used a variety of supervised machine learning (ML) methods to automate this text annotation step by training machines to mimic expert-coding. More recently, a growing number of papers propose using large language models (LLMs), such as ChatGPT, to automate text annotations by predicting expert annotations. Given that researchers can adapt LLMs to perform a wide range of text annotation tasks by simply changing prompts, automated LLM annotations present exciting opportunities for the social sciences.

While text annotation is essential, it is only the first step. Social scientists are often primarily interested in using text labels predicted by automated methods as key variables in subsequent statistical analyses. In the vast majority of current applications, researchers treat predicted text-based variables as if they were observed without any error: they ignore prediction errors in the first step of automated text annotation. However, ignoring such prediction errors in the first step of text annotation, even if errors are small, leads to substantial bias, invalid confidence intervals, and wrong p-values in downstream statistical analyses of text-based variables. Biases from prediction errors exist even when the prediction accuracy in the text classification step is extremely high, e.g., above 90% or even at 95%. This is because prediction errors are not random—prediction errors are correlated with observed and unobserved variables we include in downstream analyses. In practice, this means that substantive and statistical conclusions can easily flip if researchers ignore prediction errors in automated text annotation methods.

Design-based supervised learning (DSL) is a general framework for using predicted variables in downstream statistical analyses without suffering from bias due to prediction errors. Unlike the existing approaches, DSL allows researchers to obtain statistically valid estimates and standard errors, even when automated text annotation methods have arbitrary non-random prediction errors. To do so, DSL combines large-scale (potentially biased) automated annotations and a smaller number of high-quality expensive expert annotations using a doubly robust bias-correction step. Please read Egami, Hinck, Stewart, and Wei (2023) for the methodological details.


dsl: DSL Estimator

As an example, we use Pan and Chen (2018), which studies whether online complaints accusing of corruption by local officials are reported to upper-level governments in China. In this example, variable countyWrong (whether a post accuses of corruption by county-level politicians) requires text annotation. We randomly sampled a subset of the data (500 documents) to provide expert annotations for this variable.

library(dsl)
data("PanChen")
head(PanChen)
##   countyWrong pred_countyWrong SendOrNot prefecWrong connect2b prevalence
## 1           0                0         0           0         1          0
## 2           0                1         0           0         0          0
## 3           1                0         0           0         0          0
## 4          NA                0         0           0         1          0
## 5          NA                0         0           1         0          0
## 6          NA                0         0           0         1          0
##   regionj groupIssue
## 1       0          1
## 2       0          1
## 3       0          1
## 4       0          1
## 5       0          1
## 6       0          1

Variable countyWrong represents expert labels, and it takes NA if a given observation is not labeled by experts.

Variable pred_countyWrong represents text labels predicted by a user-specified automated text annotation method, e.g., annotations given by a user-specified LLM. In this example, variable pred_countyWrong represent annotations given by GPT-4. Unlike variable countyWrong, variable pred_countyWrong is available for the entire data.

The remaining six columns represent other key variables included in the main statistical analyses.

If users include pred_countyWrong instead of countyWrong in the downstream analyses, they will suffer from biases due to prediction errors. If users worry about prediction errors and only use a subset of the data that have expert-coded countyWrong, they have to miss a large number of observations.

Researchers can use dsl to perform their main statistical analyses by analyzing all the data while taking into account prediction errors. In this example, we are interested in running a logistic regression model regressing binary outcome SendOrNot on a set of independent variables, including countyWrong.

out <- dsl(model = "logit", 
           formula = SendOrNot ~ countyWrong + prefecWrong + 
             connect2b + prevalence + regionj + groupIssue,
           predicted_var = "countyWrong",
           prediction = "pred_countyWrong",
           data = PanChen)
## Cross-Fitting: 1/10..2/10..3/10..4/10..5/10..6/10..7/10..8/10..9/10..10/10..
Argument Description
model A regression model. dsl currently supports lm (linear regression), logit (logistic regression), and felm (fixed-effects regression).
formula A formula used in the specified regression model.
predicted_var A vector of column names in the data that correspond to variables that need to be predicted.
prediction A vector of column names in the data that correspond to predictions of predicted_var.
data A data frame. The class should be data.frame.

Researchers can obtain the summary of output using function summary().

summary(out)
## ==================
## DSL Specification:
## ==================
## Model:  logit
## Call:  SendOrNot ~ countyWrong + prefecWrong + connect2b + prevalence + regionj + groupIssue
## 
## Predicted Variables:  countyWrong
## Prediction:  pred_countyWrong
## 
## Number of Labeled Observations:  500
## Random Sampling for Labeling with Equal Probability: Yes
## 
## =============
## Coefficients:
## =============
##             Estimate Std. Error CI Lower CI Upper p value    
## (Intercept)   2.0979     0.3621   1.3882   2.8077  0.0000 ***
## countyWrong  -0.2613     0.2230  -0.6983   0.1757  0.1206    
## prefecWrong  -1.1163     0.2970  -1.6983  -0.5342  0.0001 ***
## connect2b    -0.0789     0.1197  -0.3135   0.1557  0.2550    
## prevalence   -0.3269     0.1520  -0.6247  -0.0290  0.0157   *
## regionj       0.1254     0.4565  -0.7694   1.0203  0.3917    
## groupIssue   -2.3224     0.3597  -3.0274  -1.6174  0.0000 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 95% confidence intervals (CI) are reported.

It first outputs the specification of the DSL estimation, including the number of labeled observations and the sampling process for expert annotations. When users do not specify explicitly, function dsl() assumes random sampling with equal probabilities. If users deployed a sampling strategy other than random sampling with equal probabilities, they have to explicitly specify it using argument sample_prob when fitting dsl (please see this page).

As for coefficients, researchers can interpret estimates and standard errors as in the usual regression analyses. The package implements the hetroskedasticity-robust standard errors by default. When users want to compute cluster-robust standard errors, they can use argument cluster, which we explain more here.

The package supports different types of regression models: linear regression (lm), logistic regression (logit), and linear fixed-effects regression (felm). To learn more about each model, please visit Models Available in DSL.

To help users apply DSL in various practical settings, we answer frequently asked questions at Frequently Asked Questions, including how to determine the required number of expert annotations, how to incorporate more complex sampling strategies for expert annotations, and how to handle cluster standard errors.

References:

  • Egami, Hinck, Stewart, and Wei. (2024). “Using Large Language Model Annotations for the Social Sciences: A General Framework of Using Predicted Variables in Downstream Analyses.”

  • Egami, Hinck, Stewart, and Wei. (2023). Using Imperfect Surrogates for Downstream Inference: Design-based Supervised Learning for Social Science Applications of Large Language Models, Advances in Neural Information Processing Systems (NeurIPS).

  • Pan and Chen. (2018). Concealing Corruption: How Chinese Officials Distort Upward Reporting of Online Grievances. American Political Science Review 112, 3, 602–620.