# Get Started with dsl

`intro.Rmd`

**On this page, we provide the introduction to dsl
package.**

**Table of Contents**

**Overview**

We provide the overview of`dsl`

package.`dsl`

: DSL Estimator

We explain how to use function`dsl()`

to use predicted variables in statistical analyses.

DSL is a general estimation framework for using predicted variables in statistical analyses. The package is especially useful for researchers trying to use large language models (LLMs) to annotate a large number of documents they analyze subsequently. DSL allows users to obtain statistically valid estimates and standard errors, even when LLM annotations contain arbitrary non-random prediction errors and biases.

###
**Overview**

In text-as-data applications, one of the most common tasks is text annotation (or text classification) to generate text-based variables for subsequent statistical analyses. For example, researchers might first annotate whether each online post contains hate speech such that they can later study who are more likely to post hate speech and what types of interventions can reduce hate speech. Over the last decade, scholars used a variety of supervised machine learning (ML) methods to automate this text annotation step by training machines to mimic expert-coding. More recently, a growing number of papers propose using large language models (LLMs), such as ChatGPT, to automate text annotations by predicting expert annotations. Given that researchers can adapt LLMs to perform a wide range of text annotation tasks by simply changing prompts, automated LLM annotations present exciting opportunities for the social sciences.

While text annotation is essential, it is only the first step. Social
scientists are often primarily interested in using text labels predicted
by automated methods as key variables in subsequent statistical
analyses. In the vast majority of current applications, researchers
treat predicted text-based variables as if they were observed without
any error: they ignore **prediction errors** in the first
step of automated text annotation. However, ignoring such prediction
errors in the first step of text annotation, even if errors are small,
leads to substantial bias, invalid confidence intervals, and wrong
p-values in downstream statistical analyses of text-based variables.
Biases from prediction errors exist even when the prediction accuracy in
the text classification step is extremely high, e.g., above 90% or even
at 95%. This is because prediction errors are not random—prediction
errors are correlated with observed and unobserved variables we include
in downstream analyses. In practice, this means that substantive and
statistical conclusions can easily flip if researchers ignore prediction
errors in automated text annotation methods.

**Design-based supervised learning (DSL)** is a general
framework for using predicted variables in downstream statistical
analyses without suffering from bias due to prediction errors. Unlike
the existing approaches, DSL allows researchers to obtain statistically
valid estimates and standard errors, even when automated text annotation
methods have arbitrary non-random prediction errors. To do so, DSL
combines large-scale (potentially biased) automated annotations and a
smaller number of high-quality expensive expert annotations using a
doubly robust bias-correction step. Please read Egami, Hinck, Stewart, and
Wei (2023) for the methodological details.

###
`dsl`

: DSL Estimator

`dsl`

: DSL Estimator

As an example, we use Pan and Chen (2018), which studies whether
online complaints accusing of corruption by local officials are reported
to upper-level governments in China. In this example, variable
`countyWrong`

(whether a post accuses of corruption by
county-level politicians) requires text annotation. We randomly sampled
a subset of the data (500 documents) to provide expert annotations for
this variable.

```
## countyWrong pred_countyWrong SendOrNot prefecWrong connect2b prevalence
## 1 0 0 0 0 1 0
## 2 0 1 0 0 0 0
## 3 1 0 0 0 0 0
## 4 NA 0 0 0 1 0
## 5 NA 0 0 1 0 0
## 6 NA 0 0 0 1 0
## regionj groupIssue
## 1 0 1
## 2 0 1
## 3 0 1
## 4 0 1
## 5 0 1
## 6 0 1
```

Variable `countyWrong`

represents expert labels, and it
takes `NA`

if a given observation is not labeled by
experts.

Variable `pred_countyWrong`

represents text labels
predicted by a user-specified automated text annotation method, e.g.,
annotations given by a user-specified LLM. In this example, variable
`pred_countyWrong`

represent annotations given by GPT-4.
Unlike variable `countyWrong`

, variable
`pred_countyWrong`

is available for the entire data.

The remaining six columns represent other key variables included in the main statistical analyses.

If users include `pred_countyWrong`

instead of
`countyWrong`

in the downstream analyses, they will suffer
from biases due to prediction errors. If users worry about prediction
errors and only use a subset of the data that have expert-coded
`countyWrong`

, they have to miss a large number of
observations.

Researchers can use `dsl`

to perform their main
statistical analyses by analyzing all the data while taking into account
prediction errors. In this example, we are interested in running a
logistic regression model regressing binary outcome
`SendOrNot`

on a set of independent variables, including
`countyWrong`

.

```
out <- dsl(model = "logit",
formula = SendOrNot ~ countyWrong + prefecWrong +
connect2b + prevalence + regionj + groupIssue,
predicted_var = "countyWrong",
prediction = "pred_countyWrong",
data = PanChen)
```

`## Cross-Fitting: 1/10..2/10..3/10..4/10..5/10..6/10..7/10..8/10..9/10..10/10..`

Argument | Description |
---|---|

`model` |
A regression model. `dsl` currently supports
`lm` (linear regression), `logit` (logistic
regression), and `felm` (fixed-effects regression). |

`formula` |
A formula used in the specified regression model. |

`predicted_var` |
A vector of column names in the data that correspond to variables that need to be predicted. |

`prediction` |
A vector of column names in the data that correspond to
predictions of `predicted_var` . |

`data` |
A data frame. The class should be
`data.frame` . |

Researchers can obtain the summary of output using function
`summary()`

.

`summary(out)`

```
## ==================
## DSL Specification:
## ==================
## Model: logit
## Call: SendOrNot ~ countyWrong + prefecWrong + connect2b + prevalence + regionj + groupIssue
##
## Predicted Variables: countyWrong
## Prediction: pred_countyWrong
##
## Number of Labeled Observations: 500
## Random Sampling for Labeling with Equal Probability: Yes
##
## =============
## Coefficients:
## =============
## Estimate Std. Error CI Lower CI Upper p value
## (Intercept) 2.0979 0.3621 1.3882 2.8077 0.0000 ***
## countyWrong -0.2613 0.2230 -0.6983 0.1757 0.1206
## prefecWrong -1.1163 0.2970 -1.6983 -0.5342 0.0001 ***
## connect2b -0.0789 0.1197 -0.3135 0.1557 0.2550
## prevalence -0.3269 0.1520 -0.6247 -0.0290 0.0157 *
## regionj 0.1254 0.4565 -0.7694 1.0203 0.3917
## groupIssue -2.3224 0.3597 -3.0274 -1.6174 0.0000 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 95% confidence intervals (CI) are reported.
```

It first outputs the specification of the DSL estimation, including
the number of labeled observations and the sampling process for expert
annotations. When users do not specify explicitly, function
`dsl()`

assumes random sampling with equal probabilities. If
users deployed a sampling strategy other than random sampling with equal
probabilities, they have to explicitly specify it using argument
`sample_prob`

when fitting `dsl`

(please see this page).

As for coefficients, researchers can interpret estimates and standard
errors as in the usual regression analyses. The package implements the
hetroskedasticity-robust standard errors by default. When users want to
compute cluster-robust standard errors, they can use argument
`cluster`

, which we explain more here.

The package supports different types of regression models: linear
regression (`lm`

), logistic regression (`logit`

),
and linear fixed-effects regression (`felm`

). To learn more
about each model, please visit **Models Available in
DSL**.

To help users apply DSL in various practical settings, we answer
frequently asked questions at **Frequently Asked
Questions**, including how to determine the required number
of expert annotations, how to incorporate more complex sampling
strategies for expert annotations, and how to handle cluster standard
errors.

**References:**

Egami, Hinck, Stewart, and Wei. (2024). “Using Large Language Model Annotations for the Social Sciences: A General Framework of Using Predicted Variables in Downstream Analyses.”

Egami, Hinck, Stewart, and Wei. (2023). Using Imperfect Surrogates for Downstream Inference: Design-based Supervised Learning for Social Science Applications of Large Language Models, Advances in Neural Information Processing Systems (NeurIPS).

Pan and Chen. (2018). Concealing Corruption: How Chinese Officials Distort Upward Reporting of Online Grievances. American Political Science Review 112, 3, 602–620.