Models Available in DSL
model.Rmd
Note: We currently cover three models (lm
,
logit
, felm
). In the next update, we are
planning to add more models, including multinomial-logit regression,
poisson regression, and two-stage least squares for the instrumental
variable method.
lm
: Linear Regression
Researchers can implement DSL linear regression by specifying
model = "lm"
.
## Y pred_Y X1 X2 X3 X4 X5
## 1 3.337877 3.288469 0.84678708 0.10927346 0.15178520 0.4336862 0.07082542
## 2 1.283851 1.353864 0.59445495 0.19281988 0.77950672 0.4498675 0.30573402
## 3 3.618762 3.603551 0.91183977 0.97745425 0.19939829 0.3630903 0.97453388
## 4 NA 1.402226 0.24259237 0.03956956 0.02819825 0.8497914 0.95381503
## 5 NA 3.842970 0.51535594 0.78112420 0.31758914 0.5531493 0.46124773
## 6 NA 0.601493 0.09942167 0.28733200 0.44238525 0.5460361 0.83561538
In this example, variable Y
requires text annotation
(they take NA
if a given observation is not labeled by
experts). Variable pred_Y
represents predictions for
Y
.
out_lm <- dsl(model = "lm",
formula = Y ~ X1 + X2 + X3 + X4 + X5,
predicted_var = "Y",
prediction = "pred_Y",
data = data_lm)
## Cross-Fitting: 1/10..2/10..3/10..4/10..5/10..6/10..7/10..8/10..9/10..10/10..
Argument | Description |
---|---|
model |
A regression model. dsl currently supports
lm (linear regression), logit (logistic
regression), and felm (fixed-effects regression). |
formula |
A formula used in the specified regression model. |
predicted_var |
A vector of column names in the data that correspond to variables that need to be predicted. |
prediction |
A vector of column names in the data that correspond to
predictions of predicted_var . |
data |
A data frame. The class should be
data.frame . |
summary(out_lm)
## ==================
## DSL Specification:
## ==================
## Model: lm
## Call: Y ~ X1 + X2 + X3 + X4 + X5
##
## Predicted Variables: Y
## Prediction: pred_Y
##
## Number of Labeled Observations: 245
## Random Sampling for Labeling with Equal Probability: Yes
##
## =============
## Coefficients:
## =============
## Estimate Std. Error CI Lower CI Upper p value
## (Intercept) 0.6445 0.0713 0.5046 0.7843 0.0000 ***
## X1 2.1319 0.0684 1.9978 2.2661 0.0000 ***
## X2 2.1024 0.0642 1.9765 2.2283 0.0000 ***
## X3 0.0446 0.0686 -0.0900 0.1791 0.2581
## X4 0.0026 0.0635 -0.1218 0.1270 0.4837
## X5 0.0256 0.0593 -0.0906 0.1418 0.3331
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 95% confidence intervals (CI) are reported.
logit
: Logistic Regression
Researchers can implement DSL logistic regression by specifying
model = "logit"
. It requires the same arguments as
model = "lm"
.
## Y X1 pred_Y pred_X1 X2 X3 X4
## 1 0 -0.1443742 0 0.9486148 -0.2079232 -0.06932595 1.65748976
## 2 1 -1.2864079 1 -1.8766233 -1.6961906 -0.27274691 1.54678252
## 3 1 -1.4665386 1 -2.4866911 -0.6212224 -0.70787173 -0.27913330
## 4 0 -1.3554115 1 0.5138155 1.2299990 -0.86519213 0.18177847
## 5 NA NA 1 -2.1116582 0.6624512 1.19504665 -0.02724476
## 6 NA NA 1 -0.6391120 -0.8971608 1.20072203 -0.51241357
## X5 X6 X7 X8 X9 X10
## 1 0.96825927 0.05389556 -2.0762690 0.2653279 1.458743948 0.4127003
## 2 -0.46626141 0.73182950 -1.7677551 1.0790758 -0.003172833 -0.5313550
## 3 -0.26973904 -1.75925076 -0.1585703 0.6823146 0.050500221 -1.2679467
## 4 0.41947288 -0.32375937 0.3226157 0.5131867 -2.065367939 1.6111660
## 5 -0.94927647 1.01808653 0.7608077 1.6830265 -1.907959205 -0.6979879
## 6 0.09162689 1.15086391 -0.7917356 0.9765960 -0.144924483 -1.2848539
In this example, variables Y
and X1
require
text annotation (they take NA
if a given observation is not
labeled by experts). Variables pred_Y
and
pred_X1
represent predictions for Y
and
X1
. As in this example, dsl
can handle cases
when multiple variables require text annotations.
out_logit <- dsl(model = "logit",
formula = Y ~ X1 + X2 + X4,
predicted_var = c("Y", "X1"),
prediction = c("pred_Y", "pred_X1"),
data = data_logit)
## Cross-Fitting: 1/10..2/10..3/10..4/10..5/10..6/10..7/10..8/10..9/10..10/10..
summary(out_logit)
## ==================
## DSL Specification:
## ==================
## Model: logit
## Call: Y ~ X1 + X2 + X4
##
## Predicted Variables: Y, X1
## Prediction: pred_Y, pred_X1
##
## Number of Labeled Observations: 487
## Random Sampling for Labeling with Equal Probability: Yes
##
## =============
## Coefficients:
## =============
## Estimate Std. Error CI Lower CI Upper p value
## (Intercept) -0.7380 0.1001 -0.9342 -0.5419 0.0000 ***
## X1 0.2197 0.1045 0.0149 0.4245 0.0178 *
## X2 0.0233 0.0972 -0.1673 0.2139 0.4054
## X4 0.2771 0.1023 0.0766 0.4777 0.0034 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 95% confidence intervals (CI) are reported.
felm
: Fixed Effects Linear
Regression
Researchers can implement DSL fixed effects linear regression by
specifying model = "felm"
.
## state year log_gsp pred_log_gsp log_pcap log_pc unemp
## 1 ALABAMA 1970 NA 9.738913 9.617981 10.48553 4.7
## 2 ALABAMA 1971 10.28790 10.287899 9.648720 10.52675 5.2
## 3 ALABAMA 1972 10.35147 10.351469 9.678618 10.56283 4.7
## 4 ALABAMA 1973 10.41721 10.417209 9.705418 10.59873 3.9
## 5 ALABAMA 1974 10.42671 10.426706 9.726910 10.64679 5.5
## 6 ALABAMA 1975 10.42240 10.422400 9.759401 10.69130 7.7
In this example, variable log_gsp
requires text
annotation (it takes NA
if a given observation is not
labeled by experts). Variable pred_log_gsp
represents
predictions for log_gsp
.
One-way Fixed Effects
To implement DSL one-way fixed effects regression, users can set
fixed_effect = "oneway"
. Use index
to denote
which column defines fixed effects. Users can also cluster standard
errors by specifying a variable name in argument cluster
(this cluster
argument is also avaiable for other
models).
out_felm_one <- dsl(model = "felm",
formula = log_pcap ~ log_gsp + log_pc + unemp,
predicted_var = "log_gsp",
prediction = "pred_log_gsp",
fixed_effect = "oneway",
index = c("state"),
cluster = "state",
data = data_felm)
## Cross-Fitting: 1/10..2/10..3/10..4/10..5/10..6/10..7/10..8/10..9/10..10/10..
Argument | Description |
---|---|
fixed_effect |
A type of fixed effects regression you run.
oneway (one-way fixed effects) or twoways
(two-way fixed effects). |
index |
A vector of column names specifying fixed effects. When
fixed_effect = oneway , it has one element. When
fixed_effect = twoways , it has two elements, e.g.,
index = c("state", "year") . |
cluster |
A column name in the data that indicates the level at
which cluster standard errors are calculated. Default is
NULL . |
summary(out_felm_one)
## ==================
## DSL Specification:
## ==================
## Model: felm (oneway)
## Call: log_pcap ~ log_gsp + log_pc + unemp
## Fixed Effects: state
##
## Predicted Variables: log_gsp
## Prediction: pred_log_gsp
##
## Number of Labeled Observations: 334
## Random Sampling for Labeling with Equal Probability: Yes
##
## =============
## Coefficients:
## =============
## Estimate Std. Error CI Lower CI Upper p value
## log_gsp 0.0031 0.0020 -0.0007 0.0069 0.0563 .
## log_pc 0.5357 0.0419 0.4537 0.6178 0.0000 ***
## unemp 0.0067 0.0021 0.0025 0.0108 0.0008 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 95% confidence intervals (CI) are reported.
## Standard errors are clustered by state.
Two-way Fixed Effects
To implement DSL two-way fixed effects regression, users can set
fixed_effect = "twoways"
. Use index
(a vector
of length 2) to denote which columns define fixed effects.
out_felm_two <- dsl(model = "felm",
formula = log_pcap ~ log_gsp + log_pc + unemp,
predicted_var = "log_gsp",
prediction = "pred_log_gsp",
fixed_effect = "twoways",
index = c("state", "year"),
cluster = "state",
data = data_felm)
## Cross-Fitting: 1/10..2/10..3/10..4/10..5/10..6/10..7/10..8/10..9/10..10/10..
summary(out_felm_two)
## ==================
## DSL Specification:
## ==================
## Model: felm (twoways)
## Call: log_pcap ~ log_gsp + log_pc + unemp
## Fixed Effects: state and year
##
## Predicted Variables: log_gsp
## Prediction: pred_log_gsp
##
## Number of Labeled Observations: 334
## Random Sampling for Labeling with Equal Probability: Yes
##
## =============
## Coefficients:
## =============
## Estimate Std. Error CI Lower CI Upper p value
## log_gsp 0.0026 0.0017 -0.0009 0.0060 0.0722 .
## log_pc 0.4275 0.0748 0.2809 0.5741 0.0000 ***
## unemp 0.0093 0.0034 0.0027 0.0159 0.0028 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 95% confidence intervals (CI) are reported.
## Standard errors are clustered by state.