Skip to contents

Estimating Regression using the DSL framework

Usage

dsl(
  model = "lm",
  formula,
  predicted_var,
  prediction = NULL,
  data,
  cluster = NULL,
  labeled = NULL,
  sample_prob = NULL,
  index = NULL,
  fixed_effect = "oneway",
  sl_method = "grf",
  feature = NULL,
  family = "gaussian",
  cross_fit = 5,
  sample_split = 10,
  seed = 1234
)

Arguments

model

A regression model dsl currently supports lm (linear regression), logit (logistic regression), and felm (fixed-effects regression).

formula

A formula used in the specified regression model.

predicted_var

A vector of column names in the data that correspond to variables that need to be predicted.

prediction

A vector of column names in the data that correspond to predictions of predicted_var.

data

A data frame. The class should be data.frame.

cluster

A column name in the data that indicates the level at which cluster standard errors are calculated. Default is NULL.

labeled

(Optional) A column name in the data that indicates which observation is labeled. It should be a vector of 1 (labeled) and 0 (non-labeled). When NULL, the function assumes that observations that have NA in predicted_var are non-labeled and other observations are labeled.

sample_prob

(Optional) A column name in the data that correspond to the sampling probability for labeling a particular observation. When NULL, the function assumes random sampling with equal probabilities.

index

(Used when model = "felm") A vector of column names specifying fixed effects. When fixed_effect = oneway, it has one element. When fixed_effect = twoways, it has two elements, e.g., index = c("state", "year").

fixed_effect

(Used when model = "felm") A type of fixed effects regression you run. oneway (one-way fixed effects) or twoways (two-way fixed effects).

sl_method

A name of a supervised machine learning model used internally to predict predicted_var by fine-tuning prediction or using predictors (specified in feature) when prediction = NULL. Users can run available_method() to see available supervised machine learning methods. Default is grf (generalized random forest).

feature

A vector of column names in the data that correspond to predictors used to fit a supervised machine learning (specified in sl_method).

family

(Used when making predictions) A variable type of predicted_var. Default is gaussian.

cross_fit

The fold of cross-fitting. Default is 5.

sample_split

The number of sampling-splitting. Default is 10.

seed

Numeric seed used internally. Default is 1234.

Value

dsl returns an object of dsl class.

  • coefficients: Estimated coefficients.

  • standard_errors: Estimated standard errors.

  • vcov: Estimated variance-covariance matrix.

  • RMSE: Root mean squared error in the internal prediction step.

  • internal: Outputs used only for the internal use.