Econometrics lecture notes

E(X) = The expected value/population mean of X. Weighted average of all possible values of X. Weights are based on their probability of occurring.

E(Y|X=12) = The expected value of Y when X = 12. Conditional expectation.

E(u|x) = E(u) The expected value of u when x is defined must be equal to expected value of u in any circumstance. This means that u is independent of x, i.e. a change in x causes no change in u. Formally, “u is mean independent of x”. Often used to isolate a relationship in linear regression so that u (the “other factors” we’re not interested in measuring) do not influence the relationship.

E(y|x) = Population regression function (PRF). A linear function of x.

Assumption: E(u|x) = E(u)

Implies 0 covariance between x and u. Can be rewritten as:

Cov(x, u) = E(xu) = 0

Ordinary Least Squares (OLS) = A method for estimating the unknown parameters in a linear regression model. It minimizes the sum of squared residuals (u)

The sum, and hence the average, of the OLS residuals is zero (Sum(ui) = 0)

p-value = The probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true. One often "rejects the null hypothesis" when the p-value is less than the predetermined significance level which is often 0.05 or 0.01, indicating that the observation is highly unlikely to be the result of random chance.

SST = Total Sum of Squares. Measure of the variation in a variable.

SSE = Explained Sum of Squares. Total sample variation in a variable. If B1 = 0, then SSE = 0 (no variation in y when x changes)

SSR = Residual Sum of Squares. If it equals SST, that means y is independent of x, and u (unknown variables) explain everything.

Same logic as for ("hat" means predicted value):

y_i = yhat_i + uhat_i


R^2 = R-squared. Coefficient of determination. R^2 = 1 If all observations are on a straight line, and the linear regression perfectly matches the data points. How much the linear regression explains the data. If 0.5, the linear regression explains half of the variation in the data. High R^2 is important if you want to use the model for prediction. Often high in general time series data, but does not necessarily teach us causation. A more specific data might generate lower R^2, but we might learn something important.

R^2 = SSE/SST = 1 - SSR/SST


regress dependant independant


regress y x

regress wage educ

regress salary roe_dec

Coef x = How much a change in one variable changes the other. Or rather, how an increase of 1 in x changes y, e.g. how much 1 year of education increases wage.

_const coef = B0

x (e.g. “educ”) coef = B1

Nonlinearities in simple regression.

Measure the values in log() instead.


log(wage) = B0 + B1*educ + u

Logs generate percentage diferences:

delta log(x) * 100 = Percentage differences

delta log(x) = log(x1) - log(x0) = log(x1/x0)

Stata example of log:

generate logwage=ln(wage)

regress logwage educ

Important: Coef of educ now measures percentage increases instead of actual unit changes. I.e. 1 extra year of education gives 0.08 *100 = 8% higher wage.

Converting from log to level:

log(wage) = B0 + B1*educ + u

wage = exp(B0 + B1*educ + u)

Constant Elasticity (CE) Model:

Logarithm of both wage and educ. Gives us percentage changes in both x and y.  E.g. If sales increase by 1%, salary will increase by 0.25%.


log(wage) = B0 + B1*log(educ) + u

Why? Because:

delta logsalery/delta logsales = B1 = (dy/y)/(dx/x)

Gives us elasticity

Unbiasedness of OLS

We want E(^B1) = B1

Expected value of predicted B1 equals actual B1. This is important because it gives:

E[u|x] = 0


SLR = Single-variable regression

MLR = Multi-variable regression

Assumptions SLR:

SLR1: Must be linear in parameters: y = B0 + B1x + u.

SLR2: Must have a random sample.

SLR3: There is sample variation in the variable (i.e. the x are not all the same). Var(xi) > 0

SLR4: Zero conditional mean - the error u has an expected value of zero given x: E[u|x] = 0, same as Cor(u, x) = 0. No correlation between the error term and the independent variables. If there is, the variables will be overestimated (i.e. account for other variables as well). For example: if we regress wage on educ, education will be overestimated, compared to if we regress wage on education and IQ. Because if IQ is omitted, it will be included in education. Expectations of the dependant variable might also be factors that change the independant variable.

SLR4 is the most important in practice.

Assumptions MLR:

Assumption MLR.1: The population model is linear in parameters: y=β0+β1x1+β2x2 +…+ u.

Assumption MLR.2: Random sampling: We have a sample of n observations {xi1, xi2,…,xik,y): i=1,2,…,n} following the population model in Assumption MLR.1

Assumption MLR.3: No perfect collinearity: In the sample, none of the independent variables is constant and there are no exact linear relationships among the independent variables

Assumption MLR.4: Zero conditional mean: The error u has an expected value of zero, given any values of the independent variables: E(u|x1, x2,…,xk)=0 Omitted variables that correlate with the explanatory/independent variables violates MLR.4.

Assumption MLR.5: Homoskedasticity: Constant variance across the data points. The error u has the same variance given any value of the explanatory variables.

Assumption MLR.6: Normality: The population error u is independent of the explanatory variables x1, x2,…,xk, and is normally distributed with zero mean and variance σ2: u ~ Normal(0, σ2). Non-normally distributed error terms (violates MLR.6). Can you think of examples in which it’s obvious the dependent variable (conditional on the x-variables) does not follow a normal distribution?

Homoscedastic = Constant variance across the data points. Even distribution of points, i.e. same distance between points and regression line. Normal distribution. We want high homoscedasticity. Often log() helps.

If MLR 1-4 holds, the OLS estimator is said to be “unbiased”.

If MLR 1-5 holds, the OLS estimator is said to be the “Best Linear Unbiased Estimator” (BLUE).

MLR 1-5 are known as the Gauss-Markov assumptions (for cross-sectional regression).

MLR 1-6 are known as the classical linear model (CLM) assumption (for cross-sectional regression).

If MLR 1-6 holds, we can do “statistical inference” using conventional OLS standard errors, t statistics and F statistics.

Exogenous - The variable is completely outside the model, does not depend on any of the variables in the model (not even the residual). That’s what we want.

Endogenous - The variable depends on at least one of the other variables in the model.

t-distribution: Similar to a normal distribution. (Why is it used?)

Null hypothesis (H0): There is no relationship between two variables. This conclusion is often reached by testing the variables, possibly rejecting it.

Alternative hypothesis (H1): The opposite of the null hypothesis.

Significance level: The probability of rejecting H0 when it is in fact true. In other words, the probability of the result being incorrect. Most commonly 5%. “With a 5% significance level, the definition of ”sufficiently large” is simply the 95th percentile in a t distribution with n-k-1 df. “ H0 is rejected if t > c. We get c by looking at tables for significance level and degrees of freedom (df).

Degrees of freedom (df): n-k-1, i.e. number of observations minus number of explanatory variables minus one.

t-value: Lower means less significance of variable(?)

p-value: The smallest significance level of a specific t at which the null hypothesis would be rejected. Given the observed value of the t statistic, what is the smallest significance level at which the null hypothesis would be rejected? The p-value is the probability of observing a t value as extreme as we did if the null hypothesis is true. Thus, small p-values are evidence against the null hypothesis. If the p-value is, say, 0.04, we might say there’s significance at the 5% level (actually at the 4% level) but not at the 1% level (or 3% or 2% level).

Confidence interval (CI): Provides a range of likely values for the unknown βj.

F test: Testing multiple restrictions. Similar to t-test, but with several hypothesis at the same time. Commonly test that all coefficients = 0. Example:

H0: B1 = 0, B2 = 0, B3 = 0, …, B2 = 0.

H1: H0 is not true

If all are equal to 0, none would explain the model, i.e. same as saying no of the included variable explains the model. We would only have the constant. We want to test that our original model explains more than the model under the null hypothesis, i.e. without influence from any of the variables (the restricted model). If we get a low p-value/significance level, this means they’re not likely to be 0, i.e. null hypothesis can be rejected at low significant levels.

Unrestricted model: Original model with all variables

Restricted model: Some variables removed to impose restrictions on the model.

R^2: Higher when more variables are included, so lower for restricted models. R^2 = 1 - SSR/SST. In other words, SSR decreases with more variables, and increases with the restricted model.

If SSR increases a lot when you exclude variables, those variables have significant explanatory power, and should not be omitted.

Chow test: Doing an F test for a restricted model setting all coefficients to equal.

F-test in Stata:

First to regression, then directly after type:

“test excluded_var_1 excluded_var_2”

Automatically defines null hypothesis for these variables and does the F-test. Returns F-value “F(x, y)” and p-value “Prob > F”. Top-right of a regression, the F-value for all variables = 0 are shown, as well as the the p-value.

Compare with c (found in table), if F-value > c we can reject H0.

F value = F(q, df, p)

q = Number of restricted variables

df = Denominator degrees of freedom = n - k - 1

n = Number of observations

k = Number of variables in regression

p = Significance level between 0-1

Stat command to get c-value

invttail(n, p)


When reporting the regression, we should include at least standard errors and t-statistics.


log(y) + log(x): dlog y/ dlog x = dy/y / dx/x = Elasticity

log(y) + normal x: When x increases by 1, y increases by %. This is an approximation, which will be less exact when the coef gets larger.

To get exact percent:

100 * [exp(coef) - 1]


A model where x is squared. Includes it both normal and squared. Example: y = B0 + B1x + B2x^2 + u.

dy/dx = B1 + 2 * B2 * X

B1 = Coefficient of squared variable

B2 = Coefficient of normal variable (not squared).

X = Increase in squared variable we want to find out, e.g. 1 if 1 year of experience.

Useful for calculating diminishing effects/returns on a variable.


wage = 3.73 + 0.298 * exper - 0.0061 * expert^2

exper has a diminishing effect on wage. When there’s positive coef on x and negative on x^2, the quadratic has a parabolic shape.

To find the “optimal” point X* where the impact of increase is 0:

B1 + 2 * B2 * X = 0

X* = -(B2/2*B1)

In Stata:
gen exper2 = exper^2

In Stata to get coefficient from last regression:


x = name of variable

To test if B1 is significantly different from 0:

H0: B1 = 0

H1: B1 # 0


Interaction terms:

Sometimes partial effects depend on another explanatory variable. Example:

price = B0 + B1 * sqrft + B2 * bedrooms + B3(sqrft * bedrooms) + B4 * bthrms + u

Means that, the effect on price of adding more bedrooms depends on the size of the house:

d price/d bedrooms = B2 + B3 * sqrtft

It’s more expensive to add bedrooms to a small house than a large house.

Should be tested with H0: B3 = 0, to see if house size does not influence bedroom effect on price.

Adjusted R-squared:

Traditional R^2 = 1- (SSR/n) / (SST/n)

Adjusted R^2 = 1- (SSR/n-k-1) / (SST/n-1)

Adjusted R^2 penalizes inclusion of more x-variables (since k increases).

Limited Dependent Variable (LDV):

When the dependent variable (y) is a dummy/binary/boolean/qualitative variable (i.e. it can only have the value 0 or 1).

Often ok to use OLS even with LDV.

If y can be 0 or 1, the expected value of y can be interpreted as the probability that y i equal to 1. Therefore, multiple linear regression model with binary dependent variable is called the linear probability model (LPM).

In Stata:

Plot graph of regression in Stata of lwage and exper relation.:


lwage = b0 + b1 * educ + b2 * exper + b3 * exper2

First do regression.

reg lwage educ exper exper2

Find the predicted log wage:

gen lwage_hat = _b[_const] + _b[educ] * mean_of_educ + _b[exper] * exper + _b[exper2] * exper2

Where mean_of_educ is the mean values of the educ.

Find the mean values:


scatter lwage_hat exper

Multiple generated dummy variables:

For example, we want wage differences between married and skin color:

We already have two dummy variables, married and black. Then we calculate the various combinations of these.

gen marrblk = 1 if married == 1 & black == 1

gen marrwhit = 1 if married == 1 & black == 0

gen singblk = 1 if married == 0 & black == 1

gen singwhit = 1 if married == 0 & black == 0

Doesn’t automatically set values to 0 in other occassion, so need to do it manually, e.g:

replace marrblk = 0 if married == 0 | black == 0

“replace” updates the contents of a variable, not really replaces it.

Delete a variable, useful if we really wants to replace the variable with something completely different :)

drop variable

Can then regress on these variables. Use one (any of them) as base variable, so don’t include it in regression. All other coefs will be in relation to the base variable.


delta lwage/delta married = B1 +

tab variable

Stata commands:

tab x

List number of variables of all possible values of x. Useful for analyzing dummy variables.

Predictions in Stata:

First do a regression, then do:

predict yhat, xb

xb means that it does a linear prediction. It generates yhat from normal y from regression.

yhat = B0_hat + B1_hat * x …

Where _hat are the generated coefficients from the regression.

Time-series data

Logarithmic form common to eliminate scale effects.

Dummy variables often used to identify an event or to isolate a shock. Also for capturing seasonality.

Index numbers (e.g. CPI) often used an independent variable.

Static model:

Static Philips curve:

inf_t = B0 + B1 * unem_t + u_t

Inflation and unemployment a given year.

Difference from cross-sectional model is replacing i with t. Only estimates immediate effects on the dependant variable, i.e. that takes place the same year.

Finite Distributed Lag Models (FDL)

y_t = B0 + B1 * z_t + B2 * z_t-1 + B3 * z_t-2 + u_t

This model states that y is affected by a change in z in period t, but also by changes in z that happened earlier (at times t-1 and t-2).

Has a high risk of omitted variable bias.

Shortcomings: The higher the number of lags you use, the more data you lose (why?)


How interest rate at time t is impacted by inflation at times t, t-1 and t-2. After running regression we get:

int_t = 1.6 + 0.48inf_t - 0.15inf_t-1 + 0.32inf_t-2 + u_t

Impact propensity/multiplier:

Impact propensity is 0.48.

Long-run propensity/multiplier

Long-run propensity is 0.48-0.15+0.32 = 0.65


Many economic time series display a trending behavior over time, which might be important to incorporate in our model. Two series might seem related just because they follow the same trend. Danger of ignoring trends: Omitted variable bias.

Linear time trend:

y_t = a0 + a1_t + e_t

Example: The average growth rate in GDP per capita for Sweden during 1971-2012 is 1.7%, hence: y_t = log(gdp_per_capita), then a1 = 0.017


For example: If we suspect seasonality each quarter:

y_t = B0 + y1Q2 + y2Q3 + y3Q4 + b1x_t,1 + b2x_t,2 + t_t

If there is no seasonality we would find that all y = 0 which can be tested with an F-test.

Time-series assumptions:

Assumption TS.1: The population model is linear in parameters: y_t=β0+β1x_t1+β2x_t2 +…+ u_t.

Assumption TS.2: No perfect collinearity: In the sample, none of the independent variables is constant and there are no exact linear relationships among the independent variables

Assumption TS.3: Zero conditional mean / Omitted variable bias: The error u has an expected value of zero, given any values of the independent variables: E(u_t|x)=0 for all t = 1,2,...,n Omitted variables that correlate with the explanatory/independent variables violates TS.3. For each t, the expected value of the error u_t, given explanatory variables for all time period, is zero. Strict exogeneity. States both that the error term is contemporaneously uncorrelated with each independent variable, but also that the error term is uncorrelated with all independent variables at each point in time, i.e. past, present and future.

Assumption TS.4: Homoskedasticity: Constant  across the data points. The error u has the same variance given any value of the explanatory variables, for all t.

Assumption TS.5: No Serial Correlation. Conditional on X, the errors in two different time periods are uncorrelated, for all t different from s: Corr[u_t, u_s|X] = 0

Assumption TS.6: Normality and independence: The population error u is independent of the explanatory variables x1, x2,…,xk, and is normally distributed with zero mean and variance σ2: u ~ Normal(0, σ2). Non-normally distributed error terms (violates MLR.6). Can you think of examples in which it’s obvious the dependent variable (conditional on the x-variables) does not follow a normal distribution?

If TS.1-3 holds we have Unbiasedness of OLS(?). OLS is a consistent estimator.

If TS.1-5 holds, the OLS estimator is said to be the “Best Linear Unbiased Estimator” (BLUE).

TS 1-5 are known as the Gauss-Markov assumptions (for cross-sectional regression).

TS 1-6 are known as the classical linear model (CLM) assumption (for cross-sectional regression).

If the independent variables are strictly exogenous, i.e. E[u_t|X] = 0, then the OLS estimator is unbiased.

If we relax the strict exogeneity assumption, and instead assume weak dependence E[u_t|x_t] = 0, OLD is still consistent but not unbiased.

Serial Correlation

Implies that the OLS estimator is no longer BLUE, only LUE. It is Linear and Unbiased, but not longer Best (lower variance). Hence, there exists an alternative better estimator than OLS in this setting.

Serial Correlation means we have a misleading variance. Error terms are related across time.

If p # 0 we have serial correlation.

There is a pattern across the error terms. The error terms are then not independently distributed across the observations and are not strictly random

Violates TS.5 because Corr(et, et-s) # 0

Can still be unbiased.


1) Estimate the model by running the regression

2) Extract the residual ehat_t,

3) Regress ehat_e on its own lag, and test for the null hypothesis H0: p = 0 against the alternative hyptothesis H1: p # 0

1) y_t = bhat_0 + bhat_1 * x_t1 + bhat_2 * x_t2 + uhat_t

2) Model of serial correlation of u: u_t = p*u_t-1 + e_t

H0: If p = 0, we have no serial correlation

H1: If p # 0, we have serial correlation

3) uhat_t = p*uhat_t + e_t

Autocorrelation function (ACF):

ACF for lag one (i.e. one time unit back):

corr(rt, rt-1) = cov(rt, rt-1) / sqrt(var(rt) * var(rt-1)) = cor(rt, rt-1) / var(rt) = ACF(1)

ACF for lag s:

ACF(s) = sum(t=s+1 to t) for ((rt - _r) * (rt-s - _r)) / sum(t=1 to t) for (rt - _r)^2

ACF(s) = E

Should decrease at larger time gaps, i.e. ACF(1) is larger than ACF(2). If ACF(1) is small we have less dependancy between time period t and t-1.

In Stata, this can be calculated automatically by

ac dependent_var, lags(s)


ac rus, lags(12)

ac dyus, lags(12)

Grey area is the non-rejection area, where we cannot reject that there is no dependancy between time periods, i.e. there is a chance there is no dependancy. Outside the area we can be sure there some kind of dependancy.

In financial markets, if markets are efficient, we have zero arbitage. So we have 0 predictability.

Stationary and Weak Dependence:
Time series observations can rarely if ever be assumed to be independent. This might imply that the CLM assumptions do not hold. However, the OLS estimator mi. If we assume that our data are
stationary and weakly dependent, we can modify TS.1-3.

We can replace, for example, TS.3 with a weaker assumption (which one?).

Instrumental variables (IV):

When there’s a correlation with an independent/explanatory variable and the error term (i.e. MLR.4 doesn’t hold), we can use instrumental variables. Use an indirect variable related with the explanatory variable to control for this.

For example, if we want to control for a demand shock but cannot isolate the demand shock itself, we can use an instrumental variable instead.

Good instrument:

1) Relevant: Contains some information that has predictive power:

corr(Z, lfare) # 0

corr(bmktshr, lfare) > 0

2) Validity: corr(Z, E) = 0

corr(bmktshr, lfare) = 0

Step 1) Predict the variable we want to replace using the new instrumental variable.

Step 2) Replace the variable with the predicted variable in the original OLS regression

Stata example:

Perform step 1 and 2 with one command, replacing lfare with instrumental variable bmktshr:

ivregress 2sls lpassen (lfare=bmktshr) ldist ldist2, first