### Econometrics lecture notes 2

Warning: Bad notes below. May be misleading or contain serious errors.

Please see Regression analysis for an attempt to summarise the topic better.

Statistical program:
STATA - Easier and better than SPSS

## Statistical concepts

### Random variable

• Random variable: In probability and statistics, a random variable or stochastic variable is a variable whose value is not known.
• Two types:
• Discrete variable: Can only take a limited number of values, e.g. number of computers in household (0,1,2,3).
• Continous variable: Can take any value, e.g. the interest rate, stock market indexes, household income.
• Expected value: The expected value of a random variable is the weighted average of all possible values that this random variable can take on.
• E(x) = μ
• With constant: E(a) = a
• E(a * x) = a * E(x)
• E(a * x + b) = a * E(x) + b
• Variance: The larger the variance, the greater the spread of the numbers. E.g. 0-40 have a larger variance than 10-30. Can never be negative(?). Denoted by:
• σ2
• var(x) = σ2 = E(x - E(x))2 = E(x2) - (E(x))2
• Two constants a and b that we take the variance of:
• Var(ax + b) = a2 * Var(x)
• Use definitions above: Var(ax + b) = E[ax + b - E(ax + b)]2
• Var(ax + b) = E[ax + b - a * μ - b]2
• Var(ax + b) = E[ax - aμ]2 = E[a(x-μ)]2
• Var(ax + b) = a2 * E(x - μ))2 = a2 * Var(x)
• Two random variables in combination: E(ax + by) = a * E(x) + b * E(y)
• n (any) number of random variables: x1, x2, ..., xn random variables:
• E(a1x1 + a2x2 + ... + anxn) = a * E(x1) + a2 * E(x2) + ... + an * E(xn)
• Covariance: cov(x, y) = E(xy) - E(x) * E(y)
• If x and y are independent: E(xy) = E(x) * E(y) then cov(x, y) = 0
• If independent or correlated variables: var(ax + by) = a2 * var(x) + b2 * var(y)
• If correlated, add the following at the end: +/- 2 * ab * cov(x, y)
• Correlation coefficient
• P(x, y) = cov(x, y) / sqrt(var(x) * var(y))
• -1 <= P <= 1
• If all numbers form a downward sloping straight line, we have a perfect negative linear relationship, if upward sloping, positive relationship. If all dots are placed randomly, then there's no correlation. If they form an U, it's a perfect non-linear relationship.

Example B7:
x1, x2, ..., xn are independent random variables with the same probability distribution with mean μ (the expected value) and σ2.
x = εx/n
E(x) = E(εx/n) = 1/n * [E(x1) + E(x2) + .... + E(xn)]
= 1/n * [μ + μ + ..... + μ] = n * μ / n = μ

We want to find the variance. Because they're independent, the covariance between each pair of random variables = 0:
Var(x) = Var(1/n * εx) = 1/n2 * [Var(x1) + Var(x2) + ... + Var(xn)]
= 1/n2 * [σ2 + σ2 + ... + σ2] = nσ2 / n2 = σ2 / n

Example B8:
y1, y2, y3 is a sample of observations from N(μ, σ2)
cov(y1, y2) = cov(y1, y3) = cov(y2, y3) = 0.5 * σ2
y = (y1 + y2 + y3) / 3

a) E(y) = E((y1 + y2 + y3) / 3) = 1/3 * [E(y1) + E(y2) + E(y3)]
= 1/3 * (μ + μ + μ) = 3μ / 3 = μ

We want to find the variance. Variables are not independent so we have a covariance:
Var(y) = Var((y1 + y2 + y3) / 3)
= 1/32 * [Var(y1) + Var(y2) + Var(y3) + 2 * cov(y1, y2) + 2 * cov(y1, y3) + 2 * cov(y2, y3)
= 1/9 * [σ^2 + σ^2 + σ^2 + 2 * 0.5σ^2 + 2 * 0.5σ^2 + 2 * 0.5σ^2]
= 6σ^2 / 9 = 2/3 * σ^2

### Distribution

Normal distribution. Wikipedia: "In probability theory, the normal (or Gaussian) distribution, is a continuous probability distribution that is often used as a first approximation to describe real-valued random variables that tend to cluster around a single mean value"

Example: f(x) = 1/sqrt(2πσ2) * exp -(x-μ)2 / 2σ2
x ~ N(μ, σ2)
Z = (x - μ) / σ

E(Z) = E((x - μ)/σ) = (E(x) - μ)/σ
= (μ-μ)/σ = 0

Var(Z) = Var((x - μ)/σ) = Var(x/σ)
= 1/σ^2 * Var(x) = σ^2 / σ^2 = 1

Z ~ N(0,1)

Normal variable:
f(x) = 1/sqrt(2πσ^2) * e^(-(x-μ)^2/(2σ^2)
Z ~ N(0,1)

Example B12:
A mutual fond. Its value is normally distributed.

x = Annual rate of return for a certain mutual fund
Z ~ N(0.10,0.04^2)

a) What is the probability of getting a negative return for this fund?

We have to transform the x variable to standard normal:
P(x < 0) = P((x - μ)/σ < (0 - μ)/σ)
= P(Z < (0 - μ)/σ)
= P(Z < (0 - 0.10)/0.04)
= P(Z < -2.5)
= 1 - P(Z < -2.5)
= 1 - 0.9938
= 0.0062

The probability of getting a negative one-year rate of return is 0.62%

b) What is the probability of getting a return higher than 15%?

Transform x to standard normal:
P(x > 0.15) = P((x - μ)/σ > (0.15 - μ)/σ)
= P(Z > (0.15 - 0.10)/0.04)
= P(Z > 1.25)
Because we're looking for a certain area in the graph, we must reverse sign:
= 1 - P(Z < 1.25)
= 1-0.0944
= 0.1056

Probability 10.56%

c) The fund manager can raise the mean (average?) return to 12%, but the risk will increase as well (standard deviation = 5%). Would you advice the manager to make this portfolio change?

They could increase the expected value to 12% but the standard deviation would increase to 5%:
(x ~ N(0.10,0.04^2)) / (x ~ N(0.12,0.05^2))

In a) we get:
P(x < 0) = P((x - μ)/σ < (0 - μ)/σ)
= P(Z < (0 - μ)/σ)
= P(Z < (0 - 0.12)/0.05)
= P(Z < -2.4)
= 1 - P(Z < -2.4)
= 1 - 0.9918
= 0.0082
The probability of having a negative return is now 0.82% which is higher than before the change.

In b) we get:
P(x > 0.15) = P((x - μ)/σ > (0.15 - μ)/σ)
= P(Z > (0.15 - 0.12)/0.05)
= P(Z > 0.60)
Because we want a certain probability we must reverse sign:
= 1 - P(Z < 1.25)
= 1-0.07257
= 0.2743
The probability of getting a higher return is now 27.43% which is also higher than before the change.

Conclusion: The probability of having a negative return as well as a higher return increases. Our choice depends on the mutual fund manager's risk preference. If he's risk avert, he might want to choose the first option.

## Simple linear regression model

One dependant variable that depends on one or more explanatory variables
Measures the linear association between two variables.

Can answer questions like: How does economic growth affect poverty? What is the economic value of a public good like forests or clean air? What is the correlation between smoking and cancer?

We also want to explain how much.
That's the difference between correlation and regression analysis. We don't just want to know if there's a positive or negative relationship, but how much one variable changes if the other changes.

We have different types of data:
• Cross-sectional data: Data collected at a certain point of time. E.g. what's household income in Sweden at the end of 2009? Unemployment at a 1 Jan 2011?
• Time series data: Data collected over several years. E.g. the daily profits from a company or the inflation rates per month.
• Panel data: A mix between the two above. Multi-dimensional data (the two above are one-dimensional). Panel data contains observations on multiple phenomena observed over multiple time periods for the same firms or individuals. The database called Linda, containing records of many individuals in Sweden, is an example of such data.

Example:
Let's say we want to examine the relationship between household income and expenditure on food.
Simple economic theories state that the average household expenditures depend on household income.

E(y|x) = β1 + β2 * x

An upward sloping line.
β1 = Even if people have 0 income they'll have some income (by taking loans etc). This is denoted.
β2 = The slope of the line

If income changes by 1 unit, the expenditures will change by the following:
β2 = dE(y|x)/dx

Because we're measuring average values, this means we can have some variance.
y = The observed value
ei = yi - β1 - β2 * xi
This gives us the linear regression model:
yi = β1 + β2 * xi + ei
In this model, we have just one explanatory variable.

Assumptions in linear regression models:
1. Linear model: y = β1 + β2 * x + e
2. Expected value: E(e) = 0
3. Variance: var(e) = σ
4. Covariance: cov(ei, ej) = 0
5. The variable x is not random and must take at least two values. If there's no variance in x, it's a constant, and we can not estimate any relations.
6. e ~ N(0, σ2)
In cross-sectional data, the variance is most likely not constant. We have some problems with assumption 3).
In time-series data, we have dependancy over time, so most likely we have covariance, which is problematic with 4).

Least Squares Principle - If we have a linear model with variance. We want to find the variance β1 + β2 that minimizes the sum of  square errors.

In the example above, we know that:
ei = yi - β1 - β2 * xi
We want to use summation on the equation:
sum(ei)^2 = sum(yi - β1 - β2 * xi)^2
And derivate:
(∂ sum(ei)^2)/(∂ β2) = 2 * sum(yi - β1 - β2 * xi) * (-1) = 0
(∂ sum(ei)^2)/(∂ β2) = 2 * sum(yi - β1 - β2 * xi) - (xi) = 0
-2 * sum(yi) + 2Nβ1 + 2β2 * sum(xi) = 0
-2 * sum(xi) * yi + 2β1 * sum(xi) + 2β2 * sum(xi^2) = 0
We change β into b:
sum(yi) - Nb1 - b2 * sum(xi) = 0
sum(xi) * yi - b1 * sum(xi) - b2 * sum(xi^2) = 0
Nb1 + b2 * sum(xi) = sum(yi)                          (1)
b1 * sum(xi) + b2 * sum(xi^2) = sum(xi) - yi      (2)

From (1) we get that:
Nb1 + b1 * sum(xi) = sum(yi)
Nb2 = sum(yi) - b1 * sum(xi)
b1 = sum(yi)/N - b2 * sum(xi)/N
b1 = y - b2 * x

We now want to find out b2. We do it by multiplying (1) by sum(xi) and multiply (2) by N:
Nb1 * sum(xi) + b2(sum(xi))^2 = sum(xi) * sum(yi)
Nb1 * sum(xi) + Nb2 * sum(xi^2) = N * sum(xi) * yi
Nb1 * sum(xi) - Nb1 * sum(xi) + Nb2 * sum(xi^2) - b2(sum(xi))^2 = N sum(xi) * yi - sum(x) * sum(y)
We divide by N:
Nb2/N * sum(xi^2) - (b2 * sum(xi^2)) / N = (N * sum(xi) * yi) / N - (sum(x) sum(y)) / N
b2 * ((sum(x^2) - (sum(x))^2)/N) = sum(x) * y - (sum(x) * sum(y))/N
b2 = (sum(x) * y - (sum(x) * sum(y)) / N) / sum(x^2) - (sum(x))^2/n

y = b1 + b2 * x
ε = dy / dx * x / y = b2 * x/y

If x increases by 1% then an average y changes by (b2 * x/y)%

Summary - equations:
1. y = β1 + β2 * x + e
2. y = b1 + b2 * x
3. b1 = y - b2 * x
4. b2 = ((sum(x) * y - (sum(x) * sum(y)) / N) / (sum(x^2) - (sum(x))^2/n) = (sum(x-x)(y-y)) / sum(x-x)^2

Making use of 4) we get:
b2 = (sum(x-x)(y-y)) / sum(x-x)^2
= (sum(x-x) * y - y * sum(x-x)) / sum(x-x)^2
= (sum(x-x) * y) / sum(x-x)^2
wi = (xi - x) / sum(xi - x)^2
b2 = sum(wi) * yi
= sum(wi) * (β1 + β2 * xi + ei)
= β1 * sum(wi) + β2 * sum(wi) * xi + sum(wi) * ei

sum(wi) = ?
sum(xi - x) / sum(xi - x)^2 = 1/sum(xi - x)^2 * sum(xi-x) = 0
sum((wi) * xi : (sum(xi - x) * xi) / sum(xi - x)^2 = (sum(xi)^2 - x * sum(xi)) / sum(xi - x)^2
= (sum(xi)^2 - x * sum(xi) + x * sum(xi) - x * sum(xi)) / sum(xi - x)^2
= (sum(xi)^2 - 2 * x * sum(xi) + x * sum(xi)) / sum(xi - x)^2
= (sum(xi)^2 - 2x * sum(xi)+ x(sum(xi)/N)^x * N / (sum(xi-x)^2)
= (sum(xi)^2 - 2x * sum(xi) + N * x^2) / sum(x-x)^2
= sum(x-x)^2 / sum(x-x)^2 = 1

b2 = β2 + sum(wi) * ei
E(b2) = E(β2) + E(sum(wi) * ei)
= β2 + sum(w) * E(ei)
We assume that E(ei) = 0 so:
= β2 + 0 = β2

Gauss-Markov theorem: Wikipedia: "In a linear regression model in which the errors have expectation zero and are uncorrelated and have equal variances, the best linear unbiased estimator (BLUE) of the coefficients is given by the ordinary least squares estimator. Here "best" means giving the lowest possible mean squared error of the estimate."

b1 ~ N(β1, (σ^2 * sum(xi)^2) / (N * sum(x-x)^2))
b2 ~ N(β2, (σ^2 / sum(x-x)^2))
σ^2 = sum(ei)^2 / (N-2)
e = y-y

var(b1) = σ^2 * [sum(x)^2 / (N * sum(x-x)^2]
se(b1) = sqrt(var(b1))
var(b1) = σ^2 / sum(x-x)^2
se(b2) = sqrt(var(b2))
cov(b1, b2) = σ^2 * [-x / sum(x-x)^2]

b = (x1 * x)^-1 * x1 * y
cov(b) = σ^2 * (x1 * x)^-1

Example 2.7 - Least square principle:
y = β1 + β2 * x + e
y = b1 + b2 * x
N = 51

a)
Estimated error variance:
σ^2 = 2.04672
The residual sum of squares:
σ^2 = SSE/(N-2))
Gives us:
SSE = σ^2 * (N-2) = 2.04672 * (51-2)
SSE = 100.29

b)
var(b2) = 0.00098
se(b2) = sqrt(var(b2)) = sqrt(0.00098) = 0.0313
What is:
sum(x-x)^2 = ?
Definition of variance for b2:
var(b2) = σ^2 / sum(x-x)^2
sum(x-x)^2 = σ^2 / var(b2)
sum(x-x)^2 = 2.04672/0.00098 = 2088.5

c)
b2 = 0.18

d)
We assume that the average of x is:
x = 69.139
The average of y is:
y = 15.187
We want to calculate the intercept:
b1 = y - b * x
b1 = 15.187 - 0.18 * 69.139 = 2.742
Our estimated line is:
y = 2.742 - 0.18x

e)
We want to find the sum of squared of x:
sum(x)^2 = ?
We know that:
sum(x-x)^2 = sum(x)^2 - N * x^2
So:
sum(x)^2 = sum(x-x)^2 + N * x^2
We also know the values from earlier, so:
sum(x)^2 = 2088.5 + 51 * 69139 = 245879

f)
We want to calculate the residual.
y = b1 + b2 * x + e
e = y - y
We can calculate the predicted value:
y = 2.742 - 0.18 * 58.3
y = 13.236
Now we can calculate the residual:
e = y - y = 12.274 - 13.236 = -0.962
This means we're below the line(?)

Interval estimation / Confidence intervals - We create an interval where the two parameters could line with a certain confidence. Wikipedia: "A particular kind of interval estimate of a population parameter used to indicate the reliability of an estimate."
Hypothesis tests - First of all test if the parameters = 0. Secondly, test if  β1 and β2 = 1. Wikipedia: "A method of making decisions using data, whether from a controlled experiment or an observational study (not controlled)"

Confidence intervals
b2 ~ N(β2, σ^2 / sum(x-x)^2)
Z = (b2 - β2) / sqrt(σ^2 / sum(x-x)^2)
We can use the t-distribution, which replaces σ with σ:
t = (b2 - β2) / sqrt(σ^2 / sum(x-x)^2)

t-distribution: Wikipedia: "a continuous probability distribution that arises when estimating the mean of a normally distributed population in situations where the sample size is small"

P(-tc <= t <= tc) = 1 - α
b2 +/- t * α/2, N-2 * se(b2)
100 * (1 - α)%
α = 0.05
(1-α) = 0.95

Hypothesis tests
1. Determine H0 and H2
2. Specify a test statistics
3. Select a significance (α) and determine the rejection region
4. Calculate the value of the test statistics
5. State your conclusion

We could test that β2 = C:
H0 : β2 = C
We also want an alternative hypothesis:
H1 : β2 != C

Another test:
H0 : β2 = C
H1 : β2 > C

Or:
H0 : β2 = C
H1 : β2 < C
E.g. if a certain company's stock has a lower risk than the market.

Test statistics (the same for all three tests):
t = (b2 - β2) / se(b2)

Then we must choose the significance-level: α
This is normally α = 0.05
E.g. Even 5% of the innocent will be sent to jail

We need a critical value.
Critical region: Can be either positive or negative. We go to the t-table:
t α/2, N - 2
+/- t * 0.025, N-2
If the observed t value is larger than the critical value.
tc = tα, N-2
tc = -tα, N-2

## The ordinary least squares (OLS) assumption

1. Must be a linear relationship: yi = β0 + β1 * xi + ei
2. Expected value: E(e) = 0
3. Variance: var(e) = σ2 is constant for all i. If it's violated, then ei is said to be "heteroskedastic".
4. Covariance: cov(ei, ej) = 0 for all i ≠ j. If assumption 4 i violated, then ei is said to be "serially corelated" or "autocorrelated".
5. xi is non-random, takes on at least 2 values
6. Normal distribution: e ~ N(0, σ2)

## Violations of the ordinary least squares (OLS) assumptions

Goal: Use our observation to explain a simpler relation.

### Assumption 1

We take our observation on y and x and run it through the linear model: yi = β0 + β1 * xi + ei

### Assumption 2

Most important, together with assumption 1.
Expected value on the left hand side and right hand side:
E(yi) = E(β0 + β1 * xi + ei)
Using the rule E(ax) = aE(x) we can transform this into:
E(yi) = β0 + β1 * E(xi) + E(ei) = β0 + β1 * E(xi)
It takes on the same value, but on repeated samples we should recieve the same value.

Example: If we draw repeated samples of the monetary policy rate from a group of countries, we should arrive at the same result.

The most important reason for why we need assumption 2 is to ensure that β1 and β0 is unbiased.

Which is better?
1. β1 that is biased but with low variance, or;
2. β1* that is unbiased but with high variance
A trade-off between bias and variance. Which should be choose?

Answer: Number #2. Having a low variance of a biased variable is actually worse than a biased variable with high variance, because a biased variable with a high variance at least have some chance of ending up at the correct value, while this probability decreases if its variance is low.
Unbias is extremely important and thus we should choose the least biased sample.
Unbias: If x is an estimator of μ and E(x) = μ then x is unbiased.

To show that the least squares is unbiased, we want to show that E(b1) = β1
To show this we need assumption 1, 2 and 5.

### Assumption 3

If var(ei) is constant, that is: the variance of ei does not depend on xi, then ei is homoskedastic. Otherwise, it's heteroskedastic. The formula when calculating variance will not be correct anymore.

When might we expect heteroskedasticity to occur?

Consider the following regression:
Food expenditures = B0 + B1 * Income + error
For low income households we can expect food expenditures to be a large part of their budget. But for high income households, it'll be lower (spend less part of their money on food and more on other things). So the error term will look very different, with lower variance for low income households and higher variance for high income households. The variance will depend on which variable you look at. This is heteroskedastic, which is problematic.

### Scenarios

1. ei = β2 * zi + ui
1. E(ei) = β2 * zi + E(ui) = β2 * zi ≠ 0
2. The observation is biased and not valid, the so called omitted variable bias.
3. To make it unbiased, we must include zi in the regression as an additional variable.
2. var(ei) = xi^2 * σ^2
1. Because x varies with each i, the variance is not constant. We're unbiased, but have a problem with assumption 3.

### Assumption 5

The assumption that xi is non-random is only for simplicity. It means that if we made repeated drawings of y1, ..., yN and x1, ..., xN from the population, then the values of xi will be the same in each drawing.

Example: If yi is money supply at time i for a particular country, and xi is the policy interest rate of the ECB, then xi would be the same if we were to sample different countries.

If xi is random, then we need to assume that cov(xi, ei) = 0
This is required to ensure unbias.

The assumption that xi takes on at least two values means that it cannot be perfectly collinear with the intercept. If xi takes only one value, the least squares cannot be computed.

Static model - Contains only the contemporaneous valus, that is no lags
Lags - Variables that are lagged in time

The error terms are autocorrelated if

Example:
money demand_t = B0 + B1 * income_t + e_t
A monetary policy shock is not included in the normal variables, so it must be included in the error term (e_t)

e_t = f(r_t, r_t-1)
But
e_t-1 = f(r_t-1, r_t-2)

We have autocorrelation(?)

Both heteroskedasticity and autocorrelation are the result of omitted variables that end up in the error term because they're not included as a separate variable. That is, the dependant variable and independant variables must have similar trends, otherwise we'll have the separate trends captured in the error term.

The countermeasures are the same for both:
We can use GLS
We can try to rubustify the standard errors