Warning: Bad notes below. May be misleading or contain serious errors. Please see Regression analysis for an attempt to summarise the topic better. Statistical program: STATA - Easier and better than SPSSStatistical conceptsRandom variable
Example B7: x1, x2, ..., xn are independent random variables with the same probability distribution with mean μ (the expected value) and σ2. x = εx/n E(x) = E(εx/n) = 1/n * [E(x1) + E(x2) + .... + E(xn)] = 1/n * [μ + μ + ..... + μ] = n * μ / n = μ We want to find the variance. Because they're independent, the covariance between each pair of random variables = 0: Var(x) = Var(1/n * εx) = 1/n2 * [Var(x1) + Var(x2) + ... + Var(xn)] = 1/n2 * [σ2 + σ2 + ... + σ2] = nσ2 / n2 = σ2 / n Example B8: y1, y2, y3 is a sample of observations from N(μ, σ2) cov(y1, y2) = cov(y1, y3) = cov(y2, y3) = 0.5 * σ2 y = (y1 + y2 + y3) / 3 a) E(y) = E((y1 + y2 + y3) / 3) = 1/3 * [E(y1) + E(y2) + E(y3)] = 1/3 * (μ + μ + μ) = 3μ / 3 = μ We want to find the variance. Variables are not independent so we have a covariance: Var(y) = Var((y1 + y2 + y3) / 3) = 1/32 * [Var(y1) + Var(y2) + Var(y3) + 2 * cov(y1, y2) + 2 * cov(y1, y3) + 2 * cov(y2, y3) = 1/9 * [σ^2 + σ^2 + σ^2 + 2 * 0.5σ^2 + 2 * 0.5σ^2 + 2 * 0.5σ^2] = 6σ^2 / 9 = 2/3 * σ^2 DistributionNormal distribution. Wikipedia: "In probability theory, the normal (or Gaussian) distribution, is a continuous probability distribution that is often used as a first approximation to describe real-valued random variables that tend to cluster around a single mean value" Example: f(x) = 1/sqrt(2πσ2) * exp -(x-μ)2 / 2σ2 x ~ N(μ, σ2) Z = (x - μ) / σ E(Z) = E((x - μ)/σ) = (E(x) - μ)/σ = (μ-μ)/σ = 0 Var(Z) = Var((x - μ)/σ) = Var(x/σ) = 1/σ^2 * Var(x) = σ^2 / σ^2 = 1 Z ~ N(0,1) Normal variable: f(x) = 1/sqrt(2πσ^2) * e^(-(x-μ)^2/(2σ^2) Z ~ N(0,1) Example B12: A mutual fond. Its value is normally distributed. x = Annual rate of return for a certain mutual fund Z ~ N(0.10,0.04^2) a) What is the probability of getting a negative return for this fund? We have to transform the x variable to standard normal: P(x < 0) = P((x - μ)/σ < (0 - μ)/σ) = P(Z < (0 - μ)/σ) = P(Z < (0 - 0.10)/0.04) = P(Z < -2.5) = 1 - P(Z < -2.5) = 1 - 0.9938 = 0.0062 The probability of getting a negative one-year rate of return is 0.62% b) What is the probability of getting a return higher than 15%? Transform x to standard normal: P(x > 0.15) = P((x - μ)/σ > (0.15 - μ)/σ) = P(Z > (0.15 - 0.10)/0.04) = P(Z > 1.25) Because we're looking for a certain area in the graph, we must reverse sign: = 1 - P(Z < 1.25) = 1-0.0944 = 0.1056 Probability 10.56% c) The fund manager can raise the mean (average?) return to 12%, but the risk will increase as well (standard deviation = 5%). Would you advice the manager to make this portfolio change? They could increase the expected value to 12% but the standard deviation would increase to 5%: (x ~ N(0.10,0.04^2)) / (x ~ N(0.12,0.05^2)) In a) we get: P(x < 0) = P((x - μ)/σ < (0 - μ)/σ) = P(Z < (0 - μ)/σ) = P(Z < (0 - 0.12)/0.05) = P(Z < -2.4) = 1 - P(Z < -2.4) = 1 - 0.9918 = 0.0082 The probability of having a negative return is now 0.82% which is higher than before the change. In b) we get: P(x > 0.15) = P((x - μ)/σ > (0.15 - μ)/σ) = P(Z > (0.15 - 0.12)/0.05) = P(Z > 0.60) Because we want a certain probability we must reverse sign: = 1 - P(Z < 1.25) = 1-0.07257 = 0.2743 The probability of getting a higher return is now 27.43% which is also higher than before the change. Conclusion: The probability of having a negative return as well as a higher return increases. Our choice depends on the mutual fund manager's risk preference. If he's risk avert, he might want to choose the first option. Simple linear regression modelOne dependant variable that depends on one or more explanatory variables Measures the linear association between two variables. Can answer questions like: How does economic growth affect poverty? What is the economic value of a public good like forests or clean air? What is the correlation between smoking and cancer? We also want to explain how much. That's the difference between correlation and regression analysis. We don't just want to know if there's a positive or negative relationship, but how much one variable changes if the other changes. We have different types of data:
Example: Let's say we want to examine the relationship between household income and expenditure on food. Simple economic theories state that the average household expenditures depend on household income. E(y|x) = β1 + β2 * x An upward sloping line. β1 = Even if people have 0 income they'll have some income (by taking loans etc). This is denoted. β2 = The slope of the line If income changes by 1 unit, the expenditures will change by the following: β2 = dE(y|x)/dx Because we're measuring average values, this means we can have some variance. y = The observed value ei = yi - β1 - β2 * xi This gives us the linear regression model: yi = β1 + β2 * xi + ei In this model, we have just one explanatory variable. Assumptions in linear regression models:
In cross-sectional data, the variance is most likely not constant. We have some problems with assumption 3). In time-series data, we have dependancy over time, so most likely we have covariance, which is problematic with 4). Least Squares Principle - If we have a linear model with variance. We want to find the variance β1 + β2 that minimizes the sum of  square errors. In the example above, we know that: ei = yi - β1 - β2 * xi We want to use summation on the equation: sum(ei)^2 = sum(yi - β1 - β2 * xi)^2 And derivate: (∂ sum(ei)^2)/(∂ β2) = 2 * sum(yi - β1 - β2 * xi) * (-1) = 0 (∂ sum(ei)^2)/(∂ β2) = 2 * sum(yi - β1 - β2 * xi) - (xi) = 0 -2 * sum(yi) + 2Nβ1 + 2β2 * sum(xi) = 0 -2 * sum(xi) * yi + 2β1 * sum(xi) + 2β2 * sum(xi^2) = 0 We change β into b: sum(yi) - Nb1 - b2 * sum(xi) = 0 sum(xi) * yi - b1 * sum(xi) - b2 * sum(xi^2) = 0 Nb1 + b2 * sum(xi) = sum(yi)                    (1) b1 * sum(xi) + b2 * sum(xi^2) = sum(xi) - yi    (2) From (1) we get that: Nb1 + b1 * sum(xi) = sum(yi) Nb2 = sum(yi) - b1 * sum(xi) b1 = sum(yi)/N - b2 * sum(xi)/N b1 = y - b2 * x We now want to find out b2. We do it by multiplying (1) by sum(xi) and multiply (2) by N: Nb1 * sum(xi) + b2(sum(xi))^2 = sum(xi) * sum(yi) Nb1 * sum(xi) + Nb2 * sum(xi^2) = N * sum(xi) * yi Nb1 * sum(xi) - Nb1 * sum(xi) + Nb2 * sum(xi^2) - b2(sum(xi))^2 = N sum(xi) * yi - sum(x) * sum(y) We divide by N: Nb2/N * sum(xi^2) - (b2 * sum(xi^2)) / N = (N * sum(xi) * yi) / N - (sum(x) sum(y)) / N b2 * ((sum(x^2) - (sum(x))^2)/N) = sum(x) * y - (sum(x) * sum(y))/N b2 = (sum(x) * y - (sum(x) * sum(y)) / N) / sum(x^2) - (sum(x))^2/n y = b1 + b2 * x ε = dy / dx * x / y = b2 * x/y If x increases by 1% then an average y changes by (b2 * x/y)% Summary - equations:
Making use of 4) we get: b2 = (sum(x-x)(y-y)) / sum(x-x)^2 = (sum(x-x) * y - y * sum(x-x)) / sum(x-x)^2 = (sum(x-x) * y) / sum(x-x)^2 wi = (xi - x) / sum(xi - x)^2 b2 = sum(wi) * yi = sum(wi) * (β1 + β2 * xi + ei) = β1 * sum(wi) + β2 * sum(wi) * xi + sum(wi) * ei sum(wi) = ? sum(xi - x) / sum(xi - x)^2 = 1/sum(xi - x)^2 * sum(xi-x) = 0 sum((wi) * xi : (sum(xi - x) * xi) / sum(xi - x)^2 = (sum(xi)^2 - x * sum(xi)) / sum(xi - x)^2 = (sum(xi)^2 - x * sum(xi) + x * sum(xi) - x * sum(xi)) / sum(xi - x)^2 = (sum(xi)^2 - 2 * x * sum(xi) + x * sum(xi)) / sum(xi - x)^2 = (sum(xi)^2 - 2x * sum(xi)+ x(sum(xi)/N)^x * N / (sum(xi-x)^2) = (sum(xi)^2 - 2x * sum(xi) + N * x^2) / sum(x-x)^2 = sum(x-x)^2 / sum(x-x)^2 = 1 b2 = β2 + sum(wi) * ei E(b2) = E(β2) + E(sum(wi) * ei) = β2 + sum(w) * E(ei) We assume that E(ei) = 0 so: = β2 + 0 = β2 Gauss-Markov theorem: Wikipedia: "In a linear regression model in which the errors have expectation zero and are uncorrelated and have equal variances, the best linear unbiased estimator (BLUE) of the coefficients is given by the ordinary least squares estimator. Here "best" means giving the lowest possible mean squared error of the estimate." b1 ~ N(β1, (σ^2 * sum(xi)^2) / (N * sum(x-x)^2)) b2 ~ N(β2, (σ^2 / sum(x-x)^2)) σ^2 = sum(ei)^2 / (N-2) e = y-y var(b1) = σ^2 * [sum(x)^2 / (N * sum(x-x)^2] se(b1) = sqrt(var(b1)) var(b1) = σ^2 / sum(x-x)^2 se(b2) = sqrt(var(b2)) cov(b1, b2) = σ^2 * [-x / sum(x-x)^2] b = (x1 * x)^-1 * x1 * y cov(b) = σ^2 * (x1 * x)^-1 Example 2.7 - Least square principle: y = β1 + β2 * x + e y = b1 + b2 * x N = 51 a) Estimated error variance: σ^2 = 2.04672 The residual sum of squares: σ^2 = SSE/(N-2)) Gives us: SSE = σ^2 * (N-2) = 2.04672 * (51-2) SSE = 100.29 b) var(b2) = 0.00098 se(b2) = sqrt(var(b2)) = sqrt(0.00098) = 0.0313 What is: sum(x-x)^2 = ? Definition of variance for b2: var(b2) = σ^2 / sum(x-x)^2 sum(x-x)^2 = σ^2 / var(b2) sum(x-x)^2 = 2.04672/0.00098 = 2088.5 c) b2 = 0.18 d) We assume that the average of x is: x = 69.139 The average of y is: y = 15.187 We want to calculate the intercept: b1 = y - b * x b1 = 15.187 - 0.18 * 69.139 = 2.742 Our estimated line is: y = 2.742 - 0.18x e) We want to find the sum of squared of x: sum(x)^2 = ? We know that: sum(x-x)^2 = sum(x)^2 - N * x^2 So: sum(x)^2 = sum(x-x)^2 + N * x^2 We also know the values from earlier, so: sum(x)^2 = 2088.5 + 51 * 69139 = 245879 f) We want to calculate the residual. y = b1 + b2 * x + e e = y - y We can calculate the predicted value: y = 2.742 - 0.18 * 58.3 y = 13.236 Now we can calculate the residual: e = y - y = 12.274 - 13.236 = -0.962 This means we're below the line(?) Interval estimation / Confidence intervals - We create an interval where the two parameters could line with a certain confidence. Wikipedia: "A particular kind of interval estimate of a population parameter used to indicate the reliability of an estimate." Hypothesis tests - First of all test if the parameters = 0. Secondly, test if  β1 and β2 = 1. Wikipedia: "A method of making decisions using data, whether from a controlled experiment or an observational study (not controlled)" Confidence intervals b2 ~ N(β2, σ^2 / sum(x-x)^2) Z = (b2 - β2) / sqrt(σ^2 / sum(x-x)^2) We can use the t-distribution, which replaces σ with σ: t = (b2 - β2) / sqrt(σ^2 / sum(x-x)^2) t-distribution: Wikipedia: "a continuous probability distribution that arises when estimating the mean of a normally distributed population in situations where the sample size is small" P(-tc <= t <= tc) = 1 - α b2 +/- t * α/2, N-2 * se(b2) 100 * (1 - α)% α = 0.05 (1-α) = 0.95 Hypothesis tests
We could test that β2 = C: H0 : β2 = C We also want an alternative hypothesis: H1 : β2 != C Another test: H0 : β2 = C H1 : β2 > C Or: H0 : β2 = C H1 : β2 < C E.g. if a certain company's stock has a lower risk than the market. Test statistics (the same for all three tests): t = (b2 - β2) / se(b2) Then we must choose the significance-level: α This is normally α = 0.05 E.g. Even 5% of the innocent will be sent to jail We need a critical value. Critical region: Can be either positive or negative. We go to the t-table: t α/2, N - 2 +/- t * 0.025, N-2 If the observed t value is larger than the critical value. tc = tα, N-2 tc = -tα, N-2 The ordinary least squares (OLS) assumption
Violations of the ordinary least squares (OLS) assumptionsGoal: Use our observation to explain a simpler relation. Assumption 1We take our observation on y and x and run it through the linear model: yi = β0 + β1 * xi + ei Assumption 2Most important, together with assumption 1. Expected value on the left hand side and right hand side: E(yi) = E(β0 + β1 * xi + ei) Using the rule E(ax) = aE(x) we can transform this into: E(yi) = β0 + β1 * E(xi) + E(ei) = β0 + β1 * E(xi) It takes on the same value, but on repeated samples we should recieve the same value. Example: If we draw repeated samples of the monetary policy rate from a group of countries, we should arrive at the same result. The most important reason for why we need assumption 2 is to ensure that β1 and β0 is unbiased. Which is better?
A trade-off between bias and variance. Which should be choose? Answer: Number #2. Having a low variance of a biased variable is actually worse than a biased variable with high variance, because a biased variable with a high variance at least have some chance of ending up at the correct value, while this probability decreases if its variance is low. Unbias is extremely important and thus we should choose the least biased sample. Unbias: If x is an estimator of μ and E(x) = μ then x is unbiased. To show that the least squares is unbiased, we want to show that E(b1) = β1 To show this we need assumption 1, 2 and 5. Assumption 3If var(ei) is constant, that is: the variance of ei does not depend on xi, then ei is homoskedastic. Otherwise, it's heteroskedastic. The formula when calculating variance will not be correct anymore. When might we expect heteroskedasticity to occur? Consider the following regression: Food expenditures = B0 + B1 * Income + error For low income households we can expect food expenditures to be a large part of their budget. But for high income households, it'll be lower (spend less part of their money on food and more on other things). So the error term will look very different, with lower variance for low income households and higher variance for high income households. The variance will depend on which variable you look at. This is heteroskedastic, which is problematic. Scenarios
Assumption 5The assumption that xi is non-random is only for simplicity. It means that if we made repeated drawings of y1, ..., yN and x1, ..., xN from the population, then the values of xi will be the same in each drawing. Example: If yi is money supply at time i for a particular country, and xi is the policy interest rate of the ECB, then xi would be the same if we were to sample different countries.  If xi is random, then we need to assume that cov(xi, ei) = 0 This is required to ensure unbias. The assumption that xi takes on at least two values means that it cannot be perfectly collinear with the intercept. If xi takes only one value, the least squares cannot be computed. Static model - Contains only the contemporaneous valus, that is no lags Lags - Variables that are lagged in time The error terms are autocorrelated if Example: money demand_t = B0 + B1 * income_t + e_t A monetary policy shock is not included in the normal variables, so it must be included in the error term (e_t) e_t = f(r_t, r_t-1) But e_t-1 = f(r_t-1, r_t-2) We have autocorrelation(?) Both heteroskedasticity and autocorrelation are the result of omitted variables that end up in the error term because they're not included as a separate variable. That is, the dependant variable and independant variables must have similar trends, otherwise we'll have the separate trends captured in the error term. The countermeasures are the same for both: We can use GLS We can try to rubustify the standard errors |