Regression analysis‎ > ‎

1. Basics

This page describes some (but not yet all) basic terms and concepts in statistics and regression analysis.

Random variable

A random variable (or stochastic variable) is a variable that can take a random value from a set of possible values. Each value has an associated probability, which decides the likelihood of it being chosen.

The two most common types of random variables are:
  • Discrete random variables: Can take a limited number of fixed values.
                  Examples: a dice (1, 2, 3, 4, 5 or 6), the number of children in a family.
  • Continuous random variables: Can take any value on a scale.
                  Examples: the interest rate, stock market indexes, household income.

Probability distribution

A probability distribution is a mathematical function describing the possible values of a random variable and their associated probabilities. The most common type of probability distribution is the normal distribution.

Normal distribution

The normal distribution (or bell curve) is the most common probability distribution. It has a distinct shape which makes it easy to remember.


Normal distribution of values in a random variable.


Example
An example of an experiment of tossing two dice at the same time. The distribution of the sum of the two dice values would be normally distributed, as shown above. Source

Some even say that any random variable will end up being normally distributed, given a large enough sample size (i.e. as long as the experiment is repeated enough times). This is known as the central limit theorem (CLT).

Expected value (population mean)

The expected value or population mean (μ) is the average value we would expect to find from a random variable if we repeated an experiment an infinite number of times. In theory, the result is the same as the average or arithmetic mean value, i.e. the sum of all values divided by the number of values, although it's calculated a bit differently.

Definition
The expected value is the sum of all possible values for a random variable, each value multiplied by its probability of occurrence.
E[X] = x1p1 + x2p2 + ... + xnpn
Where:
E[X] is the expected value of random variable X
x1 to xn represents all the possible values that X can take.
p1 to pn is the probability of this value occurring

Example
The expected value for rolling a dice is:
E[X] = 1*1/6 + 2*1/6 + 3*1/6 + 4*1/6 + 5*1/6 + 6*1/6 = 3.5
There's a 1/6 chance that the dice lands on each side. If we role the dice an infinite number of times, the average value should be 3.5.

Variance

Variance is a measure of the spread of values in a random variable. The larger the variance, the greater the spread of values. For example, the two numbers 0 and 40 have a larger variance than 10 and 30, because they are more spread apart. In general, zero variance means that the values are identical.

Definition
The variance of a random variable is the expected value of the squared deviation from the mean:
Var(X) = E[(X - μ)^2]
Where:
Var(X) is the variance of the random variable X
μ is the mean, which is the same as the expected value of X, i.e. μ = E[X]

Example (simplified)
Let's say we have eight data points with the values 2+4+4+4+5+5+7+9.
The mean (μ) of these values are:
μ = (2+4+4+4+5+5+7+9) / 8 = 5
For each value, we take its deviation from the mean and square it:
(2-5)^2 = 9
(4-5)^2 = 1
(4-5)^2 = 1
(4-5)^2 = 1
(5-5)^2 = 0
(5-5)^2 = 0
(7-5)^2 = 4
(9-5)^2 = 16
We then get the mean (expected value) of these to get the variance:
Var(X) = (9+1+1+1+0+0+4+16) / 8 = 2
The standard deviation (σ) is the square root of the variance:
σ = sqrt(2) = 1.41421356237
This means that the variance of X is the standard deviation (σ) of X squared:
Var(X) = σ^2

Standard deviation

The standard deviation (σ) is another way of expressing the variance, calculated as the square root of the variance.

Definition
The square root of the variance:
σ = sqrt(Var(X))
Or you could write:
σ^2 = Var(X)
σ is the standard deviation of the random variable X
Var(X) is the variance of X

Covariance

Covariance is a measure of the relationship between two random variables. How much the two variables vary together, and in what direction. The covariance is positive when the variables move in the same direction, and negative when they move in opposite directions. Zero covariance means there is no linear relationship between them.

Covariance is measured in the same units as the variables, making it hard to compare between variables. Correlation fixes this by standardizing the values, giving us a fixed range of -1 to 1.

Definition
Cov[X,Y] = E[(X - E[X]) * (Y - E[Y])] = E[XY] - E[X] * E[Y]
If we take the covariance of a variable with itself, this simply equals its variance:
Cov[X,X] = Var[X]

Correlation

Correlation is a measure of the relationship between two random variables. How much the two variables vary together, and in what direction. It's the same as the covariance, except it uses a standardized range of values between -1 to 1 while covariance is measured in the same unit as the variables. A value of 1 means a perfect positive relationship, -1 a perfect negative relationship, and 0 means no relationship at all.

Note that correlation does not imply causation. Just because there is a statistical relationship between two things does not mean that one causes the other, only that they seem to occur at roughly the same time. This also holds for covariance.

Definition
Corr(X,Y) = Cov[X,Y] / sqrt(Var[X] * Var[Y])

Data types

There are different types of data, which require different regression methods:
  • Cross-sectional data: Data on many subjects at a certain point of time. The subjects could be individuals, firms, countries, regions or something else. Example: the income of households in Sweden in 2009.
  • Time series data: Data on a single subject over time. Examples: the daily profit & loss of a specific company over time, or the inflation rate of a country over many years.
  • Panel data: Data on many subjects over time. A mix between cross-sectional and time series data. Panel data is said to be "multi-dimensional" while the others are "one-dimensional". Examples: the income of many households in Sweden over time, the daily profit of multiple companies over time.