Closed Form Solutions

In mathematics and statistics, there are popular theories involving distributions of known values. The Central Limit Theorem is one of them. We don’t need complicated mathematics for us to approximate distributional assumptions when we use simulations. This is especially helpful when finding a closed form solution is very difficult, if not impossible. A closed form solution to a mathematical/statistical distribution problem means that you can mathematically calculate the distribution through equations. Real world data can be very complicated and may change based on different inputs which each have their own unique distribution. Simulation can reveal an approximation of these output distributions.

The next two sections work through some examples of this.

Central Limit Theorem

Assume you do not know the Central Limit Theorem, but want to study and understand the sampling distribution of sample means. You take samples of size 10, 50, and 100 from the following three population distributions and calculate the sample means repeatedly:

  1. Normal distribution

  2. Uniform distribution

  3. Exponential distribution

Let’s use each of our softwares to simulate the sampling distributions of sample means across each of the above distributions and sample sizes!

Omitted Variable Bias

What if you leave an important variable out of a linear regression that should have been in the model? From statistical theory it would change the bias of the coefficients in the model as well as the variability of the coefficients, depending on if the variable left out was correlated with another variable in the model. Simulation can help us show this as well as determine how bad these problems could get under different circumstances.

Let’s simulate the following regression model:

\[ Y = -13 + 1.21X_1 + 3.45X_2 + \varepsilon \]

Assume that the errors, \(\varepsilon\) , in the model above are normally distributed with a mean of 0 and standard deviation of 1.5. Also assume that the predictors \(X_1\) and \(X_2\) are normally distributed with mean of 0 and standard deviation of 1. Build 10,000 linear regressions (each of sample size 50) and record the coefficients from the regression models for each of these simulations:

  1. Model with both \(X_1\) and \(X_2\) in the model

  2. Model with only \(X_1\) in the model and \(X_2\) with the correlation between the two as 0.6

  3. Model with only \(X_1\) in the model and \(X_2\) with the correlation between the two as 0

Let’s see the simulations in each of our softwares!

Omitting important variables is a problem. If those variables are correlated with other inputs that are in the model, then our inputs in the model have biased estimations. However, even if they are not correlated and therefore unbiased, we make many more mistakes about our model input’s significance in the model.