The normal (or Gaussian) distribution is probably the most important and widely used distribution in probability. Many populations have distributions that can be modeled very well by a normal distribution.
In general, anytime data are generated by the summation of many random processes (multiple random variables of unknown distribution), the result is a normal distribution (e.g., height is the sum of genetic and environmental factors, fruit production is the sum of shading, soils, and moisture). This is also the reason for the [[Central Limit Theorem]].
Properties of the normal distribution include
- $f(x)$ is symmetric about the line $x=\mu$
- $f(x) > 0$ and $\int\limits_{-\infty}^{\infty}f(x)dx=1$
- $\mu+\sigma$ and $\mu - \sigma$ are the inflection points for $f(x)$
## Notation
$X \sim N(\mu, \sigma^2)$
## Probability Density Function
$f(x) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp \Big \{{-\frac{(x-\mu)^2}{2\sigma^2} } \Big \}$
Note that the core function is $e^{-x^2}$ and $\sigma$ controls the spread while $\mu$ controls the center. The scalar out front ensures it sums to $1$. For an intuition on where $\pi$ comes from, see [[Herschel-Maxwell]].
## Expected Value
$E(X) = \mu$
## Variance
$V(X) = \sigma^2$
where $\sigma$ is the standard deviation and is given when specifying the distribution.
## Joint PDF
$
f(\vec x) = \frac{1}{\sqrt{2 \pi \frac{\sigma^2}{n}}} \exp \Big \{ \frac{1}{2 \sigma^2} n (\bar x - \mu)^2\Big \}
$
## Multivariate form
The multivariate form arises for example in the context of [[base/Linear Regression/linear regression|linear regression]].
$
f(\vec \beta) = \frac{1}{\sqrt{(2 \pi)^p \ \Sigma_p}} \exp \Big \{ -\frac12 (\vec \beta - \vec \mu)^T \ \Sigma_p^{-1} (\vec \beta - \vec \mu) \Big \}
$
for $p$ parameters (the dimension of vector $\beta$).
## Bivariate form
Assume that $(X,Y)$ follow a bivariate normal distribution with $E(X) = E(Y) = 0$ and $Var(X) = Var(Y) = 1$. The pdf of $(X,Y)$ is given as
$
f(x,y|\rho) = \frac{1}{2\pi\sqrt{1-\rho^2}} \exp\left(-\frac{x^2 - 2\rho xy + y^2}{2(1-\rho^2)}\right)
$
where $\rho$ is the correlation coefficient for $X$ and $Y$.
## R
```R
# Probability mass function
pmf.norm <- function(x, mu, sigma){
f.x <- (1 / (sqrt(2 * pi) * sigma)) * exp(1) ** (-(x-mu)**2 / 2 * sigma**2)
return(f.x)
}
# Parameterize with standard deviation not variance
# Probability density function
dnorm(x, mu, sqrt(var))
# Probability density function
prob <- dnorm(x, mu, sqrt(var))
# Cumulative distribution function
cum_prob <- pnorm(x, mu, sqrt(var))
# Quantile function
quantile_val <- qnorm(x, mu, sqrt(var))
# Random number generation
random_values <- rnorm(x, mu, sqrt(var))
```
> [!warning]
> The normal distribution is parameterized with the standard deviation $\sigma$ not variance $\sigma^2$ in R!