nonparametric regression

Nonparametric regression include a family of regression models where the focus is not on estimating the parameters of the model. As with other nonparametric approaches, this allows us to avoid specifying the distribution of the data *a priori*. Where a parametric statistical model is a family of probability distributions with a finite set of parameters, nonparametric statistical models are a family of probability distributions with infinitely many parameters. Instead of trying to estimate the mean of the response given a function $f$ that specifies the form of the response such as - in normal regression: $f(x) = \beta_0 + \beta_1x_{i,1} + \dots + \beta_px_{i,p}$ - in Poisson regression: $f(x) = \exp(\beta_0 + \beta_1x_{i,1} + \dots + \beta_px_{i,p})$ In nonparametric regression we are instead trying to learn the function $f$. We learn $f$ by assuming it comes from some smooth family of functions. Note that we must enforce a "smoothness" constraint or else our best choice for the function $f$ would trivially be $f(x_i) = y_i$. Different nonparametric regression methods approach smoothing $f$ in different ways. Nonparametric approaches are more flexible and requires fewer distributional assumptions, which can avoid bias. However, it is less efficient with structure of the relationship can be known and there are difficulties in interpretation. > [!Note] > In regression, we are interested in estimating the mean for the set of possible observations $Y_i$ that share the same covariate class as the observation in the data $y_i$. These include all of the observations in the dataset that have the exact same values for each covariate AND all possible unseen values in the population. Recall that the errors are assumed to be normally distributed and so you are seeking the central tendency for repeated samples of the population. ## kernel estimator A kernel estimator is a simple weighted moving average of the response. $\hat f(x) = \frac{\frac1n \sum w_i Y_i}{\frac1n \sum w_i}$ where $w_i$ is the weight, for example $w_i = \frac{1}{h}K(\frac{x-x_i}{h})$ $h$ is considered the **bandwidth** or **smoothing** parameter. It controls the smoothness or bumpiness of the fit and $K$ is the kernel. The choice of $h$ is important, and the best choice may be up to judgement. The best $h$ is the one that best fits the data without including any unreasonable fluctuations (i.e., is much too bumpy a fit to be plausible). There are also automatic procedures but these also cannot be relied on without judgement. ### kernel A kernel is a nonnegative, real-valued function $K$ that is symmetric and normal (i.e., integrate to one). Formally - $K(x) = K(-x)$ for all values of $x$ (symmetry) - $\int K(x)dx = 1$ (normalization) > [!NOTE] > These are also qualities of the [[probability density function]] for many distributions, and in fact these are often used as kernels! ### Commonly used kernels #### Uniform The uniform distribution is a very straightforward weighted average of all values close to $x$. $K(\frac{x - x_0}{h}) = \frac{1}{2h} ,\ -h \le \frac{x - x_0}{h} \le h$ This is also known as a "box" kernel as it simply provides the same weight to all points within a distance $h$ from $x_0$. Every point that falls in the box gets included in the kernel. #### Gaussian The Gaussian kernel does provide some weight to all values, which can be computationally inefficient, but the weight falls off fairly quickly with increased distance from $x$. $K(z) = \frac{1}{\sqrt{2 \pi}}\exp(-\frac{z^2}{2})$ #### Epanechnikov The Epanechnikov kernel can be shown to be optimal but the estimation procedure in kernel estimation is not that sensitive to the choice of kernel. $K(z) = \frac34(1 - z)^2, \ -1 \le z \le 1$ In [[R]] use `ksmooth` with a kernel of either `box` for uniform or `normal`. Test bandwidths from `0.01` to `1` and plot to find the best fit. ```R k <- ksmooth(train_df$x, train_df$y, kernel, bandwidth) # Predictions # Note that ksmooth() sorts x.points, so predictions are sorted also kernel_pred <- ksmooth(train_df$x, train_df$y, kernel, bandwidth, x.points = test_df$x)$y # Get sorted response variable to calculate MSPE sorted_y <- df[order(df$x), "y"] # Calculate MSPE mspe <- mean((kernel_pred - sorted_y)^2, na.rm = TRUE) # Plot plot(y ~ x, data=exa, pch=16, cex=0.8, col=alpha("grey", 0.9)) lines(k, lwd=3) ``` ## smoothing splines A **smoothing spline** is a spline designed to balance fit with smoothness. Smoothing splines are a nonparametric regression method that minimizes the mean squared error plus a penalty term for "roughness". $\frac1n \sum_{i=1}^n \Big (Y_i - f(x_i) \Big )^2 + \lambda \int[f''(x)]^2 dx$ where $\lambda$ is a smoothing parameter that weights the roughness penalty up ($\lambda > 1$) or down ($\lambda < 1$). Consider that the naive way to minimize the mean squared error by selecting a function $f$ is to simply select the function $f(x_i) = y_i$. In this case, the error is zero. However, by including the second derivative of the function $f$, we penalize functions for increasing their "acceleration" towards each data point. This helps us avoid overfitting to noise in the data. The class of functions that minimizes the above equation are called **cubic smoothing splines**. Because we know the form of the solution, the estimation problem is reduced to a parametric problem of estimating coefficients of the piecewise polynomials. In [[R]] use the function `smooth.spline` where the parameter `spar` is a monotone function of the smoothing parameter $\lambda$. By leaving `spar` unspecified, R will choose a smoothing parameter value using a "leave one out" [[cross validation]] procedure. ```R s <- smooth.spline(data$x, data$y, spar = ...) # Predict (provide x values explicitly and extract y values) predict(s, data$x)$y # Plot plot(y ~ x, data=data) lines(s, lwd=3) ``` ### spline A **spline** is a piecewise function where each segment is a polynomial. Splines are meant to be continuous and have continuous derivatives. A **cubic spline** is a spline where the segment polynomials are each of a degree three. ## loess Locally estimated scatterplot smoothing (loess) is a nonparametric regression approach that assumes the function $f$ is a weighted [[Taylor expansion]] evaluated around a point $x_0$. $f(x) \approx f(x_0) + \frac{f'(x_0)}{1!}(x - x_0) + \frac{f''(x_0)}{2!}(x - x_0) + \frac{f^{(3)}(x_0)}{3!}(x - x_0) + \dots + \frac{f^{(p)}(x_0)}{p!}(x - x_0)$ Note that this is simply a polynomial with coefficients equal to a derivate of the function $f$. Thus, Coefficients can be estimated using [[ordinary least squares]] by minimizing the function $MSE_{Taylor}^w = \frac1n \sum_{i=1}^n w_i(x - x_0)\Big (Y_i - \sum_{j=0}^p \beta_j(x-x_0)^p \Big)^2$ where $\begin{align} \beta_j = \frac{f^{(p)}(x_0)}{p!}, && j=p\end{align}$ and $w_i$ is a weight function. A common weight function is $w_i(z) = \begin{cases}(1 - |z|^3)^3 && \text{if } |z| <1 \\ 1 && \text{if } |z| \ge 1 \end{cases}$ When $p = 0$, the loess is the same as the [[kernel estimator]]. The advantages of loess include 1. flexible fit 2. relatively simple to implement 3. possible to quantify uncertainty (including standard errors, confidence intervals and prediction intervals) Disadvantages include 1. can be computational expensive 2. interpretation can be more difficult 3. requires relatively large, densely sampled data ```R l <- loess(y ~ x, data=data) ``` To calculate confidence intervals for a loess, predict on new data and set `se=T` to also obtain the standard errors associated with the model. This is helpful in producing confidence intervals, for example, `p$se` and `p$df` returns the standard error and the estimated degrees of freedom of the loess model, which we can use to construct t-intervals. ```R p <- predict(l, newdata, se=T) ce <- cbind( p$fit - qt(0.975, p$df) * p$se, p$fit + qt(0.975, p$df) * p$se ) # Print predictions p$fit # Print confidence intervals ce ``` However, recall that prediction intervals are different from confidence intervals, and require a different standard error calculation. The `predict()` function does not calculate prediction standard errors.