wwblog

Willie Wheeler's personal blog. Mostly tech.

Polynomial Regression in R

At first glance, polynomial fits would appear to involve nonlinear regression. In fact, polynomial fits are just linear fits involving predictors of the form x1, x2, …, xd.

Example 1: Polynomial fit

First, let's generate a data set:

set.seed(16)
x <- 0:50
y <- 2.3 - 15.1*x + 1.2*x^2 + rnorm(length(x), 20, 50)
plot(x, y)

(rnorm just generates random, normally distributed noise.)

Now let's fit a model using linear regression:

fit <- lm(y ~ 1 + x + I(x^2))
points(x, predict(fit), type="l")

Polynomial fit

Here's a summary of the fit:

> summary(fit)

Call:
lm(formula = y ~ 1 + x + I(x^2))

Residuals:
        Min      1Q  Median      3Q     Max
    -92.173 -28.968   3.673  24.953  97.269

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)
             (Intercept)  33.84216   18.36178   1.843   0.0715 .
             x           -15.28705    1.69836  -9.001 7.07e-12 ***
             I(x^2)        1.20126    0.03285  36.569  < 2e-16 ***
             ---
             Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 45.44 on 48 degrees of freedom
Multiple R-squared:  0.996,    Adjusted R-squared:  0.9959
F-statistic:  6034 on 2 and 48 DF,  p-value: < 2.2e-16

This example shows that we can apply linear regression outside of cases involving linear data-generating mechanisms. Indeed the word linear in linear regression doesn't have anything to do with the nature of or relationship between the regressors themselves. Instead it refers to the relationship between the predicted variable and the regressors.

Example 2: Fractional exponents

We can even generalize this beyond polynomials to include non-polynomial expressions involving (say) fractional exponents.

set.seed(16)
x <- 0:100
y <- 5 + 5*x^0.5 - 1.2*x^0.7 + rnorm(length(x), 0, 1)
plot(x, y)
fit <- lm(y ~ 1 + I(x^0.5) + I(x^0.7))
points(x, predict(fit), type="l")

Fractional power fit

Summary:

> summary(fit)

Call:
lm(formula = y ~ 1 + I(x^0.5) + I(x^0.7))

Residuals:
        Min      1Q  Median      3Q     Max
    -2.3337 -0.7138  0.0494  0.7518  3.2104

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)
            (Intercept)   5.3829     0.6394   8.418 3.23e-13 ***
            I(x^0.5)      4.8859     0.4149  11.775  < 2e-16 ***
            I(x^0.7)     -1.1694     0.1477  -7.917 3.81e-12 ***
            ---
            Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.024 on 98 degrees of freedom
Multiple R-squared:  0.9396,    Adjusted R-squared:  0.9384
F-statistic: 762.6 on 2 and 98 DF,  p-value: < 2.2e-16

Looking ahead

The models above are either polynomial or polynomial-like, and so linear regression works. There are however some cases where instead of dealing with models having coefficient parameters, we have models where one or more exponents is itself a model parameter. An example would be

|r| ~ a * L^b

where a and b are the parameters to be estimated. There we'll need actual nonlinear regression and I'll write about that in a follow-up post.

Comments