During a couple of the lectures in the

Machine Learning MOOC offered by Prof.

Andrew Ng of Stanford University, I came across two statements about

ordinary least squares linear regression (henceforth OLS) that surprised me. Given that I taught regression for years, I was surprised that I could be surprised (meta-surprised?), but these two facts are not ones you're likely to see in a statistics course.

The first is not too shocking in hindsight. Let $X$ be an $m \times n$ matrix containing $m$ observations of $n$ predictors (including a column of ones if you want a constant term in the model) and let $y$ be an $m \times 1$ vector of observations of the dependent variable. A linear model looks like $$y = Xb + e$$where $b$ is the estimated coefficient vector ($n \times 1$) and $e$ is the vector of residuals ($m \times 1$). By definition OLS computes $b$ to minimize the sum of squared residuals $$J(b) = e^\prime e.$$The so-called

normal equations are derived by computing the gradient of $J$ and setting it equal to zero. Assuming $X$ has full column rank, this yields $$b = (X^\prime X)^{-1}X^\prime y.$$
Surprising (to me) fact #1: when dealing with large data sets, you may not want to use the nice, closed form, normal equations to compute $b$. The alternative is an iterative approach, using a nonlinear programming algorithm such as gradient descent, or the conjugate gradient method, or some other alternative. The reason is that $(X^\prime X)^{-1}$ may be computationally expensive to compute when $m$ and particularly $n$ are large. In practice, good statistical software would not invert the matrix anyway; it would do some sort of decomposition (LU, decomposition, whatever) both for efficiency and to head off potential rounding problems. Apparently, though, the comparatively small amount of arithmetic required per iteration of gradient descent (mainly computing $X b$ and $X^\prime e$, which are $O(mn)$ operations) offsets the cost of running a bunch of iterations rather doing a one-and-done solution of a matrix equation.

The bigger surprise, though, had to do with

multicollinearity, which occurs when $X$ has less than full column rank. Multicollinearity means $X^\prime X$ is singular and cannot be inverted. It also means the model contains redundant predictors (some predictors are linear combinations of others), and I always gave my students the standard prescription: figure out which predictors were redundant and eliminate them. Other people sometimes recommend a perturbation approach (

ridge regression). For that matter, gradient descent should work properly with multicollinear data.

Before continuing, I need to inject a well-established if not necessarily well-known fact about multicollinearity. Whether $X$ is multicollinear or not, there is a unique vector $\hat{y}$ of fitted values that minimizes the sum of squared errors. This is because $\hat{y}$ is the orthogonal projection of $y$ onto the column space of $X$, and the orthogonal project of any vector onto any linear subspace is unique. When $X$ has full rank, $\hat{y} =Xb$ where $b$ is given by the normal equations (and is the unique coefficient vector that generates $\hat{y}$. When $X$ is less than full rank (multicollinear), the optimal fitted values $\hat{y}$ can be obtained from any of an infinite number of coefficient vectors ... but the fits are still uniquely determined.

The big surprise for me: it turns out that you can fit a version of the normal equations using the

Moore-Penrose pseudoinverse of $X$. I'd come across the pseudoinverse in some long ago linear algebra or numerical analysis class, filed it away as another mathematical curiosity of no particular use, and forgotten it. Oops.

Let $A$ be the pseudoinverse of $X$ (which always exists, is unique, and has dimensions $n \times m$). Then $\hat{b} = Ay$ yields the correct fitted values $X \hat{b}$. A proof begins with one of the

identities true of the pseudoinverse: $$X^\prime = X^\prime X A.$$Now the gradient $\nabla_b J(b)$ works out to $-2X^\prime e$; substitute $y - XAy$ for $e$ and $\hat{b}$ for $b$, and we have \begin{alignat*}{1}
\nabla_{b}J(\hat{b}) & =-2X^{\prime}e\\
& =-2X^{\prime}\left(y-X\hat{b}\right)\\
& =-2X^{\prime}\left(y-XAy\right)\\
& =-2X^{\prime}(I-XA)y\\
& =-2\left(X^{\prime}-X^{\prime}XA\right)y.
\end{alignat*}Substitute the aforementioned identity for the middle expression and we have $
\nabla_{b}J(\hat{b}) = 0$. So choosing $b=\hat{b}$ minimizes squared errors and thus gives the proper fitted values.

All that said, I would still rather eliminate redundant predictors. That process can be automated, but if you are writing a machine learning algorithm that will chew through potentially large data sets with no human supervision, and you want to use linear (or polynomial) models, I suppose you should at least consider using the entire data set with the pseudoinverse taking care of the multicollinearity.