线性回归

首先引入一些后面会用到的定理:

定义1:定义函数\(f: \Bbb R^{m \times n} \mapsto \Bbb R\)\(A \in \Bbb R^{m \times n}\),定义 \[ \nabla_Af(A)= \begin{bmatrix} \frac{\partial f}{\partial A_{11}} & \cdots & \frac{\partial f}{\partial A_{1n}}\\\ \vdots & \ddots & \vdots \\\ \frac{\partial f}{\partial A_{m1}} & \cdots & \frac{\partial f}{\partial A_{mn}} \end{bmatrix} \] 定义2:矩阵的迹(Trace):如果\(A \in R^{n\times n}\)方阵,那么\(A\)的迹,是\(A\)对角线元素之和 \[ tr A = \sum_{i=1}^nA_{ii} \] 定理1\(tr AB = tr BA\)

定理2\(tr ABC = tr CAB = tr BCA\)

定理3\(f(A)=tr AB \Rightarrow \nabla_Af(A)=B^T\)

定理4\(trA = tr A^T\)

定理5\(a \in R \Rightarrow tr a=a\)

定理6\(\nabla_AtrABA^TC=CAB+C^TAB^T\)

线性回归

一些符号的改写

上一篇博客提到,梯度下降的每一步,对某个参数\(\theta_i\),执行: \[ \displaystyle \theta_i:=\theta_i - \alpha\frac{\partial}{\partial \theta_i}J(\theta) \] 那么,\(h_\theta(x)\)的所有参数\(\theta\)可以表示成一列向量: \[ \theta = \left[ \begin{array}{c} \theta_0\\\ \theta_1\\\ \vdots\\\ \theta_n \end{array} \right] \in R^{n+1} \]

我们可以定义: \[ \nabla_\theta J = \left[ \begin{array}{c} \frac{\partial}{\partial \theta_0}J\\\ \frac{\partial}{\partial \theta_1}J\\\ \vdots\\\ \frac{\partial}{\partial \theta_n}J \end{array} \right] \in R^{n+1} \]

梯度下降过程可以表示成: \[ \theta:=\theta - \alpha\nabla_\theta J \] 其中,\(\theta\)\(\nabla_\theta J\)都说是n+1维向量。

对于训练集中所有的输入\({x^{(1)}},x^{(2)},…,x^{(m)}\),其中 \[ x^{(i)} = \left[ \begin{array}{c} 1\\\ x_1^{(i)}\\\ \vdots\\\ x_n^{(i)}\\\ \end{array} \right] \in R^{n+1} \]

\(h(x)=h_{\theta}(x)=\theta_0+\theta_1x_1+\theta_2x_2+...+\theta_nx_n\),可以表示成向量: \[ \left[ \begin{array}{c} h_\theta(x^{(1)})\\\ h_\theta(x^{(2)})\\\ \vdots\\\ h_\theta(x^{(m)})\\\ \end{array} \right] = \left[ \begin{array}{c} (x^{(1)})^T\theta\\\ (x^{(2)})^T\theta\\\ \vdots\\\ (x^{(m)})^T\theta\\\ \end{array} \right] = \left[ \begin{array}{c} (x^{(1)})^T\\\ (x^{(2)})^T\\\ \vdots\\\ (x^{(m)})^T \end{array} \right] \cdot \theta = X \cdot \theta \]

\[ Y = \left[ \begin{array}{c} y^{(1)}\\\ y^{(2)}\\\ \vdots\\\ y^{(m)} \end{array} \right] \] 于是, \[ J(\theta) = \frac{1}{2}\sum_{i=1}^{m}(h(x^{(i)} - y^{(i)})^2)=\frac{1}{2}(X \cdot \theta - Y)^T(X \cdot \theta - Y) \]

推导过程

关于梯度下降法,可以直接简化为求梯度为0的位置,即求\(\nabla_\theta J(\theta) = \vec{0}\)

首先,简化: \[ \begin{align} \nabla_\theta J(\theta) & = \nabla_\theta\frac{1}{2}(X \cdot \theta - Y)^T(X \cdot \theta - Y)\\\ & =\frac{1}{2}\nabla_\theta tr(\theta^TX^TX\theta - \theta^TX^TY - Y^TX\theta + Y^TY)\\\ & =\frac{1}{2}[\nabla_\theta tr(\theta\theta^TX^TX) - \nabla_\theta tr(Y^TX\theta) - \nabla tr(Y^TX\theta)] \end{align} \] 其中,第一项: \[ \begin{align} \nabla_\theta tr(\theta\theta^TX^TX) & = \nabla_\theta tr(\theta I \theta^TX^TX) &\text{定理6, set: $\theta =^{set} A, I = B, X^TX=C$}\\\ & = X^TX\theta I + X^TX\theta I & \text{$CAB+C^TAB^T$}\\\ & = X^TX\theta + X^TX\theta \end{align} \] 第二项和第三项: \[ \nabla_\theta tr(Y^TX\theta) = X^TY\\\ (定理3,set:Y^TX = B, \theta = A) \] 所以: \[ \nabla_\theta J(\theta) = X^TX\theta - X^TY = 0\\\ \Rightarrow X^TX\theta = X^TY\\\ \] 最后解得: \[ \theta = (X^TX)^{(-1)}X^TY \] 当然,以上的解是有限制的,只有当\(X^TX\)满秩时,才能够求逆。

如果非满秩,说明方程数量不够,也就是当需要n个参数时,却不够n个输入样本。