latex公式

^ 上角标
\_{} 下角标
\sum 求和符号 \max\min\avg
\geq >
\frac{10^4}{2^j} 分数
\limits 上下标转至

大花括号条件
\left
\{     大花括号
\begin{matrix}
1, &决策1
\\ 换行
2, &决策2
\end{matrix}
\right

\varepsilon  ε
\theta θ
\sigma σ

Α, α, Β, β, Γ, γ, Δ, δ, Ε, ε, Ζ, ζ, Η, η, Θ, θ, Ι, ι, Κ, κ, Λ, λ, Μ, μ, Ν, ν, Ξ, ξ, Ο, ο, Π, π, Ρ, ρ, Σ, σ/ς, Τ, τ, Υ, υ, Φ, φ, Χ, χ, Ψ, ψ, Ω, ω

线性回归

前提:对于线性回归的独立同分布的输入X(m * n,m组n个参数的矩阵),误差 ​\varepsilon^{(i)} 是独立并且具有相同的分布, 并且服从均值为0方差为 ​ \theta^2 的高斯分布。

目标:设一个样本的实际值为 ​y^{(i)} (m * 1),求取参数向量 θ (n * 1)

数学求解

对于回归拟合屏幕h有:​h_{θ} = \sum\limits_{i = 0}^{n}θ_{i}x_{i} = θ^Tx

对于真实值和预测值的误差ε:​y^{(i)} = θ^Tx^{(i)} + ε^{(i)}

误差服从高斯分布:​p(\epsilon^{(i)}) = \frac{1}{\sqrt{2\pi}\sigma} \exp\left(-\frac{(\epsilon^{(i)})^{2}}{2\sigma^{2}}\right)

可得:​p(y^{(i)}|x^{(i)}; \theta) = \frac{1}{\sqrt{2\pi}\sigma} \exp\left(-\frac{(y^{(i)} - \theta^T x^{(i)})^{2}}{2\sigma^{2}}\right)

什么样的参数跟我们的数据组合后恰好是真实值

似然函数:​L(\theta) = \prod\limits_{i=1}^{m} p(y^{(i)} | x^{(i)}; \theta) = \prod\limits_{i=1}^{m} \frac{1}{\sqrt{2\pi}\sigma} \exp\left(-\frac{(y^{(i)} - \theta^T x^{(i)})^2}{2\sigma^2}\right)

对数似然: ​\log L(\theta) = \log \prod\limits_{i=1}^{m} \frac{1}{\sqrt{2\pi}\sigma} \exp\left(-\frac{(y^{(i)} - \theta^T x^{(i)})^2}{2\sigma^2}\right)

展开化简:​\sum\limits_{i=1}^{m} \log \frac{1}{\sqrt{2\pi}\sigma} \exp \left( -\frac{(y^{(i)} - \theta^T x^{(i)})^2}{2\sigma^2} \right) = m \log \frac{1}{\sqrt{2\pi}\sigma} - \frac{1}{\sigma^2} \cdot \frac{1}{2} \sum\limits_{i=1}^{m} (y^{(i)} - \theta^T x^{(i)})^2

为使误差更小,则似然函数应越大,后面部分应更小:​J(\theta) = \frac{1}{2} \sum\limits_{i=1}^{m} (y^{(i)} - \theta^T x^{(i)})^2 (最小二乘法)

目标函数:​J(\theta) = \frac{1}{2} \sum_{i=1}^{m} \left( h_\theta \left( x^{(i)} \right) - y^{(i)} \right)^2 = \frac{1}{2} (X\theta - y)^T (X\theta - y)

求偏导:

\begin{align*} \nabla_\theta J(\theta) &= \nabla_\theta \left( \frac{1}{2} (X\theta - y)^T (X\theta - y) \right) \\ &= \nabla_\theta \left( \frac{1}{2} (\theta^T X^T - y^T)(X\theta - y) \right) \\ &= \nabla_\theta \left( \frac{1}{2} (\theta^T X^T X\theta - \theta^T X^T y - y^T X\theta + y^T y) \right) \\ &= \frac{1}{2} \left( 2X^T X\theta - X^T y - (y^T X)^T \right) = X^T X\theta - X^T y \\ \end{align*}

偏导等于0的条件:​\theta = \left( X^T X \right)^{-1} X^T y

梯度下降

\begin{align*} \text{梯度下降,目标函数:} & \quad J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (y^{(i)} - h_\theta(x^{(i)}))^2 \\ \text{批量梯度下降:} & \quad \frac{\partial J(\theta)}{\partial \theta_j} = -\frac{1}{m} \sum_{i=1}^{m} (y^{(i)} - h_\theta(x^{(i)})) x_j^{(i)}, \quad \theta_j' = \theta_j + \frac{1}{m} \sum_{i=1}^{m} (y^{(i)} - h_\theta(x^{(i)})) x_j^{(i)} \\ \text{随机梯度下降:} & \quad \theta_j' = \theta_j + (y^{(i)} - h_\theta(x^{(i)})) x_j^{(i)} \\ \text{小批量梯度下降法:} & \quad \theta_j := \theta_j - \alpha \frac{1}{10} \sum_{k=i}^{i+9} (h_\theta(x^{(k)}) - y^{(k)}) x_j^{(k)} \end{align*}

(其中 α 表示学习率)

逻辑回归

Sigmoid 函数​g(z) = \frac{1}{1 + e^{-z}}

解释:将任意的输入映射到了[0,1]区间,我们在线性回归中可以得到一个预测值,再将该值映射到Sigmoid 函数
中这样就完成了由值到概率的转换,也就是分类任务

预测函数​h_\theta(x) = g(\theta^T x) = \frac{1}{1 + e^{-\theta^T x}}

对于二分类任务,有:

\begin{align*} P(y = 1 \mid x; \theta) &= h_\theta(x) P(y = 0 \mid x; \theta) &= 1 - h_\theta(x) \end{align*}

整合:​P(y \mid x; \theta) = (h_\theta(x))^y (1 - h_\theta(x))^{1-y}

似然函数​L(\theta) = \prod_{i=1}^{m} P(y_i \mid x_i; \theta) = \prod_{i=1}^{m} (h_\theta(x_i))^{y_i} (1-h_\theta(x_i))^{1-y_i}

对数似然​l(\theta) = \log L(\theta) = \sum_{i=1}^{m} ( y_i \log h_\theta(x_i) + (1-y_i) \log(1-h_\theta(x_i))

此时应用梯度上升求最大值,引入 ​ J(\theta) = -\frac{1}{m} l(\theta) 转换为梯度下降任务

求导过程:​l(\theta) = \log L(\theta) = \sum_{i=1}^{m} \left( y_i \log h_\theta(x_i) + (1-y_i) \log(1-h_\theta(x_i)) \right)

\begin{align*}\frac{\delta}{\delta \theta_j} J(\theta) &= -\frac{1}{m} \sum_{i=1}^{m} \left( y_i \frac{1}{h_\theta(x_i)} \frac{\delta}{\delta \theta_j} h_\theta(x_i) - (1-y_i) \frac{1}{1-h_\theta(x_i)} \frac{\delta}{\delta \theta_j} h_\theta(x_i) \right) \\ &= -\frac{1}{m} \sum_{i=1}^{m} \left( y_i \frac{1}{g(\theta^\top x_i)} - (1-y_i) \frac{1}{1-g(\theta^\top x_i)} \right) \frac{\delta}{\delta \theta_j} g(\theta^\top x_i) \\ &= -\frac{1}{m} \sum_{i=1}^{m} \left( y_i \frac{1}{g(\theta^\top x_i)} - (1-y_i) \frac{1}{1-g(\theta^\top x_i)} \right) g(\theta^\top x_i) (1-g(\theta^\top x_i)) \frac{\delta}{\delta \theta_j} \theta^\top x_i \\ &= -\frac{1}{m} \sum_{i=1}^{m} \left( y_i (1-g(\theta^\top x_i)) - (1-y_i) g(\theta^\top x_i) \right) x_i^j \\ &= -\frac{1}{m} \sum_{i=1}^{m} (y_i - g(\theta^\top x_i)) x_i^j \\ &= \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x_i) - y_i) x_i^j \end{align*}

参数更新​theta_j := \theta_j - \alpha \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x_i) - y_i) x_i^j

softmax回归

h_\theta(x^{(i)}) = \begin{bmatrix} p(y^{(i)} = 1 \mid x^{(i)}; \theta) \\ p(y^{(i)} = 2 \mid x^{(i)}; \theta) \\ \vdots \\ p(y^{(i)} = k \mid x^{(i)}; \theta) \end{bmatrix} = \frac{1}{\sum_{j=1}^{k} e^{\theta_j^T x^{(i)}}} \begin{bmatrix} e^{\theta_1^T x^{(i)}} \\ e^{\theta_2^T x^{(i)}} \\ \vdots \\ e^{\theta_k^T x^{(i)}} \end{bmatrix}
git:dadad