Lecture 2 notes

Lyutoon

machine learning

发布于：2021年9月18日

次浏览

Lecture 2: Math Review

Gradient Vector

Properties
- $\nabla_{\mathbf{x}}(\mathbf{x^T\mathbf{y}}) = \nabla_{\mathbf{x}}(\mathbf{y}^T\mathbf{x}) = \mathbf{y}$
- $\nabla_{\mathbf{x}}(\mathbf{x}^T\mathbf{x}) = 2\mathbf{x}$
- $\nabla_{\mathbf{x}}(\mathbf{x}^T\mathbf{A}\mathbf{y}) = \mathbf{A}\mathbf{y}$
- $\nabla_{\mathbf{x}}(\mathbf{y}^T\mathbf{A}\mathbf{x}) = \mathbf{A}^T\mathbf{y}$
- $\nabla_{\mathbf{x}}(\mathbf{x}^T\mathbf{A}\mathbf{x}) = \mathbf{A}\mathbf{x}+\mathbf{A}^T\mathbf{x}$
Eigenvalue $\lambda$ & Eigenvector $\mathbf{v}$
$\mathbf{A}\mathbf{v} = \lambda\mathbf{v} => (\mathbf{A - \lambda})\mathbf{v}=\mathbf{0}$ $|\mathbf{A}-\lambda\mathbf{I}|=0\,\,\,\,\,(characteristic\,\,equation)$
$\lambda$ is called eigenvalue, $\mathbf{v}$ is called eigenvector.

Useful formula
- $P(B|A) = \frac{P(A\cap B)}{P(A)}$
- $P(A\cap B) = P(A)P(B|A)$
- Formula of total probability:
  $P(B) = P(A_1)P(B|A_1)+P(A_2)P(B|A_2)+\cdots+P(A_n)P(B|A_n)$
- Bayes Theorem:
  $P(A|B) = \frac{P(A)P(B|A)}{P(B)}$
- PDF, CDF in discrete and continuous form.
- Moments:
  
  $r_{th}$ moment: $E(X^r)$, $\mu=E(X)$ which is the $1_{th}$ moment
  
  $r_{th}$ central moment: $\mu_r = E((x-\mu)^r)$, $\mu_0=1, \mu_1=0, \mu_2 = \sigma$

Optimization Problem contains three parts: Object function; Var that you should optimize; Constraints
- minimize $\mathbf{x}$ $f_0(\mathbf{x})$ (objective function)
- Subject to $f_i(\mathbf{x}), i=1,…,m$ (inequality constraints)
  
  $h_i(\mathbf{x}), i=1,…,m$ (equality constraints)
Active constraint

A constraint is active at $\mathbf{x}$ is that $\mathbf{x}$ is on the boundary of its feasible region ($f_i(\mathbf{x})=0$)
Lagrangian

Consider the standard optimization problem that without the equality constraints, we have a optimal value $p^*$

Assume we have Lagrangian $\mathcal{L}$:
$\mathcal{L}(\mathbf{x},\mathbf{\lambda}) = f_0(\mathbf{x})+\sum_i\lambda_if_i(\mathbf{x})$
Lagrangian Dual Function

Define the Dual Function $g$:
$g(\mathbf{\lambda}) = \inf\limits_{x}\mathcal{L(\mathbf{x}, \mathbf{\lambda})}$
Lower Bound Property

If $\mathbf{\lambda}\ge0$ and $\mathbf{x}$ is primal feasible, then $g(\mathbf{\lambda})\le f_0(\mathbf{x})$ Since that $\lambda_i \ge 0$ and $f_i(\mathbf{x})\le 0$
Lagrange Dual Problem

Find the best lower bound on $p^*$:

maximize $\mathbf{\lambda}$ $g(\mathbf{\lambda})$

subject to $\mathbf{\lambda}\ge 0$

We define $d^$ as the optimal value for the dual function $g(\mathbf{\lambda})$ and we always have $d^\le p^$ since $g(\mathbf{\lambda})\le f_0(\mathbf{x})$. And $p^ - d^$ is called *optimal duality gap.

Important conclusion: For convex problems, we (usually) have $p^ = d^$
Optimize the convex problem (examples can be seen in slides)
- Find the function that we should optimize and find the constraints
- Find the dual function
- maximize the dual function which is the same as minimizing the primal function.
Duality in Algorithm

It can be used to find the stop of the iteration by doing:

Maximizing $g(\mathbf{\lambda})$ when minimizing $f_0(\mathbf{x})$, until the duality gap reach some threshold.
Complementary Slackness

Suppose $\mathbf{x}^, \mathbf{\lambda}^$ are optima, dual optimal with zero duality gap, then we have:
$\begin{align*} f_0(\mathbf{x^*}) &= g(\mathbf{\lambda^*})\\ &\le f_0(\mathbf{x^*}) + \sum_i\lambda_i^*f_i(\mathbf{x^*}) \end{align*}$
Since that $\lambda_i^ \ge 0$ and $f_i(\mathbf{x^})\le0$, so that $\lambda_i^f_i(\mathbf{x^}) = 0$

So we can know that if $\lambda_i^ = 0$, then $i_{th}$ constraint is inactive. Also if $\lambda_i^>0$, then $i_{th}$ constraint is active.
KKT Optimality Conditions

Suppose that $f_i$ are differentiable and $\mathbf{x^}, \mathbf{\lambda^}$ are optimal with zero duality gap.

From Complementary Slackness, we know that :
$f_0(\mathbf{x^*}) +\sum_i\lambda_i^*f_i(\mathbf{x^*}) = f_0(\mathbf{x^*}) = g(\mathbf{\lambda^*}) = \inf\limits_{x}\mathcal{L(\mathbf{x}, \mathbf{\lambda})}$
We know that $\mathbf{x^*}$ optimize $\mathcal{L(\mathbf{x}, \mathbf{\lambda})}$.

So we know that: $\nabla_{\mathbf{x}}f_0(\mathbf{x^})+\sum_i\mathbf{\lambda^}\nabla_{\mathbf{x}}f_i(\mathbf{x^*}) = 0$
Conclusion on KKT