Vanishing/Exploding gradients

\[\ell = \ell(\alpha_L)\]

\[\nabla_{w_j, b_j} \ell\]

Update parameter:

\[\theta_{i+1} := \theta_i - \eta \nabla_{w_j, b_j} \ell\]

activation \(a_t\) & \(a_{t+1}\), some matrix \(A\) in between.

If linear transfromation,

\[a_{t+1} = Aa_t, A\in\mathbb{R}^{m\times m}\]

Chain rule:

\[\frac{\partial \ell}{\partial a_{t}} = \sum_{j=1}^{m}{\frac{\partial \ell}{\partial a_{i+1}^j} \cdot \frac{\partial a_{i+1}^j}{\partial a_i^j} }\]

And

\[a_{t+1}^j = \sum_{k=1}^{m}A_{jk} a_t^k\]

\[\frac{\partial \ell}{\partial a_t^i} =\sum_{j=1}^m{a_ji} \]

\[\frac{\partial \ell}{\partial a_t} = A^T \frac{\partial \ell}{\partial a_{t+1}}\]

Then

\[\frac{\partial \ell}{\partial a_k} = \frac{\partial \ell}{\partial a_{L}} \cdot A_L^T A_{L-1}^T...A_K^T\]

Gradient vanishing / exploding

We want the eigenvalues to be bounded.

One way is to assume all A are orthogonal matrix. Rows of norm 1.

Orthogonal neural networks(Orthogonal RNNs).

Problem: 1. Expressiveness: Only rotations/reflection 2. Init orthognalizing, but probably lose orthogonality, Invariant?? 1. Unrestricted parameterizations, \(|a_{ij}| \leq 1\) 2. Givens Rotations: \[A = \text{Giv}_1\text{Giv}_2...\text{Giv}_k\]

    parameters of Givens rotations: $\text{Giv} = \text{Giv}(i, j, \theta)$

Riemmanion optimization
1. Make standard gradient step and project back onto orthogonal group
2. Compute the so-called Riemmanion gradient(Expensive)

In the general case

gradients' norms are preserved if \(A \in \mathbb{R}^{m\times m}\) is orthogonal and \(\sigma :R\to R\) satisfies : \(\sigma'(x) = 1\)

\[ \bigg| \frac{\partial \ell}{\partial a_t} \bigg|_2 \geq \frac tL \bigg|\frac{\partial \ell}{\partial a_L}\bigg|_2 \]

HiPPO: Recurrent Memory with Optimal Polynomial Projections

SVM

Find the classifiers for a set of training data.

Margins:

Optimization: find a hyperplane H such that separtates points of label +1 from points of label -1 and margin optimized.

Feasibility

Define an orthogonal vector \(\omega\) of the hyperplane, denoting the orientation of hyperplane.

Without loss of generality, we will assume that \(\omega\) points to feature vectors with label +1.

\[H = \{ x\in \mathbb{R}^d, \omega^Tx + b = 0 \}\]

\(\omega\) is the parametric form of the hyperplane in \(\mathbb{R}^d\)

Training procedure

Find \(\omega, b\) such that - H is a separating hyperplane - \(d_l(X_{train})\) is maximized given (1)

Optimization tricks: Convert to convex program

\[min \frac12 ||w||_2^2\] \[li(w^Tx_1+b)\geq 1, \text{for}\quad i=1,...N\]

Non-Linear classifier

\[X_i \to \phi(x_i)\]

Some nonlinear transformation. Substitute x with the feature mapping.

Nonlinear classifier in the original space.

Lagrange Duality

Lagrangian;

KKT conditions;

豁蒙楼

Data-Mining-Lec11

Vanishing/Exploding gradients

In the general case

SVM

Training procedure

Non-Linear classifier

Lagrange Duality