CS229 Notes11 Independent Components Analysis

The goal of Components Analysis is to find a new basis to represent the data.

Cocktail party problem:

sources \(s\in R^d\)
microphone recordings(receiver): \(x\in R^d\)
mixing matrix \(A\): \(x = As\), square matrix
unmixing matrix: \(W = A^{-1}\), \(s^{(i)} = Wx^{(i)}\), the superscript means time i.
j-th source at time i \(s_j^{(i)} = w_j^Tx^{(i)}\), \(W = [w_1^T, ...w_n^T]^T\), each row is a \(w_i^T\)

ICA ambiguities

scale of A:

\[x = 2A(0.5s) = As\]

Imagine the coupling of volume and distance.

There is no way to recover the correct scaling of the \(w_i\)'s.

Non-Gaussian

Gaussian's Isotropy results in density rotationally symmetric.

If \(s \sim N(0,I)\)

\[E_s[x] = E[As] = AE[s] = 0\]

\[\begin{aligned}Cov[x] & = E[xx^T] - (E[x])^2 \\ & = E[xx^T]\\ & = E[Ass^TA^T]\\ & = A\cdot Cov[s] \cdot A^T\\ & = AA^T \end{aligned}\]

Basically, \[x \sim N(0, AA^T)\]

Let \(R\) be an arbitrary orthogonal(e.g. rotation/reflection) matrix, \(A' = AR\), then \(x' = A's\),

the distribution of \(x'\) is also gaussian \(x'\sim N(0, AA^T)\) because

\[A'A'^T = ARR^TA^T = AA^T\]

Then we could never figure out whether the mixing matrix is \(A\) or \(A'\).

However, so long as \(s\) is distributed anisotropically, we could recover the mixing matrix.(Imagine you have 4 rotation-symmetrical sources, the n rotating these sources simulatenously does not influence the volume)

Densities and linear transforms

\[p_x(x) = p_s(Wx)\cdot|W|\]

E.g. uniform distribution.

Remember determinant measures the volume of the span. (See this in linalg recap). Imagine the density dillutes in an enlarged hyperspace.

ICA Algorithm

Joint distribution of sources \(s\) (with the assumption that sources are independent):

\[p(s) = \prod_{j=1}^{d}{p_s(s_j)}\]

then transit to \(x\), j-th source at time i \(s_j^{(i)} = w_j^Tx^{(i)}\)

\[p(x) = \prod_{j=1}^{d}{p_s(w_j^Tx)\cdot |W|}\]

We purposely choose \(p_s(s) = g'(s)\), i.e. choose the sigmoid function as CDF.

Given a training set, the log-likelihood

\[l(w) = \sum_{i=1}^n \bigg(\sum_{j=1}^d{\log g'(w_j^Tx^{(i)}) + \log|W|}\bigg)\]

Lemma:

\[\nabla_W|W = |W|(W^{-1})^T\]

\[g'' = \frac{\partial}{\partial x}g' = \frac{\partial}{\partial x}\big(g(1-g)\big) = g'(1-2g) \]

Then compute the derivative of \(l(w)\),

\[\begin{aligned} \therefore \frac{\partial}{\partial w} l(w) &= \sum_{i=1}^n \bigg(\sum_{j=1}^d{\frac{g''(w_j^Tx^{(i)}) \cdot x^{(i)T}}{g'(w_j^Tx^{(i)})} + (W^{-1})^T}\bigg) \\ &= \sum_{i=1}^n \bigg(\sum_{j=1}^d{(1-2g(w_j^Tx^{(i)}))x^{(i)T} + (W^{-1})^T}\bigg) \end{aligned}\]

If we use stochastic gradient ascent,

\[\begin{aligned} \frac{\partial}{\partial w} l(w) &= \sum_{j=1}^d{(1-2g(w_j^Tx^{(i)}))x^{(i)T} + (W^{-1})^T}\\ &= \begin{bmatrix}1-2g(w_1^Tx^{(i)})\\1-2g(w_2^Tx^{(i)}\\...\\1-2g(w_d^Tx^{(i)}\end{bmatrix}x^{(i)T} + (W^{-1})^T \end{aligned}\]

The updating rule:

\[W:= W + \alpha \bigg( \begin{bmatrix}1-2g(w_1^Tx^{(i)})\\1-2g(w_2^Tx^{(i)}\\...\\1-2g(w_d^Tx^{(i)}\end{bmatrix}x^{(i)T} + (W^{-1})^T\bigg)\]

Finally when converge, we take

\[s^{(i)} = Wx^{(i)}\]

to recover sources.