Photo by

#### 1. KL Divergence

$$\text{KL}(p||q)$$ means the information of $$q(x)$$ minus the information of $$p(x)$$, so it should be like:

$\text{KL}(p||q)=-\sum q(x)\log q(x) +\sum p(x) \log p(x)$

But we should bear in mind that “with respect to distribution $$p(x)$$”. So the exact definition should change the first term $$\sum q(x)\log q(x)$$ to $$\sum p(x)\log q(x)$$. So the $$\text{KL}(p\|q)$$ would be:

\begin{aligned} \text{KL}(p||q)&=-\sum p(x)\log q(x) +\sum p(x) \log p(x) \\ &=\sum p(x)\log \frac{p(x)}{q(x)} \\ &=-\sum p(x)\log \frac{q(x)}{p(x)} \end{aligned}

and its traits: 1) $$\text{KL}\ge 0$$; 2) $$\text{KL}(p||q)\neq \text{KL}(q||p)$$. So it’s not distance but divergence, since distance should be symmetric.

#### 2. Graphical Modeling

Variational technique is mostly used in Graphical model. So suppose there is $$z,x$$, $$z$$ is my hidden variables and $$x$$ is my observation. The posterior probability:

$p(z|x)=\frac{p(x|z)p(z)}{p(x)}=\frac{p(x,z)}{p(x)}$

computing $$p(x)$$ is complicated, we need to calculate the marginal distribution, which is:

$p(x)=\int p(x|z)p(z)dz$

this integral is intractable in many cases. So roughly there are two main approaches to handle this: 1). Monte Carlo sampling; 2) Variational Inference. The first approach is zero bias with high variance, while the second one has zero variance but is biased estimator.

#### 3. Variational Inference

approximate $$p(z|x)$$ by another distribution $$q(z)$$, which suppose to be a attractable distribution. So we want to minimize the joint loss

\begin{aligned} \min \text{KL}(q(z)\|p(z|x))&=-\sum q(z)\log \frac{p(z|x)}{q(z)} \\ &=-\sum q(z)\log \frac{p(x,z)}{p(x)}\times \frac{1}{q(z)}\\ &=-\sum q(z)\log \frac{p(x,z)}{q(z)} \times \frac{1}{p(x)}\\ &=-\sum q(z)\log \frac{p(x,z)}{q(z)}-\log p(x)\\ &=-\sum q(z)\log \frac{p(x,z)}{q(z)}+\sum_z q(z)\log p(x)\end{aligned} Since $$\sum_z q(z)\log p(x)=\log p(x)\sum_z q(z) =\log p(x)$$ , so we can transform the equation into this form:

$\log p(x)=\text{KL}(q(x)\|p(x|x)+\sum q(z)\log \frac{p(x,z)}{q(z)}$

where $$\log p(x)$$ is a constant. The $$\text{KL}$$ is the term we want to minimize in the first place, so that we can maximize the second term, which is called varitional lower bound.

$\ell=\sum q(z)\log \frac{p(x,z)}{q(z)}$

as $$\ell \leq \log p(x)$$

If we maximize the lower bound of this function, we are also maximize the original function.