Variational Inference Review

Variational Inference Review

Idea: posit a family of density and find one from the family that is the closest (in K-L divergence) to the target density.
For statisticians: VI provides a method to approximate complicate densities. Compared with MCMC, it’s easier to compute and scale to big data.

Problem of Approximate Inference

x x x-observations
z z z-latent variables
want to estimate
p ( z ∣ x ) = p ( z , x ) p ( x ) p(z|x)=\frac{p(z,x)}{p(x)} p(zx)=p(x)p(z,x)

pick a family of distribution Q \mathcal{Q} Q, and approximate p ( z ∣ x ) p(z|x) p(zx) by
q ∗ ( z ) = arg min ⁡ q ( z ) ∈ Q   K L ( q ( z ) ∣ ∣ p ( z ∣ x ) ) q^*(z)=\argmin_{q(z) \in \mathcal{Q}}\ KL(q(z)||p(z|x)) q(z)=q(z)Qargmin KL(q(z)p(zx))

Variational Object Function

K L ( q ( z ) ∣ ∣ p ( z ∣ x ) ) = E q [ log ⁡ q ( z ) ] − E q [ log ⁡ p ( z ∣ x ) ] = E q [ log ⁡ q ( z ) ] − E q [ log ⁡ p ( z , x ) ] + log ⁡ p ( x ) = − E L B O ( q ) + c o n s t . KL(q(z)||p(z|x))=E_q[\log q(z)]-E_q[\log p(z|x)] \\ = E_q[\log q(z)]-E_q[\log p(z,x)]+\log p(x) \\ = -ELBO(q)+const. KL(q(z)p(zx))=Eq[logq(z)]Eq[logp(zx)]=Eq[logq(z)]Eq[logp(z,x)]+logp(x)=ELBO(q)+const.

ELBO = evidence lower bound, and minimizing K-L is equivalent to maximizing ELBO:
E L B O ( q ) = E q [ log ⁡ p ( z , x ) ] − E q [ log ⁡ q ( z ) ] = E q [ log ⁡ p ( z ) ] + E q [ log ⁡ p ( x ∣ z ) ] − E q [ log ⁡ q ( z ) ] = E [ log ⁡ p ( x ∣ z ) ] ⏟ e x p e c t e d   l i k e h o o d − K L ( q ( z ) ∣ ∣ p ( z ) ) ⏟ p e n a l i z e   d e v i a t i o n   f r o m   p r i o r ELBO(q)=E_q[\log p(z,x)]-E_q[\log q(z)] \\ = E_q[\log p(z)]+E_q[\log p(x|z)]-E_q[\log q(z)] \\ = \underbrace{E[\log p(x|z)]}_{expected\ likehood}-\underbrace{KL(q(z)||p(z))}_{penalize\ deviation\ from\ prior} ELBO(q)=Eq[logp(z,x)]Eq[logq(z)]=Eq[logp(z)]+Eq[logp(xz)]Eq[logq(z)]=expected likehood E[logp(xz)]penalize deviation from prior KL(q(z)p(z))

Mean-field Variational Family

A widely used Q \mathcal{Q} Q is mean-field variational family whose latent variables are mutually independent:
q ( z ) = ∏ j = 1 m q j ( z j ) q(z) = \prod_{j=1}^m q_j(z_j) q(z)=j=1mqj(zj)

CAVI (Coordinate Ascent Variantional Inference, Bishop 2006)

Now for
q ∗ ( z ) = arg min ⁡ q ( z ) ∈ Q   K L ( q ( z ) ∣ ∣ p ( z ∣ x ) ) q ( z ) = ∏ j = 1 m q j ( z j ) q^*(z)=\argmin_{q(z) \in \mathcal{Q}}\ KL(q(z)||p(z|x)) \\ q(z) = \prod_{j=1}^m q_j(z_j) q(z)=q(z)Qargmin KL(q(z)p(zx))q(z)=j=1mqj(zj)

we do optimization iteratively,
q j ∗ ( z j ) = arg min ⁡ q ( z ) ∈ Q   K L ( q ( z ) ∣ ∣ p ( z ∣ x ) ) , g i v e n q l ( z l ) , l ≠ j q^*_j(z_j)=\argmin_{q(z) \in \mathcal{Q}}\ KL(q(z)||p(z|x)),given q_l(z_l),l \ne j qj(zj)=q(z)Qargmin KL(q(z)p(zx))givenql(zl),l=j

For KL minimization, fixed other latent factors, optimal z j z_j zj follows
q j ∗ ( z j ) ∝ exp ⁡ ( E − j [ log ⁡ p ( z j ∣ z − j , x ) ] ) q^*_j(z_j) \propto \exp(E_{-j}[\log p(z_j|z_{-j},x)]) qj(zj)exp(Ej[logp(zjzj,x)])

在这里插入图片描述

Example: Bayesian Mixture of Gaussians

Joint density of latent and observation:
p ( μ , c , x ) = p ( μ ) ∏ i = 1 n p ( c i ) p ( x i ∣ c i , μ ) μ k ∼ N ( 0 , σ 2 ) , k = 1 , ⋯   , K c i ∼ C a t e g o r y ( 1 / K , ⋯   , 1 / K ) , x i ∣ c i , μ ∼ N ( c i T μ , 1 ) p(\mu,c,x)=p(\mu)\prod_{i=1}^n p(c_i)p(x_i|c_i,\mu) \\ \mu_k \sim N(0,\sigma^2),k = 1,\cdots,K \\ c_i \sim Category(1/K,\cdots,1/K),x_i|c_i,\mu \sim N(c_i^T\mu,1) p(μ,c,x)=p(μ)i=1np(ci)p(xici,μ)μkN(0,σ2),k=1,,KciCategory(1/K,,1/K),xici,μN(ciTμ,1)

Variation density of μ , c \mu,c μ,c is
q ( μ , c ) = ∏ k = 1 K q ( μ k ; m k , s k 2 ) ∏ i = 1 n q ( c i ; ψ i ) q(\mu,c)=\prod_{k=1}^K q(\mu_k;m_k,s^2_k)\prod_{i=1}^n q(c_i;\psi_i) q(μ,c)=k=1Kq(μk;mk,sk2)i=1nq(ci;ψi)

q ( μ k ; m k , s k 2 ) q(\mu_k;m_k,s_k^2) q(μk;mk,sk2) is Gaussian, and ψ i \psi_i ψi is K-dim vector.

Given μ \mu μ,
q ∗ ( c i ; ψ i ) ∝ exp ⁡ { log ⁡ p ( c i ) + E [ log ⁡ p ( x i ∣ c i , μ ) ; m , s 2 ] } ∝ exp ⁡ { − log ⁡ K + ∑ k c i k E [ log ⁡ p ( x i ∣ c i , μ k ) , m k , s k 2 ] } ∝ exp ⁡ { ∑ k c i k E [ − ( x i − μ k ) 2 2 ; m k , s k 2 ] } + c o n s t . ∝ exp ⁡ { ∑ k c i k ( E [ μ k ; m k , s k 2 ] x i − 0.5 E [ μ k 2 ; m k , s k 2 ] ) } + c o n s t . q^*(c_i;\psi_i) \propto \exp\{\log p(c_i)+E[\log p(x_i|c_i,\mu);m,s^2]\} \\ \propto \exp\{-\log K+\sum_{k}c_{ik}E[\log p(x_i|c_i,\mu_k),m_k,s_k^2]\} \\ \propto \exp \{ \sum_k c_{ik} E[-\frac{(x_i-\mu_k)^2}{2};m_k,s_k^2]\}+const. \\ \propto \exp \{\sum_k c_{ik}(E[\mu_k;m_k,s_k^2]x_i-0.5E[\mu_k^2;m_k,s_k^2])\}+const. q(ci;ψi)exp{logp(ci)+E[logp(xici,μ);m,s2]}exp{logK+kcikE[logp(xici,μk),mk,sk2]}exp{kcikE[2(xiμk)2;mk,sk2]}+const.exp{kcik(E[μk;mk,sk2]xi0.5E[μk2;mk,sk2])}+const.

Given ψ \psi ψ,
q ( μ k ) ∝ exp ⁡ { log ⁡ p ( μ k ) + ∑ i = 1 n E [ log ⁡ p ( x i ∣ c i , μ ) ; ψ i , m − k , s − k 2 ] } ∝ exp ⁡ { − μ k 2 2 σ 2 + ∑ i = 1 n E [ c i k ; ψ i ] log ⁡ p ( x i ∣ μ k ) } ∝ exp ⁡ { − μ k 2 2 σ 2 − ∑ i = 1 n ψ i k ( x i − μ k ) 2 2 } ∝ exp ⁡ { ( ∑ i = 1 n ψ i k x i ) μ k − ( 1 2 σ 2 + ∑ i = 1 n ψ i k 2 ) μ k 2 } ∝ exp ⁡ { − ( μ k − m k ) 2 2 s k 2 } q(\mu_k) \propto \exp \{\log p(\mu_k)+\sum_{i=1}^n E[\log p(x_i|c_i,\mu);\psi_i,m_{-k},s^2_{-k}]\} \\ \propto \exp\{-\frac{\mu_k^2}{2\sigma^2}+\sum_{i=1}^n E[c_{ik};\psi_i]\log p(x_i|\mu_k)\} \\ \propto \exp\{-\frac{\mu_k^2}{2\sigma^2}-\sum_{i=1}^n \psi_{ik} \frac{(x_i-\mu_k)^2}{2}\} \\ \propto \exp\{(\sum_{i=1}^n \psi_{ik}x_i)\mu_k-(\frac{1}{2\sigma^2}+\sum_{i=1}^n \frac{\psi_{ik}}{2})\mu_k^2\} \\ \propto \exp\{-\frac{(\mu_k-m_k)^2}{2s_k^2}\} q(μk)exp{logp(μk)+i=1nE[logp(xici,μ);ψi,mk,sk2]}exp{2σ2μk2+i=1nE[cik;ψi]logp(xiμk)}exp{2σ2μk2i=1nψik2(xiμk)2}exp{(i=1nψikxi)μk(2σ21+i=1n2ψik)μk2}exp{2sk2(μkmk)2}

where
m k = ∑ i = 1 n ψ i k x i 1 / σ 2 + ∑ i = 1 n ψ i k , s k 2 = 1 1 / σ 2 + ∑ i = 1 n ψ i k m_k = \frac{\sum_{i=1}^n \psi_{ik}x_i}{1/\sigma^2+\sum_{i=1}^n \psi_{ik}},s_k^2 = \frac{1}{1/\sigma^2+\sum_{i=1}^n \psi_{ik}} mk=1/σ2+i=1nψiki=1nψikxi,sk2=1/σ2+i=1nψik1
在这里插入图片描述

在这里插入图片描述

相关推荐
<p> <b><span style="font-size:14px;"></span><span style="font-size:14px;background-color:#FFE500;">【Java面试宝典】</span></b><br /> <span style="font-size:14px;">1、68讲视频课,500道大厂Java常见面试题+100个Java面试技巧与答题公式+10万字核心知识解析+授课老师1对1面试指导+无限次回放</span><br /> <span style="font-size:14px;">2、这门课程基于胡书敏老师8年Java面试经验,调研近百家互联网公司及面试官的问题打造而成,从筛选简历和面试官角度,给出能帮助候选人能面试成功的面试技巧。</span><br /> <span style="font-size:14px;">3、通过学习这门课程,你能系统掌握Java核心、数据库、Java框架、分布式组件、Java简历准备、面试实战技巧等面试必考知识点。</span><br /> <span style="font-size:14px;">4、知识点+项目经验案例,每一个都能做为面试的作品展现。</span><br /> <span style="font-size:14px;">5、本课程已经在线下的培训课程中经过实际检验,老师每次培训结束后,都能帮助同学们运用面试技巧,成功找到更好的工作。</span><br /> <br /> <span style="font-size:14px;background-color:#FFE500;"><b>【超人气讲师】</b></span><br /> <span style="font-size:14px;">胡书敏 | 10年大厂工作经验,8年Java面试官经验,5年线下Java职业培训经验,5年架构师经验</span><br /> <br /> <span style="font-size:14px;background-color:#FFE500;"><b>【报名须知】</b></span><br /> <span style="font-size:14px;">上课模式是什么?</span><br /> <span style="font-size:14px;">课程采取录播模式,课程永久有效,可无限次观看</span><br /> <span style="font-size:14px;">课件、课程案例代码完全开放给你,你可以根据所学知识,自行修改、优化</span><br /> <br /> <br /> <span style="font-size:14px;background-color:#FFE500;"><strong>如何开始学习?</strong></span><br /> <span style="font-size:14px;">PC端:报名成功后可以直接进入课程学习</span><br /> <span style="font-size:14px;">移动端:<span style="font-family:Helvetica;font-size:14px;background-color:#FFFFFF;">CSDN 学院APP(注意不是CSDN APP哦)</span></span> </p>
©️2020 CSDN 皮肤主题: 撸撸猫 设计师:马嘣嘣 返回首页