This is about basic machine learning which I do not understand clearly.
I have 2 Gaussian models $G_1$ and $G_2$ and given a list of data as below.
(1, 3, 4, 5, 7, 8, 9, 13, 14, 15, 16, 18, 23)
I do not know which numbers were generated by which $G_i$ model, but I want to do Expectation Maximization for clustering.
So, what are the hidden variables? and how to perform Expectation and Maximization steps?
Could you show some calculation or short Python code of the likelihood metric? (Perhaps 2 iteration)
I got the basics of EM algorithm from here, but I cannot apply to this question. http://www.nature.com/nbt/journal/v26/n8/full/nbt1406.html
Asked By : Karyo
Answered By : Nicholas Mancuso
$\newcommand{\I}{\mathbb{I}}\newcommand{\E}{\mathbb{E}}$Given our observed variables $x = \{x_i\}$, hidden variables $z = \{z_i\}$ and distribution parameters $\theta = \{(\mu_j, \sigma_j)\}$, let's define the complete-data log-likelihood as $$ \begin{align*} \ell_c(\theta) &= \sum_{i=1}^n \log \Pr[x_i, z_i|\theta] = \sum_{i=1}^n \sum_{j=1}^k \I[z_i = j] \log \Pr[x_i, z_i = j| \theta]\\ &= \sum_{i=1}^n \sum_{j=1}^k \I[z_i = j] \log (\Pr[x_i|\theta_j] \Pr[z_i=j|\theta]) \end{align*} $$ where $\I[\cdot]$ is the indicator function, $\Pr[x_i|\theta_j]$ is the probability of observing $x_i$ given the $j$th Gaussian and $\Pr[z_i=j|\theta]$ is the mixing weight of the $j$th distribution (i.e., the probability of observing the $j$th hidden variable). This is the model we have if we somehow knew which Gaussian generated a given data-point. We obviously don't know, so we would like to compute the expected value of this complete-data log-likelihood. That is, what is the expected outcome of this full-knowledge formulation? To do that let us define the $Q(\theta^t|\theta^{t-1})$ function as $$ \begin{align*} Q(\theta^t|\theta^{t-1}) &= \E\left[\sum_{i=1}^n \sum_{j=1}^k \I[z_i = j] \log \Pr[x_i, z_i = j| \theta]\right]\\ &= \sum_{i=1}^n \sum_{j=1}^k \E[\I[z_i = j]] \log \Pr[x_i, z_i = j| \theta]\\ &= \sum_{i=1}^n \sum_{j=1}^k \Pr[z_i = j | x_i] \log \Pr[x_i, z_i = j| \theta]\\ &= \sum_{i=1}^n \sum_{j=1}^k \underbrace{\Pr[z_i = j | x_i]}_{r_{ij}} \log (\Pr[x_i|\theta_j]\underbrace{\Pr[z_i = j| \theta]}_{\pi_j}).\\ &= \sum_{i=1}^n \sum_{j=1}^k r_{ij} \log \Pr[x_i|\theta_j] + \sum_{i=1}^n \sum_{j=1}^k r_{ij} \log \pi_j. \end{align*} $$ Computing $r_{ij}$ for the E-step is done by simple Bayes theorem, $$ r_{ij} = \frac{\pi_j \Pr[x_i|\theta_j^{t-1}]}{\sum_{j'=1}^k \pi_{j'} \Pr[x_i|\theta_{j'}^{t-1}]}.$$ Now for the M-step we would like to compute the mixing weights $\pi$ and parameters $\theta$ that maximize $Q$. Luckily for us, we can optimize for each independently as seen by the final formulation! By taking the derivative of $Q$ wrt $\pi$ with the Lagrange constraint that $\sum_j \pi_j = 1$ we have $$ \pi_j = \frac{\sum_{i=1}^n r_{ij}}{n}. $$
In order to optimize $Q$ wrt $(\mu_j, \sigma_j)$ lets look at the part only dependent on them (ignoring some constants). $$ \sum_{i=1}^n r_{ij} \log \Pr[x_i|\theta_j] = -\frac{1}{2}\sum_{i=1}^n r_{ij} \left(\log \sigma_j + \frac{(x_i - \mu_j)^2}{\sigma_j}\right) $$ Using this information helps when taking the derivative of $Q$ wrt to each parameter. Finally we have, $$ \begin{align*} \mu_j &= \frac{\sum_{i=1}^n r_{ij} x_i}{\sum_{i=1}^n r_{ij}}, &\sigma_j &= \frac{\sum_{i=1}^n r_{ij} x_i^2}{\sum_{i=1}^n r_{ij}} - \mu_j^2. \end{align*} $$
In your case, $x = (1, 3, 4, 5, 7, 8, 9, 13, 14, 15, 16, 18, 23)$, $z = (z_1,\dotsc, z_{13})$, $k=2$, and $\theta = ((\mu_1, \sigma_1),(\mu_2, \sigma_2))$. EM begins by randomly initializing $\pi = (\pi_1, \pi_2)$ and $\theta$ and then alternating between E-step and M-step. Once the parameters haven't changed much (or better the observed-data log-likelihood hasn't changed much) you can stop and output your current values $\theta^t$ and $\pi$.
One benefit from using $|\ell(\theta^t) - \ell(\theta^{t-1})|$ over $||\theta^t - \theta^{t-1}||$ as a stopping condition is for bug checking. EM guarantees that the likelihood will always monotonically increase; therefore, you can simply check the values each iteration to much sure that your likelihood is greater than or equal to last iteration's. If it isn't, then something is wrong.
Finally one important thing to note is that computing a MLE for a mixture of Gaussians will tend to overfit. One way to overcome this is by incorporating priors for $\pi$ and $\theta$ and then computing the MAP estimate instead. This would only change the M-step; however, its derivation is dependent on which prior you choose.
EDIT: Since it is clear you are more interested in just a working example than really understanding any of this here is Python code that will run EM for a basic GMM. Running this in ipython gives:
In [1]: import em In [2]: x = (1, 3, 4, 5, 7, 8, 9, 13, 14, 15, 16, 18, 23) In [3]: em.em(x, 2) Out[3]: (array([ 0.49645528, 0.50354472]), array([ 15.89753149, 5.10207907]), array([ 3.8611819 , 2.65861623]))
where the final output are the mixing weights, means for each Gaussian, and stdevs for each Gaussian.
Best Answer from StackOverflow
Question Source : http://cs.stackexchange.com/questions/23673
0 comments:
Post a Comment
Let us know your responses and feedback