Let us consider observable random variables $\mathbf{x} \in \mathcal{X}$ and latent random variables $\mathbf{z} \in \mathcal{Z}$. There is given an empirical distribution $p_{\mathrm{data}} (\mathbf{x})$. We use a latent variable model with parameters $\vartheta = \{\theta, \lambda\}$, namely: \begin{equation} p_{\mathrm{model}} (\mathbf{x}) = \int p_{\theta}(\mathbf{x} | \mathbf{z})\ p_{\lambda}(\mathbf{z}) \mathrm{d}\mathbf{z} , \end{equation} that could be interpreted as a mixture model. For $\mathcal{Z} \equiv \{0, 1, \ldots, K-1\}$, we get a finite mixture model, and for $\mathcal{Z} \equiv \mathbb{R}^{K}$ we get an infinite mixture model.
Further, let us assume that we have given data $\mathcal{D}$, or, equivalently, the empirical distribution $p_{\mathrm{data}}(\mathbf{x})$, i.e., $\mathcal{D} \sim p_{\mathrm{data}}(\mathbf{x})$. Later on, we will denote $p_{\mathrm{data}}(\mathbf{x})$ by $q(\mathbf{x})$ for simplicity. Typically, learning $p_{\mathrm{model}} (\mathbf{x})$ corresponds to maximizing the log-likelihood function, or, equivalently, minimizing the Kullback-Leibler divergence between $p_{\mathrm{data}}$ and $p_{\mathrm{model}}$: \begin{align} D_{\mathrm{KL}}\left[p_{\mathrm{data}} \| p_{\mathrm{model}} \right] &= \sum_{\mathbf{x}} p_{\mathrm{data}}(\mathbf{x}) \ln \frac{p_{\mathrm{data}}(\mathbf{x})}{p_{\mathrm{model}}(\mathbf{x})} \\ &= -\mathbb{H}_{p_{\mathrm{data}}}\left[ \mathbf{x} \right] - \mathbb{E}_{\mathbf{x} \sim p_{\mathrm{data}}(\mathbf{x})}\left[ \ln p_{\mathrm{model}}(\mathbf{x}) \right] \\ &= -\mathbb{H}_{p_{\mathrm{data}}}\left[ \mathbf{x} \right] + \mathbb{C}\mathbb{E}\left[p_{\mathrm{data}}(\mathbf{x}) || p_{\mathrm{model}}(\mathbf{x}) \right]. \end{align}
In other words, minimizing the Kullback-Leibler divergence with respect to the model parameters corresponds to maximizing the cross entropy between the data (empirical) distribution and the model.
Since our model is the latent variable model, calculating the cross-entropy term becomes troublesome. One possible approach to learn such a model is the utilization of (amortized) variational inference. Here, we consider the amortized family of variational posteriors with parameters $\phi$, $q_{\phi}(\mathbf{z} | \mathbf{x})$. Then, we can calculate the lower-bound on the cross-entropy term that we want to maximize, namely: \begin{align} \mathbb{E}_{\mathbf{x} \sim p_{\mathrm{data}} (\mathbf{x})}\left[ \ln p_{\mathrm{model}} (\mathbf{x}) \right] &= \mathbb{E}_{\mathbf{x} \sim p_{\mathrm{data}} (\mathbf{x})}\left[ \ln \int p_{\theta}(\mathbf{x} | \mathbf{z})\ p_{\lambda}(\mathbf{z}) \mathrm{d}\mathbf{z} \right] \\ &\geq \mathbb{E}_{\mathbf{x} \sim p_{\mathrm{data}} (\mathbf{x})}\left[ \mathbb{E}_{\mathbf{z} \sim q_{\phi}(\mathbf{z}|\mathbf{x})}\left[ \ln p_{\theta}(\mathbf{x} | \mathbf{z}) + \ln p_{\lambda}(\mathbf{z}) - \ln q_{\phi}(\mathbf{z} | \mathbf{x}) \right] \right] \\ &= \mathbb{E}_{\mathbf{x},\mathbf{z} \sim q_{\phi}(\mathbf{z}|\mathbf{x})p_{\mathrm{data}} (\mathbf{x})}\left[ \ln p_{\theta}(\mathbf{x} | \mathbf{z}) \right] + \mathbb{E}_{\mathbf{z} \sim q_{\phi}(\mathbf{z})}\left[\ln p_{\lambda}(\mathbf{z})\right] + \notag\\ &\ - \mathbb{E}_{\mathbf{x},\mathbf{z} \sim q_{\phi}(\mathbf{z}|\mathbf{x})p_{\mathrm{data}} (\mathbf{x})}\left[\ln q_{\phi}(\mathbf{z} | \mathbf{x})\right] \\ &= \mathbb{E}_{\mathbf{x},\mathbf{z} \sim q_{\phi}(\mathbf{x},\mathbf{z})}\left[ \ln p_{\theta}(\mathbf{x} | \mathbf{z}) \right] - \mathbb{C}\mathbb{E}\left[ q_{\phi}(\mathbf{z}) \| p_{\lambda}(\mathbf{z}) \right] + \mathbb{H}_{q_{\phi}(\mathbf{x},\mathbf{z})}\left[\mathbf{z} | \mathbf{x}\right] \\ &= \mathbb{E}_{\mathbf{x},\mathbf{z} \sim q_{\phi}(\mathbf{x},\mathbf{z})}\left[ \ln p_{\theta}(\mathbf{x} | \mathbf{z}) \right] - \mathbb{C}\mathbb{E}\left[ q_{\phi}(\mathbf{z}) \| p_{\lambda}(\mathbf{z}) \right] + \mathbb{H}_{q_{\phi}(\mathbf{z})}\left[\mathbf{z} \right] - \mathbb{I}_{q_{\phi}(\mathbf{x},\mathbf{z})}\left[\mathbf{x} ; \mathbf{z} \right] \\ &= \mathbb{E}_{\mathbf{x},\mathbf{z} \sim q_{\phi}(\mathbf{x},\mathbf{z})}\left[ \ln p_{\theta}(\mathbf{x} | \mathbf{z}) \right] - D_{\mathrm{KL}}\left[ q_{\phi}(\mathbf{z}) \| p_{\lambda}(\mathbf{z}) \right] - \mathbb{I}_{q_{\phi}(\mathbf{x},\mathbf{z})}\left[\mathbf{x} ; \mathbf{z} \right] \end{align}
In the above, we used a few facts:
(i) Jensen's inequality: $\ln \mathbb{E}_{p}[f(\mathbf{x})] \geq \mathbb{E}_{p}[\ln f(x)]$,
(ii) $\mathbb{I}[\mathbf{x} ; \mathbf{z}] = \mathbb{H}[\mathbf{z}] - \mathbb{H}[\mathbf{z}|\mathbf{x}]$,
(iii) $\mathbb{C}\mathbb{E}[q(\mathbf{z})||p(\mathbf{z})] = \mathbb{H}[\mathbf{z}] + D_{\mathrm{KL}}[q(\mathbf{z})||p(\mathbf{z})]$,
(iv) the aggregated posterior: $q_{\phi}(\mathbf{z}) = \mathbb{E}_{\mathbf{x} \sim q(\mathbf{x})} \left[ q_{\phi}(\mathbf{z} | \mathbf{x}) \right] $.
Thus, we can write the ELBO as follows:
\begin{equation}
\mathbb{E}_{\mathbf{x} \sim p_{\mathrm{data}} (\mathbf{x})}\left[ \ln p_{\mathrm{model}} (\mathbf{x}) \right] \geq \mathbb{E}_{\mathbf{x},\mathbf{z} \sim q_{\phi}(\mathbf{x},\mathbf{z})}\left[ \ln p_{\theta}(\mathbf{x} | \mathbf{z}) \right] - D_{\mathrm{KL}}\left[ q_{\phi}(\mathbf{z}) \| p_{\lambda}(\mathbf{z}) \right] - \mathbb{I}_{q_{\phi}(\mathbf{x},\mathbf{z})}\left[\mathbf{x} ; \mathbf{z} \right] .
\end{equation}
This form of the ELBO is not necessarily the form that is optimized in practice. However, it gives us very interesting insight into how the latent variable model is trained by maximizing the ELBO. Let us consider each component separately.
The first component of the ELBO is the negative reconstruction error. In other words, $\mathbf{z}$ is sampled from variational posterior for a given $\mathbf{x}$ and then it is stochastically reconstructed by $p_{\theta}(\mathbf{x}|\mathbf{z})$. That is, the conditional likelihood is used to calculate the reconstruction error. This term tries to make $q_\phi$ as peaky as possible. If $q_{\phi}$ is non-concentrated, is is hard to learn a stochastic mapping from multiple $\mathbf{z}$'s to a single $\mathbf{x}$.
The second component of the ELBO, $D_{\mathrm{KL}}\left[ q_{\phi}(\mathbf{z}) \| p_{\lambda}(\mathbf{z}) \right]$, determines the error between the aggregated posterior and the marginal over $\mathbf{z}$ (a.k.a. the \textit{prior}). Since we maximize the ELBO, and the term $D_{\mathrm{KL}}\left[ q_{\phi}(\mathbf{z}) \| p_{\lambda}(\mathbf{z}) \right]$ is with the negative sign, we want to minimize it. Thus, the ELBO tells us that the difference between $q_{\phi}(\mathbf{z})$ and $p_{\lambda}(\mathbf{z})$ should be as small as possible. This makes perfect sense because we want the aggregated posterior to assign the probability mass to the same regions as the prior.
The last component of the ELBO, $\mathbb{I}_{q_{\phi}(\mathbf{x},\mathbf{z})}\left[\mathbf{x} ; \mathbf{z} \right]$, is the mutual information between $\mathbf{x}$ and $\mathbf{z}$ for the aggregated posterior and the empirical distribution. This term appears with the negative sing, hence, the ELBO indicates that this mutual information should be minimized. This is a very puzzling result because, in other words, there should be no stochastic dependency between $\mathbf{z}$ and $\mathbf{x}$. Commonly, $\mathbf{z}$ is interpreted as a representation of $\mathbf{x}$, however, optimizing the ELBO leads to a completely different outcome! $\mathbb{I}_{q_{\phi}(\mathbf{x},\mathbf{z})}\left[\mathbf{x} ; \mathbf{z} \right]$ is a cumbersome component amd it may be an explanation why training latent variable models with variational inference is typically hard.
In the conclusion, the ELBO navigates the model to fit the aggregated posterior to the prior, and, on the other hand, to learn no stochastic dependency between $\mathbf{x}$ and $\mathbf{z}$ while maintaining a one-to-one mapping from $\mathbf{z}$ to $\mathbf{x}$. This last contradiction may result in a difficult training of the model. As reported in the literature (e.g., (Alemi et al., 2018)), the choice of parameterizations of the distributions play the crucial role and could be the reason for complicating the training. For instance, taking a very powerful parameterization of the conditional likelihood, $p_{\theta}(\mathbf{x}|\mathbf{z})$, e.g., a deep autoregressive model, gives a model that completely disregards $\mathbf{z}$.
It seems that training a latent variable model by applying variational inference could be challenging. Moreover, it is rather apparent that treating latent variables as a data representation and using variational inference for this purpose make little sense. The mutual information in the ELBO clearly indicates that the goal is the very opposite: Make the stochastic dependency between $\mathbf{x}$ and $\mathbf{z}$ as small as possible! I think we can all agree that this statement has clearly nothing in common with representation learning.
At this point, one could question whether there is any sense in using latent variables with or even without variational inference. However, we know that models like (hierarchical) Variational Auto-Encoders or diffusion-based models do work. Therefore, instead, we should ask the following questions:
We claim here that there is no problem with latent variable models and variational inference. Hence, the ELBO itself is not an issue as some may argue. The problem is the family of variational posteriors that may lead to a (nearly) perfect minimization of the mutual information. Then, in the consequence, one gets a model that does not take advantage of potential capacity of $\mathbf{z}$ and learns an unconditional model $p(\mathbf{x}|\mathbf{z})$ with $\mathbf{z}$ being treated as noise. Therefore, we formulate the following hypothesis:
We will look into these questions and the hypothesis in the (near) future. Please stay tuned!
(Alemi et al., 2018) Alemi, A., Poole, B., Fischer, I., Dillon, J., Saurous, R. A., & Murphy, K. (2018, July). Fixing a broken ELBO. In International Conference on Machine Learning (pp. 159-168). PMLR.