A Crash Course in Diffusion-based Models¶

An introduction to flow matching¶

by Jakub Tomczak

Agenda¶

Continuous Normalizing Flows
Flow Matching
An example
Appendix

A different perspective on generative models with ODEs: Continuous Normalizing Flows (CNFs)¶

About ODEs, again¶

Previously:

Generative models can be defined through Stochastic Differential Equations (SDEs) or, equivalently, corresponding Probability Flow Ordinary Differential Equations (PF-ODEs).
Solving SDEs/ODEs using a numerical solver like backward Euler's method results in an iterative generative procedure of turning noise into data

Question

Do we need to formulate an SDE and its PF-ODE equivalent, or can we take any ODE to define a generative model?

Recall the definition of an ODE:

\begin{equation} \frac{\mathrm{d} \mathbf{x}_t}{\mathrm{d} t} = v(\mathbf{x}_{t}, t) , \end{equation}

where the vector field, $v(\mathbf{x}_{t}, t)$, defines the dynamics. Parameterizing the vector field with a neural network with weights $\theta$, $v_{\theta}(\mathbf{x}_{t}, t)$, leads to a so-called neural ODE (Chen et al., 2018). If we denote by $\mathbf{x}_{0}$ the initial condition for this neural ODE, e.g., noise, then by solving it, i.e., integrating over time $t$, we get the output (e.g., data):

\begin{equation} \mathbf{x}_{1} = \int_{0}^{1} v_{\theta}(\mathbf{x}_{t}, t) \mathrm{d} t . \end{equation}

From the continuity equation (conservation of mass) to the instantaneous change of variables¶

Question Starting with a known distribution $\mathbf{x}_{0} \sim \pi(\mathbf{x})$ like standard Gaussian, and then solving the ODE, what is the final distribution $p_1(\mathbf{x})$?

Answer We can express this induced distribution analytically using the continuity equation.

Let's imagine that probability is a mass like water.

Now let's visualize a pipe of the same cross-section volume across its length in which our mass (e.g., water) flows. At each moment of time, we have some flux of this mass, $f_t$, i.e., our (probability) mass is moved according to the vector field (or velocity), $f_t(\mathbf{x}_t) = p_t(\mathbf{x}_t) v(\mathbf{x}_{t}, t)$.

Since we talk about probability mass (or water flowing through the pipe of the same volume of cross-sections everywhere), the mass is conserved, i.e., no new mass (water) (dis)appears (no leaking or pouring in).

Mathematically, it means that the change of the mass $\frac{\partial p_t(\mathbf{x}_t)}{\partial t}$ plus the change of the flux volume in all directions (a.k.a. the divergence of the flux) is constant (i.e., the mass is conserved):

\begin{equation} \frac{\partial p_t(\mathbf{x}_t)}{\partial t} + \mathrm{div} \left( p_t(\mathbf{x}_t) v(\mathbf{x}_{t}, t) \right) = 0, \end{equation}

where $\mathrm{div} \left(\cdot\right)$ is the divergence defined as follows: $\mathrm{div} \left( V(x_1, \ldots, x_D) \right) = \sum_{d=1}^{D} \frac{\partial V_d(x_1, \ldots, x_D)}{\partial x_d}$, i.e., the sum of first derivatives of $V$ over all variables separately.

It turns out that applying identities of vector calculus and the properties of the divergence allows us to write the continuity equation using the logarithm of the probability distribution (a.k.a. the instantaneous change of variables (Chen et al., 2018)):

\begin{equation} \frac{\mathrm{d} \ln p(\mathbf{x}_t)}{\mathrm{d} t} + \mathrm{Tr}\left( \frac{\partial v(\mathbf{x}_{t}, t)}{\partial \mathbf{x}_t} \right) = 0 . \end{equation}

Then, by integrating across time, we can compute the total change in log-density as follows:

\begin{equation} \ln p(\mathbf{x}_1) = \ln \pi(\mathbf{x}_0) - \int_0^1 \mathrm{Tr}\left( \frac{\partial v(\mathbf{x}_{t}, t)}{\partial \mathbf{x}_t} \right) \mathrm{d} t . \end{equation}

Why do we bother to calculate everything as log-probabilities? Because the last line is a continuous version of the change of variables used for normalizing flows! Here, we have the integral over time of the trace of the Jacobian matrix instead of the sum of the log-determinants of the Jacobian matrix. Therefore, training neural ODEs is similar to training normalizing flows but with continuous time. As a result, neural ODEs in this context are refered to as continuous normalizing flows (CNFs).

Calculating the log-likelihood for CNFs. However, unlike in discrete time normalizing flows, we do not require invertibility of $v$, thus, for given datapoint $\mathbf{x}_1$, typically, we cannot simply invert the transformation to obtain $\mathbf{x}_0$. However, under pretty mild conditions (namely, $v$ and its first derivative are Lipschitz continuous, e.g., for a neural net with Lipschitz continuous activation functions like SELU or SiLU, among others), we can uniquely solve the following problem (Grathwohl et al., 2018):

\begin{equation} \begin{bmatrix} \mathbf{x}_0 \\ \ln p_1(\mathbf{x}_1) - \ln \pi(\mathbf{x}_0) \end{bmatrix} = \int_{1}^{0} \begin{bmatrix} v_{\theta}(\mathbf{x}_{t}, t) \\ -\mathrm{Tr}\left( \frac{\partial v_{\theta}(\mathbf{x}_{t}, t)}{\partial \mathbf{x}_t} \right) \end{bmatrix} \mathrm{d} t , \end{equation}

with the following initial conditions:

\begin{equation} \begin{bmatrix} \mathbf{x}_1 \\ \ln p_1(\mathbf{x}_{data}) - \ln p_{1}(\mathbf{x}_1) \end{bmatrix} = \begin{bmatrix} \mathbf{x}_{data} \\ 0 , \end{bmatrix} \end{equation}

in which $\mathbf{x}_1$ is a datapoint $\mathbf{x}_{data}$, and the difference in log-probability is zero. Note that we solve the problem in the reverse order, namely, from data $\mathbf{x}_1$ to noise $\mathbf{x}_0$.

To sum up, we need to do the following:

Take a datapoint $\mathbf{x}_1 = \mathbf{x}_{data}$.
Solve the optimization problem by applying a solver to find $\mathbf{x}_0$ and keeping track of traces over time.
Calculate the log-likelihood by adding $\ln \pi(\mathbf{x}_0)$ to the sum of negative traces $- \int_0^1 \mathrm{Tr}\left( \frac{\partial v_{\theta}(\mathbf{x}_{t}, t)}{\partial \mathbf{x}_t} \right) \mathrm{d} t$.

Now we can backpropagate through a solver but t is very expensive.

Hutchinson's trace estimator.

The problem with normalizing flows is about calculating the log-determinant of the Jacobian matrix of size $D \times D$, which in general case costs $\mathcal{O}(D^3)$. Computing the trace requires $\mathcal{O}(D^2)$ since we need the sum of the diagonal, but each entry in the diagonal requires a separate forward propagation, thus, the quadratic complexity.

One trick we can apply is about caclulating the trace. By utilizing Hutchinson's trace estimator (Grathwohl et al., 2018), the quadratic complexity is decreased to $\mathcal{O}(D)$, and it is relatively easy to calculate for any square matrix $\mathbf{A}$, namely:

\begin{equation} \mathrm{Tr}(\mathbf{A}) = \mathbb{E}_{\epsilon}\left[ \epsilon^{\top} \mathbf{A} \epsilon \right] , \end{equation}

where $\epsilon$ follows a distribution with zero mean and unit variance, e.g., $\epsilon \sim \mathcal{N}(0, \mathbf{I})$.

For a specific $\epsilon$, the product of $\mathbf{A} \epsilon$ could be calculated in a single forward pass, and it is "backpropagatable"; therefore, we can estimate the trace by taking $M$ Monte Carlo samples:

\begin{equation} \mathrm{Tr}(\mathbf{A}) \approx \frac{1}{M} \sum_{m=1}^{M} \epsilon_m^{\top} \mathbf{A} \epsilon_m . \end{equation}

In practice, we take $M=1$, namely, a single sample of $\epsilon$ for every newly coming datapoint. This is a noisy estimate, obviously; however, it is unbiased. As a result, during training with a stochastic gradient-based method, it does not matter too much.

Eventually, we obtain a procedure that is $\mathcal{O}(D)$ plus the cost of running the adjoint sensitivity method (a specific numerical solver). Scaling up CNFs is a known problem.

Going with the flow: Flow Matching¶

Let us take another look at the ODE we introduced earlier:

\begin{equation} \frac{\mathrm{d} \mathbf{x}_t}{\mathrm{d} t} = v(\mathbf{x}_{t}, t) . \end{equation}

Additionally to this ODE, we assume a known distribution $q_{0}(\mathbf{x})$ (e.g., the standard Gaussian) and a data distribution $q_{1}(\mathbf{x})$.

The distribution defined at any moment $t$ is characterized by the continuity equation.

By applying the instantaneous change of variables, we can find a solution, i.e., a probability distribution.

However, all of this results in pretty computationally heavy training.

Questions

What could we do if we knew the vector field $v(\mathbf{x}_{t}, t)$ and distributions $p_{t}(\mathbf{x})$?
How could we train our model then? W
What would be our model? Do you remember the score matching approach?

Let's take a look at the denoising score matching loss again for finding a model of the vector field $v_{\theta}(\mathbf{x}_{t}, t)$:

\begin{equation} \ell_{FM}(\theta) = \mathbb{E}_{t\sim U(0,1), \mathbf{x}_t \sim p_{t}(\mathbf{x})}\left[ \|v_{\theta}(\mathbf{x}_{t}, t) - v(\mathbf{x}_{t}, t) \|^2 \right] , \end{equation}

instead of looking for a distribution like in CNFs.

In plain words, for any time $t$ sampled uniformly at random, we sample $\mathbf{x}_{t}$ from the distribution $p_{t}(\mathbf{x})$ (we assume we know it!) and aim at minimizing the difference between the model $v_{\theta}(\mathbf{x}_{t}, t)$ and the real vector field $v(\mathbf{x}_{t}, t)$ (we assume we know it!). We refer to this objective as flow matching (FM).

Why would this work? For a very simple reason: If $\ell_{FM}(\theta) = 0$, i.e., our model perfectly imitates the real vector field, we can transform any noise distribution to the data distribution! Why? Because the vector field pushes points towards the data distribution over time. Take a look at Figure 1 where blue arrows (the vector field) indicate how points should evolve over time from a noise distribution $q_0(\mathbf{x})$ (e.g., the standard Gaussian) to the data distribution $q_1(\mathbf{x})$ (orange half-moons in Figure 1).

No description has been provided for this image

Figure 1. An example of how a model of the vector field (blue arrows) changes over time around datapoints (orange two moons).

The follow-up question is why is this so great? The answer is (again) simple: This is the regression problem, the mean squared error loss! Nothing complicated, nothing tricky, a well-behaved convex loss. One can run autograd, and use any deep learning library to implement that. Fantastic!

But... we do not know $p_{t}(\mathbf{x})$ and $v(\mathbf{x}_{t}, t)$. How to overcome it?

Conditional Flow Matching. First, let us consider a modified problem in which we introduce additional variables $\mathbf{z}$ sampled from a given distribution $q(\mathbf{z})$. The conditional ODE takes the following form:

\begin{equation} \frac{\mathrm{d} \mathbf{x}_t}{\mathrm{d} t} = v(\mathbf{x}_{t}, t; \mathbf{z}) . \end{equation}

For now, please think of this problem as a proxy for the unconditional ODE introduced before. In general, it is typically easier to work with conditional problems as long as the conditioning information is relevant. Regarding $\mathbf{z}$, we can think of it as extra information like data $\mathbf{x}_1$, or anything else like a class label, a piece of text, an audio signal, or an additional image.

Then, since we have to also sample $\mathbf{z}$'s from some distribution $q(\mathbf{z})$, the conditional flow matching (CFM) loss can be defined as follows:

\begin{equation} \ell_{CFM}(\theta) = \mathbb{E}_{t\sim U(0,1), \mathbf{x}_t \sim p_{t}(\mathbf{x} | \mathbf{z}), \mathbf{z} \sim q(\mathbf{z})}\left[ \|v_{\theta}(\mathbf{x}_{t}, t) - v(\mathbf{x}_{t}, t; \mathbf{z}) \|^2 \right] , \end{equation}

where we still use an unconditional model of the conditional vector field.

In the CFM loss, we need to define the conditional distribution at every $t$, $p_{t}(\mathbf{x} | \mathbf{z})$, and the real vector field is conditioned on $\mathbf{z}$, $v(\mathbf{x}_{t}, t; \mathbf{z})$.

Question

Why would adding conditioning help to learn a model that should work for the unconditional case?

As proved in (Lipman et al., 2022) and (Tong et al., 2023), both losses are equal up to a constant independent of $\theta$, and, thus, their gradients are equal!

Theorem 1 (Lipman et al., 2022) If $p_{t}(\mathbf{x}) > 0$ for all $\mathbf{x} \in \mathbb{R}^D$ amd $t \in [0, 1]$, then, up to a constant independent of $\theta$, $\ell_{FM}$ and $\ell_{CFM}$ are equal, and hence $\nabla_{\theta} \ell_{FM}(\theta) = \nabla_{\theta} \ell_{CFM}(\theta)$.

This result means that the model $v_{\theta}(\mathbf{x}, t)$ trained with the conditional version of the loss, but which is unconditional, coincides with the solution of the unconditional flow matching problem. Fantastic! One problem is gone; we can use CFM instead of FM!

Now we have another problem, namely, what this conditioning $\mathbf{z}$ should be, and what is its distribution $q(\mathbf{z})$. Fortunately, there are multiple options (see, e.g., (Tong et al., 2023)), here we focus on two of those:

In Lipman et al. CNF, $\mathbf{z}$ is a datapoint $\mathbf{x}_1$, and thus, $q(\mathbf{z}) = q_{1}(\mathbf{z})$; in other words, $q(\mathbf{z})$ is the data distribution.
In Tong et al. CNF (a.k.a. independent CFM, iCFM), $\mathbf{z}$ is a pair of noise and data, $\mathbf{z} = (\mathbf{x}_0, \mathbf{x}_1)$, sampled independently from each other, i.e., $q_0(\mathbf{z}) = q(\mathbf{x}_{0})\ q_{1}(\mathbf{x}_{1})$.

An extension of iCFM is sampling $\mathbf{z} = (\mathbf{x}_0, \mathbf{x}_1)$ by solving the optimal transport problem (Tong et al., 2023).

Conditional probability paths¶

The last piece of the puzzle is how to obtain conditional distributions $p_{t}(\mathbf{x} | \mathbf{z})$ a.k.a. (conditional) probability paths.

Let's consider the conditional probability path of the following form:

\begin{equation} p_{t}(\mathbf{x} | \mathbf{z}) = \mathcal{N}(\mathbf{x} | \mu(\mathbf{z}, t), \sigma^2(\mathbf{z}, t) \mathbf{I}) , \end{equation}

which is a Gaussian distribution with the mean function $ \mu(\mathbf{z}, t)$ and a diagonal covariance matrix with the standard deviation function $\sigma(\mathbf{z}, t)$. In general, no unique ODE generates these distributions. However, the following theorem shows that there is a unique vector field that leads to those!

Theorem 2 (Lipman et al., 2022) The unique vector field with initial conditionas $p_{0}(\mathbf{x}) = \mathcal{N}(\mu_0, \sigma_0^2 \mathbf{I})$ that generates $p_{t}(\mathbf{x} | \mathbf{z}) = \mathcal{N}(\mathbf{x} | \mu(\mathbf{z}, t), \sigma^2(\mathbf{z}, t) \mathbf{I})$ has the following form: \begin{equation} v(\mathbf{x}, t; \mathbf{z}) = \frac{\sigma^{'}(\mathbf{z}, t)}{\sigma(\mathbf{z}, t)} \left(\mathbf{x} - \mu(\mathbf{z}, t) \right) + \mu^{'}(\mathbf{z}, t), \end{equation} where $\sigma^{'}(\mathbf{z}, t)$ and $\mu^{'}(\mathbf{z}, t)$ denote the time derivates of $\sigma(\mathbf{z}, t)$ and $\mu(\mathbf{z}, t)$, respectively.

Let's look into two specific forms of conditional flow matching, namely:

Lipman et al. CFM (we refer to it as 'fm' later on in the code): We take $\mathbf{z} \equiv \mathbf{x}_1$ to be data sampled from the data distribution, $\mathbf{x}_1 \sim q_{1}(\mathbf{x})$. Then, we define the mean and the standard deviation functions as follows:

\begin{align} \mu(\mathbf{z}, t) &= t \mathbf{x}_1 ,\\ \sigma(\mathbf{z}, t) &= t \sigma_{const} - t + 1 , \end{align}

where $\sigma_{const} > 0$ is a smoothing constant. As a result, we obtain the following conditional probability path and the conditional vector field:

\begin{align} p_{t}(\mathbf{x} | \mathbf{z}) &= \mathcal{N}\left( \mathbf{x} | t \mathbf{x}_1, (t \sigma_{const} - t + 1)^2 \mathbf{I} \right) , \\ v(\mathbf{x}, t ; \mathbf{z}) &= \frac{ \mathbf{x}_1 - (1 - \sigma_{const}) \mathbf{x} }{ 1 - (1 - \sigma_{const}) t } . \end{align}

To obtain the analytical form of $v(\mathbf{x}, t ; \mathbf{z})$, we need to apply the theorem presented above. It turns out that we get a probability path from the standard Gaussian distribution, $p_0(\mathbf{x}) = \mathcal{N}(\mathbf{x} | 0, \mathbf{I})$, to a Gaussian distribution centered at a datapoint with standard deviation $\sigma_{const}$, $p_1(\mathbf{x}) = \mathcal{N}(\mathbf{x} | \mathbf{x}_1, \sigma_{const}^2 \mathbf{I})$ (Lipman et al., 2022).

Tong et al. iCFM: We consider $\mathbf{z} \equiv (\mathbf{x}_0, \mathbf{x}_1)$ and $q(\mathbf{z}) = q_0(\mathbf{x}_0) q_1(\mathbf{x}_1)$. Next, we choose the following means and standard deviations:

\begin{align} \mu(\mathbf{z}, t) &= t \mathbf{x}_1 + (1 - t) \mathbf{x}_0 ,\\ \sigma(\mathbf{z}, t) &= \sigma_{const} , \end{align}

where $\sigma_{const} > 0$ is a smoothing constant. The mean function is an interpolation between noise and data since $t \in [0, 1]$. The resulting conditional probability path and the conditional vector fields are the following:

\begin{align} p_{t}(\mathbf{x} | \mathbf{z}) &= \mathcal{N}\left( \mathbf{x} | t \mathbf{x}_1 + (1 - t) \mathbf{x}_0, \sigma_{const}^2 \mathbf{I} \right) , \\ v(\mathbf{x}, t ; \mathbf{z}) &= \mathbf{x}_1 - \mathbf{x}_0 . \end{align}

Side Note

Interestingly, the vector field results in a difference between a data point and a sampled noise. This comes from the fact that we assume a fixed standard deviation in the probability path, and after applying Theorem 2, we end up with the derivate of the mean function only.

(Tong et al., 2023) showed (see Proposition 3.3 therein) that the boundary distributions are $p_0(\mathbf{x}) = q_0(\mathbf{x}) * \mathcal{N}(\mathbf{x} | 0, \sigma_{const}^2 \mathbf{I})$ and $p_1(\mathbf{x}) = q_1(\mathbf{x}) * \mathcal{N}(\mathbf{x} | 0, \sigma_{const}^2 \mathbf{I})$, where $*$ denotes the convolution operator. For instance, if we take $q_0(\mathbf{x}) = \mathcal{N}(\mathbf{x} | 0, \mathbf{I})$, then $p_0(\mathbf{x}) = \mathcal{N}(\mathbf{x} | 0, (\sigma_{const}^2 + 1) \mathbf{I})$. However, $q_0(\mathbf{x})$ could be any distribution, not only Gaussian. For $q_1(\mathbf{x})$ being the data distribution, the other boundary distribution is $p_1(\mathbf{x}) = \mathcal{N}(\mathbf{x} | \mathbf{x}_1, \sigma_{const}^2 \mathbf{I})$ that is the same as in the case of the Lipman et al. CFM.

The differences between Lipman et al. CFM and Tong et al. iCFM are rather subtle. However, these subtle differences lead to different behavior of probability paths. In Figure 2, examples of these two CFMs are presented. Lipman et al. CFM starts with the standard Gaussian that evolves over time to a small Gaussian (i.e., standard deviation decreases) in the data space (see Figure 2A). Tong et al. CFM, on the other hand, defines a small Gaussian that is "moved" over time to the data space (see Figure 2B).

Figure 2. (left) An example of Lipman et al. CNF. (right) An example of Tong et al. iCNF. The dotted line indicates the interpolation between noise and data.

Training algorithms¶

Sample $t \sim \text{Uniform}(0, 1)$.
Sample $\mathbf{z} \sim q(\mathbf{z})$.
Calculate $\mu_{t}(\mathbf{z})$ and $\sigma_{t}(\mathbf{z})$.
Sample $$\mathbf{x}_t \sim \mathcal{N}\left(\mathbf{x} | \mu(\mathbf{z}, t), \sigma(\mathbf{z}, t) \right)$$
Calculate the vector field $v(\mathbf{x}_{t}, t; \mathbf{z})$.
Calculate loss $$\ell_{CFM}(\theta) = \| v_{\theta}(\mathbf{x}_t, t) - v(\mathbf{x}_{t}, t; \mathbf{z})\|^2$$
Update parameters: $$\theta \leftarrow \text{Update}\left(\theta, \nabla_{\theta} \ell_{CFM}(\theta) \right)$$

Sampling¶

Let me remind you, my curious reader, that we learn a model of the vector field. Once we have it, sampling is straightforward, namely:

Sample $\mathbf{x} \sim q_0(\mathbf{x})$.
Run forward Euler method until $t=1$ and with a step size $\Delta$: $$ \mathbf{x}_{t+\Delta} = \mathbf{x}_{t} + v_{\theta}(\mathbf{x}_{t}, t)\ \Delta .$$

Please note that in the case of flow matching, unlike in score-based generative models, we assume that time flows from $t=0$ (i.e., noise), to $t=1$ (i.e., data). As a result, once the model of the vector field is trained, we run the forward Euler method, not the backward Euler method like in score-based generative models.

Calculating the log-likelihood function¶

In the discussion on CNFs we showed that calculating the (log-)likelihood function is possible. The full derivation of the log-likelihood function for flow matching is presented in Appendix C of (Lipman et al., 2022). The main idea is similar to CNFs and to make calculations practical, we use Hutchinson's trace estimator, that yields:

\begin{equation} \ln p_{1}(\hat{\mathbf{x}}_1) \approx \ln p_{0}(\hat{\mathbf{x}}_0) - f_{0} , \end{equation}

where $\hat{\mathbf{x}}_1$ is a data point, $\hat{\mathbf{x}}_0$ is noise that corresponds to $\hat{\mathbf{x}}_1$, and $f_{0}$ is an approximation of the trace of the vector field Jacobian. Note that we have only $\hat{\mathbf{x}}_1$, but we can obtain $\hat{\mathbf{x}}_0$ and $f_{0}$ by running the following procedure (Lipman et al., 2022):

For given data point $\hat{\mathbf{x}}_1$, set the initial conditions: $$ \begin{bmatrix} \phi_1 \\ f_1 \end{bmatrix} = \begin{bmatrix} \hat{\mathbf{x}}_1 \\ 0 \end{bmatrix} . $$
Define the following ODE: $$ \frac{\mathrm{d}}{\mathrm{d}s}\begin{bmatrix} \phi_{1-s} \\ f_{1-s} \end{bmatrix} = \begin{bmatrix} -v_{\theta}(\phi_{1-s}, 1-s) \\ \epsilon^{\top} \nabla_{\phi} v_{\theta}(\phi_{1-s}, 1-s) \epsilon \end{bmatrix} $$ where $\epsilon \sim \mathcal{N}(0, \mathbf{I})$.

NOTE: We calculate $\nabla_{\phi} v_{\theta}(\phi_{1-s}, 1-s)$ by using autograd.

Solve the ODE in step 2 by running the backward Euler method.

NOTE: Here we need to go from $t=1$ to $t=0$, thus, the backward Euler.

Output the result: $$ \begin{bmatrix} \phi_0 \\ f_0 \end{bmatrix} = \begin{bmatrix} \hat{\mathbf{x}}_0 \\ \hat{c} \end{bmatrix} . $$

The outputs of this procedure are then plugged into the equation earlier that yields: \begin{equation} \ln p_{1}(\hat{\mathbf{x}}_1) \approx \ln p_{0}(\hat{\mathbf{x}}_0) - \hat{c} . \end{equation}

Example¶

Let's see how flow matching behaves on the two moon dataset in Figure 3. We start with data generated according to the standard Gaussian, and then the vector field model (blue arrows) pushes the points towards two orange half-moons.

Figure 3. An example of how a model of the vector field (blue arrows) pushes points (black crosses) towards data (orange two moons) over time.

After running the code with an MLP-based vector field model, and the following values of the hyperparameters: $\sigma_{const} = 0.1$ and $T=100$, we can expect results like in Figure 4. Note that we report the negative log-likelihood estimated according to the procedure presented before.

Figure 4. A sample of generated images.

What is the future of flow matching?¶

Similalry to Latent Diffusion, in (Dao et al., 2023) flow matching is used as a "prior" in an auto-encoder setting. First, the auto-encoder is trained, and then the vector field model is trained in the latent space. Afterwards, a sample from the FM model is decoded back to the data space.
The idea of interpolations in CFM was further extented to stochastic interpolants proposed by (Albergo et al., 2023a; Albergo et al., 2023b). I highly recommend to look these papers up since they provide many interesting extensions, both theoretical and practical.
As I mentioned earlier, one can propose a better distribution $q(\mathbf{z})$ by using Optimal Transport (OT). This results in OT-CFM (Tong et al., 2023) and Schrodinger Bridge CFM (De Bortoli et al., 2021; Vargas et al., 2021).
Action matching, a closely related approach to CFM, proposed by (Neklyudov et al., 2023), allows learning an underlying mechanism of moving points in time without modeling the distributions at each step.

Here, we considered interpolations in Euclidean spaces. (Chen & Lipman, 2023) proposed and extension of CFMs to general Riemannian manifolds.
To take advantage of symmetries in data, (Klein et al., 2023) modified the cost function in OT-CFM to account for those. Additionally, they used equivariant graph neural networks to formulate equivariant flow matching.
Here, we discussed the case of continuous random variables. However, in practice, we often deal with discrete data, e.g., molecules, proteins, pixel values. (Campbell et al., 2024) proposed Discrete Flow Models that could be seen as a version of CFM for handling discrete data.
CFM was also proposed as an method for simulation-based inference (Wildberger et al., 2023), i.e., a problem in which one has an access to a simulator but the likelihood function is unknown or intractable.
There are very close connections between score-based generative models and flow matching. I highly recommend looking into (Tong et al., 2023) and (Kingma & Gao, 2023) (the appendix therein is simply marvelous!) for further details.

References¶

(Albergo et al., 2023a) Albergo, M. S., Boffi, N. M., & Vanden-Eijnden, E. (2023). Stochastic interpolants: A unifying framework for flows and diffusions. arXiv preprint arXiv:2303.08797.

(Albergo et al., 2023b) Albergo, M. S., Goldstein, M., Boffi, N. M., Ranganath, R., & Vanden-Eijnden, E. (2023). Stochastic interpolants with data-dependent couplings. arXiv preprint arXiv:2310.03725.

(Ben-Hamu et al., 2022) Ben-Hamu, H., Cohen, S., Bose, J., Amos, B., Grover, A., Nickel, M., Chen, R.T. and Lipman, Y., 2022. Matching normalizing flows and probability paths on manifolds. arXiv preprint arXiv:2207.04711.

(Campbell et al., 2024) Campbell, A., Yim, J., Barzilay, R., Rainforth, T., & Jaakkola, T. (2024). Generative Flows on Discrete State-Spaces: Enabling Multimodal Flows with Applications to Protein Co-Design. arXiv preprint arXiv:2402.04997.

(Chen et al., 2018) Chen, R.T., Rubanova, Y., Bettencourt, J. and Duvenaud, D.K., 2018. Neural ordinary differential equations. Advances in neural information processing systems, 31.

(Chen & Lipman, 2023) Chen, R. T., & Lipman, Y. (2023). Riemannian flow matching on general geometries. arXiv preprint arXiv:2302.03660.

(Dao et al., 2023) Dao, Q., Phung, H., Nguyen, B., & Tran, A. (2023). Flow matching in latent space. arXiv preprint arXiv:2307.08698.

(De Bortoli et al., 2021) De Bortoli, V., Thornton, J., Heng, J., & Doucet, A. (2021). Diffusion Schrödinger bridge with applications to score-based generative modeling. Advances in Neural Information Processing Systems, 34, 17695-17709.

(Grathwohl et al., 2018) Grathwohl, W., Chen, R.T., Bettencourt, J., Sutskever, I. and Duvenaud, D., 2018, September. FFJORD: Free-Form Continuous Dynamics for Scalable Reversible Generative Models. In International Conference on Learning Representations.

(Kingma & Gao, 2023) Kingma, D.P. and Gao, R., 2023, November. Understanding diffusion objectives as the ELBO with simple data augmentation. In Thirty-seventh Conference on Neural Information Processing Systems.

(Klein et al., 2023) Klein, L., Krämer, A., & Noé, F. (2023). Equivariant flow matching. Advances in Neural Information Processing Systems, 36.

(Lipman et al., 2022) Lipman, Y., Chen, R. T., Ben-Hamu, H., Nickel, M., & Le, M. (2022). Flow matching for generative modeling. arXiv preprint arXiv:2210.02747.

(Neklyudov et al., 2023) Neklyudov, K., Severo, D., & Makhzani, A. (2023). Action matching: A variational method for learning stochastic dynamics from samples. ICML 2023.

(Tong et al., 2023) Tong, A., Malkin, N., Huguet, G., Zhang, Y., Rector-Brooks, J., Fatras, K., ... & Bengio, Y. (2023). Improving and generalizing flow-based generative models with minibatch optimal transport. arXiv preprint arXiv:2302.00482.

(Vargas et al., 2021) Vargas, F., Thodoroff, P., Lamacraft, A., & Lawrence, N. (2021). Solving schrödinger bridges via maximum likelihood. Entropy, 23(9), 1134.

(Wildberger et al., 2023) Wildberger, J., Dax, M., Buchholz, S., Green, S., Macke, J. H., & Schölkopf, B. (2023). Flow Matching for Scalable Simulation-Based Inference. Advances in Neural Information Processing Systems, 36.