Discretized Logistic Mixture Likelihood  The Why, What and How
In this post I will explain what the Discretized Logistic Mixture Likelihood (DLML)^{1} is. This is a modeling method particularly relevant to my MSc thesis, where I use it to model the continuous action my imitation learning agent should make. While there are already some great posts explaining the concept, the information is scattered, which can make understanding the concept a bit painful. I will first start with motivating why we need DLML. I will then present what DLML is. Finally I will outline how we can implement DLML in PyTorch.
The Why
Suppose you wish to predict some variable that happens to be continuous, conditional on some other quantity. For example, you are interested in predicting the value of a given pixel in an image, given the values of neighbouring pixels.
With a bit of domain knowledge, this class of problem can typically be reformulated by discretizing the target variable and modeling the resulting (conditional) probability distribution. The prediction task can then be posed as a classification one over the discretized bins: apply a softmax and train using crossentropy loss.
This is (in part) what the authors of PixelCNN did: for the task of conditional image generation, they model each (sub)pixel of an image with a softmax over a 256dimensional vector, where each dimension represents an 8bit intensity value that the pixel may take. There are more details, particularly around the conditioning, but that’s all you need to know for now for the premise of this tutorial.
Immediately, we face a number of limitations:

Softmax can be computationally expensive and unstable. This is particularly problematic for highdimensional inputs, which are usually the case when dealing with (discretized) continuous variables. This is particularly problematic if you plan to repeat the computation on several output variables (e.g. several pixels of an output image, several dimensions of robotic arm rotation, etc.), which is usually the case when interested in conditional generation of continuous variables.

Softmax can lead to sparse gradients, especially at the beginning of training, which can slow down learning. This is also especially the case with highdimensional input.

Softmax does not model any sort of ordinality in the random variable that is being considered: every single dimension in the input vector is considered independently. There is no notion that a value of 127 is close to 128. This ordinality is typically present when dealing with (discretized) continuous variables by virtue of their nature. Rather than relying on some inductive bias, the model has to devote more training time to learn this aspect of the data, leading to slower training.

Softmax fails to properly model values that are never observed, assigning probabilities of 0 to values that may otherwise be more likely to occur.
These issues are at least some of the motivations for using DLML, which I will introduce in the next section.
The What (and some more why)
In DLML, for a given output variable $y$ we do the following:

We assume that there is a latent value $v$ with a continuous distribution.

We take $y$ to come from a discretization of this continuous distribution of $v$. We do this discretization in some arbitrary way, but usually by rounding to the nearest 8bit representation. What this means is that if e.g. $v$ can be any value between 0 and 255, then $y$ will be any integer between those two numbers.

We model $v$ using a simple continuous distribution  e.g. the logistic distribution.

We then take a further step, choosing to model $v$ as a mixture of $K$ logistic distributions:
$$ v \sim \sum_i^K \pi_i \text{logistic}(\mu_i, s_i), $$
(equation 1) where $\pi_i$ is some coefficient weighing the likelihood of the $i$th distribution while $\mu_i$ and $s_i$ are the mean and scale parametrizing it.

To compute the likelihood of $y$, we sum its (weighted) probability masses over the $K$ mixtures. We can obtain the probability masses by computing the difference between consecutive cumulative density function (CDF) values of equation (1). Note that the CDF of the logistic distribution is a sigmoid function. We therefore write:
$$ p(y  \mathbf{\pi}, \mathbf{\mu}, \mathbf{s} ) = \sum_{i=1}^K \pi_i \left[\sigma\left(\frac{y + 0.5  \mu_i}{s_i}\right)  \sigma\left(\frac{y  0.5  \mu_i}{s_i}\right)\right], $$
(equation 2) where $\sigma$ is the logistic sigmoid. The 0.5 value comes from the fact that we have discretized $v$ into $y$ through rounding, and therefore successive values of our discrete random variable $y$ are found at this boundary.

We can additionally model edge cases, replacing $y  0.5$ with $\infty$ when $y=0$ and $y + 0.5$ with $+\infty$ when $y = 2^8 = 255$.
This is nothing more than a likelihood, so we can use it in a maximum likelihood estimation (MLE) process to estimate our parameters. In the case of Deep Learning, we use the negative log likelihood as our loss function. This comment on GitHub provides a different perspective to what’s going on.
Some more why
This approach provides a number of advantages, many of which address the shortcomings of the softmax approach described in the previous section. In particular:

It avoids assigning probability mass outside the valid range of [0, 255] by explicitly modeling the rounding and edge cases.

Edge values are naturally assigned higher probability values, which tends to align with what is observed when dealing with this nature of data.

We rely on the simple sigmoid function, which is less computationally expensive than its multiclass cousin the softmax. This addresses limitation 1 from above.

Because we are now making use of the logistic distribution to model the (latent) value of $y$, we are implicitly also modelling ordinality when discretizing, since the logistic distribution is continuous. This addresses limitation 3 from above.

Our reliance on a continuous distribution similarly addresses limitation 4 from above, as we will no longer assign nonzero probability prematurely.

Empirically it has been found that only a small number of mixtures, $\le 10$, is enough. What this means is that we can work with much lower dimensionality network outputs (3 parameters: $\mu$, $s$ and $\pi$ for each mixture element), leading to denser gradients. This addresses limitation 2 from above.

Because we make use of a mixture, we can more easily model multimodal data. This can be desirable when learning skills from imitation, where the same skill can be shown to be completed through different action sequences. It is exactly for this reason that Lynch et al. 2020 and Mees et al. 2022 make use of DLML in their action decoders.
The How
So how do we actually go about implementing this? This is one of those techniques where we do slightly different things depending on whether we are training or whether we just want outputs from our model (sampling). For completeness, I provide a fullreference to the code below on my GitHub.
Training
Earlier, we defined a likelihood for $y$. We can use this to train our model to output the appropriate parameters $\mu$, $s$, and $\pi$ for a given input using MLE. In practice we will calculate the likelihood and then minimize the negative log likelihood.
For a given output variable $y$, using $K$ mixture elements, our model should therefore output $K$ means, scales and mixture logits:


We treat the predicted scales as $\log(s)$, which we can then follow with an
exponential to recover $s$. This is to enforce positive values of $s$ and for
numerical stability. In practice we take an exponential of the negative
log_scales
to obtain inv_scales
i.e. $\frac{1}{s}$, since $s$ is always used
in the denominator in our formulas.
We can then start computing the rest of the terms for our likelihood from equation (2).


Before I go on, you may be asking  “what is this epsilon? Weren’t we
adding/subtracting $0.5$ instead?”. Indeed, this is actually still the case.
However, here we are operating on the assumption that we have scaled our $y$’s
to be in the range [1, 1]. For 8bit data, that is equivalent to scaling by
$\frac{2}{2^8  1}$. We have to apply this same scaling to our $0.5$ boundaries
for consistency. Note that y_range
is simply $1  (1) = 2$, the $2$ in the
numerator, and that num_classes
for $y$ is $ 2^8 = 256 $. A final note, we
need need to play with y
’s shape a little bit when computing centered_y
:
remember, for each target variable, we have multiple means (one for each mixture
component), while we have a single target value. We therefore need to repeat the
target value for each mixture component to complete the subtraction. We now move
on to the edge cases described in step 6 of
the what.


“Woah, what is going on here?” you may ask. Here we are making use of the softplussigmoid relations, in particular that
$$ \zeta(x) = \int_{\infty}^x \sigma(y)dy, $$
(equation 3) where $\zeta$ is the softplus function. We also approximate the log probability at the center of the bin, based on the assumption that the logdensity is constant in the bin of the observed value. This is used as a backup in cases where calculated probabilities are below 1e5, which could happen due to numerical instability. This case is extremely rare and I would not dedicate too much thought to it, it is just there as a (rarelyused) backup.
We can now put all these terms together into a single log likelihood tensor:


We are almost done, but there is one last piece. So far we have computed the terms to minimize for learning the distribution(s), i.e. learning $\mu$ and $s$. We also need to learn which mixture distribution to sample from, i.e. we have to learn $\pi$. This is very simple, and consists in adding a log of the softmax over the logits (the $\pi_i$) outputted by our model:


We add a log of the softmax because we are after $\log(\text{softmax}(\pi) \cdot \text{likelihood})$ which, by applying log properties, is equivalent to $\log(\text{softmax}(\pi)) + \log(\text{likelihood})$, which is what we get.
All that’s left to do now is summing over our mixtures. We do this after applying the LogSumExp trick for numerical stability


Our loss is the negative log likelihood, which we can choose to reduce across the batch or return unreduced


And that’s it for training. Once you have your loss, you can run loss.backwards() and all the cool stuff torch provides with autodiff.
Sampling
Sampling is fortunately a bit easier, and some people start their explanation from here. Here, we first sample a distribution from our mixture, and then sample a value from the sampled distribution. We have logits for each distribution in our mixture, so we can sample from a softmax over this distribution. In practice, we make use of the GumbelMax trick, to keep things differentiable.


argmax
is the index of our sampled distribution. We can use it to get the
distribution’s mean and scale from our model’s outputs:


We can then sample from our logistic distribution using inverse sampling. For a logistic distribution, this consists in
$$ X = \mu + s \log \left(\frac{y}{1y} \right), $$
(equation 4) where we select y from a random uniform distribution. In code:


And just like that, we have a way of sampling from our model.
Closing words
I hope this post was helpful. I came across DLML during my MSc thesis on languageenabled imitation learning and while there are several high quality posts online, I couldn’t find a single one that summarized the process in its entirety, from motivation to implementation, so I decided to write it myself, also as a way to help me understand the concept. As a reminder, the complete code accompanying this post is available on my GitHub profile.
More Resources
 Great comment on Tacotron GitHub
 Somewhat outdated Google Colab
 PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications, Salimans et al., 2017.
 Learning Latent Plans from Play, Lynch et al., 2020.
 What Matters in Language Conditioned Robotic Imitation Learning over Unstructured Data, Mees et al., 2022.

Also known as “Discretized Mixture of Logistics (DMoL)”, “Discretized Logistic Mixture (DLM)”, “Mixture of Discretized Logistics (MDL)” ↩︎