Some Observations on Relaxed Bernoulli Variables.

Lately I have been making use of a continuous relaxation of discrete random variables proposed in two recent papers:  The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables and Categorical Reparameterization with Gumbel-Softmax.  I decided to write a blog post with some motivation of the method, as well as providing some minor clarification on this method’s properties.

Consider a bernoulli variable $x \in \lbrace 0,1 \rbrace$ where the distribution of $x$ conditioned on model inputs $\vec{s}$ and weights $\theta$ has the following form: $P(x=1|\vec{s},\theta) = \frac{\alpha(\vec{s},\theta)}{1 + \alpha(\vec{s},\theta)}$, where $\alpha(\vec{s},\theta) > 0$.  For notational clarity,  I will suppress the dependence of $\alpha$ on $\vec{s}$ and $\theta$.  Additionally, I am specifically excluding the $\alpha = 0$ case because deterministic variables are not very interesting.  We can also view a bernoulli variable with parameter $\alpha$ as having the following density on $\mathbb{R}$:

$P(x;\alpha) = \frac{1}{1+ \alpha}\delta(x) + \frac{\alpha}{1 + \alpha}\delta(1 - x)$

Where $\delta()$ is the dirac delta function.  The cumulative distribution function (CDF) of a bernoulli variable parameterized with $\alpha$ has a simple piecewise constant form (disregarding some technical details about delta functions and measure theory):

$F_B(\tau;\alpha) = \begin{cases} 0 & \quad \tau < 0 \\ \frac{1}{1 + \alpha} & \quad 0 \le \tau \le 1 \\ 1 & \quad 1 < \tau \end{cases}$

We note that $F_B(\tau;\alpha)$ is continuous everywhere except $\tau=0,1$.

Imagine that we had a function of $x$, $g(x)$, and we wanted to estimate  $\nabla_{\theta} \langle g(x) \rangle_{P(x|\vec{s},\theta)}$ in terms of a finite number of samples drawn from $P(x|\vec{s},\theta)$.  This might seem frivolous to do for a single bernoulli variable since we could just write out the full formula for the expectation exactly but it is easier to see how intractable this process gets if we are considering high dimensional vectors of such variables.  One previous method for computing this estimate is the REINFORCE estimator, but this estimator suffers from high variance and does not easily integrate well with automatic differentiation methods like those found in Tensorflow or Theano.

If $x$ were a continuous valued random variable then we could probably use some form of the reparameterisation trick, where $x$ is represented as a function of the parameters $\theta$ and a noise term $\varepsilon$, such that $x = t(\theta,\varepsilon)$.  Importantly $t(\theta,\varepsilon)$ must be differentiable in both of its inputs.  Gradients with respect to $\theta$ can be expressed in terms of $\frac{1}{N} \sum_i \nabla_{\theta}t(\theta,\varepsilon_i)$ where $\varepsilon_i \sim p(\varepsilon)$$p(\varepsilon)$ is generally an easy to sample from distribution with a smooth density (such as a standard gaussian) and explicitly does not depend on $\theta$.  See this blog post for a great overview of the reparameterisation trick.

Unfortunately no such function $t()$ exists for discrete variables like bernoulli or categorical random variables.  The solution proposed (simultaneously it seems) by Maddison, Et al and Jang, Et al is to come up with continuous random variables that can be expressed using a reparameterisation trick, share certain statistical properties with the discrete variables they are approximating, and become close (in a sense) to these variables when a “temperature” parameter tends to zero.  I will focus on the application of this method to a bernoulli variable (henceforth referred to as the relaxed bernoulli variable) but I note that the method generalizes to categorical variables of any number of states.  Additionally, I will mostly follow the convention of Maddison, Et al, where this variable is referred to as a Binary Concrete variable.  A sample of a relaxed bernoulli variable $x$ with parameter $\alpha$ can be defined in terms of a new temperature parameter $\lambda > 0$ and a variable $L$ distributed according to a standard logistic distribution:

$x \sim \frac{1}{1 + \exp(-(\log \alpha + L)/\lambda)}$

In (other) words a relaxed bernoulli variable is obtained by a) sampling a standard logistic variable b) offsetting that variable by $\log \alpha$ c) scaling the result by $\lambda$ and finally d) passing the whole thing through a logistic sigmoid.  For notational expediency I will henceforth let $s(z)$ stand for a logistic sigmoid applied to the variable z.  A relaxed bernoulli variate $x$ with parameter $\alpha$ and temperature $\lambda$ has the following density:

$P(x;\alpha,\lambda) = \begin{cases} 0 & \quad x \le 0 \\ \frac{\lambda \alpha x^{-\lambda - 1} (1 - x)^{-\lambda - 1}}{\left( \alpha x^{-\lambda} + (1-x)^{-\lambda} \right)^2 } & \quad 0 < x < 1 \\ 0 & \quad x \ge 1 \end{cases}$

In the appendix of Maddison, Et al, the authors state several properties of the relaxed bernoulli variable, two of which are of particular interest to me:

1. (Rounding) $P(x > 0.5) = \alpha/(1 + \alpha)$.  Intuitively, if we were rounding a relaxed bernoulli variable to the nearest integer, then it would have the same distribution as a normal bernoulli variable with the same parameter $\alpha$.  This is easy to see by noting that $s(z) > 0.5$ if and only if $z > 0$.  Thus $P(x > 0.5) = P(L > -\log \alpha)$.  Since the CDF of $L$ is the logistic sigmoid (by definition) then we have, $P(L > -\log \alpha) = 1 - s(-\log \alpha) = \alpha/(1 + \alpha)$.
2. (Zero Temperature) $P(lim_{\lambda \to 0} \, x = 1) = \alpha/(1 + \alpha)$.  This statement was not immediately clear to me but I took it to mean that if one sampled many relaxed bernoulli variables but kept $\lambda$ as an adjustable parameter, then took the limit of $\lambda \to 0$ then approximately $\frac{\alpha}{1 + \alpha}$ would converge to 1 and the rest to 0.  This statement follows directly from the previous one by noting that $lim_{\lambda \to 0} \, s(z/\lambda) = 1$ if $z > 0$ and $lim_{\lambda \to 0} \, s(z/\lambda) = 0$ if $z < 0$.

The zero temperature property is probably good enough to convince most users that a relaxed bernoulli variable becomes a good approximation to a standard bernoulli variable with the same $\alpha$ as  $\lambda \to 0$.  However, I was left wondering if the stronger property of convergence in distribution also held. Convergence in distribution guarantees some other useful properties through the portmanteau lemma, such as $E g(x_{\lambda}) \to E g(x)$ for all bounded and continuous functions $g()$.

The relaxed bernoulli variate $x$ with CDF $F_r(\tau;\alpha,\lambda)$ converges in distribution to a bernoulli variable if $F_r(\tau;\alpha,\lambda) \to F_B(\tau;\alpha)$ as $\lambda \to 0$ for every $\tau$ where $F_B(\tau;\alpha)$ is continuous.

For the relaxed bernoulli variable, obtaining an expression for $F_r(\tau;\alpha,\lambda)$ takes a little bit of trickery.  Normally one could just define the CDF in terms of an integral of the probability density function:

$F_r(\tau;\alpha,\lambda) = \int_{-\infty}^{\tau} P(x;\alpha,\lambda) dx$

However, $P(x;\alpha,\lambda)$ is not defined, and even diverges for small enough $\lambda$, at $x=0,1$.  This makes even numerical evaluation of $F_r(\tau;\alpha,\lambda)$ difficult.  However, we can take advantage of the rounding property above to make the following claim (once again suspending some technical details about measure theory):

$\int_{0}^{\frac{1}{2}}P(x;\alpha,\lambda) dx = \frac{1}{1+\alpha}$

We can use this expression and standard properties of integrals to get a more tractable formulation of $F_r(\tau;\alpha,\lambda)$.  First we define the following notation for $a,b \in (0,1)$:

$f_{\alpha,\lambda}(a,b) = \int_a^b P(x;\alpha,\lambda) dx$

Finally, we have an expression for $F_r(\tau;\alpha,\lambda)$ that can be evaluated with standard numerical integration techniques.

$F_r(\tau;\alpha,\lambda) = \begin{cases} 0 & \quad \tau \le 0 \\ \frac{1}{1 + \alpha} - f_{\alpha,\lambda}(\tau,\frac{1}{2}) & \quad 0 < \tau \le \frac{1}{2} \\ \frac{1}{1 + \alpha } + f_{\alpha,\lambda}(\frac{1}{2},\tau) & \quad \frac{1}{2} \le \tau < 1 \\ 1 & \quad 1 \le \tau \end{cases}$

By inspection it is clear that $F_r(\tau;\alpha,\lambda) = F_B(\tau;\alpha)$ for $\tau < 0$ or $\tau > 1$.  Additionally, since $F_B(\tau;\alpha)$ is discontinuous at $\tau = 0,1$ all that remains to show is that $F_r(\tau;\alpha,\lambda) \to F_B(\tau;\alpha)$ as $\lambda \to 0$ for $x \in (0,1)$$F_r(\tau;\alpha,\lambda)$ is plotted below for several values of $\alpha$ (translated into different values of $P(x=1)$), and for different values of $\lambda$$F_r(\tau;\alpha,\lambda)$ does appear to converge to $F_B(\tau;\alpha)$, although not uniformly.

To make the statement of convergence a little more concrete, I note that it suffices to show that $f_{\alpha,\lambda}(\tau,\frac{1}{2}) \to 0$ and $f_{\alpha,\lambda}(\frac{1}{2},\tau) \to 0$ as $\lambda \to 0$ for all $\tau \in (0,1)$.  Furthermore, these limits can be established by showing that $P(x;\alpha,\lambda) \to 0$ as $\lambda \to 0$ for all $x \in (0,1)$.  It is clear that both the numerator and denominator of $P(x;\alpha,\lambda)$ are continuous functions of $\lambda$ as long as $x \in (0,1)$ and $\lambda \ge 0$.  The denominator converges to $(1 + \alpha)^2$ and the numerator to $0$ as $\lambda \to 0$.  Thus $P(x;\alpha,\lambda) \to 0$ as $\lambda \to 0$ for all $x \in (0,1)$.