Some Observations on Relaxed Bernoulli Variables.

Lately I have been making use of a continuous relaxation of discrete random variables proposed in two recent papers:  The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables and Categorical Reparameterization with Gumbel-Softmax.  I decided to write a blog post with some motivation of the method, as well as providing some minor clarification on this method’s properties.

Consider a bernoulli variable x \in \lbrace 0,1 \rbrace where the distribution of x conditioned on model inputs \vec{s} and weights \theta has the following form: P(x=1|\vec{s},\theta) = \frac{\alpha(\vec{s},\theta)}{1 + \alpha(\vec{s},\theta)}, where \alpha(\vec{s},\theta) > 0.  For notational clarity,  I will suppress the dependence of \alpha on \vec{s} and \theta.  Additionally, I am specifically excluding the \alpha = 0 case because deterministic variables are not very interesting.  We can also view a bernoulli variable with parameter \alpha as having the following density on \mathbb{R}:

P(x;\alpha) = \frac{1}{1+ \alpha}\delta(x) + \frac{\alpha}{1 + \alpha}\delta(1 - x)

Where \delta() is the dirac delta function.  The cumulative distribution function (CDF) of a bernoulli variable parameterized with \alpha has a simple piecewise constant form (disregarding some technical details about delta functions and measure theory):

F_B(\tau;\alpha) =  \begin{cases}  0 & \quad \tau < 0 \\  \frac{1}{1 + \alpha} & \quad 0 \le \tau \le 1 \\  1 & \quad 1 < \tau  \end{cases}

We note that F_B(\tau;\alpha) is continuous everywhere except \tau=0,1.

Imagine that we had a function of x, g(x), and we wanted to estimate  \nabla_{\theta} \langle g(x) \rangle_{P(x|\vec{s},\theta)} in terms of a finite number of samples drawn from P(x|\vec{s},\theta).  This might seem frivolous to do for a single bernoulli variable since we could just write out the full formula for the expectation exactly but it is easier to see how intractable this process gets if we are considering high dimensional vectors of such variables.  One previous method for computing this estimate is the REINFORCE estimator, but this estimator suffers from high variance and does not easily integrate well with automatic differentiation methods like those found in Tensorflow or Theano.

If x were a continuous valued random variable then we could probably use some form of the reparameterisation trick, where x is represented as a function of the parameters \theta and a noise term \varepsilon, such that x = t(\theta,\varepsilon).  Importantly t(\theta,\varepsilon) must be differentiable in both of its inputs.  Gradients with respect to \theta can be expressed in terms of \frac{1}{N} \sum_i \nabla_{\theta}t(\theta,\varepsilon_i) where \varepsilon_i \sim p(\varepsilon)p(\varepsilon) is generally an easy to sample from distribution with a smooth density (such as a standard gaussian) and explicitly does not depend on \theta.  See this blog post for a great overview of the reparameterisation trick.

Unfortunately no such function t() exists for discrete variables like bernoulli or categorical random variables.  The solution proposed (simultaneously it seems) by Maddison, Et al and Jang, Et al is to come up with continuous random variables that can be expressed using a reparameterisation trick, share certain statistical properties with the discrete variables they are approximating, and become close (in a sense) to these variables when a “temperature” parameter tends to zero.  I will focus on the application of this method to a bernoulli variable (henceforth referred to as the relaxed bernoulli variable) but I note that the method generalizes to categorical variables of any number of states.  Additionally, I will mostly follow the convention of Maddison, Et al, where this variable is referred to as a Binary Concrete variable.  A sample of a relaxed bernoulli variable x with parameter \alpha can be defined in terms of a new temperature parameter \lambda > 0 and a variable L distributed according to a standard logistic distribution:

x \sim \frac{1}{1 + \exp(-(\log \alpha + L)/\lambda)}

In (other) words a relaxed bernoulli variable is obtained by a) sampling a standard logistic variable b) offsetting that variable by \log \alpha c) scaling the result by \lambda and finally d) passing the whole thing through a logistic sigmoid.  For notational expediency I will henceforth let s(z) stand for a logistic sigmoid applied to the variable z.  A relaxed bernoulli variate x with parameter \alpha and temperature \lambda has the following density:

P(x;\alpha,\lambda) =  \begin{cases}  0  & \quad x \le 0 \\  \frac{\lambda \alpha x^{-\lambda - 1} (1 - x)^{-\lambda - 1}}{\left( \alpha x^{-\lambda} + (1-x)^{-\lambda} \right)^2 } & \quad 0 < x < 1 \\  0 & \quad x \ge 1  \end{cases}

In the appendix of Maddison, Et al, the authors state several properties of the relaxed bernoulli variable, two of which are of particular interest to me:

  1. (Rounding) P(x > 0.5) = \alpha/(1 + \alpha).  Intuitively, if we were rounding a relaxed bernoulli variable to the nearest integer, then it would have the same distribution as a normal bernoulli variable with the same parameter \alpha.  This is easy to see by noting that s(z) > 0.5 if and only if $z > 0$.  Thus P(x > 0.5) = P(L > -\log \alpha).  Since the CDF of L is the logistic sigmoid (by definition) then we have, P(L > -\log \alpha) = 1 - s(-\log \alpha) = \alpha/(1 + \alpha).
  2. (Zero Temperature) P(lim_{\lambda \to 0} \, x = 1) = \alpha/(1 + \alpha).  This statement was not immediately clear to me but I took it to mean that if one sampled many relaxed bernoulli variables but kept \lambda as an adjustable parameter, then took the limit of \lambda \to 0 then approximately \frac{\alpha}{1 + \alpha} would converge to 1 and the rest to 0.  This statement follows directly from the previous one by noting that lim_{\lambda \to 0} \, s(z/\lambda) = 1 if z > 0 and lim_{\lambda \to 0} \, s(z/\lambda) = 0 if z < 0.

The zero temperature property is probably good enough to convince most users that a relaxed bernoulli variable becomes a good approximation to a standard bernoulli variable with the same \alpha as  \lambda \to 0.  However, I was left wondering if the stronger property of convergence in distribution also held. Convergence in distribution guarantees some other useful properties through the portmanteau lemma, such as E g(x_{\lambda}) \to E g(x) for all bounded and continuous functions g().

The relaxed bernoulli variate x with CDF F_r(\tau;\alpha,\lambda) converges in distribution to a bernoulli variable if F_r(\tau;\alpha,\lambda) \to F_B(\tau;\alpha) as \lambda \to 0 for every \tau where F_B(\tau;\alpha) is continuous.

For the relaxed bernoulli variable, obtaining an expression for F_r(\tau;\alpha,\lambda) takes a little bit of trickery.  Normally one could just define the CDF in terms of an integral of the probability density function:

F_r(\tau;\alpha,\lambda) = \int_{-\infty}^{\tau}  P(x;\alpha,\lambda) dx

However, P(x;\alpha,\lambda) is not defined, and even diverges for small enough \lambda, at x=0,1.  This makes even numerical evaluation of F_r(\tau;\alpha,\lambda) difficult.  However, we can take advantage of the rounding property above to make the following claim (once again suspending some technical details about measure theory):

\int_{0}^{\frac{1}{2}}P(x;\alpha,\lambda) dx = \frac{1}{1+\alpha}

We can use this expression and standard properties of integrals to get a more tractable formulation of F_r(\tau;\alpha,\lambda).  First we define the following notation for a,b \in (0,1):

f_{\alpha,\lambda}(a,b) = \int_a^b P(x;\alpha,\lambda) dx

Finally, we have an expression for F_r(\tau;\alpha,\lambda) that can be evaluated with standard numerical integration techniques.

F_r(\tau;\alpha,\lambda) =  \begin{cases}  0 & \quad \tau \le 0 \\  \frac{1}{1 + \alpha} -  f_{\alpha,\lambda}(\tau,\frac{1}{2}) & \quad 0 < \tau \le \frac{1}{2} \\  \frac{1}{1 + \alpha } + f_{\alpha,\lambda}(\frac{1}{2},\tau) & \quad \frac{1}{2} \le \tau < 1 \\  1 & \quad 1 \le \tau  \end{cases}

By inspection it is clear that F_r(\tau;\alpha,\lambda) = F_B(\tau;\alpha) for \tau < 0 or \tau > 1.  Additionally, since F_B(\tau;\alpha) is discontinuous at \tau = 0,1 all that remains to show is that F_r(\tau;\alpha,\lambda) \to F_B(\tau;\alpha) as \lambda \to 0 for x \in (0,1)F_r(\tau;\alpha,\lambda) is plotted below for several values of \alpha (translated into different values of P(x=1)), and for different values of \lambdaF_r(\tau;\alpha,\lambda) does appear to converge to F_B(\tau;\alpha), although not uniformly.

 

P_08P_05P_02

To make the statement of convergence a little more concrete, I note that it suffices to show that f_{\alpha,\lambda}(\tau,\frac{1}{2}) \to 0 and f_{\alpha,\lambda}(\frac{1}{2},\tau) \to 0 as \lambda \to 0 for all \tau \in (0,1).  Furthermore, these limits can be established by showing that P(x;\alpha,\lambda) \to 0 as \lambda \to 0 for all x \in (0,1).  It is clear that both the numerator and denominator of P(x;\alpha,\lambda) are continuous functions of \lambda as long as x \in (0,1) and \lambda \ge 0.  The denominator converges to (1 + \alpha)^2 and the numerator to 0 as \lambda \to 0.  Thus P(x;\alpha,\lambda) \to 0 as \lambda \to 0 for all x \in (0,1).

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s