Good news, everyone!

Out of memory, or gradient accumulation for larger models

2024-01-31T07:00:00+00:00

When the model grows large and does not fit on a single device, and there are no more devices to spare, the common mitigation strategy is to reduce the batch size, thereby allowing more space for the model at the expense of the data. However, smaller batches lead to noisier weight updates, which is undesirable. One solution is gradient accumulation where the weights are updated after evaluating the gradients for several batches at a time. In this article, we show how it can be implemented in practice.

Solution

Long story short:

# Inherit from any optimizer of choice, such as Adam.
class Optimizer(tf.keras.optimizers.Adam):
    """Optimizer that implements gradient accumulation."""

    def __init__(self, accumulation: int = 1, **options) -> None:
        """Create an instance.

        Arguments:
          accumulation: The number of iterations to accumulate gradients over.
          If it is set to one, no accumulation is performed, and the gradients
          are applied as soon as they are computed. If it is set to a value
          greater than one, the gradients will be accumulated for the specified
          number of iterations and only then applied, starting a new cycle.

        All other arguments are passed to the base optimizer.
        """
        super().__init__(**options)
        self.accumulation = accumulation
        self._accumulation = None
        self._gradients = None

    def apply_gradients(
        self, gradients_variables: list[tuple[tf.Tensor, tf.Tensor]]
    ) -> tf.Tensor:
        """Apply the gradients according to the accumulation scheme."""
        # Split off the gradients from the trainable variables.
        gradients, variables = zip(*list(gradients_variables))
        # Perform the initialization if needed.
        with tf.init_scope():
            self.build(variables)
        first = self._accumulation % self.accumulation == 0
        last = (self._accumulation + 1) % self.accumulation == 0
        # Add the new gradients to the old ones with resetting if needed.
        for gradient, increment in zip(self._gradients, gradients):
            gradient.assign(tf.cast(~first, tf.float32) * gradient + increment)
        # Apply the average accumulated gradients to the trainable variables.
        gradients = [gradient / self.accumulation for gradient in self._gradients]
        super().apply_gradients(zip(gradients, variables))
        # Decrement the base counter incremented by the application if needed.
        self.iterations.assign_sub(tf.cast(~last, tf.int64))
        # Increment the accumulation counter.
        self._accumulation.assign_add(1)
        return self.iterations

    def update_step(self, gradient: tf.Tensor, variable: tf.Tensor) -> None:
        """Update the trainable variable with the gradient."""
        update_step = super().update_step
        last = (self._accumulation + 1) % self.accumulation == 0
        # Allow the update to happen only at the end of each cycle.
        tf.cond(last, lambda: update_step(gradient, variable), lambda: None)

    def build(self, variables: list[tf.Tensor]) -> None:
        """Initialize the internal state."""
        super().build(variables)
        if self._gradients is None:
            # Create a counter for tracking accumulation.
            self._accumulation = self.add_variable(shape=(), dtype=tf.int64)
            # Allocate memory for accumulation.
            self._gradients = [
                self.add_variable_from_reference(
                    model_variable=variable,
                    variable_name="gradient",
                )
                for variable in variables
            ]

It is important to note that the learning rate is not held constant during accumulation. However, since it is not expected to change much from one iteration to another, it is an adequate simplification.

Acknowledgments

I would like to thank André Pedersen, Axel Roebel, and Tor-Arne Nordmo for their help with the implementation.

Relative positional embedding for any attention mechanism

2024-01-17T07:00:00+00:00

In Shaw et al. (2018), the authors introduce relative positional embedding for self-attention in transformer models, and in Huang et al. (2018), the authors present a memory efficient approach to calculating this embedding in decoder blocks, in which the self-attention is causal. In this article, the approach is generalized to any attention mechanism, should it be self or cross or full or causal.

Background

The classical attention is formalized as follows:

\[A = \text{softmax}\left( \frac{QK^{T}}{\sqrt{n_d}} \right) V\]

where \(K\), \(V\), and \(Q\) are the keys, values, and queries, respectively. The keys and values are of shape \(n_s \times n_h \times n_{t_1} \times n_d\) where \(n_s\) is the batch size (s for space), \(n_h\) is the number of attention heads, \(n_{t_1}\) is the window size (t for time) of the input sequence, and \(n_d\) is the head size. The queries are of shape \(n_s \times n_h \times n_{t_2} \times n_d\) where \(n_{t_2}\) is the window size of the output sequence.

The relative attention obtains one additional term in the numerator:

\[A = \text{softmax}\left( \frac{QK^T + S}{\sqrt{n_d}} \right) V. \tag{1}\]

In the above, \(S\) is of shape \(n_s \times n_h \times n_{t_2} \times n_{t_1}\) and calculated based on \(Q\) and a matrix \(E\) of shape \(n_d \times n_{t_3}\) containing relative positional embeddings. The typical context is causal self-attention, in which \(n_{t_3}\) is thought of as the maximum allowed length of the input sequence and set to \(n_{t_1}\), with the interpretation that the embeddings are running from position \(-n_{t_1} + 1\) (the most distant past) up to \(0\) (the present moment). Then \(S\) is a specific arrangement of the inner products between the queries in \(Q\) and the embeddings in \(E\) so as to respect the arrangement in \(QK^T\).

The original and more memory efficient calculations of \(S\) in the case of causal attention, are illustrated in the figure below, which is taken from Huang et al. (2018).

The matrix to the very right shows how \(S\) is arranged. Since the use case is causal attention, the upper triangle above the main diagonal (gray circles) is irrelevant and can contain arbitrary values, which it does in the algorithm proposed in Huang et al. (2018). The main diagonal (green circles) contains the inner products of the queries and the embedding corresponding to position \(0\). The first subdiagonal (pink circles) contains the inner products of the queries except for the first one as it has no past, and the embedding corresponding to position \(-1\). And it continues in this way down to \(-n_{t_1} + 1\), in which case it is only the last query that is involved, since it comes last in the sequence and has the longest past.

The calculation given in Huang et al. (2018) reduces the intermediate memory requirement from \(\mathcal{O}(n_h \, n_d \, n_t^2)\) to \(\mathcal{O}(n_h \, n_d \, n_t)\) where \(n_t\) is a general sequence length. However, it is limited to self-attention with causal connectivity, which is what is found in decoder blocks. It is not suitable for other attention patterns. Therefore, it cannot be used in, for instance, encoder blocks and decoder blocks with cross-attention, which usually have non-causal attention. In what follow, the limitation is lifted.

Algorithm

Let us extend \(E\) to be of shape \(n_d \times (2 n_{t_3} - 1)\) so that it has an embedding for any relative position not only when looking back in the past but also forward into the future, with \(n_{t_3}\) being the maximum allowed length of the input sequence as before, that is, \(t_1 \leq t_3\). Let us also interpret \(E\)’s columns as running from position \(n_{t_3} - 1\) (the most distant future) to position \(-n_{t_3} + 1\) (the most distant past). For instance, when the output sequence is of length \(t_3\) (the longest possible), the first query (position 0) will be “interested” only in columns \(0\) through \(n_{t_3} - 1\) inclusively, while the last (position \(n_{t_3} - 1\)) only in columns \(n_{t_3} - 1\) through \(2 n_{t_3} - 2\) inclusively.

Similarly to Huang et al. (2018), we note that multiplying \(Q\) by \(E\) results in a matrix that contains all the inner products necessary for assembling \(S\) in the general case. For instance, for \(t_3 = 4\) and dropping the batch and head dimensions for clearer visualization, the product is as follows:

\[QE = \left( \begin{matrix} s_{0 + 3} & s_{0 + 2} & s_{0 + 1} & s_{0 + 0} & & & \\ & s_{1 + 2} & s_{1 + 1} & s_{1 + 0} & s_{1 - 1} & & \\ & & s_{2 + 1} & s_{2 + 0} & s_{2 - 1} & s_{2 - 2} & \\ & & & s_{3 + 0} & s_{3 - 1} & s_{3 - 2} & s_{3 - 3} \\ \end{matrix} \right)\]

where \(s_{i + t}\) denotes query \(i\) embedded to look at relative time \(t\), that is, the inner product between the query at position \(i\) and the embedding corresponding to a relative attention shift of \(t\), whose embedding is stored in column \(n_{t_3} - 1 - t\) of \(E\). For instance, for \(s_{2 - 1}\) with \(t_3 = 4\) still, the inner product is between row \(2\) of \(Q\) and column \(4 - 1 - (-1) = 4\) of \(E\).

The target arrangement is then simply the one where we stack the “interesting” diagonals of \(QE\) on top of each other from diagonal \(0\) (the main diagonal) at the bottom and diagonal \(t_3 - 1\) (the rightmost relevant superdiagonal) at the top

\[\bar{S} = \left( \begin{matrix} s_{0 + 0} & s_{1 - 1} & s_{2 - 2} & s_{3 - 3} \\ s_{0 + 1} & s_{1 + 0} & s_{2 - 1} & s_{3 - 2} \\ s_{0 + 2} & s_{1 + 1} & s_{2 + 0} & s_{3 - 1} \\ s_{0 + 3} & s_{1 + 2} & s_{2 + 1} & s_{3 + 0} \\ \end{matrix} \right)\]

and then transpose the result

\[S = \left( \begin{matrix} s_{0 + 0} & s_{0 + 1} & s_{0 + 2} & s_{0 + 3} \\ s_{1 - 1} & s_{1 + 0} & s_{1 + 1} & s_{1 + 2} \\ s_{2 - 2} & s_{2 - 1} & s_{2 + 0} & s_{2 + 1} \\ s_{3 - 3} & s_{3 - 2} & s_{3 - 1} & s_{3 + 0} \\ \end{matrix} \right).\]

More generally, the algorithm can be summarized as follows:

\[S = \text{transpose}\left( \text{diagonal}\left( QE, \, \text{lower}=0, \, \text{upper}=n_{t_3} - 1 \right) \right)\]

where \(\text{diagonal}\) is a function taking a tensor and stacking its diagonals—specified by a range with two offsets relative to the main diagonal—from bottom up, and \(\text{transpose}\) is a function taking a tensor and transposing it. Both functions operators on the last two dimensions of the given tensor. This resulting matrix can then be plugged into Equation (1) to complete the calculation.

In case the keys and values are shorter than the maximum allowed relative position, that is, \(t_1 < t_3\), \(S\) should be truncated to its intended shape, \(n_s \times n_h \times n_{t_2} \times n_{t_1}\):

\[S = \text{truncate}\left( \text{transpose}\left( \text{diagonal}\left( QE, \, \text{lower}=0, \, \text{upper}=n_{t_3} - 1 \right) \right), \text{keep} = n_{t_1} \right)\]

where \(\text{truncate}\) is a function taking a tensor and keeping only the specified number of its first elements in the last dimension, discarding the rest.

It can be seen that the algorithm the same intermediate memory requirement than the one proposed in Huang at al. (2018), that is, \(\mathcal{O}(n_h \, n_d \, n_t)\); however, its application scope is larger.

Implementation

In TensorFlow, the algorithm can be implemented as an embedding layer as follows:

class RelativePositionalEmbedding(tf.keras.layers.Layer):
    def __init__(self, head_size: int, sequence_length: int) -> None:
        super().__init__()
        self.projection = self.add_weight(
            shape=(head_size, 2 * sequence_length - 1),
            initializer="glorot_uniform",
            trainable=True,
        )
        self.sequence_length = sequence_length

    def call(self, Q: tf.Tensor) -> tf.Tensor:
        S = tf.matmul(Q, self.projection)
        S = tf.linalg.diag_part(S, k=(0, self.sequence_length - 1))
        S = tf.transpose(S, perm=[0, 1, 3, 2])
        return S

The above layer can be invoked as part of an attention layer as illustrated below:

class Attention(tf.keras.layers.Layer):
    def __init__(self, head_size: int, sequence_length: int) -> None:
        super().__init__()
        self.head_size = head_size
        self.positional_embedding = RelativePositionalEmbedding(
            head_size=head_size,
            sequence_length=sequence_length,
        )

    def call(self, K: tf.Tensor, V: tf.Tensor, Q: tf.Tensor) -> tf.Tensor:
        # TODO: Add permutation if needed.
        S = self.positional_embedding(Q)
        W = tf.matmul(Q, K, transpose_b=True)
        W = W + S[:, :, :, : K.shape[2]]
        W = W * self.head_size**-0.5
        # TODO: Add masking if needed.
        W = tf.nn.softmax(W, axis=-1)
        # TODO: Add dropout if needed.
        A = tf.matmul(W, V)
        # TODO: Add dropout if needed.
        return A

References

Huang et al., “Music transformer: Generating music with long-term structure,” Google Brain, 2018.
Shaw et al., “Self-attention with relative position representations,” Google Brain, 2018.

Breaking sticks, or estimation of probability distributions using the Dirichlet process

2021-01-25T06:00:00+00:00

Recall the last time you wanted to understand the distribution of given data. One alternative was to plot a histogram. However, it resulted in frustration due to the choice of the number of bins to use, which led to drastically different outcomes. Another alternative was kernel density estimation. Despite having a similar choice to make, it has the advantage of producing smooth estimates, which are more realistic for continuous quantities with regularities. However, kernel density estimation was unsatisfactory too: it did not aid in understanding the underlying structure of the data and, moreover, provided no means of quantifying the uncertainty associated with the results. In this article, we discuss a Bayesian approach to the estimation of data-generating distributions that addresses the aforementioned concerns.

The approach we shall discuss is based on the family of Dirichlet processes. How specifically such processes are constructed will be described in the next section; here, we focus on the big picture.

A Dirichlet process is a stochastic process, that is, an indexed sequence of random variables. Each realization of this process is a discrete probability distribution, which makes the process a distribution over distributions, similarly to a Dirichlet distribution. The process has only one parameter: a measure \(\nu: \mathcal{B} \to [0, \infty]\) in a suitable finite measure space \((\mathcal{X}, \mathcal{B}, \nu)\) where \(\mathcal{X}\) is a set, and \(\mathcal{B}\) is a \(\sigma\)-algebra on \(\mathcal{X}\). We shall adopt the following notation:

\[P \sim \text{Dirichlet Process}(\nu)\]

where \(P\) is a random probability distribution that is distributed according to the Dirichlet process. Note that measure \(\nu\) does not have to be a probability measure; that is, \(\nu(\mathcal{X}) = 1\) is not required. To obtain a probability measure, one can divide \(\nu\) by the total volume \(\lambda = \nu(\mathcal{X})\):

\[P_0(\cdot) = \frac{1}{\lambda} \nu(\cdot).\]

Since this normalization is always possible, it is common and convenient to replace \(\nu\) with \(\lambda P_0\) and consider the process to be parametrized by two quantities instead of one:

\[P \sim \text{Dirichlet Process}(\lambda P_0).\]

Parameter \(\lambda\) is referred to as the concentration parameter of the process.

There are two major alternatives of using the Dirichlet process for estimating distributions: as a direct prior for the data at hand and as a mixing prior. We begin with the former.

Direct prior

Given a data set of \(n\) observations \(\{ x_i \}_{i = 1}^n\), a Dirichlet process can be used as a prior:

\[\begin{align} x_i | P_x & \sim P_x, \text{ for } i = 1, \dots, n; \text{ and} \\ P_x & \sim \text{Dirichlet Process}(\lambda P_0). \tag{1} \end{align}\]

It is important to realize that the \(x_i\)’s are assumed to be distributed not according to the Dirichlet process but according to a distribution drawn from the Dirichlet process. Parameter \(\lambda\) allows one to control the strength of the prior: the larger it is, the more shrinkage toward the prior is induced.

Inference

Due to the conjugacy property of the Dirichlet process in the above setting, the posterior is also a Dirichlet process and has the following simple form:

\[P_x | \{ x_i \}_{i = 1}^n \sim \text{Dirichlet Process}\left( \lambda P_0 + \sum_{i = 1}^n \delta_{x_i} \right). \tag{2}\]

That is, the total volume and normalized measure are updated as follows:

\[\begin{align} \lambda & := \lambda + n \quad \text{and} \\ P_0 & := \frac{\lambda}{\lambda + n} P_0 + \frac{1}{\lambda + n} \sum_{i = 1}^n \delta_{x_i}. \end{align}\]

Here, \(\delta_x(\cdot)\) is the Dirac measure, meaning that \(\delta_x(X) = 1\) if \(x \in X\) for any \(X \subseteq \mathcal{X}\), and otherwise, it is zero. It can be seen in Equation (2) that the base measure has simply been augmented with unit masses placed at the \(n\) observed data points.

The main question now is, How to draw samples from a Dirichlet process given \(\lambda\) and \(P_0\)?

As noted earlier, a draw from a Dirichlet process is a discrete probability distribution \(P_x\). The probability measure of this distribution admits the following representation:

\[P_x(\cdot) = \sum_{i = 1}^\infty p_i \delta_{x_i}(\cdot) \tag{3}\]

where \(\{ p_i \}\) is a set of probabilities that sum up to one, and \(\{ x_i \}\) is a set of points in \(\mathcal{X}\). Such a draw can be obtained using the so-called stick-breaking construction, which prescribes \(\{ p_i \}\) and \(\{ x_i \}\). To begin with, for practical computations, the infinite summation is truncated to retain the only first \(m\) elements:

\[P_x(\cdot) = \sum_{i = 1}^m p_i \delta_{x_i}(\cdot).\]

Atoms \(\{ x_i \}_{i = 1}^m\) are drawn independently from the normalized base measure \(P_0\). The calculation of probabilities \(\{ p_i \}\) is more elaborate, and this is where the construction and this article get their name, “stick breaking.” Imagine a stick of unit length, representing the total probability. The procedure is to keep breaking the stick into two parts where, for each iteration, the left part yields \(p_i\), and the right one, the remainder, is carried over to the next iteration. How much to break off is decided on by drawing \(m\) independent realizations from a carefully chosen beta distribution:

\[q_i \sim \text{Beta}(1, \lambda), \text{ for } i = 1, \dots, m. \tag{4}\]

All of them lie in the unit interval and are the proportions to break off of the remainder. When \(\lambda = 1\), these proportions (of the reminder) are uniformly distributed. When \(\lambda < 1\), the probability mass is shifted to the right, which means that there are likely to be a small number of large pieces, covering virtually the entire stick. When \(\lambda > 1\), the probability mass is shifted to the left, which means that there are likely to be a large number of small pieces, struggling to reach the end of the stick.

Formally, the desired probabilities are given by the following expression:

\[p_i = q_i \prod_{j = 1}^{i - 1} (1 - q_j), \text{ for } i = 1, \dots, m,\]

which, as noted earlier, are the left parts of the remainder of the stick during each iteration. For instance, \(p_1 = q_1\), \(p_2 = q_2 (1 - q_1)\), and so on. Due to the truncation, the probabilities \(\{ p_i \}_{i = 1}^m\) do not sum up to one, and it is common to set \(q_m := 1\) so that \(p_m\) takes up the remaining probability mass.

To recapitulate, a single draw from a Dirichlet process is obtained in two steps: prescribe atoms \(\{ x_i \}\) via draws from the normalized base measure and prescribe the corresponding probabilities \(\{ p_i \}\) via the stick-breaking construction. The two give a complete description of a discrete probability distribution. Recall that this distribution is still a single draw. By repeating this process many times, one obtains the distribution of this distribution, which can be used to, for instance, quantify uncertainty in the estimation.

Illustration

It is time to demonstrate how the Dirichlet process behaves as a direct prior. To this end, we shall use a data set containing velocities of “82 galaxies from 6 well-separated conic sections of an unfilled survey of the Corona Borealis region.” It was studied in Roeder (1990), which gives us a reference point.

For the curious reader, the source code of this notebook along with auxiliary scripts that are used for performing all the calculations presented below can be found on GitHub.

The empirical cumulative distribution function of the velocity is as follows:

Already here, it is apparent that the distribution is multimodal: there are two distinct regions, one to the left and one to the right, where the curve is flat, meaning there are no observations there. The proverbial histogram gives a confirmation:

It can be seen that there is a handful of galaxies moving relatively slowly and relatively fast compared to the majority located somewhere in the middle around twenty thousand kilometers per second. For completeness, kernel density estimation results in the following plot:

How many clusters of galaxies are there? What are their average velocities? How uncertain are these estimates? Our goal is to answer these questions by virtue of the Dirichlet process.

Now that the intention is to apply the presented theory in practice, we have to make all choices we have conveniently glanced over. Specifically, \(P_0\) has to be chosen, and we shall use the following:

\[P_0(\cdot) = \text{Gaussian}(\, \cdot \, | \mu_0, \sigma_0^2). \tag{5}\]

In the above, \(\text{Gaussian}(\cdot)\) refers to the probability measure of a Gaussian distribution with parameters \(\mu_0\) and \(\sigma_0\). In addition to these two, there is one more: \(\lambda\). We shall set \(\mu_0\) and \(\sigma_0\) to 20 and 5, respectively—which correspond roughly to the mean and standard deviation of the data—and present results for different \(\lambda\)’s to investigate how the prior volume affects shrinkage toward the prior.

First, we do not condition on the data to get a better understanding of the prior itself, which corresponds to Equation (1). The following figure shows a single draw from four Dirichlet processes with different \(\lambda\)’s (the gray curves show the cumulative distribution function of the data as a reference):

It can be seen that the larger the prior volume, the smoother the curve. This is because larger \(\lambda\)’s “break” the stick into more pieces, allowing the normalized base measure to be extensively sampled, which, in the limit, converges to this very measure; see Equation (5).

Now, conditioning on the observed velocities of galaxies—that is, sampling as shown in Equation (2)—we obtain the following draws from the posterior Dirichlet distributions with different \(\lambda\)’s:

When the prior volume is small, virtually no data points come from \(P_0\); instead, they are mostly uniform draws from the observed data set, leading to a curve that is nearly indistinguishable from the one of the data (the top curve). As \(\lambda\) gets larger, the prior gets stronger, and the estimate gets shrunk toward it, up to a point where the observations appear to be entirely ignored (the bottom curve).

The above model has a serious limitation: it assumes a discrete probability distribution for the data-generating process, which can be seen in the prior and posterior given in Equation (1) and (2), respectively, and it is also apparent in the decomposition given in Equation (3). In some cases, it might be appropriate; however, there is arguably more situations where it is inadequate, including the running example.

Mixing prior

Instead of using a Dirichlet process as a direct prior for the given data, it can be used as a prior for mixing distributions from a given family. The resulting posterior will then naturally inherit the properties of the family, such as continuity. The general structure is as follows:

\[\begin{align} x_i | \theta_i & \sim P_x \left( \theta_i \right), \text{ for } i = 1, \dots, n; \tag{6} \\ \theta_i | P_\theta & \sim P_\theta, \text{ for } i = 1, \dots, n; \text{ and} \\ P_\theta & \sim \text{Dirichlet Process}(\lambda P_0). \\ \end{align}\]

The \(i\)th data point, \(x_i\), is distributed according to distribution \(P_x\) with parameters \(\theta_i\). For instance, \(P_x\) could refer to the Gaussian family with \(\theta_i = (\mu_i, \sigma_i)\) identifying a particular member of the family by its mean and standard deviation. Parameters \(\{ \theta_i \}_{i = 1}^n\) are unknown and distributed according to distribution \(P_\theta\). Distribution \(P_\theta\) is not known either and gets a Dirichlet process prior with measure \(\lambda P_0\).

It can be seen in Equation (6) that each data point can potentially have its own unique set of parameters. However, this is not what usually happens in practice. If \(\lambda\) is reasonably small, the vast majority of the stick—the one we explained how to break in the previous section—tends to be consumed by a small number of pieces. This makes many data points share the same parameters, which is akin to clustering. In fact, clustering is a prominent use case for the Dirichlet process.

Inference

Unlike the previous model, there is no conjugacy in this case, and hence the posterior is not a Dirichlet process. There is, however, a simple Markov chain Monte Carlo sampling strategy based on the stick-breaking construction. It belongs to the class of Gibbs samplers and is as follows.

Similarly to Equation (3), we have the following decomposition:

\[P_m(\cdot) = \sum_{i = 1}^\infty p_i P_x(\cdot | \theta_i)\]

where \(P_m\) is the probability measure of the mixture. As before, the infinite decomposition has to be made finite to be usable in practice:

\[P_m(\cdot) = \sum_{i = 1}^m p_i P_x(\cdot | \theta_i).\]

Here, \(m\) represents an upper limit on the number of mixture components. Each data point \(x_i\), for \(i = 1, \dots, n\), is mapped to one of the \(m\) components, which we denote by \(k_i \in \{ 1, \dots, m \}\). In other words, \(k_i\) takes values from 1 to \(m\) and gives the index of the component of the \(i\)th observation.

There are \(m + m \times |\theta| + n\) parameters to be inferred where \(|\theta|\) denotes the number of parameters of \(P_x\). These parameters are \(\{ p_i \}_{i = 1}^m\), \(\{ \theta_i \}_{i = 1}^m\), and \(\{ k_i \}_{i = 1}^n\). As usual in Gibbs sampling, the parameters assume arbitrary but compatible initial values. The sampler has the following three steps.

First, given \(\{ p_i \}\), \(\{ \theta_i \}\), and \(\{ x_i \}\), the mapping of the observations to the mixture components, \(\{ k_i \}\), is updated as follows:

\[k_i \sim \text{Categorical}\left( m, \left\{ \frac{p_j P_x(x_i | \theta_j)}{\sum_{l = 1}^m p_l P_x(x_i | \theta_l)} \right\}_{j = 1}^m \right), \text{ for } i = 1, \dots, n.\]

That is, \(k_i\) is a draw from a categorical distribution with \(m\) categories whose unnormalized probabilities are given by \(p_j P_x(x_i | \theta_j)\), for \(j = 1, \dots, m\).

Second, given \(\{ k_i \}\), the probabilities of the mixture components, \(\{ p_i \}\), are updated using the stick-breaking construction described earlier. This time, however, the beta distribution for sampling \(\{ q_i \}\) in Equation (4) is replaced with the following:

\[q_i \sim \text{Beta}\left( 1 + n_i, \lambda + \sum_{j = i + 1}^m n_j \right), \text{ for } i = 1, \dots, m,\]

where

\[n_i = \sum_{j = 1}^n I_{\{i\}}(k_j), \text{ for } i = 1, \dots, m,\]

is the number of data points that are currently allocated to component \(i\). Here, \(I_A\) is the indicator function of a set \(A\). As before, in order for the \(p_i\)’s to sum up to one, it is common to set \(q_m := 1\).

Third, given \(\{ k_i \}\) and \(\{ x_i \}\), the parameters of the mixture components, \(\{ \theta_i \}\), are updated. This is done by sampling from the posterior distribution of each component. In this case, the posterior is a prior of choice that is updated using the data points that are currently allocated to the corresponding component. To streamline this step, a conjugate prior for the data distribution, \(P_x\), is commonly utilized, which we shall illustrate shortly.

To recapitulate, a single draw from the posterior is obtained in a number of steps where parameters or groups of parameters are updated in turn, while treating the other parameters as known. This Gibbs procedure is very flexible. Other parameters can be inferred too, instead of setting them to fixed values. An important example is the concentration parameter, \(\lambda\). This parameter controls the formation of clusters, and one might let the data decide what the value should be, in which case a step similar to the third one is added to the procedure to update \(\lambda\). This will be also illustrated below.

Illustration

We continue working with the galaxy data. For concreteness, consider the following choices:

\[\begin{align} \theta_i &= (\mu_i, \sigma_i), \text{ for } i = 1, \dots, n; \\ P_x (\theta_i) &= \text{Gaussian}(\mu_i, \sigma_i^2), \text{ for } i = 1, \dots, n; \text{ and} \\ P_0(\cdot) &= \text{Gaussian–Scaled-Inverse-}\chi^2(\, \cdot \, | \mu_0, \kappa_0, \nu_0, \sigma_0^2). \end{align} \tag{7}\]

In the above, \(\text{Gaussian–Scaled-Inverse-}\chi^2(\cdot)\) refers to the probability measure of a bivariate distribution composed of a conditional Gaussian and an unconditional scaled inverse chi-squared distribution. Some intuition about this distribution can be built via the following decomposition:

\[\begin{align} \mu_i | \sigma_i^2 & \sim \text{Gaussian}\left(\mu_0, \frac{\sigma_i^2}{\kappa_0}\right) \text{ and} \\ \sigma_i^2 & \sim \text{Scaled-Inverse-}\chi^2(\nu_0, \sigma_0^2). \end{align} \tag{8}\]

This prior is a conjugate prior for a Gaussian data distribution with unknown mean and variance, which we assume here. This means that the posterior is also a Gaussian–scaled-inverse-chi-squared distribution. Given a data set with \(n\) observations \(x_1, \dots, x_n\), the four parameters of the prior are updated simultaneously (not sequentially) as follows:

\[\begin{align} \mu_0 & := \frac{\kappa_0}{\kappa_0 + n} \mu_0 + \frac{n}{\kappa_0 + n} \mu_x, \\ \kappa_0 & := \kappa_0 + n, \\ \nu_0 & := \nu_0 + n, \text{ and} \\ \sigma_0^2 & := \frac{1}{\nu_0 + n} \left( \nu_0 \sigma_0^2 + ss_x + \frac{\kappa_0 n}{\kappa_0 + n}(\mu_x - \mu_0)^2 \right) \end{align}\]

where \(\mu_x = \sum_{i = 1}^n x_i / n\) and \(ss_x = \sum_{i = 1}^n (x_i - \mu_x)^2\). It can be seen that \(\kappa_0\) and \(\nu_0\) act as counters of the number of observations; \(\mu_0\) is a weighted sum of two means; and \(\nu_0 \sigma_0^2\) is a sum of two sums of squares and a third term increasing the uncertainty due to the difference in the means. In the Gibbs sampler, each component (each cluster of galaxies) will have its own posterior based on the data points that are assigned to that component during each iteration of the process. Therefore, \(n\), \(\mu_x\), and \(ss_x\) will generally be different for different components and, moreover, will vary from iteration to iteration.

We set \(\mu_0\) to 20, which is roughly the mean of the data, and \(\nu_0\) to 3, which is the smallest integer that allows the scaled chi-squared distribution to have a finite expectation. The choice of \(\kappa_0\) and \(\sigma_0\) is more subtle. Recall Equation (8). What we would like from the prior is to allow for free formation of clusters in a region generously covering the support of the data. To this end, the uncertainty in the mean, \(\mu_i\), has to be high; however, it should not come from \(\sigma_i\), since it would produce very diffuse clusters. We set \(\kappa_0\) to 0.01 to magnify the variance of \(\mu_i\) without affecting \(\sigma_i\), and \(\sigma_0\) to 1 to keep clusters compact.

Now, let us take a look at what the above choices entail. The following figure illustrates the prior for the mean of a component:

The negative part is unrealistic for velocity; however, it is rarely a problem in practice. What is important is that there is a generous coverage of the plausible values. The following figure shows the prior for the standard deviation of a component:

The bulk is below the standard deviation of the data; however, this is by choice: we expect more than one cluster of galaxies with similar velocities.

As mentioned earlier, we intend to include \(\lambda\) in the inference. First, we put the following prior:

\[\lambda \sim \text{Gamma}(\alpha_0, \beta_0). \tag{9}\]

Note this is the rate parameterization of the Gamma family. Conditionally, this is a conjugate prior with the following update rule for the two parameters:

\[\begin{align} \alpha_0 & := \alpha_0 + m - 1 \quad \text{and} \\ \beta_0 & := \beta_0 - \sum_{i = 1}^{m - 1} \ln(1 - q_i) \end{align}\]

where \(\{ q_i \}\) come from the stick-breaking construction. This is a fourth step in the Gibbs sampler. We set \(\alpha_0\) and \(\beta_0\) to 2 and 0.1, respectively, which entails the following prior assumption about \(\lambda\):

The parameter is allowed to vary freely from small to large values, as desired.

Having chosen all priors and their hyperparameters, we are ready to investigate the behavior of the entire model; see Equations (6), (7), and (9). In what follows, we shall limit the number of mixture components to 25; that is, \(m = 25\). Furthermore, we shall perform 2000 Gibbs iterations and discard the first half as a warm-up period. As before, we start without conditioning on the data to observe draws from the prior itself. The following figure shows two sample draws:

It can be seen that clusters of galaxies can appear anywhere in the region of interest and can be of various sizes. We conclude that the prior is adequate. When taking the observed velocities into account, we obtain a full posterior distribution in the form of 1000 draws. The following shows two random draws:

Indeed, mixture components have started to appear in the regions where there are observations.

Before we proceed to the final summary of results, it is prudent to inspect sample chains for a few parameters in order to ensure there are not problems with convergence to the stationary distribution. The following shows the number of occupied components among the 25 permitted:

The chain fluctuates around a fixed level without any prominent pattern, as it should. One can plot the actual marginal posterior distribution for the number of components; however, it is already clear that the distribution of the number of clusters of galaxies is mostly between 5 and 10 with a median of 7.

As for the concentration parameter, \(\lambda\), the chain is as follows:

The behavior is uneventful, which is a good sign.

Let us now take a look at the posterior distributions of the first seven components highlighted earlier (note the different scales on the vertical axes):

The components clearly change roles, which can be seen by the multimodal nature of the distributions. Component 1 is most often at 10 (times \(10^6\) m/s); however, it also peaks between 24 and 25 and even above 30. Components 2 and 3 are the most certain ones, which is due to a relatively large number of samples present in the corresponding region. They seem to exchanges roles and capture velocities of around 20 and 23. Components 4 and 5, on the other hand, appear to play the same role. Unlike Component 1, they are most likely to be found at around 33. Components 6 and 7 are similar too. They seem to be responsible for the small formation to the left and right next to the bulk in the middle (at 16); recall the histogram of the data. The small formation on the other side of the bulk at around 26 is captured as well, which is mostly done by Component 6.

Lastly, we summarize the inference using the following figure where the median distribution and a 95% uncertainty band—composed of distributions at the 0.025 and 0.975 quantiles—are plotted:

In this view, only five components are visible to the naked eye. The median curve matches well the findings in Roeder (1990). Judging by the width of the uncertainty band, there is a lot of plausible alternatives, and it is important to communicate this uncertainty to those who base decisions on the inference. The ability to quantify uncertainty with such ease is a prominent advantage of Bayesian inference.

Conclusion

In this article, the family of Dirichlet processes has been presented in the context of Bayesian inference. More specifically, it has been shown how a Dirichlet process can be utilized as a prior for an unknown discrete distribution and as a prior for mixing distributions from a given family. In both cases, it has been illustrated how to perform inference via a finite approximation and the stick-breaking construction.

Clearly, the overall procedure is more complicated than counting observations falling in a number of fixed bins, which is what a histogram does, or placing kernels all over the place, which is what a kernel density estimator does. However, “anything in life worth having is worth working for.” The advantages of the Bayesian approach include the ability to incorporate prior knowledge, which is crucial in situations with little data, and the ability to propagate and quantify uncertainty, which is a must.

Recall that the source code of this notebook along with auxiliary scripts that were used for performing the calculations presented above can be found on GitHub. Any feedback is welcome!

Acknowledgments

I would like to thank Mattias Villani for the insightful and informative graduate course in Bayesian statistics titled “Advanced Bayesian learning,” which was the inspiration behind writing this article, and for his guidance regarding the implementation.

References

Andrew Gelman et al., Bayesian Data Analysis, Chapman and Hall/CRC, 2014.
Kathryn Roeder, “Density estimation with confidence sets exemplified by superclusters and voids in galaxies,” Journal of the American Statistical Association, 1990.
Rick Durrett, Probability: Theory and Examples, Cambridge University Press, 2010.

Heteroscedastic Gaussian process regression

2020-06-22T06:00:00+00:00

Gaussian process regression is a nonparametric Bayesian technique for modeling relationships between variables of interest. The vast flexibility and rigor mathematical foundation of this approach make it the default choice in many problems involving small- to medium-sized data sets. In this article, we illustrate how Gaussian process regression can be utilized in practice. To make the case more compelling, we consider a setting where linear regression would be inadequate. The focus will be not on getting the job done as fast as possible but on learning the technique and understanding the choices being made.

Data

Consider the following example taken from Semiparametric Regression by Ruppert et al.:

The figure shows 221 observations collected in a light detection and ranging experiment. Each observation can be interpreted as the sum of the true underlying response at the corresponding distance and random noise. It can be clearly seen that the variance of the noise varies with the distance: the spread is substantially larger toward the right-hand side. This phenomenon is known as heteroscedasticity. Homoscedasticity (the absence of heteroscedasticity) is one of the key assumptions of linear regression. Applying linear regression to the above problem would yield suboptimal results. The estimates of the regression coefficients would still be unbiased; however, the standard errors of the coefficients would be incorrect and hence misleading. A different modeling technique is needed in this case.

The above data set will be our running example. For formally and slightly more generally, we assume that there is a data set of \(m\) observations:

\[\left\{ (\mathbf{x}_i, y_i): \, \mathbf{x}_i \in \mathbb{R}^d; \, y_i \in \mathbb{R}; \, i = 1, \dots, m \right\}\]

where the independent variable, \(\mathbf{x}\), is \(d\)-dimensional, and the dependent variable, \(y\), is scalar. In the running example, \(d\) is 1, and \(m\) is 221. It is time for modeling.

Model

To begin with, consider the following model with additive noise:

\[y_i = f(\mathbf{x}_i) + \epsilon_i, \text{ for } i = 1, \dots, m. \tag{1}\]

In the above, \(f: \mathbb{R}^d \to \mathbb{R}\) represents the true but unknown underlying function, and \(\epsilon_i\) represents the perturbation of the \(i\)th observation by random noise. In the classical linear-regression setting, the unknown function is modeled as a linear combination of (arbitrary transformations of) the \(d\) covariates. Instead of assuming any particular functional form, we put a Gaussian process prior on the function:

\[f(\mathbf{x}) \sim \text{Gaussian Process}\left( 0, k(\mathbf{x}, \mathbf{x}') \right).\]

The above notation means that, before observing any data, the function is a draw from a Gaussian process with zero mean and a covariance function \(k\). The covariance function dictates the degree of correlation between two arbitrary locations \(\mathbf{x}\) and \(\mathbf{x}'\) in \(\mathbb{R}^d\). For instance, a frequent choice for \(k\) is the squared-exponential covariance function:

\[k(\mathbf{x}, \mathbf{x}') = \sigma_\text{process}^2 \exp\left( -\frac{\|\mathbf{x} - \mathbf{x}'\|_2^2}{2 \, \ell_\text{process}^2} \right)\]

where \(\|\cdot\|_2\) stands for the Euclidean norm, \(\sigma_\text{process}^2\) is the variance (to see this, substitute \(\mathbf{x}\) for \(\mathbf{x}'\)), and \(\ell_\text{process}\) is known as the length scale. While the variance parameter is intuitive, the length-scale one requires an illustration. The parameter controls the speed with which the correlation fades with the distance. The following figure shows 10 random draws for \(\ell_\text{process} = 0.1\):

With \(\ell_\text{process} = 0.5\), the behavior changes to the following:

It can be seen that it takes a greater distance for a function with a larger length scale (top) to change to the same extent compared to a function with a smaller length scale (bottom).

Let us now return to Equation (1) and discuss the error terms, \(\epsilon_i\). In linear regression, they are modeled as independent identically distributed Gaussian random variables:

\[\epsilon_i \sim \text{Gaussian}\left( 0, \sigma_\text{noise}^2 \right), \text{ for } i = 1, \dots, m. \tag{2}\]

This is also the approach one can take with Gaussian process regression; however, one does not have to. There are reasons to believe the problem at hand is heteroscedastic, and it should be reflected in the model. To this end, the magnitude of the noise is allowed to vary with the covariates:

\[\epsilon_i | \mathbf{x}_i \sim \text{Gaussian}\left(0, \sigma^2_{\text{noise}, i}\right), \text{ for } i = 1, \dots, m. \tag{3}\]

The error terms are still independent (given the covariates) but not identically distributed. At this point, one has to make a choice about the dependence of \(\sigma_{\text{noise}, i}\) on \(\mathbf{x}_i\). This dependence could be modeled with another Gaussian process with an appropriate link function to ensure \(\sigma_{\text{noise}, i}\) is nonnegative. Another reasonable choice is a generalized linear model, which is what we shall use:

\[\ln \sigma^2_{\text{noise}, i} = \alpha_\text{noise} + \boldsymbol{\beta}^\intercal_\text{noise} \, \mathbf{x}_i, \text{ for } i = 1, \dots, m, \tag{4}\]

where \(\alpha\) is the intercept of the regression line, and \(\boldsymbol{\beta} \in \mathbb{R}^d\) contains the slopes.

Thus far, a model for the unknown function \(f\) and a model for the noise have been prescribed. In total, there are \(d + 3\) parameters: \(\sigma_\text{process}\), \(\ell_\text{process}\), \(\alpha_\text{noise}\), and \(\beta_{\text{noise}, i}\) for \(i = 1, \dots, d\). The first two are positive, and the rest are arbitrary. The final piece is prior distributions for these parameters.

The variance of the coveriance function, \(\sigma^2_\text{process}\), corresponds to the amount of variance in the data that is explained by the Gaussian process. It poses no particular problem and can be tackled with a half-Gaussian or a half-Student’s t distribution:

\[\sigma_\text{process} \sim \text{Half-Gaussian}\left( 0, 1 \right).\]

The notation means that the standard Gaussian distribution is truncated at zero and renormalized. The nontrivial mass around zero implied by the prior is considered to be beneficial in this case.¹

A prior for the length scale of the covariance function, \(\ell_\text{process}\), should be chosen with care. Small values—especially, those below the resolution of the data—give the Gaussian process extreme flexibility and easily leads to overfitting. Moreover, there are numerical ramifications of the length scale approaching zero as well: the quality of Hamiltonian Monte Carlo sampling degrades.² The bottom line is that a prior penalizing values close to zero is needed. A reasonable choice is an inverse gamma distribution:

\[\ell_\text{process} \sim \text{Inverse Gamma}\left( 1, 1 \right).\]

To understand the implications, let us perform a prior predictive check for this component in isolation:

It can be seen that the density is very low in the region close to zero, while being rather permissive to the right of that region, especially considering the scale of the distance in the data; recall the very first figure. Consequently, the choice is adequate.

The choice of priors for the parameters of the noise is complicated by the nonlinear link function; see Equation (4). What is important to realize is that small amounts of noise correspond to negative values in the linear space, which is probably what one should be expecting given the scale of the response. Therefore, the priors should allow for large negative values. Let us make an educated assumption and perform a prior predictive check to understand the consequences. Consider the following:

\[\begin{align} \alpha_\text{noise} & \sim \text{Gaussian}\left( -1, 1 \right) \text{ and} \\ \beta_{\text{noise}, i} & \sim \text{Gaussian}\left( 0, 1 \right), \text{ for } i = 1, \dots, d.\\ \end{align}\]

The density of \(\sigma_\text{noise}\) without considering the regression slopes is depicted below (note the logarithmic scale on the horizontal axis):

The variability in the intercept, \(\alpha_\text{noise}\), allows the standard deviation, \(\sigma_\text{noise}\), to comfortably vary from small to large values, keeping in mind the scale of the response. Here are two draws from the prior distribution of the noise, including Equations (3) and (4):

The large ones are perhaps unrealistic and could be addressed by further shifting the distribution of the intercept. However, they should not cause problems for the inference.

Putting everything together, the final model is as follows:

\[\begin{align} y_i & = f(\mathbf{x}_i) + \epsilon_i, \text{ for } i = 1, \dots, m; \\ f(\mathbf{x}) & \sim \text{Gaussian Process}\left( 0, k(\mathbf{x}, \mathbf{x}') \right); \\ k(\mathbf{x}, \mathbf{x}') & = \sigma_\text{process}^2 \exp\left( -\frac{\|\mathbf{x} - \mathbf{x}'\|_2^2}{2 \, \ell_\text{process}^2} \right); \\ \epsilon_i | \mathbf{x}_i & \sim \text{Gaussian}\left( 0, \sigma^2_{\text{noise}, i} \right), \text{ for } i = 1, \dots, m; \\ \ln \sigma^2_{\text{noise}, i} & = \alpha_\text{noise} + \boldsymbol{\beta}_\text{noise}^\intercal \, \mathbf{x}_i, \text{ for } i = 1, \dots, m; \\ \sigma_\text{process} & \sim \text{Half-Gaussian}\left( 0, 1 \right); \\ \ell_\text{process} & \sim \text{Inverse Gamma}\left( 1, 1 \right); \\ \alpha_\text{noise} & \sim \text{Gaussian}\left( -1, 1 \right); \text{ and} \\ \beta_{\text{noise}, i} & \sim \text{Gaussian}\left( 0, 1 \right), \text{ for } i = 1, \dots, d.\\ \end{align}\]

This concludes the modeling part. The remaining two steps are to infer the parameters and to make predictions using the posterior predictive distribution.

Inference

The model is analytically intractable; one has to resort to sampling or variational methods for inferring the parameters. We shall use Hamiltonian Markov chain Monte Carlo sampling via Stan. The model can be seen in the following listing, where the notation closely follows the one used throughout the article:

data {
  int<lower = 1> d;
  int<lower = 1> m;
  vector[d] x[m];
  vector[m] y;
}

transformed data {
  vector[m] mu = rep_vector(0, m);
  matrix[m, d] X;
  for (i in 1:m) {
    X[i] = x[i]';
  }
}

parameters {
  real<lower = 0> sigma_process;
  real<lower = 0> ell_process;
  real alpha_noise;
  vector[d] beta_noise;
}

model {
  matrix[m, m] K = cov_exp_quad(x, sigma_process, ell_process);
  vector[m] sigma_noise_squared = exp(alpha_noise + X * beta_noise);
  matrix[m, m] L = cholesky_decompose(add_diag(K, sigma_noise_squared));

  y ~ multi_normal_cholesky(mu, L);
  sigma_process ~ normal(0, 1);
  ell_process ~ inv_gamma(1, 1);
  alpha_noise ~ normal(-1, 1);
  beta_noise ~ normal(0, 1);
}

In the parameters block, one can find the \(d + 3\) parameters identified earlier. In regards to the model block, it is worth noting that there is no any Gaussian process distribution in Stan. Instead, a multivariate Gaussian distribution is utilized to model \(f\) at \(\mathbf{X} = (\mathbf{x}_i)_{i = 1}^m \in \mathbb{R}^{m \times d}\) and eventually \(\mathbf{y} = (y_i)_{i = 1}^m\), which is for a good reason. Even though a Gaussian process is an infinite-dimensional object, in practice, one always works with finite amounts of data. For instance, in the running example, there are only 221 data points. By definition, a Gaussian process is a stochastic process with the condition that any finite collection of points from this process has a multivariate Gaussian distribution. This fact combined with the conditional independence of the process and the noise given the covariates yields the following and explains the usage of a multivariate Gaussian distribution:

\[\mathbf{y} | \mathbf{X}, \sigma_\text{process}, \ell_\text{process}, \alpha_\text{noise}, \boldsymbol{\beta}_\text{noise} \sim \text{Multivariate Gaussian}\left( \mathbf{0}, \mathbf{K} + \mathbf{D} \right)\]

where \(\mathbf{K} \in \mathbb{R}^{m \times m}\) is a covariance matrix computed by evaluating the covariance function \(k\) at all pairs of locations in the observed data, and \(\mathbf{D} = \text{diag}(\sigma^2_{\text{noise}, i})_{i = 1}^m \in \mathbb{R}^{m \times m}\) is a diagonal matrix of the variances of the noise at the corresponding locations.

After running the inference, the following posterior distributions are obtained:

The intervals are at the bottom of the densities are 66% and 95% equal-tailed probability intervals, and the dots indicate the medians. Let us also take a look at the 95% probability interval for the noise with respect to the distance:

As expected, the variance of the noise increases with the distance.

Prediction

Suppose there are \(n\) locations \(\mathbf{X}_\text{new} = (\mathbf{x}_{\text{new}, i})_{i = 1}^n \in \mathbb{R}^{n \times d}\) where one wishes to make predictions. Let \(\mathbf{f}_\text{new} \in \mathbb{R}^n\) be the values of \(f\) at those locations. Assuming all the data and parameters given, the joint distribution of \(\mathbf{y}\) and \(\mathbf{f}_\text{new}\) is as follows:

\[\left[ \begin{matrix} \mathbf{y} \\ \mathbf{f}_\text{new} \end{matrix} \right] \sim \text{Multivariate Gaussian}\left( \mathbf{0}, \left[ \begin{matrix} \mathbf{K} + \mathbf{D} & k(\mathbf{X}, \mathbf{X}_\text{new}) \\ k(\mathbf{X}_\text{new}, \mathbf{X}) & k(\mathbf{X}_\text{new}, \mathbf{X}_\text{new}) \end{matrix} \right] \right)\]

where, with a slight abuse of notation, \(k(\cdot, \cdot)\) stands for a covariance matrix computed by evaluating the covariance function \(k\) at the specified locations, which is analogous to \(\mathbf{K}\). It is well known (see Rasmussen et al. 2006, for instance) that the marginal distribution of \(\mathbf{f}_\text{new}\) is a multivariate Gaussian with the following mean vector and covariance matrix, respectively:

\[\begin{align} E(\mathbf{f}_\text{new}) & = k(\mathbf{X}_\text{new}, \mathbf{X})(\mathbf{K} + \mathbf{D})^{-1} \, \mathbf{y} \quad \text{and} \\ \text{cov}(\mathbf{f}_\text{new}) & = k(\mathbf{X}_\text{new}, \mathbf{X}_\text{new}) - k(\mathbf{X}_\text{new}, \mathbf{X})(\mathbf{K} + \mathbf{D})^{-1} k(\mathbf{X}, \mathbf{X}_\text{new}). \end{align}\]

The final component is the noise, as per Equation (1). The noise does not change the mean of the multivariate Gaussian distribution but does magnify the variance:

\[\begin{align} E(\mathbf{y}_\text{new}) & = k(\mathbf{X}_\text{new}, \mathbf{X})(\mathbf{K} + \mathbf{D})^{-1} \, \mathbf{y} \quad \text{and} \\ \text{cov}(\mathbf{y}_\text{new}) & = k(\mathbf{X}_\text{new}, \mathbf{X}_\text{new}) - k(\mathbf{X}_\text{new}, \mathbf{X})(\mathbf{K} + \mathbf{D})^{-1} k(\mathbf{X}, \mathbf{X}_\text{new}) + \text{diag}(\sigma^2_\text{noise}(\mathbf{X}_\text{new})) \end{align}\]

where \(\text{diag}(\sigma^2_\text{noise}(\cdot))\) stands for a diagonal matrix composed of the noise variance evaluated at the specified locations, which is analogous to \(\mathbf{D}\).

Given a set of draws from the joint posterior distribution of the parameters and the last two expressions, it is now straightforward to draw samples from the posterior predictive distribution of the response: for each draw of the parameters, one has to evaluate the mean vector and the covariance matrix and sample the corresponding multivariate Gaussian distribution. The result is given in the following figure:

The graph shows the mean value of the posterior predictive distribution given by the black line along with a 95% equal-tailed probability band about the mean. It can be seen that the uncertainty in the predictions is adequately captured along the entire support. Naturally, the full predictive posterior distribution is available at any location of interest.

Before we conclude, let us illustrate what would happen if the data were modeled as having homogeneous noise. To this end, the variance of the noise is assumed to be independent of the covariates, as in Equation (2). After repeating the inference and prediction processes, the following is obtained:

The inference is inadequate, which can be seen by the probability band: the variance is largely overestimated on the left-hand side and underestimated on the right-hand side. This justifies well the choice of heteroscedastic regression presented earlier.

Conclusion

In this article, it has been illustrated how a functional relationship can be modeled using a Gaussian process as a prior. Particular attention has been dedicated to adequately capturing error terms in the presence of heteroscedasticity. In addition, a practical implementation has been discussed, and the experimental results have demonstrated the appropriateness of this approach.

For the curious reader, the source code of this notebook along with a number of auxiliary scripts, such as the definition of the model in Stan, can be found on GitHub.

Acknowledgments

I would like to thank Mattias Villani for the insightful and informative graduate course in statistics titled “Advanced Bayesian learning,” which was the inspiration behind writing this article.

References

Carl Rasmussen et al., Gaussian Processes for Machine Learning, the MIT Press, 2006.
David Ruppert et al., Semiparametric Regression, Cambridge University Press, 2003.

Footnotes

“Priors for marginal standard deviation,” Stan User’s Guide, 2020. ↩
“Priors for length-scale,” Stan User’s Guide, 2020. ↩

What is the easiest way to compare two data sets?

2020-04-10T06:00:00+00:00

One has probably come across this problem numerous times. There are two versions of a tabular data set with a lot of columns of different types, and one wants to quickly identify any differences between the two. For example, the pipeline providing data to a predictive model might have been updated, and the goal is to understand if there have been any side effects of this update for the training data.

One solution is to start to iterate over the columns of the two tables, computing five-number summaries and plotting histograms or identifying distinct values and plotting bar charts, depending on the column’s type. However, this can quickly get out of hand and evolve into an endeavor for the rest of the day.

An alternative is to leverage the amazing tools that already exist in the data community.

Solution

The key takeaway is the following three lines of code, excluding the import:

import tensorflow_data_validation as dv

statistics_1 = dv.generate_statistics_from_dataframe(data_1)
statistics_2 = dv.generate_statistics_from_dataframe(data_2)
dv.visualize_statistics(lhs_statistics=statistics_1,
                        rhs_statistics=statistics_2)

This is all it takes to get a versatile dashboard embedded right into a cell of a Jupyter notebook. The visualization itself is based on Facets, and it is conveniently provided by TensorFlow Data Validation (which does not have much to do with TensorFlow and can be used stand-alone).

It is pointless to try to describe in words what the dashboard can do; instead, here is a demonstration taken from Facets where the tool is applied the UCI Census Income data set:

Go ahead and give a try to all the different controls!

In this case, it is helpful to toggle the “percentages” checkbox, since the data sets are of different sizes. Then it becomes apparent that the two partitions are fairly balanced. The only problem is that Target, which represents income, happened to be encoded incorrectly in the partition for testing.

Lastly, an example in a Jupyter notebook can be found on GitHub.

Conclusion

It can be difficult to navigate and particularly challenging to compare wide data sets. A lot of effort can be put into this exercise. However, the landscape of open-source tools has a lot to offer too. Facets is one such example. The library and its straightforward availability via TensorFlow Data Validation are arguably less known. This short note can hopefully rectify this to some extent.

Bayesian inference of the net promoter score via multilevel regression with poststratification

2020-02-03T07:00:00+00:00

Customer surveys are naturally prone to biases. One prominent example is participation bias, which arises when individuals decide not to respond to the survey, and this pattern is not random. For instance, new customers might reply less eagerly than those who are senior. This renders the obtained responses unrepresentative of the target population. In this article, we tackle participation bias for the case of the net promoter survey by means of multilevel regression and poststratification.

More specifically, the discussion here is a sequel to “A Bayesian approach to the inference of the net promoter score,” where we built a hierarchical model for inferring the net promoter score for an arbitrary segmentation of a customer base. The reader is encouraged to skim over that article to recall the mechanics of the score and the structure of the model that was constructed. In that article, there was an assumption made that the sample was representative of the population, which, as mentioned earlier, is often not the case. In what follows, we mitigate this problem using a technique called poststratification. The technique works by matching proportions observed in the sample with those observed in the population with respect to several dimensions, such as age, country, and gender. However, in order to be able to poststratify, the model has to have access to all these dimensions at once, which the model built earlier is not suited for. To enable this, we switch gears to multilevel multinomial regression.

Problem

Suppose the survey is to measure the net promoter score for a population that consists of \(N\) customers. The score is to be reported with respect to individual values of \(M\) grouping variables where variable \(i\) has \(m_i\) possible values, for \(i = 1, \dots, M\). For instance, it might be important to know the score for different age groups, in which case the variable would be the customer’s age with values such as 18–25, 26–35, and so on. This implies that, in total, \(\sum_i m_i\) scores have to be estimated.

Depending on the size of the business, one might or might not try to reach out to all customers, except for those who have opted out of communications. Regardless of the decision, the resulting sample size, which is denoted by \(n\), is likely to be substantially smaller than \(N\), as the response rate is typically low. Therefore, there is uncertainty about the opinion of those who abstained or were not targeted.

More importantly, a random sample is desired; however, certain subpopulations of customers might end up being significantly overrepresented due to participation bias, driving the score astray. Let us quantify this concern. We begin by taking the Cartesian product of the aforementioned \(M\) variables. This results in \(K = \prod_i m_i\) distinct combinations of the variables’ values, which are referred to as cells in what follows. For each cell, the number of detractors, neutrals, and promoters observed in the sample are computed and denoted by \(d_i\), \(u_i\), and \(p_i\), respectively. The number of respondents in cell \(i\) is then

\[n_i = d_i + u_i + p_i \tag{1}\]

for \(i = 1, \dots, K\). For convenience, all counts are arranged in the following matrix:

\[y = \left( \begin{matrix} y_1 \\ \vdots \\ y_i \\ \vdots \\ y_K \end{matrix} \right) = \left( \begin{matrix} d_1 & u_1 & p_1 \\ \vdots & \vdots & \vdots \\ d_i & u_i & p_i \\ \vdots & \vdots & \vdots \\ d_K & u_K & p_K \end{matrix} \right). \tag{2}\]

Given \(y\), the observed net promoter score for value \(j\) of variable \(i\) can be evaluated as follows:

\[s^i_j = 100 \times \frac{\sum_{k \in I^i_j}(p_k - d_k)}{\sum_{k \in I^i_j} n_k} \tag{3}\]

where \(I^i_j\) is an index set traversing cells with variable \(i\) set to value \(j\), which has the effect of marginalizing out other variables conditioned on the chosen value of variable \(i\), that is, on value \(j\).

We can now compare \(n_i\), computed according to Equation (1), with its counterpart in the population (the total number of customers who belong to cell \(i\)), which is denoted by \(N_i\), taking into consideration the sample size \(n\) and the population size \(N\). Problems occur when the ratios within one or more of the following tuples largely disagree:

\[\left(\frac{n_i}{n}, \frac{N_i}{N}\right) \tag{4}\]

for \(i = 1, \dots, K\). When this happens, the scores given by Equation (3) or any analyses oblivious of this disagreement cannot be trusted, since they misrepresent the population. (It should be noted, however, that equality within each tuple does not guarantee the absence of participation bias, since there might be other, potentially unobserved, dimensions along which there are deviations.)

The survey has been conducted, and there are deviations. What do we do with all these responses that have come in? Should we discard and run a new survey, hoping that, this time, it would be different?

Solution

The fact that the sample covers only a fraction of the population is, of course, no news, and the solution is standard: one has to infer the net promoter score for the population given the sample and domain knowledge. This is what was done in the previous article for one grouping variable. However, due to participation bias, additional measures are needed as follows.

Taking inspiration from political science, we proceed in two steps.

Using an adequate model, \(K = \prod_i m_i\) net promoter scores are inferred—one for each cell, that is, for each combination of the values of the grouping variables.
The \(\prod_i m_i\) “cell-scores” are combined to produce \(\sum_i m_i\) “value-scores”—one for each value of each variable. This is done in such a way that the contribution of each cell to the score is equal to the relative size of that cell in the population given by Equation (4).

The two steps are discussed in the following two subsections.

Modeling

Step 1 can, in principle, be undertaken by any model of choice. A prominent candidate is multilevel multinomial regression, which is what we shall explore. Multilevel refers to having a hierarchical structure where parameters on a higher level give birth to parameters on a lower level, which, in particular, enables information exchange through a common ancestor. Multinomial refers to the distribution used for modeling the response variable. The family of multinomial distributions is appropriate, since we work with counts of events falling into one of several categories: detractors, neutrals, and promoters; see Equation (2). The response for each cell is then as follows:

\[y_i | \theta_i \sim \text{Multinomial}(n_i, \theta_i)\]

where \(n_i\) is given by Equation (1), and

\[\theta_i = \left\langle\theta^d_i, \theta^u_i, \theta^p_i\right\rangle\]

is a simplex (sums up to one) of probabilities of the three categories.

Multinomial regression belongs to the class of generalized linear models. This means that the inference takes place in a linear domain, and that \(\theta_i\) is obtained by applying a deterministic transformation to the corresponding linear model or models; the inverse of this transformation is known as the link function. In the case of multinomial regression, the aforementioned transformation is the softmax function, which is a generalization of the logistic function allowing more than two categories:

\[\theta_i = \text{Softmax}\left(\mu_i\right)\]

where

\[\mu_i = \left(0, \mu^u_i, \mu^p_i\right)\]

is the average log-odds of the three categories with respect to a reference category, which, by conventions, is taken to be the first one, that is, detractors. The first entry is zero, since \(\ln(1) = 0\). Therefore, there are only two linear models: one is for neutrals (\(\mu^u_i\)), and one is for promoters (\(\mu^p_i\)).

Now, there are many alternatives when it comes to the two linear parts. In this article, we use the following architecture. Both the model for neutrals and the one for promoters have the same structure, and for brevity, only the former is described. For the log-odds of neutrals, the model is

\[\mu^u_i = b^u + \sum_{j = 1}^M \delta^{uj}_{I_j[i]}\]

where

\[\delta^{uj} = \left(\delta^{uj}_1, \dots, \delta^{uj}_{m_j}\right)\]

is a vector of deviations from intercept \(b^u\) specific to grouping variable \(j\) (one entry for each value of the variable), and \(I_j[i]\) yields the index of the value that cell \(i\) has, for \(i = 1, \dots, K\) and \(j = 1, \dots, M\).

Let us now turn to the multilevel aspect. For each grouping variable, the corresponding values, represented by the elements of \(\delta^{uj}\), are allowed to be different but assumed to have something in common and thus originate from a common distribution. To this end, they are assigned distributions with a shared parameter as follows:

\[\delta^{uj}_i | \sigma^{uj} \sim \text{Gaussian}\left(0, \sigma^{uj}\right)\]

for \(i = 1, \dots, m_j\). The mean is zero, since \(\delta^{uj}_i\) represents a deviation.

Lastly, we have to decide on prior distributions of the intercept, \(b^u\), and the standard deviations, \(\sigma^{uj}\) for \(j = 1, \dots, M\). The intercept is given the following prior:

\[b^u \sim \text{Student’s t}(5, 0, 1).\]

The mean is zero in order to center at even odds. Regarding the standard deviations, they are given the following prior:

\[\sigma^{uj} \sim \text{Half-Student’s t}(5, 0, 1).\]

In order to understand the implications of these prior choices, let us take a look at the prior distribution assuming two grouping variables:

The left and right dashed lines demarcate tail regions that, for practical purposes, can be thought of as “never” and “always,” respectively. For instance, log-odds of five or higher are so extreme that detractors are rendered nearly non-existent when compared to neutrals. These regions are arguably unrealistic. The prior does not exclude these possibilities; however, it does not favor them either. The vast majority of the probability mass is still in the middle around zero.

The overall model is then as follow:

\[\begin{align} & y_i | \theta_i \sim \text{Multinomial}(n_i, \theta_i), \text{ for } i = 1, \dots, K; \\ & \theta_i = \text{Softmax}\left(\mu_i\right), \text{ for } i = 1, \dots, K; \\ & \mu_i = (0, \mu^u_i, \mu^p_i), \text{ for } i = 1, \dots, K; \\ & \mu^u_i = b^u + \sum_{j = 1}^M \delta^{uj}_{I_j[i]}, \text{ for } i = 1, \dots, K; \\ & \mu^p_i = b^p + \sum_{j = 1}^M \delta^{pj}_{I_j[i]}, \text{ for } i = 1, \dots, K; \\ & b^u \sim \text{Student’s t}(5, 0, 1); \\ & b^p \sim \text{Student’s t}(5, 0, 1); \\ & \delta^{uj}_k | \sigma^{uj} \sim \text{Gaussian}\left(0, \sigma^{uj}\right), \text{ for } j = 1, \dots, M \text{ and } k = 1, \dots, m_j; \tag{5a} \\ & \delta^{pj}_k | \sigma^{pj} \sim \text{Gaussian}\left(0, \sigma^{pj}\right), \text{ for } j = 1, \dots, M \text{ and } k = 1, \dots, m_j; \tag{5b} \\ & \sigma^{uj} \sim \text{Half-Student’s t}(5, 0, 1), \text{ for } j = 1, \dots, M; \text{ and} \\ & \sigma^{pj} \sim \text{Half-Student’s t}(5, 0, 1), \text{ for } j = 1, \dots, M. \end{align}\]

The model has \(2 \times (1 + \sum_i m_i + M)\) parameters in total. The structure that can be seen in Equations (5a) and (5b) is what makes the model multilevel. This is an important feature, since it allows for information sharing between the individual values of the grouping variables. In particular, this has a regularizing effect on the estimates, which is also known as shrinkage resulting from partial pooling.

Having defined the model, the posterior distribution can now be obtained by means of Markov chain Monte Carlo sampling. This procedure is standard and can be performed using, for instance, Stan or a higher-level package, such as brms, which is what is exemplified in the Implementation section. The result is a collection of draws of the parameters from the posterior distribution. For each draw of the parameters, a draw of the net promoter score can be computed using the following formula:

\[s_i = 100 \times (\theta^p_i - \theta^d_i) \tag{6}\]

for \(i = 1, \dots, K\). This means that we have obtained a (joint) posterior distribution of the net promoter score over the \(K\) cells. It is now time to combine the scores for the cells on the level of the values of the \(M\) grouping variables, which results in \(\sum_i m_i\) scores in total.

Poststratification

Step 2 is poststratification, whose purpose is to correct for potential deviations of the sample from the population; recall the discussion around Equation (4). The foundation laid in the previous subsection makes the work here straightforward. The idea is as follows. Each draw from the posterior distribution consists of \(K\) values for the net promoter score, one for each cell. All one has to do in order to correct for a mismatch in proportions is to take a weighted average of these scores where the weights are the counts observed in the population:

\[s^i_j = \frac{\sum_{k \in I^i_j} N_k \, s_k}{\sum_{k \in I^i_j} N_k}\]

where \(I^i_j\) is as in Equation (3), for \(i = 1, \dots, M\) and \(j = 1, \dots, m_i\). The above gives a poststratified draw from the posterior distribution of the net promoter score for variable \(i\) and value \(j\). In practice, depending on the tool used, one might perform the poststratification procedure differently, such as predicting counts of detractors, neutrals, and promoters in the cells given their in-population sizes and then aggregating those counts and following the definition of the net promoter score.

Implementation

In what follows, we consider a contrived example with the sole purpose of illustrating how the presented workflow can be implemented in practice. To this end, we generate some data with two grouping variables, age and seniority, and then perform inference using brms, which leverages Stan under the hood. For a convenient manipulation of posterior draws, tidybayes is used as well.

library(brms)
library(tidybayes)
library(tidyverse)

set.seed(42)
options(mc.cores = parallel::detectCores())

# Load data
data <- load_data()
# => list(
# =>   population = tibble(age, seniority, cell_size),
# =>   sample = tibble(age, seniority, cell_size,
# =>                   cell_counts = (detractors, neutrals, promoters))
# => )

# Modeling
priors <- c(
  prior('student_t(5, 0, 1)', class = 'Intercept', dpar = 'muneutral'),
  prior('student_t(5, 0, 1)', class = 'Intercept', dpar = 'mupromoter'),
  prior('student_t(5, 0, 1)', class = 'sd', dpar = 'muneutral'),
  prior('student_t(5, 0, 1)', class = 'sd', dpar = 'mupromoter')
)
formula <- brmsformula(
  cell_counts | trials(cell_size) ~ (1 | age) + (1 | seniority))
model <- brm(formula, data$sample, multinomial(), priors,
             control = list(adapt_delta = 0.99), seed = 42)

# Poststratification
prediction <- data$population %>%
  add_predicted_draws(model) %>%
  spread(.category, .prediction) %>%
  group_by(age, .draw) %>%
  summarize(score = 100 * sum(promoter - detractor) / sum(cell_size)) %>%
  mean_hdi()

The final aggregation is given for age; it is similar for seniority. It can be seen in the above listing that modern tools allow for rather complex ideas to be expressed and explored in a very laconic way.

The curious reader is encouraged to run the above code. The appendix contains a function for generating synthetic data. It should be noted, however, that brms and tidybayes should be of versions greater than 2.11.1 and 2.0.1, respectively, which, at the time of writing, are available for installation only on GitHub. The appendix contains instructions for updating the packages.

Conclusion

In this article, we have discussed a multilevel multinomial model for inferring the net promoter score with respect to several grouping variables in accordance with the business needs. It has been argued that poststratification is an essential stage of the inference process, since it mitigates the deleterious consequences of participation bias on the subsequent decision-making.

There are still some aspects that could be improved. For instance, there is a natural ordering to the three categories of customers, detractors, neutrals, and promoters; however, it is currently ignored. Furthermore, there is some information thrown away when customer-level scores, which range from zero to ten, are aggregated on the category level. Lastly, the net promoter survey often happens in periodic waves, which calls for a single model capturing and learning from changes over time.

Acknowledgments

I would like to thank Andrew Gelman for the guidance on multilevel modeling and Paul-Christian Bürkner for the help with understanding the brms package.

References

Andrew Gelman et al., “Using multilevel regression and poststratification to estimate dynamic public opinion,” 2018.
Andrew Gelman and Jennifer Hill, Data Analysis Using Regression and Multilevel/Hierarchical Models, Cambridge University Press, 2006.
Andrew Gelman and Thomas Little, “Poststratification into many categories using hierarchical logistic regression,” Survey Methodology, 1997.
Paul-Christian Bürkner, “brms: An R package for Bayesian multilevel models using Stan,” Journal of Statistical Software, 2017.

Appendix

The following listing defines a function that makes the illustrative example given in the Implementation section self-sufficient. By default, the population contains one million customers, and the sample contains one percent. There are two grouping variables: age with six values and seniority with seven values.

load_data <- function(N = 1000000, n = 10000) {
  softmax <- function(x) exp(x) / sum(exp(x))

  # Age
  age_values <- c('18–25', '26–35', '36–45', '46–55', '56–65', '66+')
  age_probabilities <- softmax(c(2, 3, 3, 2, 2, 1))

  # Seniority
  seniority_values <- c('6M', '1Y', '2Y', '3Y', '4Y', '5Y', '6Y+')
  seniority_probabilities <- softmax(c(3, 2, 2, 2, 1, 1, 1))

  # Score
  score_values <- seq(0, 10)
  score_probabilities <- softmax(c(1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4))

  # Generate a population
  population <- tibble(age = sample(age_values, N,
                                    prob = age_probabilities,
                                    replace = TRUE),
                       seniority = sample(seniority_values, N,
                                          prob = seniority_probabilities,
                                          replace = TRUE))

  # Take a sample from the population
  sample <- population %>%
    sample_n(n) %>%
    mutate(score = sample(score_values, n,
                          prob = score_probabilities,
                          replace = TRUE)) %>%
    mutate(category = case_when(score < 7 ~ 'detractor',
                                score > 8 ~ 'promoter',
                                TRUE ~ 'neutral'))

  # Summarize the population
  population <- population %>%
    group_by(age, seniority) %>%
    count(name = 'cell_size')

  # Summarize the sample
  sample <- sample %>%
    group_by(age, seniority) %>%
    summarize(detractors = sum(category == 'detractor'),
              neutrals = sum(category == 'neutral'),
              promoters = sum(category == 'promoter')) %>%
    mutate(cell_size = detractors + neutrals + promoters)

  # Bind counts of neutrals, detractors, and promoters (needed for brms)
  sample$cell_counts <- with(sample, cbind(detractors, neutrals, promoters))
  colnames(sample$cell_counts) <- c('detractor', 'neutral', 'promoter')

  # Remove unused columns
  sample <- sample %>% select(-detractors, -neutrals, -promoters)

  list(population = population, sample = sample)
}

Lastly, the following snippet shows how to update brms and tidybayes from GitHub:

if (packageVersion('brms') < '2.11.2') {
  remotes::install_github('paul-buerkner/brms', upgrade = 'never')
}

if (packageVersion('tidybayes') < '2.0.1.9000') {
  remotes::install_github('mjskay/tidybayes', upgrade = 'never')
}

Ingestion of sequential data from BigQuery into TensorFlow

2019-11-08T07:00:00+00:00

How hard can it be to ingest sequential data into a TensorFlow model? As always, the answer is, “It depends.” Where are the sequences in question stored? Can they fit in main memory? Are they of the same length? In what follows, we shall build a flexible and scalable workflow for feeding sequential observations into a TensorFlow graph starting from BigQuery as the data warehouse.

To make the discussion tangible, consider the following problem. Suppose the goal is to predict the peak temperature at an arbitrary weather station present in the Global Historical Climatology Network for each day between June 1 and August 31. More concretely, given observations from June 1 up to an arbitrary day before August 31, the objective is to complete the sequence until August 31. For instance, if we find ourselves in Stockholm on June 12, we ask for the maximum temperatures from June 12 to August 31 given the temperature values between June 1 to June 11 at a weather station in Stockholm.

To set the expectations right, in this article, we are not going to build a predictive model but to cater for its development by making the data from the aforementioned database readily available in a TensorFlow graph. The final chain of states and operations is as follows:

Historical temperature measurements from the Global Historical Climatology Network are stored in a public data set in BigQuery. Each row corresponds to a weather station and a date. There are missing observations due to such reasons as measurements not passing quality checks.
Relevant measurements are grouped in BigQuery by the weather station and year. Therefore, each row corresponds to a weather station and a year, implying that all information about a particular example (a specific weather station on a specific year) is gathered in one place.
The sequences are read, analyzed, and transformed by Cloud Dataflow.
- The data are split into a training, a validation, and a testing set of examples.
- The training set is used to compute statistics needed for transforming the measurements to a form suitable for the subsequent modeling. Standardization is used as an example.
- The training and validation sets are transformed using the statistics computed with respect to the training set in order to avoid performing these computations during the training-with-validation phase. The corresponding transform is available for the testing phase.
The processed training and validation examples and the raw testing examples are written by Dataflow to Cloud Storage in the TFRecord format, which is a format native to TensorFlow.
The files containing TFRecords are read by the tf.data API of TensorFlow and eventually transformed into a data set of appropriately padded batches of examples.

The above workflow is not as simple as reading data from a Pandas DataFrame comfortably resting in main memory; however, it is much more scalable. This pipeline can handle arbitrary amounts of data. Moreover, it operates on complete examples, not on individual measurements.

In the rest of the article, the aforementioned steps will be described in more detail. The corresponding source code can be found in the following repository on GitHub:

example-weather-forecast.

Data

It all starts with data. The data come from the Global Historical Climatology Network, which is available in BigQuery for public use. Steps 1 and 2 in the list above are covered by the following query:

WITH
-- Select relevant measurements
data_1 AS (
  SELECT
    id,
    date,
    -- Find the date of the previous observation
    LAG(date) OVER (station_year) AS date_last,
    latitude,
    longitude,
    -- Convert to degrees Celsius
    value / 10 AS temperature
  FROM
    `bigquery-public-data.ghcn_d.ghcnd_201*`
  INNER JOIN
    `bigquery-public-data.ghcn_d.ghcnd_stations` USING (id)
  WHERE
    -- Take years from 2010 to 2019
    CAST(_TABLE_SUFFIX AS INT64) BETWEEN 0 AND 9
    -- Take months from June to August
    AND EXTRACT(MONTH FROM date) BETWEEN 6 AND 8
    -- Take the maximum temperature
    AND element = 'TMAX'
    -- Take observations passed spatio-temporal quality-control checks
    AND qflag IS NULL
  WINDOW
    station_year AS (
      PARTITION BY id, EXTRACT(YEAR FROM date)
      ORDER BY date
    )
),
-- Group into examples (a specific station and a specific year)
data_2 AS (
  SELECT
    id,
    MIN(date) AS date,
    latitude,
    longitude,
    -- Compute gaps between observations
    ARRAY_AGG(
      DATE_DIFF(date, IFNULL(date_last, date), DAY)
      ORDER BY date
    ) AS duration,
    ARRAY_AGG(temperature ORDER BY date) AS temperature
  FROM
    data_1
  GROUP BY
    id, latitude, longitude, EXTRACT(YEAR FROM date)
)
-- Partition into training, validation, and testing sets
SELECT
  *,
  CASE
    WHEN EXTRACT(YEAR FROM date) < 2019 THEN 'analysis,training'
    WHEN MOD(ABS(FARM_FINGERPRINT(id)), 100) < 50 THEN 'validation'
    ELSE 'testing'
  END AS mode
FROM
  data_2

The query fetches peak temperatures, denoted by temperature, for all available weather stations between June and August in 2010–2019. The crucial part is the usage of ARRAY_AGG, which is what makes it possible to gather all relevant data about a specific station and a specific year in the same row. The number of days since the previous measurement, which is denoted by duration, is also computed. Ideally, duration should always be one (except for the first day, which has no predecessor); however, this is not the case, which makes the resulting time series vary in length.

In addition, in order to illustrate the generality of this approach, two contextual (that is, non-sequential) explanatory variables are added: latitude and longitude. They are scalars stored side by side with duration and temperature, which are arrays.

Another important moment in the final SELECT statement, which defines a column called mode. This column indicates what each example is used for, allowing one to use the same query for different purposes and to avoid inconsistencies due to multiple queries. In this case, observations prior to 2019 are reserved for training, while the rest is split pseudo-randomly and reproducibly into two approximately equal parts: one is for validation, and one is for testing. This last operation is explained in detail in “Repeatable sampling of data sets in BigQuery for machine learning” by Lak Lakshmanan.

Preprocessing

In this section, we cover Steps 4 and 5 in the list given at the beginning. This job is done by TensorFlow Extended, which is a library for building machine-learning pipelines. Internally, it relies on Apache Beam as a language for defining pipelines. Once an adequate pipeline is created, it can be executed using an executor, and the executor that we shall use is Cloud Dataflow.

Before we proceed to the pipeline itself, the construction process is orchestrated by a configuration file, which will be referred to as config in the pipeline code (to be discussed shortly):

{
  "data": {
    "path": "configs/training/data.sql",
    "schema": [
      { "name": "latitude", "kind": "float32", "transform": "z" },
      { "name": "longitude", "kind": "float32", "transform": "z" },
      { "name": "duration", "kind": ["float32"], "transform": "z" },
      { "name": "temperature", "kind": ["float32"], "transform": "z" }
    ]
  },
  "modes": [
    { "name": "analysis" },
    { "name": "training", "transform": "analysis", "shuffle": true },
    { "name": "validation", "transform": "analysis" },
    { "name": "testing", "transform": "identity" }
  ]
}

It is worth noting that this way of working with a separate configuration file is not something standard that comes with TensorFlow or Beam. It is a convenience that we build for ourselves in order to keep the main logic reusable and extendable without touching the code.

The data block describes where the data can be found and provides a schema for the columns that are used. (Recall the SQL query given earlier and note that id, date, and partition are omitted.) For instance, latitude is a scale of type FLOAT32, while temperature is a sequence of type FLOAT32. Both are standardized to have a zero mean and a unit standard deviation, which is indicated by "transform": "z" and is typically needed for training neural networks.

The modes block defines four passes over the data, corresponding to four operating modes. In each mode, a specific subset of examples is considered, which is given by the mode column returned by the query. There are two types of modes: analysis and transform; recall Step 3. Whenever the transform key is present, it is a transform mode; otherwise, it is an analysis mode. In this example, there are one analysis and three transform modes.

Below is an excerpt from a Python class responsible for building the pipeline:

# config = ...
# schema = ...

# Read the SQL code
query = open(config['data']['path']).read()
# Create a BigQuery source
source = beam.io.BigQuerySource(query=query, use_standard_sql=True)
# Create metadata needed later
spec = schema.to_feature_spec()
meta = dataset_metadata.DatasetMetadata(
    schema=dataset_schema.from_feature_spec(spec))
# Read data from BigQuery
data = pipeline \
    | 'read' >> beam.io.Read(source)
# Loop over modes whose purpose is analysis
transform_functions = {}
for mode in config['modes']:
    if 'transform' in mode:
        continue
    name = mode['name']
    # Select examples that belong to the current mode
    data_ = data \
        | name + '-filter' >> beam.Filter(partial(_filter, mode))
    # Analyze the examples
    transform_functions[name] = (data_, meta) \
        | name + '-analyze' >> tt_beam.AnalyzeDataset(_analyze)
    path = _locate(config, name, 'transform')
    # Store the transform function
    transform_functions[name] \
        | name + '-write-transform' >> transform_fn_io.WriteTransformFn(path)
# Loop over modes whose purpose is transformation
for mode in config['modes']:
    if not 'transform' in mode:
        continue
    name = mode['name']
    # Select examples that belong to the current mode
    data_ = data \
        | name + '-filter' >> beam.Filter(partial(_filter, mode))
    # Shuffle examples if needed
    if mode.get('shuffle', False):
        data_ = data_ \
            | name + '-shuffle' >> beam.transforms.Reshuffle()
    # Transform the examples using an appropriate transform function
    if mode['transform'] == 'identity':
        coder = tft.coders.ExampleProtoCoder(meta.schema)
    else:
        data_, meta_ = ((data_, meta), transform_functions[mode['transform']]) \
            | name + '-transform' >> tt_beam.TransformDataset()
        coder = tft.coders.ExampleProtoCoder(meta_.schema)
    path = _locate(config, name, 'examples', 'part')
    # Store the transformed examples as TFRecords
    data_ \
        | name + '-encode' >> beam.Map(coder.encode) \
        | name + '-write-examples' >> beam.io.tfrecordio.WriteToTFRecord(path)

At the very beginning, a BigQuery source is created, which is then branched out according to the operating modes found in the configuration file. Specifically, the first for-loop corresponds to the analysis modes, and the second for-loop goes over the transform modes. The former ends with WriteTransformFn, which saves the resulting transform, and the latter ends with WriteToTFRecord, which writes the resulting examples as TFRecords.

The distinction between the contextual and sequential features is given by the schema object created based on the schema block in the configuration file. The call schema.to_feature_spec() shown above alternates between tf.io.FixedLenFeature and tf.io.VarLenFeature and produces a feature specification that is understood by TensorFlow and TensorFlow Extended.

The repository provides a wrapper for executing the pipeline on Cloud Dataflow. The following figure shows the flow of the data with respect to the four operating modes:

The outcome is a hierarchy of files on Cloud Storage:

.
└── data/
    └── training/
        └── 2019-11-01-12-00-00/
            ├── analysis/
            │   └── transform/
            │       ├── transform_fn/...
            │       └── transform_metadata/...
            ├── testing/
            │   └── examples/
            │       ├── part-000000-of-00004
            │       ├── ...
            │       └── part-000003-of-00004
            ├── training/
            │   └── examples/
            │       ├── part-000000-of-00006
            │       ├── ...
            │       └── part-000005-of-00006
            └── validation/
                └── examples/
                    ├── part-000000-of-00004
                    ├── ...
                    └── part-000003-of-00004

Here, data/training contains all data needed for the training phase, which collectively refers to training entwined with validation and followed by testing. Moving forward, this hierarchy is meant to accommodate the application phase as well by populating a data/application entry next to the data/training one. It can also accommodate trained models and the results of applying these models by having a model entry with a structure similar to the one of the data entry.

In the listing above, the files whose name starts with part- are the ones containing TFRecords. It can be seen that, for each mode, the corresponding examples have been split into multiple files, which is done for more efficient access during the usage stage discussed in the next section.

Execution

At this point, the data have made it all the way to the execution phase, referring to training, validation, and testing; however, the data are yet to be injected into a TensorFlow graph, which is the topic of this section. As before, relevant parameters are kept in a separate configuration file:

{
  "data": {
    "schema": [
      { "name": "latitude", "kind": "float32" },
      { "name": "longitude", "kind": "float32" },
      { "name": "duration", "kind": ["float32"] },
      { "name": "temperature", "kind": ["float32"] }
    ],
    "modes": {
      "training": {
        "spec": "transformed",
        "shuffle_macro": { "buffer_size": 100 },
        "interleave": { "cycle_length": 100, "num_parallel_calls": -1 },
        "shuffle_micro": { "buffer_size": 512 },
        "map": { "num_parallel_calls": -1 },
        "batch": { "batch_size": 128 },
        "prefetch": { "buffer_size": 1 },
        "repeat": {}
      },
      "validation": {
        "spec": "transformed",
        "shuffle_macro": { "buffer_size": 100 },
        "interleave": { "cycle_length": 100, "num_parallel_calls": -1 },
        "map": { "num_parallel_calls": -1 },
        "batch": { "batch_size": 128 },
        "prefetch": { "buffer_size": 1 },
        "repeat": {}
      },
      "testing": {
        "spec": "original",
        "interleave": { "cycle_length": 100, "num_parallel_calls": -1 },
        "map": { "num_parallel_calls": -1 },
        "batch": { "batch_size": 128 },
        "prefetch": { "buffer_size": 1 }
      }
    }
  }
}

It can be seen that the file contains only one block: data. This is sufficient for the purposes of this article; however, it is also meant to cover the construction of the model in mind, including its hyperparameters, and the execution process, including the optimizer and evaluation metrics.

The data block is similar to the one we saw before. In this case, modes describes various calls to the tf.data API related to shuffling, batching, and so on. Those who are familiar with the API will probably immediately recognize them. It is now instructive to go straight to the Python code.

Below is an excerpt from a Python class responsible for building the pipeline on the TensorFlow side:

# config = ...

# List all files matching a given pattern
pattern = [self.path, name, 'examples', 'part-*']
dataset = tf.data.Dataset.list_files(os.path.join(*pattern))
# Shuffle the files if needed
if 'shuffle_macro' in config:
    dataset = dataset.shuffle(**config['shuffle_macro'])
# Convert the files into datasets of examples stored as TFRecords and
# amalgamate these datasets into one dataset of examples
dataset = dataset \
    .interleave(tf.data.TFRecordDataset, **config['interleave'])
# Shuffle the examples if needed
if 'shuffle_micro' in config:
    dataset = dataset.shuffle(**config['shuffle_micro'])
# Preprocess the examples with respect to a given spec, pad the examples
# and form batches of different sizes, and postprocess the batches
dataset = dataset \
    .map(_preprocess, **config['map']) \
    .padded_batch(padded_shapes=_shape(), **config['batch']) \
    .map(_postprocess, **config['map'])
# Prefetch the batches if needed
if 'prefetch' in config:
    dataset = dataset.prefetch(**config['prefetch'])
# Repeat the data once the source is exhausted if needed
if 'repeat' in config:
    dataset = dataset.repeat(**config['repeat'])

The pipeline is self-explanatory. It is simply a chain of operations stacked on top of each other. It is, however, worth taking a closer look at the preprocessing and postprocessing mappings, which can be seen before and after the padding step, respectively:

def _preprocess(proto):
    spec = self.transforms[config['transform']] \
        .transformed_feature_spec()
    example = tf.io.parse_single_example(proto, spec)
    return (
        {name: example[name] for name in self.contextual_names},
        {
            # Convert the sequential columns from sparse to dense
            name: self.schema[name].to_dense(example[name])
            for name in self.sequential_names
        },
    )

def _postprocess(contextual, sequential):
    sequential = {
        # Convert the sequential columns from dense to sparse
        name: self.schema[name].to_sparse(sequential[name])
        for name in self.sequential_names
    }
    return {**contextual, **sequential}

Currently, tf.data does not support padding sparse tensors, which is the representation used for sequential features in TensorFlow. In the running example about forecasting weather, such features are duration and temperature. This is the reason such features are converted to their dense counterparts in _preprocess. However, the final representation has to be sparse still. Therefore, the sequential features are converted back to the sparse format in _postprocess. Hopefully, this back-and-forth conversion will be rendered obsolete in future versions.

Having executed the above steps, we have an instance of tf.data.Dataset, which is the ultimate goal, as it is the standard way of ingesting data into a TensorFlow graph. At this point, one might create a Keras model leveraging tf.keras.layers.DenseFeatures and tf.keras.experimental.SequenceFeatures for constructing the input layer and then pass the data set to the fit function of the model. A skeleton for this part can be found in the repository.

Conclusion

In this article, we have discussed a scalable approach to the ingestion of sequential observations from BigQuery into a TensorFlow graph. The key tools that have been used to this end are TensorFlow Extended in combination with Cloud Dataflow and the tf.data API of TensorFlow.

In addition, the provided code has been written to be general and easily customizable. It has been achieved by separating the configuration part from the implementation one.

References

Lak Lakshmanan, “Repeatable sampling of data sets in BigQuery for machine learning,” 2016.

Sample size determination using historical data and simulation

2019-09-25T06:00:00+00:00

In order to test a hypothesis, one has to design and execute an adequate experiment. Typically, it is neither feasible nor desirable to involve the whole population. Instead, a relatively small subset of the population is studied, and given the outcome for this small sample, relevant conclusions are drawn with respect to the population. An important question to answer is then, What is the minimal sample size needed for the experiment to succeed? In what follows, we answer this question using solely historical data and computer simulation, without invoking any classical statistical procedures.

Although, as we shall see, the ideas are straightforward, direct calculations were impossible to perform before computers. To be able to answer this kind of questions back then, statisticians developed mathematical theories in order to approximate the calculations for specific situations. Since nothing else was possible, these approximations and the various terms and conditions under which they operate made up a large part of traditional textbooks and courses in statistics. However, the advent of today’s computing power has enabled one to estimate required sample sizes in a more direct and intuitive way, with the only prerequisites being an understanding of statistical inference, the availability of historical data describing the status quo, and the ability to write a few lines of code in a programming language.

Problem

For concreteness, consider the following scenario. We run an online business and hypothesize that a specific change in promotion campaigns, such as making them personalized, will have a positive effect on a specific performance metric, such as the average deposit. In order to investigate if it is the case, we decide to perform a two-sample test. There are the following two competing hypotheses.

The null hypothesis postulates that the change has no effect on the metric.
The alternative hypothesis postulates that the change has a positive effect on the metric.

There will be two groups: a control group and a treatment group. The former will be exposed to the current promotion policy, while the latter to the new one. There are also certain requirements imposed on the test. First, we have a level of statistical significance \(\alpha\) and a level of practical significance \(\delta\) in mind. The former puts a limit on the false-positive rate, and the latter indicates the smallest effect that we still care about; anything smaller is as good as zero for any practical purpose. In addition, we require the test to have a prescribed false-negative rate \(\beta\), ensuring that the test has enough statistical power.

For our purposes, the test is considered well designed if it is capable of detecting a difference as small as \(\delta\) so that the false-positive and false-negative rates are controlled to levels \(\alpha\) and \(\beta\), respectively. Typically, parameters \(\alpha\) and \(\delta\) are held constant, and the desired false-positive rate \(\beta\) is attained by varying the number of participants in each group, which we denote by \(n\). Note that we do not want any of the parameters to be smaller than the prescribed values, as it would be wasteful.

So what should the sample size be for the test to be well designed?

Solution

Depending on the distribution of the data and on the chosen metric, one might or might not be able to find a suitable test among the standard ones, while ensuring that the test’s assumptions can safely be considered satisfied. More importantly, a textbook solution might not be the most intuitive one, which, in particular, might lead to misuse of the test. It is the understanding that matters.

Here we take a more pragmatic and rather general approach that circumvents the above concerns. It requires only historical data and basic programming skills. Despite its simplicity, the method below goes straight to the core of what the famed statistical tests are doing behind all the math. The approach belongs to the class of so-called bootstrap techniques and is as follows.

Suppose we have historical data on customers’ behavior under the current promotion policy, which is commonplace in practice. An important realization is that this data set represents what we expect to observe in the control group. It is also what is expected of the treatment group provided that the null hypothesis is true, that is, when the proposed change has no effect. This realization enables one to simulate what would happen if each group was limited to an arbitrary number of participants. Then, by varying this size parameter, it is possible to find the smallest value that makes the test well designed, that is, make the test satisfy the requirements on \(\alpha\), \(\beta\), and \(\delta\), as discussed in the previous section.

This is all. The rest is an elaboration of the above idea.

The simulation entails the following. To begin with, note that what we are interested in testing is the difference between the performance metric applied to the treatment group and the same metric applied to the control group, which is referred to as the test statistic:

Test statistic = Metric(Treatment sample) - Metric(Control sample).

Treatment sample and Control sample stand for sets of observations, and Metric(Sample) stands for computing the performance metric given such a sample. For instance, each observation could be the total deposit of a customer, and the metric could be the average value:

Metric(Sample) = Sum of observations / Number of observations.

Note, however, that it is an example; the metric can be arbitrary, and this is a huge advantage of this approach to sample size determination based on data and simulation.

Large positive values of the test statistic speak in favor of the treatment (that is, the new promotion policy in our example), while those that are close to zero suggest that the treatment is futile.

A sample of \(n\) observations corresponding to the status quo (that is, the current policy in our example) can be easily obtained by drawing \(n\) data points with replacement from the historical data:

Sample = Choose random with replacement(Data, N).

This expression is used for Control sample under both the null and alternative hypotheses. As alluded to earlier, this is also how Treatment sample is obtained under the null. Regarding the alternative hypothesis being true, one has to express the hypothesized outcome as a distribution for the case of the minimal detectable difference, \(\delta\). The simplest and reasonable solution is to sample the data again, apply the metric, and then adjust the result to reflect the alternative hypothesis:

Metric(Choose random with replacement(Data, N)) + Delta.

Here, again, one is free to change the logic under the alternative according to the situation at hand. For instance, instead of an additive effect, one could simulate a multiplicative one.

The above is a way to simulate a single instance of the experiment under either the null or alternative hypothesis; the result is a single value for the test statistic. The next step is to estimate how the test statistic would vary if the experiment was repeated many times in the two scenarios. This simply means that the procedure should be repeated multiple times:

Repeat many times {
  Sample 1 = Choose random with replacement(Data, N)
  Sample 2 = Choose random with replacement(Data, N)
  Metric 1 = Metric(Sample 1)
  Metric 2 = Metric(Sample 2)
  Test statistic under null = Metric 1 - Metric 2

  Sample 3 = Choose random with replacement(Data, N)
  Sample 4 = Choose random with replacement(Data, N)
  Metric 3 = Metric(Sample 3) + Delta
  Metric 4 = Metric(Sample 4)
  Test statistic under alternative = Metric 3 - Metric 4
}

This yields a collection of values for the test statistic under the null hypothesis and a collection of values for the test statistic under the alternative hypothesis. Each one contains realizations from the so-called sampling distribution in the corresponding scenario. The following figure gives an illustration:

The blue shape is the sampling distribution under the null hypothesis, and the red one is the sampling distribution under the alternative hypothesis. We shall come back to this figure shortly.

These two distributions of the test statistic are what we are after, as they allow one to compute the false-positive rate and eventually choose a sample size. First, given \(\alpha\), the sampling distribution under the null (the blue one) is used in order to find a value beyond which the probability mass is equal to \(\alpha\):

Critical value = Quantile([Test statistic under null], 1 - alpha).

Quantile computes the quantile specified by the second argument given a set of observations. This quantity is called the critical value of the test. In the figure above, it is denoted by a dashed line. When the test statistic falls to the right of the critical value, we reject the null hypothesis; otherwise, we fail to reject it. Second, the sampling distribution in the case of the alternative hypothesis being true (the red one) is used in order to compute the false-negative rate:

Attained beta = Mean([Test statistic under alternative < Critical value]).

It corresponds to the probability mass of the sampling distribution under the alternative to the left of the critical value. In the figure, it is the red area to the left of the dashed line.

The final step is to put the above procedure in an optimization loop that minimizes the distance between the target and attained \(\beta\)’s with respect to the sample size:

Optimize N until Attained beta is close to Target beta {
  Repeat many times {
    Test statistic under null = ...
    Test statistic under alternative = ...
  }
  Critical value = ...
  Attained beta = ...
}

This concludes the calculation of the size that the control and treatment groups should have in order for the upcoming test in promotion campaigns to be well designed in terms of the level of statistical significance \(\alpha\), the false-negative rate \(\beta\), and the level of practical significance \(\delta\).

An example of how this technique could be implemented in practice can be found in the appendix.

Conclusion

In this article, we have discussed an approach to sample size determination that is based on historical data and computer simulation rather than on mathematical formulae tailored for specific situations. It is general and straightforward to implement. More importantly, the technique is intuitive, since it directly follows the narrative of null hypothesis significance testing. It does require prior knowledge of the key concepts in statistical inference. However, this knowledge is arguably essential for those who are involved in scientific experimentation. It constitutes the core of statistical literacy.

Acknowledgments

This article was inspired by a blog post authored by Allen Downey and a talk given by John Rauser. I also would like to thank Aaron Rendahl for his feedback on the introduction to the method presented here and for his help with the implementation given in the appendix.

References

Allen Downey, “There is only one test!,” 2011.
John Rauser, “Statistics without the agonizing pain,” 2014.
Joseph Lee Rodgers, “The bootstrap, the jackknife, and the randomization test: A sampling taxonomy,” Multivariate Behavioral Research, 2010.

Appendix

The following listing shows an implementation of the bootstrap approach in R:

library(tidyverse)

set.seed(42)

# Artificial data for illustration
observation_count <- 20000
data <- tibble(value = rlnorm(observation_count))

# Performance metric
metric <- mean
# Statistical significance
alpha <- 0.05
# False-negative rate
beta <- 0.2
# Practical significance
delta <- 0.1 * metric(data$value)

simulate <- function(sample_size, replication_count) {
  # Function for drawing a single sample of size sample_size
  run_one <- function() sample(data$value, sample_size, replace = TRUE)
  # Function for drawing replication_count samples of size sample_size
  run_many <- function() replicate(replication_count, { metric(run_one()) })

  # Simulation under the null hypothesis
  control_null <- run_many()
  treatment_null <- run_many()
  difference_null <- treatment_null - control_null

  # Simulation under the alternative hypothesis
  control_alternative <- run_many()
  treatment_alternative <- run_many() + delta
  difference_alternative <- treatment_alternative - control_alternative

  # Computation of the critical value
  critical_value <- quantile(difference_null, 1 - alpha)
  # Computation of the false-negative rate
  beta <- mean(difference_alternative < critical_value)

  list(difference_null = difference_null,
       difference_alternative = difference_alternative,
       critical_value = critical_value,
       beta = beta)
}

# Number of replications
replication_count <- 1000
# Interval of possible values for the sample size
search_interval <- c(1, 10000)
# Root finding to attain the desired value by varying the sample size
target <- function(n) beta - simulate(as.integer(n), replication_count)$beta
sample_size <- as.integer(uniroot(target, interval = search_interval)$root)

The illustrative figure shown in the solution section displays the sampling distribution of the test statistic under the null and alternative for the sample size found by this code snippet.

A Bayesian approach to the inference of the net promoter score

2019-08-19T06:00:00+00:00

The net promoter score is a widely adopted metric for gauging customers’ satisfaction with a product. The popularity of the score is arguably attributed to the simplicity of measurement and the intuitiveness of interpretation. Moreover, it is claimed to be correlated with revenue growth, which, ignoring causality, makes it even more appealing. In this article, we leverage Bayesian statistics in order to infer the net promoter score for an arbitrary segmentation of a customer base. The outcome of the inference is a distribution over all possible values of the score weighted by probabilities, which provides exhaustive information for the subsequent decision-making.

A bare-bones net promoter survey is composed of only one question: “How likely are you to recommend us to a friend?” The answer is an integer ranging from 0 to 10 inclusively. If the grade is between 0 and 6 inclusively, the person in question is said to be a detractor. If it is 7 or 8, the person is said to be a neutral. Lastly, if it is 9 or 10, the person is deemed a promoter. The net promoter score itself is then the percentage of promoters minus the percentage of detractors. The minimum and maximum attainable values of the score are −100 and 100, respectively. In this case, the greater, the better.

As it is usually the case with surveys, a small but representative subset of customers is reached out to, and the collected responses are then used to draw conclusions about the target population of customers. Our objective is to facilitate this last step by estimating the net promoter score given a set of responses and necessarily quantify and put front and center the uncertainty in our estimates.

Before we proceed, since a net promoter survey is an observational study, which is prone to such biases as participation and response biases, great care must be taken when analyzing the results. In this article, however, we focus on the inference of the net promoter score under the assumption that the given sample of responses is representative of the target population.

Problem

In practice, one is interested to know the net promoter scope for different subpopulations of customers, such as countries of operation and age groups, which is the scenario that we shall target. To this end, suppose that there are \(m\) segments of interest, and each customer belongs to strictly one of them. The results of a net promoter survey can then be summarized using the following \(m \times 3\) matrix:

\[y = \left( \begin{matrix} d_1 & n_1 & p_1 \\ \vdots & \vdots & \vdots \\ d_i & n_i & p_i \\ \vdots & \vdots & \vdots \\ d_m & n_m & p_m \end{matrix} \right)\]

where \(d_i\), \(n_i\), and \(p_i\) denote the number of detractors, neutrals, and promoters in segment \(i\), respectively. For segment \(i\), the observed net promoter score can be computed as follows:

\[\hat{s}_i = 100 \times \frac{p_i - d_i}{d_i + n_i + p_i}.\]

However, this observed score is a single scalar value calculated using \(d_i + n_i + p_i\) data points, which is only a subset of the corresponding subpopulation. It may or may not correspond well to the actual net promoter score of that subpopulation. We have no reason to trust it, since the above estimate alone does not tell us anything about the uncertainty associated with it. Uncertainty quantification is essential for sound decision-making, which is what we are after.

Ideally, for each segment, given the observed data, we would like to have a distribution of all possible values of the score with probabilities attached. Such a probability distribution would be exhaustive information, from which any other statistic could be easily derived. Here we tackle the problem by means of Bayesian inference, which we discuss next.

Solution

In order to perform Bayesian inference of the net promoter score, we need to decide on an adequate Bayesian model for the problem at hand. Recall first that we are interested in inferring scores for several segments. Even though there might be segment-specific variations in the product, such as special offers in certain countries, or in customers’ perception of the product, such as age-related preferences, it is conceptually the same product that the customers were asked to evaluate. It is then sensible to expect the scores in different segments to have something in common. With this in mind, we construct a hierarchical model with parameters shared by the segments.

First, let

\[\theta_i = (\theta_{id}, \theta_{in}, \theta_{ip}) \in \langle 0, 1 \rangle^3\]

be a triplet of parameters corresponding to the proportion of detractors, neutrals, and promoters in segment \(i\), respectively, with the constraint that they have to sum up to one. The constraint makes the triplet a simplex, which is what is emphasized by the angle brackets on the right-hand side. These are the main parameters we are interested in inferring. If the true value of \(\theta_i\) was known, the net promoter score would be computed as follows:

\[\hat{s}_i = 100 \times (\theta_{ip} - \theta_{id}).\]

Parameter \(\theta_i\) can also be thought of as a vector of probabilities of observing one of the three types of customers in segment \(i\), that is, detractors, neutrals, and promoters. Then the natural model for the observed data is a multinomial distribution with \(d_i + n_i + p_i\) trials and probabilities \(\theta_i\):

\[y_i | \theta_i \sim \text{Multinomial}(d_i + n_i + p_i, \theta_i)\]

where \(y_i\) refers to the \(i\)th row of matrix \(y\) introduced earlier. The family of multinomial distributions is a generalization of the family of binomial distributions to more than two outcomes.

The above gives a data distribution. In order to complete the modeling part, we need to decide on a prior probability distribution for \(\theta_i\). Each \(\theta_i\) is a simplex of probabilities. In such a case, a reasonable choice is a Dirichlet distribution:

\[\theta_i | \phi \sim \text{Dirichlet}(\phi)\]

where \(\phi = (\phi_d, \phi_n, \phi_p)\) is a vector of strictly positive parameters. This family of distributions is a generalization of the family of beta distributions to more than two categories. Note that \(\phi\) is the same for all segments, which is what enables information sharing. In particular, it means that the less reliable estimates for segments with fewer observations will be shrunk toward the more reliable estimates for segments with more observations. In other words, with this architecture, segments with fewer observations are able to draw strength from those with more observations.

How about \(\phi\)? This triplet is a characteristic of the product irrespective of the segment. Its individual components can be utilized in order to encode one’s prior knowledge about the net promoter score. Specifically, \(\phi_d\), \(\phi_n\), and \(\phi_p\) could be set to imaginary observations of detractors, neutrals, and promoters, respectively, reflecting one’s beliefs prior to conducting the survey. The higher these imaginary counts are, the more certain one claims to be about the true score. One could certainly set these hyperparameters to fixed values; however, a more comprehensive solution is to infer them from the data as well, giving the model more flexibility by making it hierarchical. In addition, an inspection of \(\phi\) afterward can provide insights into the overall satisfaction with the product.

We now need to specify a prior, or rather a hyperprior, for \(\phi\). We proceed under the assumption that we have little knowledge about the true score. Even if there were surveys in the past, it is still a valid choice, especially when the product evolves rapidly, rendering prior surveys marginally relevant.

Now, it is more convenient to think in terms of expected values and variances instead of imaginary counts, which is what \(\phi\) represents. Let us find an alternative parameterization of the Dirichlet distribution. The expected value of this distribution is as follows:

\[\mu = (\mu_d, \mu_n, \mu_p) = \frac{\phi}{\phi_d + \phi_n + \phi_p} \in \langle 0, 1 \rangle^3.\]

It can be seen that it is a simplex of proportions of detractors, neutrals, and promoters of the whole population, which is similar to \(\theta_i\) describing segment \(i\). Regarding the variance,

\[\sigma^2 = \frac{1}{\phi_d + \phi_n + \phi_p}\]

is considered to capture it sufficiently well. Solving the system of the last two equations for \(\phi\) yields the following result:

\[\phi = \frac{\mu}{\sigma^2}.\]

The prior for \(\theta_i\) can then be rewritten as follows:

\[\theta_i | \mu, \sigma \sim \text{Dirichlet}\left(\frac{\mu}{\sigma^2}\right).\]

This new parameterization requires two hyperpriors: one is for \(\mu\), and one is for \(\sigma\). For \(\mu\), a reasonable choice is a uniform distribution (over a simplex), and for \(\sigma\), a half-Cauchy distribution:

\[\begin{align} & \mu \sim \text{Uniform}(\langle 0, 1 \rangle^3) \text{ and} \\ & \sigma \sim \text{Half-Cauchy}(0, 1). \end{align}\]

The two distributions are relatively week, which is intended in order to let the data speak for themselves. At this point, all parameters have been defined. Of course, one could go further if the problem at hand had a deeper structure; however, in this case, it is arguably not justifiable.

The final model is as follows:

\[\begin{align} y_i | \theta_i & \sim \text{Multinomial}(d_i + n_i + p_i, \theta_i), \\ \theta_i | \mu, \sigma & \sim \text{Dirichlet}(\mu / \sigma^2), \\ \mu & \sim \text{Uniform}(\langle 0, 1 \rangle^3), \text{ and} \\ \sigma & \sim \text{Half-Cauchy}(0, 1). \end{align}\]

The posterior distribution factorizes as follows:

\[p(\theta_1, \dots, \theta_m, \mu, \sigma | y) \propto p(y | \theta_1, \dots, \theta_m) \, p(\theta_1 | \mu, \sigma) \cdots p(\theta_m | \mu, \sigma) \, p(\mu) \, p(\sigma),\]

which relies on the usual assumption of independence given the parameters. One could make a few simplifications by, for instance, leveraging the conjugacy of the Dirichlet distribution with respect to the multinomial distribution; however, it is not needed in practice, as we shall see shortly.

The above posterior distribution is our ultimate goal. It is the one that gives us a complete picture of what the true net promoter score in each segment might be given the available evidence, that is, the responses from the survey. All that is left is to draw a large enough sample from this distribution and start to summarize and visualize the results.

Unfortunately, as one might probably suspect, drawing samples from the posterior is not an easy task. It does not correspond to any standard distribution and hence does not have a readily available random number generator. Fortunately, the topic is sufficiently mature, and there have been developed techniques for sampling complex distributions, such as the family of Markov chain Monte Carlo methods. Unfortunately, the most effective and efficient of these techniques are notoriously complex themselves, and it might be extremely difficult and tedious to implement and apply them correctly in practice. Fortunately, the need for versatile tools for modeling and inference with the focus on the problem at hand and not on implementation details has been recognized and addressed. Nontrivial scenarios can be tackled with a surprisingly small amount of effort nowadays, which we illustrate next.

Implementation

In this section, we implement the model using the probabilistic programming language Stan. Stan is straightforward to integrate into one’s workflow, as it has interfaces for many general-purpose programming languages, including Python and R. Here we only highlight the main points of the implementation and leave it to the curious reader to discover Stan on their own.

The following listing is a complete implementation of the model:

data {
  int<lower = 0> m; // The number of segments
  int<lower = 0> n; // The number of categories, which is always three
  int y[m, n]; // The observed counts of detractors, neutrals, and promoters
}

parameters {
  simplex[n] mu;
  real<lower = 0> sigma;
  simplex[n] theta[m];
}

transformed parameters {
  vector<lower = 0>[n] phi;
  phi = mu / sigma^2;
}

model {
  mu ~ uniform(0, 1);
  sigma ~ cauchy(0, 1);
  for (i in 1:m) {
    theta[i] ~ dirichlet(phi);
    y[i] ~ multinomial(theta[i]);
  }
}

It can be seen that the code is very laconic and follows closely the development given in the previous section, including the notation. It is worth noting that, in the model block, we seemingly use unconstrained uniform and Cauchy distributions; however, the constraints are enforced by the definitions of the corresponding hyperparameters, mu and sigma.

This is practically all that is needed; the rest will be taken care of by Stan, which is actually a lot of work, including an adequate initialization, an efficient execution, and necessary diagnostics and quality checks. Under the hood, the sampling of the posterior in Stan is based on the Hamiltonian Monte Carlo algorithm and the no-U-turn sampler, which are considered to be the state-of-the-art.

The output of the sampling procedure is a set of draws from the posterior distribution, which, again, is exhaustive information about the net promoter score in the segments of interest. In particular, one can quantify the uncertainty in and the probability of any statement one makes about the score. For instance, if a concise summary is needed, one could compute the mean of the score and accompany it with a high-posterior-density credible interval, capturing the true value with the desired probability. However, if applicable, the full distribution should be integrated into the decision-making process.

Conclusion

In this article, we have constructed a hierarchical Bayesian model for inferring the net promoter score for an arbitrary segmentation of a customer base. The model features shared parameters, which enable information exchange between the segments. This allows for a more robust estimation of the score, especially in the case of segments with few observations. The final output of the inference is a probability distribution over all possible values of the score in each segment, which lays a solid foundation for the subsequent decision-making. We have also seen how seamlessly the model can be implemented in practice using modern tools for statistical inference, such as Stan.

Lastly, note that the presented model is only one alternative; there are many other. How would you model the net promoter score? What changes would you make? Make sure to leave a comment.

References

Andrew Gelman et al., Bayesian Data Analysis, Chapman and Hall/CRC, 2014.
Andrew Gelman, “Some practical questions about prior distributions,” 2009.

Interactive notebooks in tightly sealed disposable containers

2019-07-24T06:00:00+00:00

It is truly amazing how interactive notebooks—where a narrative in a spoken language is entwined with executable chunks of code in a programming language—have revolutionized the way we work with data and document our thought processes and findings for others and, equally importantly, for our future selves. They are ubiquitous and taken for granted. It is hard to imagine where data enthusiasts would be without them. Most likely, we would be spending too much time staring at a terminal window, anxiously re-running scripts from start to finish, printing variables, and saving lots of files with tables and graphs on disk for further inspection. Interactive notebooks are an essential tool in the data scientist’s toolbox, and in this article, we are going to make them readily available for our use with our favorite packages installed and preferences set up, no matter where we find ourselves working and regardless of the mess we might have left behind during the previous session.

Python and R (in alphabetic order) are arguably the primary languages used by data scientists nowadays. In the context of interactive computations, IPython and later on Project Jupyter have been of paramount importance for the Python community (the latter is actually language agnostic). In the R community, this role has been played by RStudio. Therefore, having at one’s disposal JupyterLab, which is Project Jupyter’s flagship, and RStudio should make one well equipped for a wide range of data challenges. As alluded to earlier, the objective is to have an environment that has a fixed initial state defined by us and is accessible to us on any machine we might happen to work on. This problem definition is a perfect fit for containerization. Specifically, we shall build custom-tailored Docker images for JupyterLab and RStudio and create a few convenient shortcuts for launching them.

The code discussed below can be found in the following two repositories:

JupyterLab and
RStudio.

JupyterLab

In order to build a Docker image for JupyterLab, we begin with a Dockerfile:

# Start with a minimal Python image
FROM python:3.7-slim

# Install the desired Python packages
COPY requirements.txt /tmp/requirements.txt
RUN pip install --upgrade pip
RUN pip install --upgrade --requirement /tmp/requirements.txt

# Configure JupyterLab to use a specific IP address and port
RUN mkdir -p ~/.jupyter
RUN echo "c.NotebookApp.ip = '0.0.0.0'" >> ~/.jupyter/jupyter_notebook_config.py
RUN echo "c.NotebookApp.port = 8888" >> ~/.jupyter/jupyter_notebook_config.py

# Set the working directory
WORKDIR /home/jupyterlab

# Stort JupyterLab once the container is launched
ENTRYPOINT jupyter lab --allow-root --no-browser

In words, we take a minimalistic image with the desired version of Python preinstalled—in this case, it is the official Python image tagged 3.7-slim, which refers to Python 3.7 with any available bug fixes promptly applied—and add packages that we consider to be important for our work. These packages are gathered in the usual requirements.txt, which might look as follows:

jupyterlab
matplotlib
numpy
pandas
pylint
pytest
scikit-learn
scipy
seaborn
tensorflow
yapf

The first one, jupyterlab, is essential; the rest is up to the data scientist’s taste. An important aspect to note is that, in this example, the versions of the listed packages are not fixed; hence, the latest available versions will be taken each time a new image is built. Alternatively, one can pin them to specific numbers by changing requirements.txt. For instance, one might write tensorflow==1.14.0 instead of tensorflow.

Having defined an image, we need a tool for orchestration. We would like to have a convenient command for actually building the image and, more importantly, a convenient command for launching a container with that image from an arbitrary directory. The versatile make to the rescue!

# The name of the Docker image
name := jupyterlab
# The directory to be mounted to the container
root ?= ${PWD}

# Build a new image
build:
	docker rmi ${name} || true
	docker build --tag ${name} .

# Start a new container
start:
	@docker run --interactive --tty --rm \
		--name ${name} \
		--publish 8888:8888 \
		--volume "${root}:/home/jupyterlab" \
		${name}

In the above Makefile, we define two commands: build and start. The build command instructs Docker to build a new image according to the recipe in Dockerfile. The start command launches a new container and mounts the directory specified by the root variable to the file system inside the container using the --volume option. It also forwards port 8888 inside the container, which is the one specified in Dockerfile, to port 8888 on the host machine so that JupyterLab can be reached from the browser.

Let us now go ahead and try the two commands:

make build
make start

JupyterLab should come back with usage instructions similar to the following:

...
[I 18:40:15.078 LabApp] The Jupyter Notebook is running at:
[I 18:40:15.078 LabApp] http://e4edba021595:8888/?token=
[I 18:40:15.078 LabApp]  or http://127.0.0.1:8888/?token=
[I 18:40:15.078 LabApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 18:40:15.082 LabApp]

    To access the notebook, open this file in a browser:
        file:///root/.local/share/jupyter/runtime/nbserver-6-open.html
    Or copy and paste one of these URLs:
        http://e4edba021595:8888/?token=
     or http://127.0.0.1:8888/?token=
...

By clicking on the last link, we end up in a fully fledged JupyterLab. Congratulations! However, there is one step left. JupyterLab is currently running in the folder with our Dockerfile and Makefile, which is not particularly useful, as each project we might want to work on probably lives in its own folder elsewhere in the file system. Fortunately, it is easy to fix with an alias:

alias jupyterlab='make -C /path/to/the/folder/with/the/Makefile root="${PWD}"'

This command should be placed in the start-up script of the shell being utilized. In the case of Bash, it can be done as follows:

echo "alias jupyterlab='make -C \"${PWD}\" root=\"\${PWD}\"'" >> ~/.bashrc

Now, in a new terminal, one should be able to run JupyterLab from any directory as follows:

cd /path/to/some/project
jupyterlab

Note that the content of the current working directory (that is, /path/to/some/project) is readily available inside JupyterLab. All notebooks created and modified in the GUI there will be stored directly in this folder, and they will remain here when the container is shut down.

RStudio

It is time to get to grips with an image for R notebooks. As before, we begin with a Dockerfile:

# Start with an RStudio image
FROM rocker/rstudio:latest

# Install the software that R packages require
RUN apt-get update
RUN apt-get install -y libxml2-dev texlive texlive-latex-extra zlib1g-dev

# Set the working directory
WORKDIR /home/rstudio

# Install the desired R packages
COPY requirements.txt /tmp/requirements.txt
RUN echo "install.packages(readLines('/tmp/requirements.txt'), \
                           repos = 'http://cran.us.r-project.org')" | R

Installing RStudio from scratch is not an easy task. Fortunately, we can start with the official RStudio image, which is what is specified at the top of the file. If desired, the latest tag can be changed to a specific version. The second block of Docker instructions is to provide programs and libraries that are needed by the R packages that one is planning to install. For instance, TeX Live is needed for rendering notebooks as PDF documents using LaTeX. The last block of instructions in Dockerfile is for installing the R packages themselves. As with Python, all necessary packages are gathered in a single file called requirements.txt:

devtools
glmnet
plotly
rmarkdown
rstan
testthat
tidytext
tidyverse

The rmarkdown package is required for notebooks in Markdown. The rest is intended to be changed according to one’s preferences; although, tidyverse is arguably a must in modern R.

All right, in order to build the image and launch containers, we create the following Makefile:

# The name of the Docker image
name := rstudio
# The directory to be mounted to the container
root ?= ${PWD}

# Build a new image
build:
	docker rmi ${name} || true
	docker build --tag ${name} .

# Start a new container
start:
	@echo "Address:  http://localhost:8787/"
	@echo "User:     rstudio"
	@echo "Password: rstud10"
	@echo
	@echo 'Press Control-C to terminate...'
	@docker run --interactive --tty --rm \
		--name ${name} \
		--publish 8787:8787 \
		--volume "${root}:/home/rstudio" \
		--env PASSWORD=rstud10 \
		${name} > /dev/null

It is similar to the one for JupyterLab; however, since the default prompt of RStudio is not as informative as the one of JupyterLab, we print our own usage instructions upon start.

The final piece is the shortcut for launching RStudio:

alias rstudio='make -C /path/to/the/folder/with/the/Makefile root="${PWD}"'

In the case of Bash, it can be installed as follows:

echo "alias rstudio='make -C \"${PWD}\" root=\"\${PWD}\"'" >> ~/.bashrc

Now it is time to build the image, go to an arbitrary directory, and test the alias:

make build
cd /path/to/some/project
rstudio

Unlike the JupyterLab image, this one is much slower to build due to R packages traditionally compiling a lot of C++ code upon installation.

Lastly, it might be particularly convenient to have one’s GUI preferences (such as the font size in the editor) and alike be automatically set up upon each container launch. This can be achieved by realizing that RStudio stores user preferences in a local folder called .rstudio. Then the start command can be adjusted to silently plant a preconfigured .rstudio into the current working directory, which can be seen in the repository accompanying this article.

Conclusion

Having completed the above steps, we have two Docker images: one is for Python notebooks via JupyterLab, and one is for R notebooks via RStudio. At the moment, the images are stored locally; however, they can be pushed to a public or private image repository, such as Docker Hub and Google Container Registry, and subsequently pulled on an arbitrary machine having Docker installed. Alternatively, they can be built on each machine separately. Regardless of the installation, the crucial point is that our working environment will unshakably remain in a specific pristine state defined by us.

Lastly, it is worth noting that similar images can straightforwardly be built for more specific scenarios. For instance, the following repository provides a skeleton for building and using a custom Datalab, which is Google’s wrapper for Jupyter notebooks that run in the cloud: Datalab.