Long story short:
# Inherit from any optimizer of choice, such as Adam.
class Optimizer(tf.keras.optimizers.Adam):
"""Optimizer that implements gradient accumulation."""
def __init__(self, accumulation: int = 1, **options) -> None:
"""Create an instance.
Arguments:
accumulation: The number of iterations to accumulate gradients over.
If it is set to one, no accumulation is performed, and the gradients
are applied as soon as they are computed. If it is set to a value
greater than one, the gradients will be accumulated for the specified
number of iterations and only then applied, starting a new cycle.
All other arguments are passed to the base optimizer.
"""
super().__init__(**options)
self.accumulation = accumulation
self._accumulation = None
self._gradients = None
def apply_gradients(
self, gradients_variables: list[tuple[tf.Tensor, tf.Tensor]]
) -> tf.Tensor:
"""Apply the gradients according to the accumulation scheme."""
# Split off the gradients from the trainable variables.
gradients, variables = zip(*list(gradients_variables))
# Perform the initialization if needed.
with tf.init_scope():
self.build(variables)
first = self._accumulation % self.accumulation == 0
last = (self._accumulation + 1) % self.accumulation == 0
# Add the new gradients to the old ones with resetting if needed.
for gradient, increment in zip(self._gradients, gradients):
gradient.assign(tf.cast(~first, tf.float32) * gradient + increment)
# Apply the average accumulated gradients to the trainable variables.
gradients = [gradient / self.accumulation for gradient in self._gradients]
super().apply_gradients(zip(gradients, variables))
# Decrement the base counter incremented by the application if needed.
self.iterations.assign_sub(tf.cast(~last, tf.int64))
# Increment the accumulation counter.
self._accumulation.assign_add(1)
return self.iterations
def update_step(self, gradient: tf.Tensor, variable: tf.Tensor) -> None:
"""Update the trainable variable with the gradient."""
update_step = super().update_step
last = (self._accumulation + 1) % self.accumulation == 0
# Allow the update to happen only at the end of each cycle.
tf.cond(last, lambda: update_step(gradient, variable), lambda: None)
def build(self, variables: list[tf.Tensor]) -> None:
"""Initialize the internal state."""
super().build(variables)
if self._gradients is None:
# Create a counter for tracking accumulation.
self._accumulation = self.add_variable(shape=(), dtype=tf.int64)
# Allocate memory for accumulation.
self._gradients = [
self.add_variable_from_reference(
model_variable=variable,
variable_name="gradient",
)
for variable in variables
]
It is important to note that the learning rate is not held constant during accumulation. However, since it is not expected to change much from one iteration to another, it is an adequate simplification.
I would like to thank André Pedersen, Axel Roebel, and Tor-Arne Nordmo for their help with the implementation.
]]>The classical attention is formalized as follows:
\[A = \text{softmax}\left( \frac{QK^{T}}{\sqrt{n_d}} \right) V\]where \(K\), \(V\), and \(Q\) are the keys, values, and queries, respectively. The keys and values are of shape \(n_s \times n_h \times n_{t_1} \times n_d\) where \(n_s\) is the batch size (s for space), \(n_h\) is the number of attention heads, \(n_{t_1}\) is the window size (t for time) of the input sequence, and \(n_d\) is the head size. The queries are of shape \(n_s \times n_h \times n_{t_2} \times n_d\) where \(n_{t_2}\) is the window size of the output sequence.
The relative attention obtains one additional term in the numerator:
\[A = \text{softmax}\left( \frac{QK^T + S}{\sqrt{n_d}} \right) V. \tag{1}\]In the above, \(S\) is of shape \(n_s \times n_h \times n_{t_2} \times n_{t_1}\) and calculated based on \(Q\) and a matrix \(E\) of shape \(n_d \times n_{t_3}\) containing relative positional embeddings. The typical context is causal self-attention, in which \(n_{t_3}\) is thought of as the maximum allowed length of the input sequence and set to \(n_{t_1}\), with the interpretation that the embeddings are running from position \(-n_{t_1} + 1\) (the most distant past) up to \(0\) (the present moment). Then \(S\) is a specific arrangement of the inner products between the queries in \(Q\) and the embeddings in \(E\) so as to respect the arrangement in \(QK^T\).
The original and more memory efficient calculations of \(S\) in the case of causal attention, are illustrated in the figure below, which is taken from Huang et al. (2018).
The matrix to the very right shows how \(S\) is arranged. Since the use case is causal attention, the upper triangle above the main diagonal (gray circles) is irrelevant and can contain arbitrary values, which it does in the algorithm proposed in Huang et al. (2018). The main diagonal (green circles) contains the inner products of the queries and the embedding corresponding to position \(0\). The first subdiagonal (pink circles) contains the inner products of the queries except for the first one as it has no past, and the embedding corresponding to position \(-1\). And it continues in this way down to \(-n_{t_1} + 1\), in which case it is only the last query that is involved, since it comes last in the sequence and has the longest past.
The calculation given in Huang et al. (2018) reduces the intermediate memory requirement from \(\mathcal{O}(n_h \, n_d \, n_t^2)\) to \(\mathcal{O}(n_h \, n_d \, n_t)\) where \(n_t\) is a general sequence length. However, it is limited to self-attention with causal connectivity, which is what is found in decoder blocks. It is not suitable for other attention patterns. Therefore, it cannot be used in, for instance, encoder blocks and decoder blocks with cross-attention, which usually have non-causal attention. In what follow, the limitation is lifted.
Let us extend \(E\) to be of shape \(n_d \times (2 n_{t_3} - 1)\) so that it has an embedding for any relative position not only when looking back in the past but also forward into the future, with \(n_{t_3}\) being the maximum allowed length of the input sequence as before, that is, \(t_1 \leq t_3\). Let us also interpret \(E\)’s columns as running from position \(n_{t_3} - 1\) (the most distant future) to position \(-n_{t_3} + 1\) (the most distant past). For instance, when the output sequence is of length \(t_3\) (the longest possible), the first query (position 0) will be “interested” only in columns \(0\) through \(n_{t_3} - 1\) inclusively, while the last (position \(n_{t_3} - 1\)) only in columns \(n_{t_3} - 1\) through \(2 n_{t_3} - 2\) inclusively.
Similarly to Huang et al. (2018), we note that multiplying \(Q\) by \(E\) results in a matrix that contains all the inner products necessary for assembling \(S\) in the general case. For instance, for \(t_3 = 4\) and dropping the batch and head dimensions for clearer visualization, the product is as follows:
\[QE = \left( \begin{matrix} s_{0 + 3} & s_{0 + 2} & s_{0 + 1} & s_{0 + 0} & & & \\ & s_{1 + 2} & s_{1 + 1} & s_{1 + 0} & s_{1 - 1} & & \\ & & s_{2 + 1} & s_{2 + 0} & s_{2 - 1} & s_{2 - 2} & \\ & & & s_{3 + 0} & s_{3 - 1} & s_{3 - 2} & s_{3 - 3} \\ \end{matrix} \right)\]where \(s_{i + t}\) denotes query \(i\) embedded to look at relative time \(t\), that is, the inner product between the query at position \(i\) and the embedding corresponding to a relative attention shift of \(t\), whose embedding is stored in column \(n_{t_3} - 1 - t\) of \(E\). For instance, for \(s_{2 - 1}\) with \(t_3 = 4\) still, the inner product is between row \(2\) of \(Q\) and column \(4 - 1 - (-1) = 4\) of \(E\).
The target arrangement is then simply the one where we stack the “interesting” diagonals of \(QE\) on top of each other from diagonal \(0\) (the main diagonal) at the bottom and diagonal \(t_3 - 1\) (the rightmost relevant superdiagonal) at the top
\[\bar{S} = \left( \begin{matrix} s_{0 + 0} & s_{1 - 1} & s_{2 - 2} & s_{3 - 3} \\ s_{0 + 1} & s_{1 + 0} & s_{2 - 1} & s_{3 - 2} \\ s_{0 + 2} & s_{1 + 1} & s_{2 + 0} & s_{3 - 1} \\ s_{0 + 3} & s_{1 + 2} & s_{2 + 1} & s_{3 + 0} \\ \end{matrix} \right)\]and then transpose the result
\[S = \left( \begin{matrix} s_{0 + 0} & s_{0 + 1} & s_{0 + 2} & s_{0 + 3} \\ s_{1 - 1} & s_{1 + 0} & s_{1 + 1} & s_{1 + 2} \\ s_{2 - 2} & s_{2 - 1} & s_{2 + 0} & s_{2 + 1} \\ s_{3 - 3} & s_{3 - 2} & s_{3 - 1} & s_{3 + 0} \\ \end{matrix} \right).\]More generally, the algorithm can be summarized as follows:
\[S = \text{transpose}\left( \text{diagonal}\left( QE, \, \text{lower}=0, \, \text{upper}=n_{t_3} - 1 \right) \right)\]where \(\text{diagonal}\) is a function taking a tensor and stacking its diagonals—specified by a range with two offsets relative to the main diagonal—from bottom up, and \(\text{transpose}\) is a function taking a tensor and transposing it. Both functions operators on the last two dimensions of the given tensor. This resulting matrix can then be plugged into Equation (1) to complete the calculation.
In case the keys and values are shorter than the maximum allowed relative position, that is, \(t_1 < t_3\), \(S\) should be truncated to its intended shape, \(n_s \times n_h \times n_{t_2} \times n_{t_1}\):
\[S = \text{truncate}\left( \text{transpose}\left( \text{diagonal}\left( QE, \, \text{lower}=0, \, \text{upper}=n_{t_3} - 1 \right) \right), \text{keep} = n_{t_1} \right)\]where \(\text{truncate}\) is a function taking a tensor and keeping only the specified number of its first elements in the last dimension, discarding the rest.
It can be seen that the algorithm the same intermediate memory requirement than the one proposed in Huang at al. (2018), that is, \(\mathcal{O}(n_h \, n_d \, n_t)\); however, its application scope is larger.
In TensorFlow, the algorithm can be implemented as an embedding layer as follows:
class RelativePositionalEmbedding(tf.keras.layers.Layer):
def __init__(self, head_size: int, sequence_length: int) -> None:
super().__init__()
self.projection = self.add_weight(
shape=(head_size, 2 * sequence_length - 1),
initializer="glorot_uniform",
trainable=True,
)
self.sequence_length = sequence_length
def call(self, Q: tf.Tensor) -> tf.Tensor:
S = tf.matmul(Q, self.projection)
S = tf.linalg.diag_part(S, k=(0, self.sequence_length - 1))
S = tf.transpose(S, perm=[0, 1, 3, 2])
return S
The above layer can be invoked as part of an attention layer as illustrated below:
class Attention(tf.keras.layers.Layer):
def __init__(self, head_size: int, sequence_length: int) -> None:
super().__init__()
self.head_size = head_size
self.positional_embedding = RelativePositionalEmbedding(
head_size=head_size,
sequence_length=sequence_length,
)
def call(self, K: tf.Tensor, V: tf.Tensor, Q: tf.Tensor) -> tf.Tensor:
# TODO: Add permutation if needed.
S = self.positional_embedding(Q)
W = tf.matmul(Q, K, transpose_b=True)
W = W + S[:, :, :, : K.shape[2]]
W = W * self.head_size**-0.5
# TODO: Add masking if needed.
W = tf.nn.softmax(W, axis=-1)
# TODO: Add dropout if needed.
A = tf.matmul(W, V)
# TODO: Add dropout if needed.
return A
The approach we shall discuss is based on the family of Dirichlet processes. How specifically such processes are constructed will be described in the next section; here, we focus on the big picture.
A Dirichlet process is a stochastic process, that is, an indexed sequence of random variables. Each realization of this process is a discrete probability distribution, which makes the process a distribution over distributions, similarly to a Dirichlet distribution. The process has only one parameter: a measure \(\nu: \mathcal{B} \to [0, \infty]\) in a suitable finite measure space \((\mathcal{X}, \mathcal{B}, \nu)\) where \(\mathcal{X}\) is a set, and \(\mathcal{B}\) is a \(\sigma\)-algebra on \(\mathcal{X}\). We shall adopt the following notation:
\[P \sim \text{Dirichlet Process}(\nu)\]where \(P\) is a random probability distribution that is distributed according to the Dirichlet process. Note that measure \(\nu\) does not have to be a probability measure; that is, \(\nu(\mathcal{X}) = 1\) is not required. To obtain a probability measure, one can divide \(\nu\) by the total volume \(\lambda = \nu(\mathcal{X})\):
\[P_0(\cdot) = \frac{1}{\lambda} \nu(\cdot).\]Since this normalization is always possible, it is common and convenient to replace \(\nu\) with \(\lambda P_0\) and consider the process to be parametrized by two quantities instead of one:
\[P \sim \text{Dirichlet Process}(\lambda P_0).\]Parameter \(\lambda\) is referred to as the concentration parameter of the process.
There are two major alternatives of using the Dirichlet process for estimating distributions: as a direct prior for the data at hand and as a mixing prior. We begin with the former.
Given a data set of \(n\) observations \(\{ x_i \}_{i = 1}^n\), a Dirichlet process can be used as a prior:
\[\begin{align} x_i | P_x & \sim P_x, \text{ for } i = 1, \dots, n; \text{ and} \\ P_x & \sim \text{Dirichlet Process}(\lambda P_0). \tag{1} \end{align}\]It is important to realize that the \(x_i\)’s are assumed to be distributed not according to the Dirichlet process but according to a distribution drawn from the Dirichlet process. Parameter \(\lambda\) allows one to control the strength of the prior: the larger it is, the more shrinkage toward the prior is induced.
Due to the conjugacy property of the Dirichlet process in the above setting, the posterior is also a Dirichlet process and has the following simple form:
\[P_x | \{ x_i \}_{i = 1}^n \sim \text{Dirichlet Process}\left( \lambda P_0 + \sum_{i = 1}^n \delta_{x_i} \right). \tag{2}\]That is, the total volume and normalized measure are updated as follows:
\[\begin{align} \lambda & := \lambda + n \quad \text{and} \\ P_0 & := \frac{\lambda}{\lambda + n} P_0 + \frac{1}{\lambda + n} \sum_{i = 1}^n \delta_{x_i}. \end{align}\]Here, \(\delta_x(\cdot)\) is the Dirac measure, meaning that \(\delta_x(X) = 1\) if \(x \in X\) for any \(X \subseteq \mathcal{X}\), and otherwise, it is zero. It can be seen in Equation (2) that the base measure has simply been augmented with unit masses placed at the \(n\) observed data points.
The main question now is, How to draw samples from a Dirichlet process given \(\lambda\) and \(P_0\)?
As noted earlier, a draw from a Dirichlet process is a discrete probability distribution \(P_x\). The probability measure of this distribution admits the following representation:
\[P_x(\cdot) = \sum_{i = 1}^\infty p_i \delta_{x_i}(\cdot) \tag{3}\]where \(\{ p_i \}\) is a set of probabilities that sum up to one, and \(\{ x_i \}\) is a set of points in \(\mathcal{X}\). Such a draw can be obtained using the so-called stick-breaking construction, which prescribes \(\{ p_i \}\) and \(\{ x_i \}\). To begin with, for practical computations, the infinite summation is truncated to retain the only first \(m\) elements:
\[P_x(\cdot) = \sum_{i = 1}^m p_i \delta_{x_i}(\cdot).\]Atoms \(\{ x_i \}_{i = 1}^m\) are drawn independently from the normalized base measure \(P_0\). The calculation of probabilities \(\{ p_i \}\) is more elaborate, and this is where the construction and this article get their name, “stick breaking.” Imagine a stick of unit length, representing the total probability. The procedure is to keep breaking the stick into two parts where, for each iteration, the left part yields \(p_i\), and the right one, the remainder, is carried over to the next iteration. How much to break off is decided on by drawing \(m\) independent realizations from a carefully chosen beta distribution:
\[q_i \sim \text{Beta}(1, \lambda), \text{ for } i = 1, \dots, m. \tag{4}\]All of them lie in the unit interval and are the proportions to break off of the remainder. When \(\lambda = 1\), these proportions (of the reminder) are uniformly distributed. When \(\lambda < 1\), the probability mass is shifted to the right, which means that there are likely to be a small number of large pieces, covering virtually the entire stick. When \(\lambda > 1\), the probability mass is shifted to the left, which means that there are likely to be a large number of small pieces, struggling to reach the end of the stick.
Formally, the desired probabilities are given by the following expression:
\[p_i = q_i \prod_{j = 1}^{i - 1} (1 - q_j), \text{ for } i = 1, \dots, m,\]which, as noted earlier, are the left parts of the remainder of the stick during each iteration. For instance, \(p_1 = q_1\), \(p_2 = q_2 (1 - q_1)\), and so on. Due to the truncation, the probabilities \(\{ p_i \}_{i = 1}^m\) do not sum up to one, and it is common to set \(q_m := 1\) so that \(p_m\) takes up the remaining probability mass.
To recapitulate, a single draw from a Dirichlet process is obtained in two steps: prescribe atoms \(\{ x_i \}\) via draws from the normalized base measure and prescribe the corresponding probabilities \(\{ p_i \}\) via the stick-breaking construction. The two give a complete description of a discrete probability distribution. Recall that this distribution is still a single draw. By repeating this process many times, one obtains the distribution of this distribution, which can be used to, for instance, quantify uncertainty in the estimation.
It is time to demonstrate how the Dirichlet process behaves as a direct prior. To this end, we shall use a data set containing velocities of “82 galaxies from 6 well-separated conic sections of an unfilled survey of the Corona Borealis region.” It was studied in Roeder (1990), which gives us a reference point.
For the curious reader, the source code of this notebook along with auxiliary scripts that are used for performing all the calculations presented below can be found on GitHub.
The empirical cumulative distribution function of the velocity is as follows:
Already here, it is apparent that the distribution is multimodal: there are two distinct regions, one to the left and one to the right, where the curve is flat, meaning there are no observations there. The proverbial histogram gives a confirmation:
It can be seen that there is a handful of galaxies moving relatively slowly and relatively fast compared to the majority located somewhere in the middle around twenty thousand kilometers per second. For completeness, kernel density estimation results in the following plot:
How many clusters of galaxies are there? What are their average velocities? How uncertain are these estimates? Our goal is to answer these questions by virtue of the Dirichlet process.
Now that the intention is to apply the presented theory in practice, we have to make all choices we have conveniently glanced over. Specifically, \(P_0\) has to be chosen, and we shall use the following:
\[P_0(\cdot) = \text{Gaussian}(\, \cdot \, | \mu_0, \sigma_0^2). \tag{5}\]In the above, \(\text{Gaussian}(\cdot)\) refers to the probability measure of a Gaussian distribution with parameters \(\mu_0\) and \(\sigma_0\). In addition to these two, there is one more: \(\lambda\). We shall set \(\mu_0\) and \(\sigma_0\) to 20 and 5, respectively—which correspond roughly to the mean and standard deviation of the data—and present results for different \(\lambda\)’s to investigate how the prior volume affects shrinkage toward the prior.
First, we do not condition on the data to get a better understanding of the prior itself, which corresponds to Equation (1). The following figure shows a single draw from four Dirichlet processes with different \(\lambda\)’s (the gray curves show the cumulative distribution function of the data as a reference):
It can be seen that the larger the prior volume, the smoother the curve. This is because larger \(\lambda\)’s “break” the stick into more pieces, allowing the normalized base measure to be extensively sampled, which, in the limit, converges to this very measure; see Equation (5).
Now, conditioning on the observed velocities of galaxies—that is, sampling as shown in Equation (2)—we obtain the following draws from the posterior Dirichlet distributions with different \(\lambda\)’s:
When the prior volume is small, virtually no data points come from \(P_0\); instead, they are mostly uniform draws from the observed data set, leading to a curve that is nearly indistinguishable from the one of the data (the top curve). As \(\lambda\) gets larger, the prior gets stronger, and the estimate gets shrunk toward it, up to a point where the observations appear to be entirely ignored (the bottom curve).
The above model has a serious limitation: it assumes a discrete probability distribution for the data-generating process, which can be seen in the prior and posterior given in Equation (1) and (2), respectively, and it is also apparent in the decomposition given in Equation (3). In some cases, it might be appropriate; however, there is arguably more situations where it is inadequate, including the running example.
Instead of using a Dirichlet process as a direct prior for the given data, it can be used as a prior for mixing distributions from a given family. The resulting posterior will then naturally inherit the properties of the family, such as continuity. The general structure is as follows:
\[\begin{align} x_i | \theta_i & \sim P_x \left( \theta_i \right), \text{ for } i = 1, \dots, n; \tag{6} \\ \theta_i | P_\theta & \sim P_\theta, \text{ for } i = 1, \dots, n; \text{ and} \\ P_\theta & \sim \text{Dirichlet Process}(\lambda P_0). \\ \end{align}\]The \(i\)th data point, \(x_i\), is distributed according to distribution \(P_x\) with parameters \(\theta_i\). For instance, \(P_x\) could refer to the Gaussian family with \(\theta_i = (\mu_i, \sigma_i)\) identifying a particular member of the family by its mean and standard deviation. Parameters \(\{ \theta_i \}_{i = 1}^n\) are unknown and distributed according to distribution \(P_\theta\). Distribution \(P_\theta\) is not known either and gets a Dirichlet process prior with measure \(\lambda P_0\).
It can be seen in Equation (6) that each data point can potentially have its own unique set of parameters. However, this is not what usually happens in practice. If \(\lambda\) is reasonably small, the vast majority of the stick—the one we explained how to break in the previous section—tends to be consumed by a small number of pieces. This makes many data points share the same parameters, which is akin to clustering. In fact, clustering is a prominent use case for the Dirichlet process.
Unlike the previous model, there is no conjugacy in this case, and hence the posterior is not a Dirichlet process. There is, however, a simple Markov chain Monte Carlo sampling strategy based on the stick-breaking construction. It belongs to the class of Gibbs samplers and is as follows.
Similarly to Equation (3), we have the following decomposition:
\[P_m(\cdot) = \sum_{i = 1}^\infty p_i P_x(\cdot | \theta_i)\]where \(P_m\) is the probability measure of the mixture. As before, the infinite decomposition has to be made finite to be usable in practice:
\[P_m(\cdot) = \sum_{i = 1}^m p_i P_x(\cdot | \theta_i).\]Here, \(m\) represents an upper limit on the number of mixture components. Each data point \(x_i\), for \(i = 1, \dots, n\), is mapped to one of the \(m\) components, which we denote by \(k_i \in \{ 1, \dots, m \}\). In other words, \(k_i\) takes values from 1 to \(m\) and gives the index of the component of the \(i\)th observation.
There are \(m + m \times |\theta| + n\) parameters to be inferred where \(|\theta|\) denotes the number of parameters of \(P_x\). These parameters are \(\{ p_i \}_{i = 1}^m\), \(\{ \theta_i \}_{i = 1}^m\), and \(\{ k_i \}_{i = 1}^n\). As usual in Gibbs sampling, the parameters assume arbitrary but compatible initial values. The sampler has the following three steps.
First, given \(\{ p_i \}\), \(\{ \theta_i \}\), and \(\{ x_i \}\), the mapping of the observations to the mixture components, \(\{ k_i \}\), is updated as follows:
\[k_i \sim \text{Categorical}\left( m, \left\{ \frac{p_j P_x(x_i | \theta_j)}{\sum_{l = 1}^m p_l P_x(x_i | \theta_l)} \right\}_{j = 1}^m \right), \text{ for } i = 1, \dots, n.\]That is, \(k_i\) is a draw from a categorical distribution with \(m\) categories whose unnormalized probabilities are given by \(p_j P_x(x_i | \theta_j)\), for \(j = 1, \dots, m\).
Second, given \(\{ k_i \}\), the probabilities of the mixture components, \(\{ p_i \}\), are updated using the stick-breaking construction described earlier. This time, however, the beta distribution for sampling \(\{ q_i \}\) in Equation (4) is replaced with the following:
\[q_i \sim \text{Beta}\left( 1 + n_i, \lambda + \sum_{j = i + 1}^m n_j \right), \text{ for } i = 1, \dots, m,\]where
\[n_i = \sum_{j = 1}^n I_{\{i\}}(k_j), \text{ for } i = 1, \dots, m,\]is the number of data points that are currently allocated to component \(i\). Here, \(I_A\) is the indicator function of a set \(A\). As before, in order for the \(p_i\)’s to sum up to one, it is common to set \(q_m := 1\).
Third, given \(\{ k_i \}\) and \(\{ x_i \}\), the parameters of the mixture components, \(\{ \theta_i \}\), are updated. This is done by sampling from the posterior distribution of each component. In this case, the posterior is a prior of choice that is updated using the data points that are currently allocated to the corresponding component. To streamline this step, a conjugate prior for the data distribution, \(P_x\), is commonly utilized, which we shall illustrate shortly.
To recapitulate, a single draw from the posterior is obtained in a number of steps where parameters or groups of parameters are updated in turn, while treating the other parameters as known. This Gibbs procedure is very flexible. Other parameters can be inferred too, instead of setting them to fixed values. An important example is the concentration parameter, \(\lambda\). This parameter controls the formation of clusters, and one might let the data decide what the value should be, in which case a step similar to the third one is added to the procedure to update \(\lambda\). This will be also illustrated below.
We continue working with the galaxy data. For concreteness, consider the following choices:
\[\begin{align} \theta_i &= (\mu_i, \sigma_i), \text{ for } i = 1, \dots, n; \\ P_x (\theta_i) &= \text{Gaussian}(\mu_i, \sigma_i^2), \text{ for } i = 1, \dots, n; \text{ and} \\ P_0(\cdot) &= \text{Gaussian–Scaled-Inverse-}\chi^2(\, \cdot \, | \mu_0, \kappa_0, \nu_0, \sigma_0^2). \end{align} \tag{7}\]In the above, \(\text{Gaussian–Scaled-Inverse-}\chi^2(\cdot)\) refers to the probability measure of a bivariate distribution composed of a conditional Gaussian and an unconditional scaled inverse chi-squared distribution. Some intuition about this distribution can be built via the following decomposition:
\[\begin{align} \mu_i | \sigma_i^2 & \sim \text{Gaussian}\left(\mu_0, \frac{\sigma_i^2}{\kappa_0}\right) \text{ and} \\ \sigma_i^2 & \sim \text{Scaled-Inverse-}\chi^2(\nu_0, \sigma_0^2). \end{align} \tag{8}\]This prior is a conjugate prior for a Gaussian data distribution with unknown mean and variance, which we assume here. This means that the posterior is also a Gaussian–scaled-inverse-chi-squared distribution. Given a data set with \(n\) observations \(x_1, \dots, x_n\), the four parameters of the prior are updated simultaneously (not sequentially) as follows:
\[\begin{align} \mu_0 & := \frac{\kappa_0}{\kappa_0 + n} \mu_0 + \frac{n}{\kappa_0 + n} \mu_x, \\ \kappa_0 & := \kappa_0 + n, \\ \nu_0 & := \nu_0 + n, \text{ and} \\ \sigma_0^2 & := \frac{1}{\nu_0 + n} \left( \nu_0 \sigma_0^2 + ss_x + \frac{\kappa_0 n}{\kappa_0 + n}(\mu_x - \mu_0)^2 \right) \end{align}\]where \(\mu_x = \sum_{i = 1}^n x_i / n\) and \(ss_x = \sum_{i = 1}^n (x_i - \mu_x)^2\). It can be seen that \(\kappa_0\) and \(\nu_0\) act as counters of the number of observations; \(\mu_0\) is a weighted sum of two means; and \(\nu_0 \sigma_0^2\) is a sum of two sums of squares and a third term increasing the uncertainty due to the difference in the means. In the Gibbs sampler, each component (each cluster of galaxies) will have its own posterior based on the data points that are assigned to that component during each iteration of the process. Therefore, \(n\), \(\mu_x\), and \(ss_x\) will generally be different for different components and, moreover, will vary from iteration to iteration.
We set \(\mu_0\) to 20, which is roughly the mean of the data, and \(\nu_0\) to 3, which is the smallest integer that allows the scaled chi-squared distribution to have a finite expectation. The choice of \(\kappa_0\) and \(\sigma_0\) is more subtle. Recall Equation (8). What we would like from the prior is to allow for free formation of clusters in a region generously covering the support of the data. To this end, the uncertainty in the mean, \(\mu_i\), has to be high; however, it should not come from \(\sigma_i\), since it would produce very diffuse clusters. We set \(\kappa_0\) to 0.01 to magnify the variance of \(\mu_i\) without affecting \(\sigma_i\), and \(\sigma_0\) to 1 to keep clusters compact.
Now, let us take a look at what the above choices entail. The following figure illustrates the prior for the mean of a component:
The negative part is unrealistic for velocity; however, it is rarely a problem in practice. What is important is that there is a generous coverage of the plausible values. The following figure shows the prior for the standard deviation of a component:
The bulk is below the standard deviation of the data; however, this is by choice: we expect more than one cluster of galaxies with similar velocities.
As mentioned earlier, we intend to include \(\lambda\) in the inference. First, we put the following prior:
\[\lambda \sim \text{Gamma}(\alpha_0, \beta_0). \tag{9}\]Note this is the rate parameterization of the Gamma family. Conditionally, this is a conjugate prior with the following update rule for the two parameters:
\[\begin{align} \alpha_0 & := \alpha_0 + m - 1 \quad \text{and} \\ \beta_0 & := \beta_0 - \sum_{i = 1}^{m - 1} \ln(1 - q_i) \end{align}\]where \(\{ q_i \}\) come from the stick-breaking construction. This is a fourth step in the Gibbs sampler. We set \(\alpha_0\) and \(\beta_0\) to 2 and 0.1, respectively, which entails the following prior assumption about \(\lambda\):
The parameter is allowed to vary freely from small to large values, as desired.
Having chosen all priors and their hyperparameters, we are ready to investigate the behavior of the entire model; see Equations (6), (7), and (9). In what follows, we shall limit the number of mixture components to 25; that is, \(m = 25\). Furthermore, we shall perform 2000 Gibbs iterations and discard the first half as a warm-up period. As before, we start without conditioning on the data to observe draws from the prior itself. The following figure shows two sample draws:
It can be seen that clusters of galaxies can appear anywhere in the region of interest and can be of various sizes. We conclude that the prior is adequate. When taking the observed velocities into account, we obtain a full posterior distribution in the form of 1000 draws. The following shows two random draws:
Indeed, mixture components have started to appear in the regions where there are observations.
Before we proceed to the final summary of results, it is prudent to inspect sample chains for a few parameters in order to ensure there are not problems with convergence to the stationary distribution. The following shows the number of occupied components among the 25 permitted:
The chain fluctuates around a fixed level without any prominent pattern, as it should. One can plot the actual marginal posterior distribution for the number of components; however, it is already clear that the distribution of the number of clusters of galaxies is mostly between 5 and 10 with a median of 7.
As for the concentration parameter, \(\lambda\), the chain is as follows:
The behavior is uneventful, which is a good sign.
Let us now take a look at the posterior distributions of the first seven components highlighted earlier (note the different scales on the vertical axes):
The components clearly change roles, which can be seen by the multimodal nature of the distributions. Component 1 is most often at 10 (times \(10^6\) m/s); however, it also peaks between 24 and 25 and even above 30. Components 2 and 3 are the most certain ones, which is due to a relatively large number of samples present in the corresponding region. They seem to exchanges roles and capture velocities of around 20 and 23. Components 4 and 5, on the other hand, appear to play the same role. Unlike Component 1, they are most likely to be found at around 33. Components 6 and 7 are similar too. They seem to be responsible for the small formation to the left and right next to the bulk in the middle (at 16); recall the histogram of the data. The small formation on the other side of the bulk at around 26 is captured as well, which is mostly done by Component 6.
Lastly, we summarize the inference using the following figure where the median distribution and a 95% uncertainty band—composed of distributions at the 0.025 and 0.975 quantiles—are plotted:
In this view, only five components are visible to the naked eye. The median curve matches well the findings in Roeder (1990). Judging by the width of the uncertainty band, there is a lot of plausible alternatives, and it is important to communicate this uncertainty to those who base decisions on the inference. The ability to quantify uncertainty with such ease is a prominent advantage of Bayesian inference.
In this article, the family of Dirichlet processes has been presented in the context of Bayesian inference. More specifically, it has been shown how a Dirichlet process can be utilized as a prior for an unknown discrete distribution and as a prior for mixing distributions from a given family. In both cases, it has been illustrated how to perform inference via a finite approximation and the stick-breaking construction.
Clearly, the overall procedure is more complicated than counting observations falling in a number of fixed bins, which is what a histogram does, or placing kernels all over the place, which is what a kernel density estimator does. However, “anything in life worth having is worth working for.” The advantages of the Bayesian approach include the ability to incorporate prior knowledge, which is crucial in situations with little data, and the ability to propagate and quantify uncertainty, which is a must.
Recall that the source code of this notebook along with auxiliary scripts that were used for performing the calculations presented above can be found on GitHub. Any feedback is welcome!
I would like to thank Mattias Villani for the insightful and informative graduate course in Bayesian statistics titled “Advanced Bayesian learning,” which was the inspiration behind writing this article, and for his guidance regarding the implementation.
Consider the following example taken from Semiparametric Regression by Ruppert et al.:
The figure shows 221 observations collected in a light detection and ranging experiment. Each observation can be interpreted as the sum of the true underlying response at the corresponding distance and random noise. It can be clearly seen that the variance of the noise varies with the distance: the spread is substantially larger toward the right-hand side. This phenomenon is known as heteroscedasticity. Homoscedasticity (the absence of heteroscedasticity) is one of the key assumptions of linear regression. Applying linear regression to the above problem would yield suboptimal results. The estimates of the regression coefficients would still be unbiased; however, the standard errors of the coefficients would be incorrect and hence misleading. A different modeling technique is needed in this case.
The above data set will be our running example. For formally and slightly more generally, we assume that there is a data set of \(m\) observations:
\[\left\{ (\mathbf{x}_i, y_i): \, \mathbf{x}_i \in \mathbb{R}^d; \, y_i \in \mathbb{R}; \, i = 1, \dots, m \right\}\]where the independent variable, \(\mathbf{x}\), is \(d\)-dimensional, and the dependent variable, \(y\), is scalar. In the running example, \(d\) is 1, and \(m\) is 221. It is time for modeling.
To begin with, consider the following model with additive noise:
\[y_i = f(\mathbf{x}_i) + \epsilon_i, \text{ for } i = 1, \dots, m. \tag{1}\]In the above, \(f: \mathbb{R}^d \to \mathbb{R}\) represents the true but unknown underlying function, and \(\epsilon_i\) represents the perturbation of the \(i\)th observation by random noise. In the classical linear-regression setting, the unknown function is modeled as a linear combination of (arbitrary transformations of) the \(d\) covariates. Instead of assuming any particular functional form, we put a Gaussian process prior on the function:
\[f(\mathbf{x}) \sim \text{Gaussian Process}\left( 0, k(\mathbf{x}, \mathbf{x}') \right).\]The above notation means that, before observing any data, the function is a draw from a Gaussian process with zero mean and a covariance function \(k\). The covariance function dictates the degree of correlation between two arbitrary locations \(\mathbf{x}\) and \(\mathbf{x}'\) in \(\mathbb{R}^d\). For instance, a frequent choice for \(k\) is the squared-exponential covariance function:
\[k(\mathbf{x}, \mathbf{x}') = \sigma_\text{process}^2 \exp\left( -\frac{\|\mathbf{x} - \mathbf{x}'\|_2^2}{2 \, \ell_\text{process}^2} \right)\]where \(\|\cdot\|_2\) stands for the Euclidean norm, \(\sigma_\text{process}^2\) is the variance (to see this, substitute \(\mathbf{x}\) for \(\mathbf{x}'\)), and \(\ell_\text{process}\) is known as the length scale. While the variance parameter is intuitive, the length-scale one requires an illustration. The parameter controls the speed with which the correlation fades with the distance. The following figure shows 10 random draws for \(\ell_\text{process} = 0.1\):
With \(\ell_\text{process} = 0.5\), the behavior changes to the following:
It can be seen that it takes a greater distance for a function with a larger length scale (top) to change to the same extent compared to a function with a smaller length scale (bottom).
Let us now return to Equation (1) and discuss the error terms, \(\epsilon_i\). In linear regression, they are modeled as independent identically distributed Gaussian random variables:
\[\epsilon_i \sim \text{Gaussian}\left( 0, \sigma_\text{noise}^2 \right), \text{ for } i = 1, \dots, m. \tag{2}\]This is also the approach one can take with Gaussian process regression; however, one does not have to. There are reasons to believe the problem at hand is heteroscedastic, and it should be reflected in the model. To this end, the magnitude of the noise is allowed to vary with the covariates:
\[\epsilon_i | \mathbf{x}_i \sim \text{Gaussian}\left(0, \sigma^2_{\text{noise}, i}\right), \text{ for } i = 1, \dots, m. \tag{3}\]The error terms are still independent (given the covariates) but not identically distributed. At this point, one has to make a choice about the dependence of \(\sigma_{\text{noise}, i}\) on \(\mathbf{x}_i\). This dependence could be modeled with another Gaussian process with an appropriate link function to ensure \(\sigma_{\text{noise}, i}\) is nonnegative. Another reasonable choice is a generalized linear model, which is what we shall use:
\[\ln \sigma^2_{\text{noise}, i} = \alpha_\text{noise} + \boldsymbol{\beta}^\intercal_\text{noise} \, \mathbf{x}_i, \text{ for } i = 1, \dots, m, \tag{4}\]where \(\alpha\) is the intercept of the regression line, and \(\boldsymbol{\beta} \in \mathbb{R}^d\) contains the slopes.
Thus far, a model for the unknown function \(f\) and a model for the noise have been prescribed. In total, there are \(d + 3\) parameters: \(\sigma_\text{process}\), \(\ell_\text{process}\), \(\alpha_\text{noise}\), and \(\beta_{\text{noise}, i}\) for \(i = 1, \dots, d\). The first two are positive, and the rest are arbitrary. The final piece is prior distributions for these parameters.
The variance of the coveriance function, \(\sigma^2_\text{process}\), corresponds to the amount of variance in the data that is explained by the Gaussian process. It poses no particular problem and can be tackled with a half-Gaussian or a half-Student’s t distribution:
\[\sigma_\text{process} \sim \text{Half-Gaussian}\left( 0, 1 \right).\]The notation means that the standard Gaussian distribution is truncated at zero and renormalized. The nontrivial mass around zero implied by the prior is considered to be beneficial in this case.^{1}
A prior for the length scale of the covariance function, \(\ell_\text{process}\), should be chosen with care. Small values—especially, those below the resolution of the data—give the Gaussian process extreme flexibility and easily leads to overfitting. Moreover, there are numerical ramifications of the length scale approaching zero as well: the quality of Hamiltonian Monte Carlo sampling degrades.^{2} The bottom line is that a prior penalizing values close to zero is needed. A reasonable choice is an inverse gamma distribution:
\[\ell_\text{process} \sim \text{Inverse Gamma}\left( 1, 1 \right).\]To understand the implications, let us perform a prior predictive check for this component in isolation:
It can be seen that the density is very low in the region close to zero, while being rather permissive to the right of that region, especially considering the scale of the distance in the data; recall the very first figure. Consequently, the choice is adequate.
The choice of priors for the parameters of the noise is complicated by the nonlinear link function; see Equation (4). What is important to realize is that small amounts of noise correspond to negative values in the linear space, which is probably what one should be expecting given the scale of the response. Therefore, the priors should allow for large negative values. Let us make an educated assumption and perform a prior predictive check to understand the consequences. Consider the following:
\[\begin{align} \alpha_\text{noise} & \sim \text{Gaussian}\left( -1, 1 \right) \text{ and} \\ \beta_{\text{noise}, i} & \sim \text{Gaussian}\left( 0, 1 \right), \text{ for } i = 1, \dots, d.\\ \end{align}\]The density of \(\sigma_\text{noise}\) without considering the regression slopes is depicted below (note the logarithmic scale on the horizontal axis):
The variability in the intercept, \(\alpha_\text{noise}\), allows the standard deviation, \(\sigma_\text{noise}\), to comfortably vary from small to large values, keeping in mind the scale of the response. Here are two draws from the prior distribution of the noise, including Equations (3) and (4):
The large ones are perhaps unrealistic and could be addressed by further shifting the distribution of the intercept. However, they should not cause problems for the inference.
Putting everything together, the final model is as follows:
\[\begin{align} y_i & = f(\mathbf{x}_i) + \epsilon_i, \text{ for } i = 1, \dots, m; \\ f(\mathbf{x}) & \sim \text{Gaussian Process}\left( 0, k(\mathbf{x}, \mathbf{x}') \right); \\ k(\mathbf{x}, \mathbf{x}') & = \sigma_\text{process}^2 \exp\left( -\frac{\|\mathbf{x} - \mathbf{x}'\|_2^2}{2 \, \ell_\text{process}^2} \right); \\ \epsilon_i | \mathbf{x}_i & \sim \text{Gaussian}\left( 0, \sigma^2_{\text{noise}, i} \right), \text{ for } i = 1, \dots, m; \\ \ln \sigma^2_{\text{noise}, i} & = \alpha_\text{noise} + \boldsymbol{\beta}_\text{noise}^\intercal \, \mathbf{x}_i, \text{ for } i = 1, \dots, m; \\ \sigma_\text{process} & \sim \text{Half-Gaussian}\left( 0, 1 \right); \\ \ell_\text{process} & \sim \text{Inverse Gamma}\left( 1, 1 \right); \\ \alpha_\text{noise} & \sim \text{Gaussian}\left( -1, 1 \right); \text{ and} \\ \beta_{\text{noise}, i} & \sim \text{Gaussian}\left( 0, 1 \right), \text{ for } i = 1, \dots, d.\\ \end{align}\]This concludes the modeling part. The remaining two steps are to infer the parameters and to make predictions using the posterior predictive distribution.
The model is analytically intractable; one has to resort to sampling or variational methods for inferring the parameters. We shall use Hamiltonian Markov chain Monte Carlo sampling via Stan. The model can be seen in the following listing, where the notation closely follows the one used throughout the article:
data {
int<lower = 1> d;
int<lower = 1> m;
vector[d] x[m];
vector[m] y;
}
transformed data {
vector[m] mu = rep_vector(0, m);
matrix[m, d] X;
for (i in 1:m) {
X[i] = x[i]';
}
}
parameters {
real<lower = 0> sigma_process;
real<lower = 0> ell_process;
real alpha_noise;
vector[d] beta_noise;
}
model {
matrix[m, m] K = cov_exp_quad(x, sigma_process, ell_process);
vector[m] sigma_noise_squared = exp(alpha_noise + X * beta_noise);
matrix[m, m] L = cholesky_decompose(add_diag(K, sigma_noise_squared));
y ~ multi_normal_cholesky(mu, L);
sigma_process ~ normal(0, 1);
ell_process ~ inv_gamma(1, 1);
alpha_noise ~ normal(-1, 1);
beta_noise ~ normal(0, 1);
}
In the parameters
block, one can find the \(d + 3\) parameters identified
earlier. In regards to the model
block, it is worth noting that there is no
any Gaussian process distribution in Stan. Instead, a multivariate Gaussian
distribution is utilized to model \(f\) at \(\mathbf{X} = (\mathbf{x}_i)_{i =
1}^m \in \mathbb{R}^{m \times d}\) and eventually \(\mathbf{y} = (y_i)_{i =
1}^m\), which is for a good reason. Even though a Gaussian process is an
infinite-dimensional object, in practice, one always works with finite amounts
of data. For instance, in the running example, there are only 221 data points.
By definition, a Gaussian process is a stochastic process with the condition
that any finite collection of points from this process has a multivariate
Gaussian distribution. This fact combined with the conditional independence of
the process and the noise given the covariates yields the following and explains
the usage of a multivariate Gaussian distribution:
where \(\mathbf{K} \in \mathbb{R}^{m \times m}\) is a covariance matrix computed by evaluating the covariance function \(k\) at all pairs of locations in the observed data, and \(\mathbf{D} = \text{diag}(\sigma^2_{\text{noise}, i})_{i = 1}^m \in \mathbb{R}^{m \times m}\) is a diagonal matrix of the variances of the noise at the corresponding locations.
After running the inference, the following posterior distributions are obtained:
The intervals are at the bottom of the densities are 66% and 95% equal-tailed probability intervals, and the dots indicate the medians. Let us also take a look at the 95% probability interval for the noise with respect to the distance:
As expected, the variance of the noise increases with the distance.
Suppose there are \(n\) locations \(\mathbf{X}_\text{new} = (\mathbf{x}_{\text{new}, i})_{i = 1}^n \in \mathbb{R}^{n \times d}\) where one wishes to make predictions. Let \(\mathbf{f}_\text{new} \in \mathbb{R}^n\) be the values of \(f\) at those locations. Assuming all the data and parameters given, the joint distribution of \(\mathbf{y}\) and \(\mathbf{f}_\text{new}\) is as follows:
\[\left[ \begin{matrix} \mathbf{y} \\ \mathbf{f}_\text{new} \end{matrix} \right] \sim \text{Multivariate Gaussian}\left( \mathbf{0}, \left[ \begin{matrix} \mathbf{K} + \mathbf{D} & k(\mathbf{X}, \mathbf{X}_\text{new}) \\ k(\mathbf{X}_\text{new}, \mathbf{X}) & k(\mathbf{X}_\text{new}, \mathbf{X}_\text{new}) \end{matrix} \right] \right)\]where, with a slight abuse of notation, \(k(\cdot, \cdot)\) stands for a covariance matrix computed by evaluating the covariance function \(k\) at the specified locations, which is analogous to \(\mathbf{K}\). It is well known (see Rasmussen et al. 2006, for instance) that the marginal distribution of \(\mathbf{f}_\text{new}\) is a multivariate Gaussian with the following mean vector and covariance matrix, respectively:
\[\begin{align} E(\mathbf{f}_\text{new}) & = k(\mathbf{X}_\text{new}, \mathbf{X})(\mathbf{K} + \mathbf{D})^{-1} \, \mathbf{y} \quad \text{and} \\ \text{cov}(\mathbf{f}_\text{new}) & = k(\mathbf{X}_\text{new}, \mathbf{X}_\text{new}) - k(\mathbf{X}_\text{new}, \mathbf{X})(\mathbf{K} + \mathbf{D})^{-1} k(\mathbf{X}, \mathbf{X}_\text{new}). \end{align}\]The final component is the noise, as per Equation (1). The noise does not change the mean of the multivariate Gaussian distribution but does magnify the variance:
\[\begin{align} E(\mathbf{y}_\text{new}) & = k(\mathbf{X}_\text{new}, \mathbf{X})(\mathbf{K} + \mathbf{D})^{-1} \, \mathbf{y} \quad \text{and} \\ \text{cov}(\mathbf{y}_\text{new}) & = k(\mathbf{X}_\text{new}, \mathbf{X}_\text{new}) - k(\mathbf{X}_\text{new}, \mathbf{X})(\mathbf{K} + \mathbf{D})^{-1} k(\mathbf{X}, \mathbf{X}_\text{new}) + \text{diag}(\sigma^2_\text{noise}(\mathbf{X}_\text{new})) \end{align}\]where \(\text{diag}(\sigma^2_\text{noise}(\cdot))\) stands for a diagonal matrix composed of the noise variance evaluated at the specified locations, which is analogous to \(\mathbf{D}\).
Given a set of draws from the joint posterior distribution of the parameters and the last two expressions, it is now straightforward to draw samples from the posterior predictive distribution of the response: for each draw of the parameters, one has to evaluate the mean vector and the covariance matrix and sample the corresponding multivariate Gaussian distribution. The result is given in the following figure:
The graph shows the mean value of the posterior predictive distribution given by the black line along with a 95% equal-tailed probability band about the mean. It can be seen that the uncertainty in the predictions is adequately captured along the entire support. Naturally, the full predictive posterior distribution is available at any location of interest.
Before we conclude, let us illustrate what would happen if the data were modeled as having homogeneous noise. To this end, the variance of the noise is assumed to be independent of the covariates, as in Equation (2). After repeating the inference and prediction processes, the following is obtained:
The inference is inadequate, which can be seen by the probability band: the variance is largely overestimated on the left-hand side and underestimated on the right-hand side. This justifies well the choice of heteroscedastic regression presented earlier.
In this article, it has been illustrated how a functional relationship can be modeled using a Gaussian process as a prior. Particular attention has been dedicated to adequately capturing error terms in the presence of heteroscedasticity. In addition, a practical implementation has been discussed, and the experimental results have demonstrated the appropriateness of this approach.
For the curious reader, the source code of this notebook along with a number of auxiliary scripts, such as the definition of the model in Stan, can be found on GitHub.
I would like to thank Mattias Villani for the insightful and informative graduate course in statistics titled “Advanced Bayesian learning,” which was the inspiration behind writing this article.
“Priors for marginal standard deviation,” Stan User’s Guide, 2020. ↩
“Priors for length-scale,” Stan User’s Guide, 2020. ↩
One solution is to start to iterate over the columns of the two tables, computing five-number summaries and plotting histograms or identifying distinct values and plotting bar charts, depending on the column’s type. However, this can quickly get out of hand and evolve into an endeavor for the rest of the day.
An alternative is to leverage the amazing tools that already exist in the data community.
The key takeaway is the following three lines of code, excluding the import:
import tensorflow_data_validation as dv
statistics_1 = dv.generate_statistics_from_dataframe(data_1)
statistics_2 = dv.generate_statistics_from_dataframe(data_2)
dv.visualize_statistics(lhs_statistics=statistics_1,
rhs_statistics=statistics_2)
This is all it takes to get a versatile dashboard embedded right into a cell of a Jupyter notebook. The visualization itself is based on Facets, and it is conveniently provided by TensorFlow Data Validation (which does not have much to do with TensorFlow and can be used stand-alone).
It is pointless to try to describe in words what the dashboard can do; instead, here is a demonstration taken from Facets where the tool is applied the UCI Census Income data set:
Go ahead and give a try to all the different controls!
In this case, it is helpful to toggle the “percentages” checkbox, since the data
sets are of different sizes. Then it becomes apparent that the two partitions
are fairly balanced. The only problem is that Target
, which represents income,
happened to be encoded incorrectly in the partition for testing.
Lastly, an example in a Jupyter notebook can be found on GitHub.
It can be difficult to navigate and particularly challenging to compare wide data sets. A lot of effort can be put into this exercise. However, the landscape of open-source tools has a lot to offer too. Facets is one such example. The library and its straightforward availability via TensorFlow Data Validation are arguably less known. This short note can hopefully rectify this to some extent.
]]>More specifically, the discussion here is a sequel to “A Bayesian approach to the inference of the net promoter score,” where we built a hierarchical model for inferring the net promoter score for an arbitrary segmentation of a customer base. The reader is encouraged to skim over that article to recall the mechanics of the score and the structure of the model that was constructed. In that article, there was an assumption made that the sample was representative of the population, which, as mentioned earlier, is often not the case. In what follows, we mitigate this problem using a technique called poststratification. The technique works by matching proportions observed in the sample with those observed in the population with respect to several dimensions, such as age, country, and gender. However, in order to be able to poststratify, the model has to have access to all these dimensions at once, which the model built earlier is not suited for. To enable this, we switch gears to multilevel multinomial regression.
Suppose the survey is to measure the net promoter score for a population that consists of \(N\) customers. The score is to be reported with respect to individual values of \(M\) grouping variables where variable \(i\) has \(m_i\) possible values, for \(i = 1, \dots, M\). For instance, it might be important to know the score for different age groups, in which case the variable would be the customer’s age with values such as 18–25, 26–35, and so on. This implies that, in total, \(\sum_i m_i\) scores have to be estimated.
Depending on the size of the business, one might or might not try to reach out to all customers, except for those who have opted out of communications. Regardless of the decision, the resulting sample size, which is denoted by \(n\), is likely to be substantially smaller than \(N\), as the response rate is typically low. Therefore, there is uncertainty about the opinion of those who abstained or were not targeted.
More importantly, a random sample is desired; however, certain subpopulations of customers might end up being significantly overrepresented due to participation bias, driving the score astray. Let us quantify this concern. We begin by taking the Cartesian product of the aforementioned \(M\) variables. This results in \(K = \prod_i m_i\) distinct combinations of the variables’ values, which are referred to as cells in what follows. For each cell, the number of detractors, neutrals, and promoters observed in the sample are computed and denoted by \(d_i\), \(u_i\), and \(p_i\), respectively. The number of respondents in cell \(i\) is then
\[n_i = d_i + u_i + p_i \tag{1}\]for \(i = 1, \dots, K\). For convenience, all counts are arranged in the following matrix:
\[y = \left( \begin{matrix} y_1 \\ \vdots \\ y_i \\ \vdots \\ y_K \end{matrix} \right) = \left( \begin{matrix} d_1 & u_1 & p_1 \\ \vdots & \vdots & \vdots \\ d_i & u_i & p_i \\ \vdots & \vdots & \vdots \\ d_K & u_K & p_K \end{matrix} \right). \tag{2}\]Given \(y\), the observed net promoter score for value \(j\) of variable \(i\) can be evaluated as follows:
\[s^i_j = 100 \times \frac{\sum_{k \in I^i_j}(p_k - d_k)}{\sum_{k \in I^i_j} n_k} \tag{3}\]where \(I^i_j\) is an index set traversing cells with variable \(i\) set to value \(j\), which has the effect of marginalizing out other variables conditioned on the chosen value of variable \(i\), that is, on value \(j\).
We can now compare \(n_i\), computed according to Equation (1), with its counterpart in the population (the total number of customers who belong to cell \(i\)), which is denoted by \(N_i\), taking into consideration the sample size \(n\) and the population size \(N\). Problems occur when the ratios within one or more of the following tuples largely disagree:
\[\left(\frac{n_i}{n}, \frac{N_i}{N}\right) \tag{4}\]for \(i = 1, \dots, K\). When this happens, the scores given by Equation (3) or any analyses oblivious of this disagreement cannot be trusted, since they misrepresent the population. (It should be noted, however, that equality within each tuple does not guarantee the absence of participation bias, since there might be other, potentially unobserved, dimensions along which there are deviations.)
The survey has been conducted, and there are deviations. What do we do with all these responses that have come in? Should we discard and run a new survey, hoping that, this time, it would be different?
The fact that the sample covers only a fraction of the population is, of course, no news, and the solution is standard: one has to infer the net promoter score for the population given the sample and domain knowledge. This is what was done in the previous article for one grouping variable. However, due to participation bias, additional measures are needed as follows.
Taking inspiration from political science, we proceed in two steps.
Using an adequate model, \(K = \prod_i m_i\) net promoter scores are inferred—one for each cell, that is, for each combination of the values of the grouping variables.
The \(\prod_i m_i\) “cell-scores” are combined to produce \(\sum_i m_i\) “value-scores”—one for each value of each variable. This is done in such a way that the contribution of each cell to the score is equal to the relative size of that cell in the population given by Equation (4).
The two steps are discussed in the following two subsections.
Step 1 can, in principle, be undertaken by any model of choice. A prominent candidate is multilevel multinomial regression, which is what we shall explore. Multilevel refers to having a hierarchical structure where parameters on a higher level give birth to parameters on a lower level, which, in particular, enables information exchange through a common ancestor. Multinomial refers to the distribution used for modeling the response variable. The family of multinomial distributions is appropriate, since we work with counts of events falling into one of several categories: detractors, neutrals, and promoters; see Equation (2). The response for each cell is then as follows:
\[y_i | \theta_i \sim \text{Multinomial}(n_i, \theta_i)\]where \(n_i\) is given by Equation (1), and
\[\theta_i = \left\langle\theta^d_i, \theta^u_i, \theta^p_i\right\rangle\]is a simplex (sums up to one) of probabilities of the three categories.
Multinomial regression belongs to the class of generalized linear models. This means that the inference takes place in a linear domain, and that \(\theta_i\) is obtained by applying a deterministic transformation to the corresponding linear model or models; the inverse of this transformation is known as the link function. In the case of multinomial regression, the aforementioned transformation is the softmax function, which is a generalization of the logistic function allowing more than two categories:
\[\theta_i = \text{Softmax}\left(\mu_i\right)\]where
\[\mu_i = \left(0, \mu^u_i, \mu^p_i\right)\]is the average log-odds of the three categories with respect to a reference category, which, by conventions, is taken to be the first one, that is, detractors. The first entry is zero, since \(\ln(1) = 0\). Therefore, there are only two linear models: one is for neutrals (\(\mu^u_i\)), and one is for promoters (\(\mu^p_i\)).
Now, there are many alternatives when it comes to the two linear parts. In this article, we use the following architecture. Both the model for neutrals and the one for promoters have the same structure, and for brevity, only the former is described. For the log-odds of neutrals, the model is
\[\mu^u_i = b^u + \sum_{j = 1}^M \delta^{uj}_{I_j[i]}\]where
\[\delta^{uj} = \left(\delta^{uj}_1, \dots, \delta^{uj}_{m_j}\right)\]is a vector of deviations from intercept \(b^u\) specific to grouping variable \(j\) (one entry for each value of the variable), and \(I_j[i]\) yields the index of the value that cell \(i\) has, for \(i = 1, \dots, K\) and \(j = 1, \dots, M\).
Let us now turn to the multilevel aspect. For each grouping variable, the corresponding values, represented by the elements of \(\delta^{uj}\), are allowed to be different but assumed to have something in common and thus originate from a common distribution. To this end, they are assigned distributions with a shared parameter as follows:
\[\delta^{uj}_i | \sigma^{uj} \sim \text{Gaussian}\left(0, \sigma^{uj}\right)\]for \(i = 1, \dots, m_j\). The mean is zero, since \(\delta^{uj}_i\) represents a deviation.
Lastly, we have to decide on prior distributions of the intercept, \(b^u\), and the standard deviations, \(\sigma^{uj}\) for \(j = 1, \dots, M\). The intercept is given the following prior:
\[b^u \sim \text{Student’s t}(5, 0, 1).\]The mean is zero in order to center at even odds. Regarding the standard deviations, they are given the following prior:
\[\sigma^{uj} \sim \text{Half-Student’s t}(5, 0, 1).\]In order to understand the implications of these prior choices, let us take a look at the prior distribution assuming two grouping variables:
The left and right dashed lines demarcate tail regions that, for practical purposes, can be thought of as “never” and “always,” respectively. For instance, log-odds of five or higher are so extreme that detractors are rendered nearly non-existent when compared to neutrals. These regions are arguably unrealistic. The prior does not exclude these possibilities; however, it does not favor them either. The vast majority of the probability mass is still in the middle around zero.
The overall model is then as follow:
\[\begin{align} & y_i | \theta_i \sim \text{Multinomial}(n_i, \theta_i), \text{ for } i = 1, \dots, K; \\ & \theta_i = \text{Softmax}\left(\mu_i\right), \text{ for } i = 1, \dots, K; \\ & \mu_i = (0, \mu^u_i, \mu^p_i), \text{ for } i = 1, \dots, K; \\ & \mu^u_i = b^u + \sum_{j = 1}^M \delta^{uj}_{I_j[i]}, \text{ for } i = 1, \dots, K; \\ & \mu^p_i = b^p + \sum_{j = 1}^M \delta^{pj}_{I_j[i]}, \text{ for } i = 1, \dots, K; \\ & b^u \sim \text{Student’s t}(5, 0, 1); \\ & b^p \sim \text{Student’s t}(5, 0, 1); \\ & \delta^{uj}_k | \sigma^{uj} \sim \text{Gaussian}\left(0, \sigma^{uj}\right), \text{ for } j = 1, \dots, M \text{ and } k = 1, \dots, m_j; \tag{5a} \\ & \delta^{pj}_k | \sigma^{pj} \sim \text{Gaussian}\left(0, \sigma^{pj}\right), \text{ for } j = 1, \dots, M \text{ and } k = 1, \dots, m_j; \tag{5b} \\ & \sigma^{uj} \sim \text{Half-Student’s t}(5, 0, 1), \text{ for } j = 1, \dots, M; \text{ and} \\ & \sigma^{pj} \sim \text{Half-Student’s t}(5, 0, 1), \text{ for } j = 1, \dots, M. \end{align}\]The model has \(2 \times (1 + \sum_i m_i + M)\) parameters in total. The structure that can be seen in Equations (5a) and (5b) is what makes the model multilevel. This is an important feature, since it allows for information sharing between the individual values of the grouping variables. In particular, this has a regularizing effect on the estimates, which is also known as shrinkage resulting from partial pooling.
Having defined the model, the posterior distribution can now be obtained by
means of Markov chain Monte Carlo sampling. This procedure is standard and can
be performed using, for instance, Stan or a higher-level package, such as
brms
, which is what is exemplified in the Implementation section. The result
is a collection of draws of the parameters from the posterior distribution. For
each draw of the parameters, a draw of the net promoter score can be computed
using the following formula:
for \(i = 1, \dots, K\). This means that we have obtained a (joint) posterior distribution of the net promoter score over the \(K\) cells. It is now time to combine the scores for the cells on the level of the values of the \(M\) grouping variables, which results in \(\sum_i m_i\) scores in total.
Step 2 is poststratification, whose purpose is to correct for potential deviations of the sample from the population; recall the discussion around Equation (4). The foundation laid in the previous subsection makes the work here straightforward. The idea is as follows. Each draw from the posterior distribution consists of \(K\) values for the net promoter score, one for each cell. All one has to do in order to correct for a mismatch in proportions is to take a weighted average of these scores where the weights are the counts observed in the population:
\[s^i_j = \frac{\sum_{k \in I^i_j} N_k \, s_k}{\sum_{k \in I^i_j} N_k}\]where \(I^i_j\) is as in Equation (3), for \(i = 1, \dots, M\) and \(j = 1, \dots, m_i\). The above gives a poststratified draw from the posterior distribution of the net promoter score for variable \(i\) and value \(j\). In practice, depending on the tool used, one might perform the poststratification procedure differently, such as predicting counts of detractors, neutrals, and promoters in the cells given their in-population sizes and then aggregating those counts and following the definition of the net promoter score.
In what follows, we consider a contrived example with the sole purpose of
illustrating how the presented workflow can be implemented in practice. To this
end, we generate some data with two grouping variables, age and seniority, and
then perform inference using brms
, which leverages Stan under the hood. For
a convenient manipulation of posterior draws, tidybayes
is used as well.
library(brms)
library(tidybayes)
library(tidyverse)
set.seed(42)
options(mc.cores = parallel::detectCores())
# Load data
data <- load_data()
# => list(
# => population = tibble(age, seniority, cell_size),
# => sample = tibble(age, seniority, cell_size,
# => cell_counts = (detractors, neutrals, promoters))
# => )
# Modeling
priors <- c(
prior('student_t(5, 0, 1)', class = 'Intercept', dpar = 'muneutral'),
prior('student_t(5, 0, 1)', class = 'Intercept', dpar = 'mupromoter'),
prior('student_t(5, 0, 1)', class = 'sd', dpar = 'muneutral'),
prior('student_t(5, 0, 1)', class = 'sd', dpar = 'mupromoter')
)
formula <- brmsformula(
cell_counts | trials(cell_size) ~ (1 | age) + (1 | seniority))
model <- brm(formula, data$sample, multinomial(), priors,
control = list(adapt_delta = 0.99), seed = 42)
# Poststratification
prediction <- data$population %>%
add_predicted_draws(model) %>%
spread(.category, .prediction) %>%
group_by(age, .draw) %>%
summarize(score = 100 * sum(promoter - detractor) / sum(cell_size)) %>%
mean_hdi()
The final aggregation is given for age; it is similar for seniority. It can be seen in the above listing that modern tools allow for rather complex ideas to be expressed and explored in a very laconic way.
The curious reader is encouraged to run the above code. The appendix contains a
function for generating synthetic data. It should be noted, however, that brms
and tidybayes
should be of versions greater than 2.11.1 and 2.0.1,
respectively, which, at the time of writing, are available for installation only
on GitHub. The appendix contains instructions for updating the packages.
In this article, we have discussed a multilevel multinomial model for inferring the net promoter score with respect to several grouping variables in accordance with the business needs. It has been argued that poststratification is an essential stage of the inference process, since it mitigates the deleterious consequences of participation bias on the subsequent decision-making.
There are still some aspects that could be improved. For instance, there is a natural ordering to the three categories of customers, detractors, neutrals, and promoters; however, it is currently ignored. Furthermore, there is some information thrown away when customer-level scores, which range from zero to ten, are aggregated on the category level. Lastly, the net promoter survey often happens in periodic waves, which calls for a single model capturing and learning from changes over time.
I would like to thank Andrew Gelman for the guidance on multilevel modeling
and Paul-Christian Bürkner for the help with understanding the brms
package.
The following listing defines a function that makes the illustrative example given in the Implementation section self-sufficient. By default, the population contains one million customers, and the sample contains one percent. There are two grouping variables: age with six values and seniority with seven values.
load_data <- function(N = 1000000, n = 10000) {
softmax <- function(x) exp(x) / sum(exp(x))
# Age
age_values <- c('18–25', '26–35', '36–45', '46–55', '56–65', '66+')
age_probabilities <- softmax(c(2, 3, 3, 2, 2, 1))
# Seniority
seniority_values <- c('6M', '1Y', '2Y', '3Y', '4Y', '5Y', '6Y+')
seniority_probabilities <- softmax(c(3, 2, 2, 2, 1, 1, 1))
# Score
score_values <- seq(0, 10)
score_probabilities <- softmax(c(1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4))
# Generate a population
population <- tibble(age = sample(age_values, N,
prob = age_probabilities,
replace = TRUE),
seniority = sample(seniority_values, N,
prob = seniority_probabilities,
replace = TRUE))
# Take a sample from the population
sample <- population %>%
sample_n(n) %>%
mutate(score = sample(score_values, n,
prob = score_probabilities,
replace = TRUE)) %>%
mutate(category = case_when(score < 7 ~ 'detractor',
score > 8 ~ 'promoter',
TRUE ~ 'neutral'))
# Summarize the population
population <- population %>%
group_by(age, seniority) %>%
count(name = 'cell_size')
# Summarize the sample
sample <- sample %>%
group_by(age, seniority) %>%
summarize(detractors = sum(category == 'detractor'),
neutrals = sum(category == 'neutral'),
promoters = sum(category == 'promoter')) %>%
mutate(cell_size = detractors + neutrals + promoters)
# Bind counts of neutrals, detractors, and promoters (needed for brms)
sample$cell_counts <- with(sample, cbind(detractors, neutrals, promoters))
colnames(sample$cell_counts) <- c('detractor', 'neutral', 'promoter')
# Remove unused columns
sample <- sample %>% select(-detractors, -neutrals, -promoters)
list(population = population, sample = sample)
}
Lastly, the following snippet shows how to update brms
and tidybayes
from
GitHub:
if (packageVersion('brms') < '2.11.2') {
remotes::install_github('paul-buerkner/brms', upgrade = 'never')
}
if (packageVersion('tidybayes') < '2.0.1.9000') {
remotes::install_github('mjskay/tidybayes', upgrade = 'never')
}
To make the discussion tangible, consider the following problem. Suppose the goal is to predict the peak temperature at an arbitrary weather station present in the Global Historical Climatology Network for each day between June 1 and August 31. More concretely, given observations from June 1 up to an arbitrary day before August 31, the objective is to complete the sequence until August 31. For instance, if we find ourselves in Stockholm on June 12, we ask for the maximum temperatures from June 12 to August 31 given the temperature values between June 1 to June 11 at a weather station in Stockholm.
To set the expectations right, in this article, we are not going to build a predictive model but to cater for its development by making the data from the aforementioned database readily available in a TensorFlow graph. The final chain of states and operations is as follows:
Historical temperature measurements from the Global Historical Climatology Network are stored in a public data set in BigQuery. Each row corresponds to a weather station and a date. There are missing observations due to such reasons as measurements not passing quality checks.
Relevant measurements are grouped in BigQuery by the weather station and year. Therefore, each row corresponds to a weather station and a year, implying that all information about a particular example (a specific weather station on a specific year) is gathered in one place.
The sequences are read, analyzed, and transformed by Cloud Dataflow.
The data are split into a training, a validation, and a testing set of examples.
The training set is used to compute statistics needed for transforming the measurements to a form suitable for the subsequent modeling. Standardization is used as an example.
The training and validation sets are transformed using the statistics computed with respect to the training set in order to avoid performing these computations during the training-with-validation phase. The corresponding transform is available for the testing phase.
The processed training and validation examples and the raw testing examples are written by Dataflow to Cloud Storage in the TFRecord format, which is a format native to TensorFlow.
The files containing TFRecords are read by the tf.data
API of TensorFlow
and eventually transformed into a data set of appropriately padded batches of
examples.
The above workflow is not as simple as reading data from a Pandas DataFrame comfortably resting in main memory; however, it is much more scalable. This pipeline can handle arbitrary amounts of data. Moreover, it operates on complete examples, not on individual measurements.
In the rest of the article, the aforementioned steps will be described in more detail. The corresponding source code can be found in the following repository on GitHub:
It all starts with data. The data come from the Global Historical Climatology Network, which is available in BigQuery for public use. Steps 1 and 2 in the list above are covered by the following query:
WITH
-- Select relevant measurements
data_1 AS (
SELECT
id,
date,
-- Find the date of the previous observation
LAG(date) OVER (station_year) AS date_last,
latitude,
longitude,
-- Convert to degrees Celsius
value / 10 AS temperature
FROM
`bigquery-public-data.ghcn_d.ghcnd_201*`
INNER JOIN
`bigquery-public-data.ghcn_d.ghcnd_stations` USING (id)
WHERE
-- Take years from 2010 to 2019
CAST(_TABLE_SUFFIX AS INT64) BETWEEN 0 AND 9
-- Take months from June to August
AND EXTRACT(MONTH FROM date) BETWEEN 6 AND 8
-- Take the maximum temperature
AND element = 'TMAX'
-- Take observations passed spatio-temporal quality-control checks
AND qflag IS NULL
WINDOW
station_year AS (
PARTITION BY id, EXTRACT(YEAR FROM date)
ORDER BY date
)
),
-- Group into examples (a specific station and a specific year)
data_2 AS (
SELECT
id,
MIN(date) AS date,
latitude,
longitude,
-- Compute gaps between observations
ARRAY_AGG(
DATE_DIFF(date, IFNULL(date_last, date), DAY)
ORDER BY date
) AS duration,
ARRAY_AGG(temperature ORDER BY date) AS temperature
FROM
data_1
GROUP BY
id, latitude, longitude, EXTRACT(YEAR FROM date)
)
-- Partition into training, validation, and testing sets
SELECT
*,
CASE
WHEN EXTRACT(YEAR FROM date) < 2019 THEN 'analysis,training'
WHEN MOD(ABS(FARM_FINGERPRINT(id)), 100) < 50 THEN 'validation'
ELSE 'testing'
END AS mode
FROM
data_2
The query fetches peak temperatures, denoted by temperature
, for all available
weather stations between June and August in 2010–2019. The crucial part is the
usage of ARRAY_AGG
, which is what makes it possible to gather all relevant
data about a specific station and a specific year in the same row. The number of
days since the previous measurement, which is denoted by duration
, is also
computed. Ideally, duration
should always be one (except for the first day,
which has no predecessor); however, this is not the case, which makes the
resulting time series vary in length.
In addition, in order to illustrate the generality of this approach, two
contextual (that is, non-sequential) explanatory variables are added: latitude
and longitude
. They are scalars stored side by side with duration
and
temperature
, which are arrays.
Another important moment in the final SELECT
statement, which defines a column
called mode
. This column indicates what each example is used for, allowing one
to use the same query for different purposes and to avoid inconsistencies due to
multiple queries. In this case, observations prior to 2019 are reserved for
training, while the rest is split pseudo-randomly and reproducibly into two
approximately equal parts: one is for validation, and one is for testing. This
last operation is explained in detail in “Repeatable sampling of data sets in
BigQuery for machine learning” by Lak Lakshmanan.
In this section, we cover Steps 4 and 5 in the list given at the beginning. This job is done by TensorFlow Extended, which is a library for building machine-learning pipelines. Internally, it relies on Apache Beam as a language for defining pipelines. Once an adequate pipeline is created, it can be executed using an executor, and the executor that we shall use is Cloud Dataflow.
Before we proceed to the pipeline itself, the construction process is
orchestrated by a configuration file, which will
be referred to as config
in the pipeline code (to be discussed shortly):
{
"data": {
"path": "configs/training/data.sql",
"schema": [
{ "name": "latitude", "kind": "float32", "transform": "z" },
{ "name": "longitude", "kind": "float32", "transform": "z" },
{ "name": "duration", "kind": ["float32"], "transform": "z" },
{ "name": "temperature", "kind": ["float32"], "transform": "z" }
]
},
"modes": [
{ "name": "analysis" },
{ "name": "training", "transform": "analysis", "shuffle": true },
{ "name": "validation", "transform": "analysis" },
{ "name": "testing", "transform": "identity" }
]
}
It is worth noting that this way of working with a separate configuration file is not something standard that comes with TensorFlow or Beam. It is a convenience that we build for ourselves in order to keep the main logic reusable and extendable without touching the code.
The data
block describes where the data can be found and provides a schema for
the columns that are used. (Recall the SQL query given earlier and note that
id
, date
, and partition
are omitted.) For instance, latitude
is a scale
of type FLOAT32
, while temperature
is a sequence of type FLOAT32
. Both are
standardized to have a zero mean and a unit standard deviation, which is
indicated by "transform": "z"
and is typically needed for training neural
networks.
The modes
block defines four passes over the data, corresponding to four
operating modes. In each mode, a specific subset of examples is considered,
which is given by the mode
column returned by the query. There are two types
of modes: analysis and transform; recall Step 3. Whenever the transform
key is
present, it is a transform mode; otherwise, it is an analysis mode. In this
example, there are one analysis and three transform modes.
Below is an excerpt from a Python class responsible for building the pipeline:
# config = ...
# schema = ...
# Read the SQL code
query = open(config['data']['path']).read()
# Create a BigQuery source
source = beam.io.BigQuerySource(query=query, use_standard_sql=True)
# Create metadata needed later
spec = schema.to_feature_spec()
meta = dataset_metadata.DatasetMetadata(
schema=dataset_schema.from_feature_spec(spec))
# Read data from BigQuery
data = pipeline \
| 'read' >> beam.io.Read(source)
# Loop over modes whose purpose is analysis
transform_functions = {}
for mode in config['modes']:
if 'transform' in mode:
continue
name = mode['name']
# Select examples that belong to the current mode
data_ = data \
| name + '-filter' >> beam.Filter(partial(_filter, mode))
# Analyze the examples
transform_functions[name] = (data_, meta) \
| name + '-analyze' >> tt_beam.AnalyzeDataset(_analyze)
path = _locate(config, name, 'transform')
# Store the transform function
transform_functions[name] \
| name + '-write-transform' >> transform_fn_io.WriteTransformFn(path)
# Loop over modes whose purpose is transformation
for mode in config['modes']:
if not 'transform' in mode:
continue
name = mode['name']
# Select examples that belong to the current mode
data_ = data \
| name + '-filter' >> beam.Filter(partial(_filter, mode))
# Shuffle examples if needed
if mode.get('shuffle', False):
data_ = data_ \
| name + '-shuffle' >> beam.transforms.Reshuffle()
# Transform the examples using an appropriate transform function
if mode['transform'] == 'identity':
coder = tft.coders.ExampleProtoCoder(meta.schema)
else:
data_, meta_ = ((data_, meta), transform_functions[mode['transform']]) \
| name + '-transform' >> tt_beam.TransformDataset()
coder = tft.coders.ExampleProtoCoder(meta_.schema)
path = _locate(config, name, 'examples', 'part')
# Store the transformed examples as TFRecords
data_ \
| name + '-encode' >> beam.Map(coder.encode) \
| name + '-write-examples' >> beam.io.tfrecordio.WriteToTFRecord(path)
At the very beginning, a BigQuery source is created, which is then branched out
according to the operating modes found in the configuration file. Specifically,
the first for-loop corresponds to the analysis modes, and the second for-loop
goes over the transform modes. The former ends with WriteTransformFn
, which
saves the resulting transform, and the latter ends with WriteToTFRecord
, which
writes the resulting examples as TFRecords.
The distinction between the contextual and sequential features is given by the
schema
object created based on the schema
block in the
configuration file. The call schema.to_feature_spec()
shown above alternates
between tf.io.FixedLenFeature
and tf.io.VarLenFeature
and produces a
feature specification that is understood by TensorFlow and TensorFlow Extended.
The repository provides a wrapper for executing the pipeline on Cloud Dataflow. The following figure shows the flow of the data with respect to the four operating modes:
The outcome is a hierarchy of files on Cloud Storage:
.
└── data/
└── training/
└── 2019-11-01-12-00-00/
├── analysis/
│ └── transform/
│ ├── transform_fn/...
│ └── transform_metadata/...
├── testing/
│ └── examples/
│ ├── part-000000-of-00004
│ ├── ...
│ └── part-000003-of-00004
├── training/
│ └── examples/
│ ├── part-000000-of-00006
│ ├── ...
│ └── part-000005-of-00006
└── validation/
└── examples/
├── part-000000-of-00004
├── ...
└── part-000003-of-00004
Here, data/training
contains all data needed for the training phase, which
collectively refers to training entwined with validation and followed by
testing. Moving forward, this hierarchy is meant to accommodate the application
phase as well by populating a data/application
entry next to the
data/training
one. It can also accommodate trained models and the results of
applying these models by having a model
entry with a structure similar to the
one of the data
entry.
In the listing above, the files whose name starts with part-
are the ones
containing TFRecords. It can be seen that, for each mode, the corresponding
examples have been split into multiple files, which is done for more efficient
access during the usage stage discussed in the next section.
At this point, the data have made it all the way to the execution phase, referring to training, validation, and testing; however, the data are yet to be injected into a TensorFlow graph, which is the topic of this section. As before, relevant parameters are kept in a separate configuration file:
{
"data": {
"schema": [
{ "name": "latitude", "kind": "float32" },
{ "name": "longitude", "kind": "float32" },
{ "name": "duration", "kind": ["float32"] },
{ "name": "temperature", "kind": ["float32"] }
],
"modes": {
"training": {
"spec": "transformed",
"shuffle_macro": { "buffer_size": 100 },
"interleave": { "cycle_length": 100, "num_parallel_calls": -1 },
"shuffle_micro": { "buffer_size": 512 },
"map": { "num_parallel_calls": -1 },
"batch": { "batch_size": 128 },
"prefetch": { "buffer_size": 1 },
"repeat": {}
},
"validation": {
"spec": "transformed",
"shuffle_macro": { "buffer_size": 100 },
"interleave": { "cycle_length": 100, "num_parallel_calls": -1 },
"map": { "num_parallel_calls": -1 },
"batch": { "batch_size": 128 },
"prefetch": { "buffer_size": 1 },
"repeat": {}
},
"testing": {
"spec": "original",
"interleave": { "cycle_length": 100, "num_parallel_calls": -1 },
"map": { "num_parallel_calls": -1 },
"batch": { "batch_size": 128 },
"prefetch": { "buffer_size": 1 }
}
}
}
}
It can be seen that the file contains only one block: data
. This is sufficient
for the purposes of this article; however, it is also meant to cover the
construction of the model in mind, including its hyperparameters, and the
execution process, including the optimizer and evaluation metrics.
The data
block is similar to the one we saw before. In this case, modes
describes various calls to the tf.data
API related to shuffling, batching,
and so on. Those who are familiar with the API will probably immediately
recognize them. It is now instructive to go straight to the Python code.
Below is an excerpt from a Python class responsible for building the pipeline on the TensorFlow side:
# config = ...
# List all files matching a given pattern
pattern = [self.path, name, 'examples', 'part-*']
dataset = tf.data.Dataset.list_files(os.path.join(*pattern))
# Shuffle the files if needed
if 'shuffle_macro' in config:
dataset = dataset.shuffle(**config['shuffle_macro'])
# Convert the files into datasets of examples stored as TFRecords and
# amalgamate these datasets into one dataset of examples
dataset = dataset \
.interleave(tf.data.TFRecordDataset, **config['interleave'])
# Shuffle the examples if needed
if 'shuffle_micro' in config:
dataset = dataset.shuffle(**config['shuffle_micro'])
# Preprocess the examples with respect to a given spec, pad the examples
# and form batches of different sizes, and postprocess the batches
dataset = dataset \
.map(_preprocess, **config['map']) \
.padded_batch(padded_shapes=_shape(), **config['batch']) \
.map(_postprocess, **config['map'])
# Prefetch the batches if needed
if 'prefetch' in config:
dataset = dataset.prefetch(**config['prefetch'])
# Repeat the data once the source is exhausted if needed
if 'repeat' in config:
dataset = dataset.repeat(**config['repeat'])
The pipeline is self-explanatory. It is simply a chain of operations stacked on top of each other. It is, however, worth taking a closer look at the preprocessing and postprocessing mappings, which can be seen before and after the padding step, respectively:
def _preprocess(proto):
spec = self.transforms[config['transform']] \
.transformed_feature_spec()
example = tf.io.parse_single_example(proto, spec)
return (
{name: example[name] for name in self.contextual_names},
{
# Convert the sequential columns from sparse to dense
name: self.schema[name].to_dense(example[name])
for name in self.sequential_names
},
)
def _postprocess(contextual, sequential):
sequential = {
# Convert the sequential columns from dense to sparse
name: self.schema[name].to_sparse(sequential[name])
for name in self.sequential_names
}
return {**contextual, **sequential}
Currently, tf.data
does not support padding sparse tensors, which is the
representation used for sequential features in TensorFlow. In the running
example about forecasting weather, such features are duration
and
temperature
. This is the reason such features are converted to their dense
counterparts in _preprocess
. However, the final representation has to be
sparse still. Therefore, the sequential features are converted back to the
sparse format in _postprocess
. Hopefully, this back-and-forth conversion will
be rendered obsolete in future versions.
Having executed the above steps, we have an instance of tf.data.Dataset
,
which is the ultimate goal, as it is the standard way of ingesting data into a
TensorFlow graph. At this point, one might create a Keras model leveraging
tf.keras.layers.DenseFeatures
and tf.keras.experimental.SequenceFeatures
for constructing the input layer and then pass the data set to the fit
function of the model. A skeleton for this part can be found in the
repository.
In this article, we have discussed a scalable approach to the ingestion of
sequential observations from BigQuery into a TensorFlow graph. The key tools
that have been used to this end are TensorFlow Extended in combination with
Cloud Dataflow and the tf.data
API of TensorFlow.
In addition, the provided code has been written to be general and easily customizable. It has been achieved by separating the configuration part from the implementation one.
Although, as we shall see, the ideas are straightforward, direct calculations were impossible to perform before computers. To be able to answer this kind of questions back then, statisticians developed mathematical theories in order to approximate the calculations for specific situations. Since nothing else was possible, these approximations and the various terms and conditions under which they operate made up a large part of traditional textbooks and courses in statistics. However, the advent of today’s computing power has enabled one to estimate required sample sizes in a more direct and intuitive way, with the only prerequisites being an understanding of statistical inference, the availability of historical data describing the status quo, and the ability to write a few lines of code in a programming language.
For concreteness, consider the following scenario. We run an online business and hypothesize that a specific change in promotion campaigns, such as making them personalized, will have a positive effect on a specific performance metric, such as the average deposit. In order to investigate if it is the case, we decide to perform a two-sample test. There are the following two competing hypotheses.
The null hypothesis postulates that the change has no effect on the metric.
The alternative hypothesis postulates that the change has a positive effect on the metric.
There will be two groups: a control group and a treatment group. The former will be exposed to the current promotion policy, while the latter to the new one. There are also certain requirements imposed on the test. First, we have a level of statistical significance \(\alpha\) and a level of practical significance \(\delta\) in mind. The former puts a limit on the false-positive rate, and the latter indicates the smallest effect that we still care about; anything smaller is as good as zero for any practical purpose. In addition, we require the test to have a prescribed false-negative rate \(\beta\), ensuring that the test has enough statistical power.
For our purposes, the test is considered well designed if it is capable of detecting a difference as small as \(\delta\) so that the false-positive and false-negative rates are controlled to levels \(\alpha\) and \(\beta\), respectively. Typically, parameters \(\alpha\) and \(\delta\) are held constant, and the desired false-positive rate \(\beta\) is attained by varying the number of participants in each group, which we denote by \(n\). Note that we do not want any of the parameters to be smaller than the prescribed values, as it would be wasteful.
So what should the sample size be for the test to be well designed?
Depending on the distribution of the data and on the chosen metric, one might or might not be able to find a suitable test among the standard ones, while ensuring that the test’s assumptions can safely be considered satisfied. More importantly, a textbook solution might not be the most intuitive one, which, in particular, might lead to misuse of the test. It is the understanding that matters.
Here we take a more pragmatic and rather general approach that circumvents the above concerns. It requires only historical data and basic programming skills. Despite its simplicity, the method below goes straight to the core of what the famed statistical tests are doing behind all the math. The approach belongs to the class of so-called bootstrap techniques and is as follows.
Suppose we have historical data on customers’ behavior under the current promotion policy, which is commonplace in practice. An important realization is that this data set represents what we expect to observe in the control group. It is also what is expected of the treatment group provided that the null hypothesis is true, that is, when the proposed change has no effect. This realization enables one to simulate what would happen if each group was limited to an arbitrary number of participants. Then, by varying this size parameter, it is possible to find the smallest value that makes the test well designed, that is, make the test satisfy the requirements on \(\alpha\), \(\beta\), and \(\delta\), as discussed in the previous section.
This is all. The rest is an elaboration of the above idea.
The simulation entails the following. To begin with, note that what we are interested in testing is the difference between the performance metric applied to the treatment group and the same metric applied to the control group, which is referred to as the test statistic:
Test statistic = Metric(Treatment sample) - Metric(Control sample).
Treatment sample
and Control sample
stand for sets of observations, and
Metric(Sample)
stands for computing the performance metric given such a
sample. For instance, each observation could be the total deposit of a customer,
and the metric could be the average value:
Metric(Sample) = Sum of observations / Number of observations.
Note, however, that it is an example; the metric can be arbitrary, and this is a huge advantage of this approach to sample size determination based on data and simulation.
Large positive values of the test statistic speak in favor of the treatment (that is, the new promotion policy in our example), while those that are close to zero suggest that the treatment is futile.
A sample of \(n\) observations corresponding to the status quo (that is, the current policy in our example) can be easily obtained by drawing \(n\) data points with replacement from the historical data:
Sample = Choose random with replacement(Data, N).
This expression is used for Control sample
under both the null and alternative
hypotheses. As alluded to earlier, this is also how Treatment sample
is
obtained under the null. Regarding the alternative hypothesis being true, one
has to express the hypothesized outcome as a distribution for the case of the
minimal detectable difference, \(\delta\). The simplest and reasonable solution
is to sample the data again, apply the metric, and then adjust the result to
reflect the alternative hypothesis:
Metric(Choose random with replacement(Data, N)) + Delta.
Here, again, one is free to change the logic under the alternative according to the situation at hand. For instance, instead of an additive effect, one could simulate a multiplicative one.
The above is a way to simulate a single instance of the experiment under either the null or alternative hypothesis; the result is a single value for the test statistic. The next step is to estimate how the test statistic would vary if the experiment was repeated many times in the two scenarios. This simply means that the procedure should be repeated multiple times:
Repeat many times {
Sample 1 = Choose random with replacement(Data, N)
Sample 2 = Choose random with replacement(Data, N)
Metric 1 = Metric(Sample 1)
Metric 2 = Metric(Sample 2)
Test statistic under null = Metric 1 - Metric 2
Sample 3 = Choose random with replacement(Data, N)
Sample 4 = Choose random with replacement(Data, N)
Metric 3 = Metric(Sample 3) + Delta
Metric 4 = Metric(Sample 4)
Test statistic under alternative = Metric 3 - Metric 4
}
This yields a collection of values for the test statistic under the null hypothesis and a collection of values for the test statistic under the alternative hypothesis. Each one contains realizations from the so-called sampling distribution in the corresponding scenario. The following figure gives an illustration:
The blue shape is the sampling distribution under the null hypothesis, and the red one is the sampling distribution under the alternative hypothesis. We shall come back to this figure shortly.
These two distributions of the test statistic are what we are after, as they allow one to compute the false-positive rate and eventually choose a sample size. First, given \(\alpha\), the sampling distribution under the null (the blue one) is used in order to find a value beyond which the probability mass is equal to \(\alpha\):
Critical value = Quantile([Test statistic under null], 1 - alpha).
Quantile
computes the quantile specified by the second argument given a set of
observations. This quantity is called the critical value of the test. In the
figure above, it is denoted by a dashed line. When the test statistic falls to
the right of the critical value, we reject the null hypothesis; otherwise, we
fail to reject it. Second, the sampling distribution in the case of the
alternative hypothesis being true (the red one) is used in order to compute the
false-negative rate:
Attained beta = Mean([Test statistic under alternative < Critical value]).
It corresponds to the probability mass of the sampling distribution under the alternative to the left of the critical value. In the figure, it is the red area to the left of the dashed line.
The final step is to put the above procedure in an optimization loop that minimizes the distance between the target and attained \(\beta\)’s with respect to the sample size:
Optimize N until Attained beta is close to Target beta {
Repeat many times {
Test statistic under null = ...
Test statistic under alternative = ...
}
Critical value = ...
Attained beta = ...
}
This concludes the calculation of the size that the control and treatment groups should have in order for the upcoming test in promotion campaigns to be well designed in terms of the level of statistical significance \(\alpha\), the false-negative rate \(\beta\), and the level of practical significance \(\delta\).
An example of how this technique could be implemented in practice can be found in the appendix.
In this article, we have discussed an approach to sample size determination that is based on historical data and computer simulation rather than on mathematical formulae tailored for specific situations. It is general and straightforward to implement. More importantly, the technique is intuitive, since it directly follows the narrative of null hypothesis significance testing. It does require prior knowledge of the key concepts in statistical inference. However, this knowledge is arguably essential for those who are involved in scientific experimentation. It constitutes the core of statistical literacy.
This article was inspired by a blog post authored by Allen Downey and a talk given by John Rauser. I also would like to thank Aaron Rendahl for his feedback on the introduction to the method presented here and for his help with the implementation given in the appendix.
The following listing shows an implementation of the bootstrap approach in R:
The illustrative figure shown in the solution section displays the sampling distribution of the test statistic under the null and alternative for the sample size found by this code snippet.
]]>A bare-bones net promoter survey is composed of only one question: “How likely are you to recommend us to a friend?” The answer is an integer ranging from 0 to 10 inclusively. If the grade is between 0 and 6 inclusively, the person in question is said to be a detractor. If it is 7 or 8, the person is said to be a neutral. Lastly, if it is 9 or 10, the person is deemed a promoter. The net promoter score itself is then the percentage of promoters minus the percentage of detractors. The minimum and maximum attainable values of the score are −100 and 100, respectively. In this case, the greater, the better.
As it is usually the case with surveys, a small but representative subset of customers is reached out to, and the collected responses are then used to draw conclusions about the target population of customers. Our objective is to facilitate this last step by estimating the net promoter score given a set of responses and necessarily quantify and put front and center the uncertainty in our estimates.
Before we proceed, since a net promoter survey is an observational study, which is prone to such biases as participation and response biases, great care must be taken when analyzing the results. In this article, however, we focus on the inference of the net promoter score under the assumption that the given sample of responses is representative of the target population.
In practice, one is interested to know the net promoter scope for different subpopulations of customers, such as countries of operation and age groups, which is the scenario that we shall target. To this end, suppose that there are \(m\) segments of interest, and each customer belongs to strictly one of them. The results of a net promoter survey can then be summarized using the following \(m \times 3\) matrix:
\[y = \left( \begin{matrix} d_1 & n_1 & p_1 \\ \vdots & \vdots & \vdots \\ d_i & n_i & p_i \\ \vdots & \vdots & \vdots \\ d_m & n_m & p_m \end{matrix} \right)\]where \(d_i\), \(n_i\), and \(p_i\) denote the number of detractors, neutrals, and promoters in segment \(i\), respectively. For segment \(i\), the observed net promoter score can be computed as follows:
\[\hat{s}_i = 100 \times \frac{p_i - d_i}{d_i + n_i + p_i}.\]However, this observed score is a single scalar value calculated using \(d_i + n_i + p_i\) data points, which is only a subset of the corresponding subpopulation. It may or may not correspond well to the actual net promoter score of that subpopulation. We have no reason to trust it, since the above estimate alone does not tell us anything about the uncertainty associated with it. Uncertainty quantification is essential for sound decision-making, which is what we are after.
Ideally, for each segment, given the observed data, we would like to have a distribution of all possible values of the score with probabilities attached. Such a probability distribution would be exhaustive information, from which any other statistic could be easily derived. Here we tackle the problem by means of Bayesian inference, which we discuss next.
In order to perform Bayesian inference of the net promoter score, we need to decide on an adequate Bayesian model for the problem at hand. Recall first that we are interested in inferring scores for several segments. Even though there might be segment-specific variations in the product, such as special offers in certain countries, or in customers’ perception of the product, such as age-related preferences, it is conceptually the same product that the customers were asked to evaluate. It is then sensible to expect the scores in different segments to have something in common. With this in mind, we construct a hierarchical model with parameters shared by the segments.
First, let
\[\theta_i = (\theta_{id}, \theta_{in}, \theta_{ip}) \in \langle 0, 1 \rangle^3\]be a triplet of parameters corresponding to the proportion of detractors, neutrals, and promoters in segment \(i\), respectively, with the constraint that they have to sum up to one. The constraint makes the triplet a simplex, which is what is emphasized by the angle brackets on the right-hand side. These are the main parameters we are interested in inferring. If the true value of \(\theta_i\) was known, the net promoter score would be computed as follows:
\[\hat{s}_i = 100 \times (\theta_{ip} - \theta_{id}).\]Parameter \(\theta_i\) can also be thought of as a vector of probabilities of observing one of the three types of customers in segment \(i\), that is, detractors, neutrals, and promoters. Then the natural model for the observed data is a multinomial distribution with \(d_i + n_i + p_i\) trials and probabilities \(\theta_i\):
\[y_i | \theta_i \sim \text{Multinomial}(d_i + n_i + p_i, \theta_i)\]where \(y_i\) refers to the \(i\)th row of matrix \(y\) introduced earlier. The family of multinomial distributions is a generalization of the family of binomial distributions to more than two outcomes.
The above gives a data distribution. In order to complete the modeling part, we need to decide on a prior probability distribution for \(\theta_i\). Each \(\theta_i\) is a simplex of probabilities. In such a case, a reasonable choice is a Dirichlet distribution:
\[\theta_i | \phi \sim \text{Dirichlet}(\phi)\]where \(\phi = (\phi_d, \phi_n, \phi_p)\) is a vector of strictly positive parameters. This family of distributions is a generalization of the family of beta distributions to more than two categories. Note that \(\phi\) is the same for all segments, which is what enables information sharing. In particular, it means that the less reliable estimates for segments with fewer observations will be shrunk toward the more reliable estimates for segments with more observations. In other words, with this architecture, segments with fewer observations are able to draw strength from those with more observations.
How about \(\phi\)? This triplet is a characteristic of the product irrespective of the segment. Its individual components can be utilized in order to encode one’s prior knowledge about the net promoter score. Specifically, \(\phi_d\), \(\phi_n\), and \(\phi_p\) could be set to imaginary observations of detractors, neutrals, and promoters, respectively, reflecting one’s beliefs prior to conducting the survey. The higher these imaginary counts are, the more certain one claims to be about the true score. One could certainly set these hyperparameters to fixed values; however, a more comprehensive solution is to infer them from the data as well, giving the model more flexibility by making it hierarchical. In addition, an inspection of \(\phi\) afterward can provide insights into the overall satisfaction with the product.
We now need to specify a prior, or rather a hyperprior, for \(\phi\). We proceed under the assumption that we have little knowledge about the true score. Even if there were surveys in the past, it is still a valid choice, especially when the product evolves rapidly, rendering prior surveys marginally relevant.
Now, it is more convenient to think in terms of expected values and variances instead of imaginary counts, which is what \(\phi\) represents. Let us find an alternative parameterization of the Dirichlet distribution. The expected value of this distribution is as follows:
\[\mu = (\mu_d, \mu_n, \mu_p) = \frac{\phi}{\phi_d + \phi_n + \phi_p} \in \langle 0, 1 \rangle^3.\]It can be seen that it is a simplex of proportions of detractors, neutrals, and promoters of the whole population, which is similar to \(\theta_i\) describing segment \(i\). Regarding the variance,
\[\sigma^2 = \frac{1}{\phi_d + \phi_n + \phi_p}\]is considered to capture it sufficiently well. Solving the system of the last two equations for \(\phi\) yields the following result:
\[\phi = \frac{\mu}{\sigma^2}.\]The prior for \(\theta_i\) can then be rewritten as follows:
\[\theta_i | \mu, \sigma \sim \text{Dirichlet}\left(\frac{\mu}{\sigma^2}\right).\]This new parameterization requires two hyperpriors: one is for \(\mu\), and one is for \(\sigma\). For \(\mu\), a reasonable choice is a uniform distribution (over a simplex), and for \(\sigma\), a half-Cauchy distribution:
\[\begin{align} & \mu \sim \text{Uniform}(\langle 0, 1 \rangle^3) \text{ and} \\ & \sigma \sim \text{Half-Cauchy}(0, 1). \end{align}\]The two distributions are relatively week, which is intended in order to let the data speak for themselves. At this point, all parameters have been defined. Of course, one could go further if the problem at hand had a deeper structure; however, in this case, it is arguably not justifiable.
The final model is as follows:
\[\begin{align} y_i | \theta_i & \sim \text{Multinomial}(d_i + n_i + p_i, \theta_i), \\ \theta_i | \mu, \sigma & \sim \text{Dirichlet}(\mu / \sigma^2), \\ \mu & \sim \text{Uniform}(\langle 0, 1 \rangle^3), \text{ and} \\ \sigma & \sim \text{Half-Cauchy}(0, 1). \end{align}\]The posterior distribution factorizes as follows:
\[p(\theta_1, \dots, \theta_m, \mu, \sigma | y) \propto p(y | \theta_1, \dots, \theta_m) \, p(\theta_1 | \mu, \sigma) \cdots p(\theta_m | \mu, \sigma) \, p(\mu) \, p(\sigma),\]which relies on the usual assumption of independence given the parameters. One could make a few simplifications by, for instance, leveraging the conjugacy of the Dirichlet distribution with respect to the multinomial distribution; however, it is not needed in practice, as we shall see shortly.
The above posterior distribution is our ultimate goal. It is the one that gives us a complete picture of what the true net promoter score in each segment might be given the available evidence, that is, the responses from the survey. All that is left is to draw a large enough sample from this distribution and start to summarize and visualize the results.
Unfortunately, as one might probably suspect, drawing samples from the posterior is not an easy task. It does not correspond to any standard distribution and hence does not have a readily available random number generator. Fortunately, the topic is sufficiently mature, and there have been developed techniques for sampling complex distributions, such as the family of Markov chain Monte Carlo methods. Unfortunately, the most effective and efficient of these techniques are notoriously complex themselves, and it might be extremely difficult and tedious to implement and apply them correctly in practice. Fortunately, the need for versatile tools for modeling and inference with the focus on the problem at hand and not on implementation details has been recognized and addressed. Nontrivial scenarios can be tackled with a surprisingly small amount of effort nowadays, which we illustrate next.
In this section, we implement the model using the probabilistic programming language Stan. Stan is straightforward to integrate into one’s workflow, as it has interfaces for many general-purpose programming languages, including Python and R. Here we only highlight the main points of the implementation and leave it to the curious reader to discover Stan on their own.
The following listing is a complete implementation of the model:
data {
int<lower = 0> m; // The number of segments
int<lower = 0> n; // The number of categories, which is always three
int y[m, n]; // The observed counts of detractors, neutrals, and promoters
}
parameters {
simplex[n] mu;
real<lower = 0> sigma;
simplex[n] theta[m];
}
transformed parameters {
vector<lower = 0>[n] phi;
phi = mu / sigma^2;
}
model {
mu ~ uniform(0, 1);
sigma ~ cauchy(0, 1);
for (i in 1:m) {
theta[i] ~ dirichlet(phi);
y[i] ~ multinomial(theta[i]);
}
}
It can be seen that the code is very laconic and follows closely the development
given in the previous section, including the notation. It is worth noting that,
in the model block, we seemingly use unconstrained uniform and Cauchy
distributions; however, the constraints are enforced by the definitions of the
corresponding hyperparameters, mu
and sigma
.
This is practically all that is needed; the rest will be taken care of by Stan, which is actually a lot of work, including an adequate initialization, an efficient execution, and necessary diagnostics and quality checks. Under the hood, the sampling of the posterior in Stan is based on the Hamiltonian Monte Carlo algorithm and the no-U-turn sampler, which are considered to be the state-of-the-art.
The output of the sampling procedure is a set of draws from the posterior distribution, which, again, is exhaustive information about the net promoter score in the segments of interest. In particular, one can quantify the uncertainty in and the probability of any statement one makes about the score. For instance, if a concise summary is needed, one could compute the mean of the score and accompany it with a high-posterior-density credible interval, capturing the true value with the desired probability. However, if applicable, the full distribution should be integrated into the decision-making process.
In this article, we have constructed a hierarchical Bayesian model for inferring the net promoter score for an arbitrary segmentation of a customer base. The model features shared parameters, which enable information exchange between the segments. This allows for a more robust estimation of the score, especially in the case of segments with few observations. The final output of the inference is a probability distribution over all possible values of the score in each segment, which lays a solid foundation for the subsequent decision-making. We have also seen how seamlessly the model can be implemented in practice using modern tools for statistical inference, such as Stan.
Lastly, note that the presented model is only one alternative; there are many other. How would you model the net promoter score? What changes would you make? Make sure to leave a comment.
Python and R (in alphabetic order) are arguably the primary languages used by data scientists nowadays. In the context of interactive computations, IPython and later on Project Jupyter have been of paramount importance for the Python community (the latter is actually language agnostic). In the R community, this role has been played by RStudio. Therefore, having at one’s disposal JupyterLab, which is Project Jupyter’s flagship, and RStudio should make one well equipped for a wide range of data challenges. As alluded to earlier, the objective is to have an environment that has a fixed initial state defined by us and is accessible to us on any machine we might happen to work on. This problem definition is a perfect fit for containerization. Specifically, we shall build custom-tailored Docker images for JupyterLab and RStudio and create a few convenient shortcuts for launching them.
The code discussed below can be found in the following two repositories:
In order to build a Docker image for JupyterLab, we begin with a
Dockerfile
:
# Start with a minimal Python image
FROM python:3.7-slim
# Install the desired Python packages
COPY requirements.txt /tmp/requirements.txt
RUN pip install --upgrade pip
RUN pip install --upgrade --requirement /tmp/requirements.txt
# Configure JupyterLab to use a specific IP address and port
RUN mkdir -p ~/.jupyter
RUN echo "c.NotebookApp.ip = '0.0.0.0'" >> ~/.jupyter/jupyter_notebook_config.py
RUN echo "c.NotebookApp.port = 8888" >> ~/.jupyter/jupyter_notebook_config.py
# Set the working directory
WORKDIR /home/jupyterlab
# Stort JupyterLab once the container is launched
ENTRYPOINT jupyter lab --allow-root --no-browser
In words, we take a minimalistic image with the desired version of Python
preinstalled—in this case, it is the official Python image
tagged 3.7-slim
, which refers to Python 3.7 with any available bug fixes
promptly applied—and add packages that we consider to be important for our work.
These packages are gathered in the usual
requirements.txt
, which might look as follows:
jupyterlab
matplotlib
numpy
pandas
pylint
pytest
scikit-learn
scipy
seaborn
tensorflow
yapf
The first one, jupyterlab
, is essential; the rest is up to the data
scientist’s taste. An important aspect to note is that, in this example, the
versions of the listed packages are not fixed; hence, the latest available
versions will be taken each time a new image is built. Alternatively, one can
pin them to specific numbers by changing requirements.txt
. For instance, one
might write tensorflow==1.14.0
instead of tensorflow
.
Having defined an image, we need a tool for orchestration. We would like to have
a convenient command for actually building the image and, more importantly, a
convenient command for launching a container with that image from an arbitrary
directory. The versatile make
to the rescue!
# The name of the Docker image
name := jupyterlab
# The directory to be mounted to the container
root ?= ${PWD}
# Build a new image
build:
docker rmi ${name} || true
docker build --tag ${name} .
# Start a new container
start:
@docker run --interactive --tty --rm \
--name ${name} \
--publish 8888:8888 \
--volume "${root}:/home/jupyterlab" \
${name}
In the above Makefile
, we define two commands: build
and start
. The build
command instructs Docker to build a new image according
to the recipe in Dockerfile
. The start
command launches a new container and
mounts the directory specified by the root
variable to the file system inside
the container using the --volume
option. It also forwards port 8888 inside the
container, which is the one specified in Dockerfile
, to port 8888 on the host
machine so that JupyterLab can be reached from the browser.
Let us now go ahead and try the two commands:
make build
make start
JupyterLab should come back with usage instructions similar to the following:
...
[I 18:40:15.078 LabApp] The Jupyter Notebook is running at:
[I 18:40:15.078 LabApp] http://e4edba021595:8888/?token=<token>
[I 18:40:15.078 LabApp] or http://127.0.0.1:8888/?token=<token>
[I 18:40:15.078 LabApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 18:40:15.082 LabApp]
To access the notebook, open this file in a browser:
file:///root/.local/share/jupyter/runtime/nbserver-6-open.html
Or copy and paste one of these URLs:
http://e4edba021595:8888/?token=<token>
or http://127.0.0.1:8888/?token=<token>
...
By clicking on the last link, we end up in a fully fledged JupyterLab.
Congratulations! However, there is one step left. JupyterLab is currently
running in the folder with our Dockerfile
and Makefile
, which is not
particularly useful, as each project we might want to work on probably lives in
its own folder elsewhere in the file system. Fortunately, it is easy to fix with
an alias:
alias jupyterlab='make -C /path/to/the/folder/with/the/Makefile root="${PWD}"'
This command should be placed in the start-up script of the shell being utilized. In the case of Bash, it can be done as follows:
echo "alias jupyterlab='make -C \"${PWD}\" root=\"\${PWD}\"'" >> ~/.bashrc
Now, in a new terminal, one should be able to run JupyterLab from any directory as follows:
cd /path/to/some/project
jupyterlab
Note that the content of the current working directory (that is,
/path/to/some/project
) is readily available inside JupyterLab. All notebooks
created and modified in the GUI there will be stored directly in this folder,
and they will remain here when the container is shut down.
It is time to get to grips with an image for R notebooks. As before, we begin
with a Dockerfile
:
# Start with an RStudio image
FROM rocker/rstudio:latest
# Install the software that R packages require
RUN apt-get update
RUN apt-get install -y libxml2-dev texlive texlive-latex-extra zlib1g-dev
# Set the working directory
WORKDIR /home/rstudio
# Install the desired R packages
COPY requirements.txt /tmp/requirements.txt
RUN echo "install.packages(readLines('/tmp/requirements.txt'), \
repos = 'http://cran.us.r-project.org')" | R
Installing RStudio from scratch is not an easy task. Fortunately, we can start
with the official RStudio image, which is what is specified at
the top of the file. If desired, the latest
tag can be changed to a specific
version. The second block of Docker instructions is to provide programs and
libraries that are needed by the R packages that one is planning to install. For
instance, TeX Live is needed for rendering notebooks as PDF documents using
LaTeX. The last block of instructions in Dockerfile
is for installing the R
packages themselves. As with Python, all necessary packages are gathered in a
single file called requirements.txt
:
devtools
glmnet
plotly
rmarkdown
rstan
testthat
tidytext
tidyverse
The rmarkdown
package is required for notebooks in Markdown. The rest is
intended to be changed according to one’s preferences; although, tidyverse
is
arguably a must in modern R.
All right, in order to build the image and launch containers, we create the
following Makefile
:
# The name of the Docker image
name := rstudio
# The directory to be mounted to the container
root ?= ${PWD}
# Build a new image
build:
docker rmi ${name} || true
docker build --tag ${name} .
# Start a new container
start:
@echo "Address: http://localhost:8787/"
@echo "User: rstudio"
@echo "Password: rstud10"
@echo
@echo 'Press Control-C to terminate...'
@docker run --interactive --tty --rm \
--name ${name} \
--publish 8787:8787 \
--volume "${root}:/home/rstudio" \
--env PASSWORD=rstud10 \
${name} > /dev/null
It is similar to the one for JupyterLab; however, since the default prompt of
RStudio is not as informative as the one of JupyterLab, we print our own usage
instructions upon start
.
The final piece is the shortcut for launching RStudio:
alias rstudio='make -C /path/to/the/folder/with/the/Makefile root="${PWD}"'
In the case of Bash, it can be installed as follows:
echo "alias rstudio='make -C \"${PWD}\" root=\"\${PWD}\"'" >> ~/.bashrc
Now it is time to build the image, go to an arbitrary directory, and test the alias:
make build
cd /path/to/some/project
rstudio
Unlike the JupyterLab image, this one is much slower to build due to R packages traditionally compiling a lot of C++ code upon installation.
Lastly, it might be particularly convenient to have one’s GUI preferences (such
as the font size in the editor) and alike be automatically set up upon each
container launch. This can be achieved by realizing that RStudio stores user
preferences in a local folder called .rstudio
. Then the start
command can be
adjusted to silently plant a preconfigured .rstudio
into the current working
directory, which can be seen in the repository accompanying this
article.
Having completed the above steps, we have two Docker images: one is for Python notebooks via JupyterLab, and one is for R notebooks via RStudio. At the moment, the images are stored locally; however, they can be pushed to a public or private image repository, such as Docker Hub and Google Container Registry, and subsequently pulled on an arbitrary machine having Docker installed. Alternatively, they can be built on each machine separately. Regardless of the installation, the crucial point is that our working environment will unshakably remain in a specific pristine state defined by us.
Lastly, it is worth noting that similar images can straightforwardly be built for more specific scenarios. For instance, the following repository provides a skeleton for building and using a custom Datalab, which is Google’s wrapper for Jupyter notebooks that run in the cloud: Datalab.
]]>