Jekyll2021-02-05T12:18:19+00:00https://blog.ivanukhov.com/feed.xmlGood news, everyone!On to data scienceIvan UkhovBreaking sticks, or estimation of probability distributions using the Dirichlet process2021-01-25T06:00:00+00:002021-01-25T06:00:00+00:00https://blog.ivanukhov.com/2021/01/25/dirichlet-process<p>Recall the last time you wanted to understand the distribution of given data.
One alternative was to plot a histogram. However, it resulted in frustration due
to the choice of the number of bins to use, which led to drastically different
outcomes. Another alternative was kernel density estimation. Despite having a
similar choice to make, it has the advantage of producing smooth estimates,
which are more realistic for continuous quantities with regularities. However,
kernel density estimation was unsatisfactory too: it did not aid in
understanding the underlying structure of the data and, moreover, provided no
means of quantifying the uncertainty associated with the results. In this
article, we discuss a Bayesian approach to the estimation of data-generating
distributions that addresses the aforementioned concerns.</p>
<p>The approach we shall discuss is based on the family of Dirichlet processes. How
specifically such processes are constructed will be described in the next
section; here, we focus on the big picture.</p>
<p>A Dirichlet process is a stochastic process, that is, an indexed sequence of
random variables. Each realization of this process is a discrete probability
distribution, which makes the process a distribution over distributions,
similarly to a Dirichlet distribution. The process has only one parameter: a
measure \(\nu: \mathcal{B} \to [0, \infty]\) in a suitable finite measure space
\((\mathcal{X}, \mathcal{B}, \nu)\) where \(\mathcal{X}\) is a set, and
\(\mathcal{B}\) is a \(\sigma\)-algebra on \(\mathcal{X}\). We shall adopt the
following notation:</p>
\[P \sim \text{Dirichlet Process}(\nu)\]
<p>where \(P\) is a <em>random</em> probability distribution that is distributed according
to the Dirichlet process. Note that measure \(\nu\) does not have to be a
probability measure; that is, \(\nu(\mathcal{X}) = 1\) is not required. To
obtain a probability measure, one can divide \(\nu\) by the total volume
\(\lambda = \nu(\mathcal{X})\):</p>
\[P_0(\cdot) = \frac{1}{\lambda} \nu(\cdot).\]
<p>Since this normalization is always possible, it is common and convenient to
replace \(\nu\) with \(\lambda P_0\) and consider the process to be
parametrized by two quantities instead of one:</p>
\[P \sim \text{Dirichlet Process}(\lambda P_0).\]
<p>Parameter \(\lambda\) is referred to as the concentration parameter of the
process.</p>
<p>There are two major alternatives of using the Dirichlet process for estimating
distributions: as a direct prior for the data at hand and as a mixing prior. We
begin with the former.</p>
<h1 id="direct-prior">Direct prior</h1>
<p>Given a data set of \(n\) observations \(\{ x_i \}_{i = 1}^n\), a Dirichlet
process can be used as a prior:</p>
\[\begin{align}
x_i | P_x & \sim P_x, \text{ for } i = 1, \dots, n; \text{ and} \\
P_x & \sim \text{Dirichlet Process}(\lambda P_0). \tag{1}
\end{align}\]
<p>It is important to realize that the \(x_i\)’s are assumed to be distributed
<em>not</em> according to the Dirichlet process but according to a distribution drawn
from the Dirichlet process. Parameter \(\lambda\) allows one to control the
strength of the prior: the larger it is, the more shrinkage toward the prior is
induced.</p>
<h2 id="inference">Inference</h2>
<p>Due to the conjugacy property of the Dirichlet process in the above setting, the
posterior is also a Dirichlet process and has the following simple form:</p>
\[P_x | \{ x_i \}_{i = 1}^n
\sim \text{Dirichlet Process}\left( \lambda P_0 + \sum_{i = 1}^n \delta_{x_i} \right). \tag{2}\]
<p>That is, the total volume and normalized measure are updated as follows:</p>
\[\begin{align}
\lambda & := \lambda + n \quad \text{and} \\
P_0 & := \frac{\lambda}{\lambda + n} P_0 + \frac{1}{\lambda + n} \sum_{i = 1}^n \delta_{x_i}.
\end{align}\]
<p>Here, \(\delta_x(\cdot)\) is the Dirac measure, meaning that \(\delta_x(X) = 1\)
if \(x \in X\) for any \(X \subseteq \mathcal{X}\), and otherwise, it is zero.
It can be seen in Equation (2) that the base measure has simply been augmented
with unit masses placed at the \(n\) observed data points.</p>
<p>The main question now is, How to draw samples from a Dirichlet process given
\(\lambda\) and \(P_0\)?</p>
<p>As noted earlier, a draw from a Dirichlet process is a discrete probability
distribution \(P_x\). The probability measure of this distribution admits the
following representation:</p>
\[P_x(\cdot) = \sum_{i = 1}^\infty p_i \delta_{x_i}(\cdot) \tag{3}\]
<p>where \(\{ p_i \}\) is a set of probabilities that sum up to one, and \(\{ x_i
\}\) is a set of points in \(\mathcal{X}\). Such a draw can be obtained using
the so-called stick-breaking construction, which prescribes \(\{ p_i \}\) and
\(\{ x_i \}\). To begin with, for practical computations, the infinite summation
is truncated to retain the only first \(m\) elements:</p>
\[P_x(\cdot) = \sum_{i = 1}^m p_i \delta_{x_i}(\cdot).\]
<p>Atoms \(\{ x_i \}_{i = 1}^m\) are drawn independently from the normalized base
measure \(P_0\). The calculation of probabilities \(\{ p_i \}\) is more
elaborate, and this is where the construction and this article get their name,
“stick breaking.” Imagine a stick of unit length, representing the total
probability. The procedure is to keep breaking the stick into two parts where,
for each iteration, the left part yields \(p_i\), and the right one, the
remainder, is carried over to the next iteration. How much to break off is
decided on by drawing \(m\) independent realizations from a carefully chosen
beta distribution:</p>
\[q_i \sim \text{Beta}(1, \lambda), \text{ for } i = 1, \dots, m. \tag{4}\]
<p>All of them lie in the unit interval and are the proportions to break off of the
remainder. When \(\lambda = 1\), these proportions (of the reminder) are
uniformly distributed. When \(\lambda < 1\), the probability mass is shifted to
the right, which means that there are likely to be a small number of large
pieces, covering virtually the entire stick. When \(\lambda > 1\), the
probability mass is shifted to the left, which means that there are likely to be
a large number of small pieces, struggling to reach the end of the stick.</p>
<p>Formally, the desired probabilities are given by the following expression:</p>
\[p_i = q_i \prod_{j = 1}^{i - 1} (1 - q_j), \text{ for } i = 1, \dots, m,\]
<p>which, as noted earlier, are the left parts of the remainder of the stick during
each iteration. For instance, \(p_1 = q_1\), \(p_2 = q_2 (1 - q_1)\), and so on.
Due to the truncation, the probabilities \(\{ p_i \}_{i = 1}^m\) do not sum up
to one, and it is common to set \(q_m := 1\) so that \(p_m\) takes up the
remaining probability mass.</p>
<p>To recapitulate, a single draw from a Dirichlet process is obtained in two
steps: prescribe atoms \(\{ x_i \}\) via draws from the normalized base measure
and prescribe the corresponding probabilities \(\{ p_i \}\) via the
stick-breaking construction. The two give a complete description of a discrete
probability distribution. Recall that this distribution is still a single draw.
By repeating this process many times, one obtains the distribution of this
distribution, which can be used to, for instance, quantify uncertainty in the
estimation.</p>
<h2 id="illustration">Illustration</h2>
<p>It is time to demonstrate how the Dirichlet process behaves as a direct prior.
To this end, we shall use a <a href="https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/galaxies.html">data set</a> containing velocities of “82
galaxies from 6 well-separated conic sections of an unfilled survey of the
Corona Borealis region.” It was studied in <a href="https://doi.org/10.2307/2289993">Roeder (1990)</a>, which gives us a
reference point.</p>
<blockquote>
<p>For the curious reader, the source code of this <a href="https://github.com/IvanUkhov/blog/blob/master/_posts/2021-01-25-dirichlet-process.Rmd">notebook</a> along with
auxiliary <a href="https://github.com/IvanUkhov/blog/tree/master/_scripts/2021-01-25-dirichlet-process">scripts</a> that are used for performing all the calculations
presented below can be found on GitHub.</p>
</blockquote>
<p>The empirical cumulative distribution function of the velocity is as follows:</p>
<p><img src="/assets/images/2021-01-25-dirichlet-process/data-cdf-1.svg" alt="" /></p>
<p>Already here, it is apparent that the distribution is multimodal: there are two
distinct regions, one to the left and one to the right, where the curve is flat,
meaning there are no observations there. The proverbial histogram gives a
confirmation:</p>
<p><img src="/assets/images/2021-01-25-dirichlet-process/data-histogram-1.svg" alt="" /></p>
<p>It can be seen that there is a handful of galaxies moving relatively slowly and
relatively fast compared to the majority located somewhere in the middle around
twenty thousand kilometers per second. For completeness, kernel density
estimation results in the following plot:</p>
<p><img src="/assets/images/2021-01-25-dirichlet-process/data-kde-1.svg" alt="" /></p>
<p>How many clusters of galaxies are there? What are their average velocities? How
uncertain are these estimates? Our goal is to answer these questions by virtue
of the Dirichlet process.</p>
<p>Now that the intention is to apply the presented theory in practice, we have to
make all choices we have conveniently glanced over. Specifically, \(P_0\) has to
be chosen, and we shall use the following:</p>
\[P_0(\cdot) = \text{Gaussian}(\, \cdot \, | \mu_0, \sigma_0^2). \tag{5}\]
<p>In the above, \(\text{Gaussian}(\cdot)\) refers to the probability measure of a
Gaussian distribution with parameters \(\mu_0\) and \(\sigma_0\). In addition to
these two, there is one more: \(\lambda\). We shall set \(\mu_0\) and
\(\sigma_0\) to 20 and 5, respectively—which correspond roughly to the mean and
standard deviation of the data—and present results for different \(\lambda\)’s
to investigate how the prior volume affects shrinkage toward the prior.</p>
<p>First, we do not condition on the data to get a better understanding of the
prior itself, which corresponds to Equation (1). The following figure shows a
single draw from four Dirichlet processes with different \(\lambda\)’s (the gray
curves show the cumulative distribution function of the data as a reference):</p>
<p><img src="/assets/images/2021-01-25-dirichlet-process/direct-prior-1.svg" alt="" /></p>
<p>It can be seen that the larger the prior volume, the smoother the curve. This is
because larger \(\lambda\)’s “break” the stick into more pieces, allowing the
normalized base measure to be extensively sampled, which, in the limit,
converges to this very measure; see Equation (5).</p>
<p>Now, conditioning on the observed velocities of galaxies—that is, sampling as
shown in Equation (2)—we obtain the following draws from the posterior Dirichlet
distributions with different \(\lambda\)’s:</p>
<p><img src="/assets/images/2021-01-25-dirichlet-process/direct-posterior-1.svg" alt="" /></p>
<p>When the prior volume is small, virtually no data points come from \(P_0\);
instead, they are mostly uniform draws from the observed data set, leading to a
curve that is nearly indistinguishable from the one of the data (the top curve).
As \(\lambda\) gets larger, the prior gets stronger, and the estimate gets
shrunk toward it, up to a point where the observations appear to be entirely
ignored (the bottom curve).</p>
<p>The above model has a serious limitation: it assumes a discrete probability
distribution for the data-generating process, which can be seen in the prior and
posterior given in Equation (1) and (2), respectively, and it is also apparent
in the decomposition given in Equation (3). In some cases, it might be
appropriate; however, there is arguably more situations where it is inadequate,
including the running example.</p>
<h1 id="mixing-prior">Mixing prior</h1>
<p>Instead of using a Dirichlet process as a direct prior for the given data, it
can be used as a prior for mixing distributions from a given family. The
resulting posterior will then naturally inherit the properties of the family,
such as continuity. The general structure is as follows:</p>
\[\begin{align}
x_i | \theta_i & \sim P_x \left( \theta_i \right), \text{ for } i = 1, \dots, n; \tag{6} \\
\theta_i | P_\theta & \sim P_\theta, \text{ for } i = 1, \dots, n; \text{ and} \\
P_\theta & \sim \text{Dirichlet Process}(\lambda P_0). \\
\end{align}\]
<p>The \(i\)th data point, \(x_i\), is distributed according to distribution
\(P_x\) with parameters \(\theta_i\). For instance, \(P_x\) could refer to the
Gaussian family with \(\theta_i = (\mu_i, \sigma_i)\) identifying a particular
member of the family by its mean and standard deviation. Parameters \(\{
\theta_i \}_{i = 1}^n\) are unknown and distributed according to distribution
\(P_\theta\). Distribution \(P_\theta\) is not known either and gets a Dirichlet
process prior with measure \(\lambda P_0\).</p>
<p>It can be seen in Equation (6) that each data point can potentially have its own
unique set of parameters. However, this is not what usually happens in practice.
If \(\lambda\) is reasonably small, the vast majority of the stick—the one we
explained how to break in the previous section—tends to be consumed by a small
number of pieces. This makes many data points share the same parameters, which
is akin to clustering. In fact, clustering is a prominent use case for the
Dirichlet process.</p>
<h2 id="inference-1">Inference</h2>
<p>Unlike the previous model, there is no conjugacy in this case, and hence the
posterior is not a Dirichlet process. There is, however, a simple Markov chain
Monte Carlo sampling strategy based on the stick-breaking construction. It
belongs to the class of Gibbs samplers and is as follows.</p>
<p>Similarly to Equation (3), we have the following decomposition:</p>
\[P_m(\cdot) = \sum_{i = 1}^\infty p_i P_x(\cdot | \theta_i)\]
<p>where \(P_m\) is the probability measure of the mixture. As before, the infinite
decomposition has to be made finite to be usable in practice:</p>
\[P_m(\cdot) = \sum_{i = 1}^m p_i P_x(\cdot | \theta_i).\]
<p>Here, \(m\) represents an upper limit on the number of mixture components. Each
data point \(x_i\), for \(i = 1, \dots, n\), is mapped to one of the \(m\)
components, which we denote by \(k_i \in \{ 1, \dots, m \}\). In other words,
\(k_i\) takes values from 1 to \(m\) and gives the index of the component of
the \(i\)th observation.</p>
<p>There are \(m + m \times |\theta| + n\) parameters to be inferred where
\(|\theta|\) denotes the number of parameters of \(P_x\). These parameters are
\(\{ p_i \}_{i = 1}^m\), \(\{ \theta_i \}_{i = 1}^m\), and \(\{ k_i \}_{i =
1}^n\). As usual in Gibbs sampling, the parameters assume arbitrary but
compatible initial values. The sampler has the following three steps.</p>
<p>First, given \(\{ p_i \}\), \(\{ \theta_i \}\), and \(\{ x_i \}\), the mapping
of the observations to the mixture components, \(\{ k_i \}\), is updated as
follows:</p>
\[k_i \sim \text{Categorical}\left(
m,
\left\{ \frac{p_j P_x(x_i | \theta_j)}{\sum_{l = 1}^m p_l P_x(x_i | \theta_l)} \right\}_{j = 1}^m
\right), \text{ for } i = 1, \dots, n.\]
<p>That is, \(k_i\) is a draw from a categorical distribution with \(m\) categories
whose unnormalized probabilities are given by \(p_j P_x(x_i | \theta_j)\), for
\(j = 1, \dots, m\).</p>
<p>Second, given \(\{ k_i \}\), the probabilities of the mixture components, \(\{ p_i
\}\), are updated using the stick-breaking construction described earlier. This
time, however, the beta distribution for sampling \(\{ q_i \}\) in Equation (4)
is replaced with the following:</p>
\[q_i \sim \text{Beta}\left( 1 + n_i, \lambda + \sum_{j = i + 1}^m n_j \right), \text{ for } i = 1, \dots, m,\]
<p>where</p>
\[n_i = \sum_{j = 1}^n I_{\{i\}}(k_j), \text{ for } i = 1, \dots, m,\]
<p>is the number of data points that are currently allocated to component \(i\).
Here, \(I_A\) is the indicator function of a set \(A\). As before, in order for
the \(p_i\)’s to sum up to one, it is common to set \(q_m := 1\).</p>
<p>Third, given \(\{ k_i \}\) and \(\{ x_i \}\), the parameters of the mixture
components, \(\{ \theta_i \}\), are updated. This is done by sampling from the
posterior distribution of each component. In this case, the posterior is a prior
of choice that is updated using the data points that are currently allocated to
the corresponding component. To streamline this step, a conjugate prior for the
data distribution, \(P_x\), is commonly utilized, which we shall illustrate
shortly.</p>
<p>To recapitulate, a single draw from the posterior is obtained in a number of
steps where parameters or groups of parameters are updated in turn, while
treating the other parameters as known. This Gibbs procedure is very flexible.
Other parameters can be inferred too, instead of setting them to fixed values.
An important example is the concentration parameter, \(\lambda\). This parameter
controls the formation of clusters, and one might let the data decide what the
value should be, in which case a step similar to the third one is added to the
procedure to update \(\lambda\). This will be also illustrated below.</p>
<h2 id="illustration-1">Illustration</h2>
<p>We continue working with the galaxy data. For concreteness, consider the
following choices:</p>
\[\begin{align}
\theta_i &= (\mu_i, \sigma_i), \text{ for } i = 1, \dots, n; \\
P_x (\theta_i) &= \text{Gaussian}(\mu_i, \sigma_i^2), \text{ for } i = 1, \dots, n; \text{ and} \\
P_0(\cdot) &= \text{Gaussian–Scaled-Inverse-}\chi^2(\, \cdot \, | \mu_0, \kappa_0, \nu_0, \sigma_0^2).
\end{align} \tag{7}\]
<p>In the above, \(\text{Gaussian–Scaled-Inverse-}\chi^2(\cdot)\) refers to the
probability measure of a bivariate distribution composed of a conditional
Gaussian and an unconditional scaled inverse chi-squared distribution. Some
intuition about this distribution can be built via the following decomposition:</p>
\[\begin{align}
\mu_i | \sigma_i^2 & \sim \text{Gaussian}\left(\mu_0, \frac{\sigma_i^2}{\kappa_0}\right) \text{ and} \\
\sigma_i^2 & \sim \text{Scaled-Inverse-}\chi^2(\nu_0, \sigma_0^2).
\end{align} \tag{8}\]
<p>This prior is a conjugate prior for a Gaussian data distribution with unknown
mean and variance, which we assume here. This means that the posterior is also a
Gaussian–scaled-inverse-chi-squared distribution. Given a data set with
\(n\) observations \(x_1, \dots, x_n\), the four parameters of the prior
are updated simultaneously (not sequentially) as follows:</p>
\[\begin{align}
\mu_0 & := \frac{\kappa_0}{\kappa_0 + n} \mu_0 + \frac{n}{\kappa_0 + n} \mu_x, \\
\kappa_0 & := \kappa_0 + n, \\
\nu_0 & := \nu_0 + n, \text{ and} \\
\sigma_0^2 & := \frac{1}{\nu_0 + n} \left( \nu_0 \sigma_0^2 + ss_x + \frac{\kappa_0 n}{\kappa_0 + n}(\mu_x - \mu_0)^2 \right)
\end{align}\]
<p>where \(\mu_x = \sum_{i = 1}^n x_i / n\) and \(ss_x = \sum_{i = 1}^n (x_i -
\mu_x)^2\). It can be seen that \(\kappa_0\) and \(\nu_0\) act as counters of
the number of observations; \(\mu_0\) is a weighted sum of two means; and
\(\nu_0 \sigma_0^2\) is a sum of two sums of squares and a third term increasing
the uncertainty due to the difference in the means. In the Gibbs sampler, each
component (each cluster of galaxies) will have its own posterior based on the
data points that are assigned to that component during each iteration of the
process. Therefore, \(n\), \(\mu_x\), and \(ss_x\) will generally be different
for different components and, moreover, will vary from iteration to iteration.</p>
<p>We set \(\mu_0\) to 20, which is roughly the mean of the data, and \(\nu_0\) to
3, which is the smallest integer that allows the scaled chi-squared distribution
to have a finite expectation. The choice of \(\kappa_0\) and \(\sigma_0\) is
more subtle. Recall Equation (8). What we would like from the prior is to allow
for free formation of clusters in a region generously covering the support of
the data. To this end, the uncertainty in the mean, \(\mu_i\), has to be high;
however, it should not come from \(\sigma_i\), since it would produce very
diffuse clusters. We set \(\kappa_0\) to 0.01 to magnify the variance of
\(\mu_i\) without affecting \(\sigma_i\), and \(\sigma_0\) to 1 to keep clusters
compact.</p>
<p>Now, let us take a look at what the above choices entail. The following figure
illustrates the prior for the mean of a component:</p>
<p><img src="/assets/images/2021-01-25-dirichlet-process/mixture-prior-mu-1.svg" alt="" /></p>
<p>The negative part is unrealistic for velocity; however, it is rarely a problem
in practice. What is important is that there is a generous coverage of the
plausible values. The following figure shows the prior for the standard
deviation of a component:</p>
<p><img src="/assets/images/2021-01-25-dirichlet-process/mixture-prior-sigma-1.svg" alt="" /></p>
<p>The bulk is below the standard deviation of the data; however, this is by
choice: we expect more than one cluster of galaxies with similar velocities.</p>
<p>As mentioned earlier, we intend to include \(\lambda\) in the inference. First,
we put the following prior:</p>
\[\lambda \sim \text{Gamma}(\alpha_0, \beta_0). \tag{9}\]
<p>Note this is the rate parameterization of the Gamma family. Conditionally, this
is a conjugate prior with the following update rule for the two parameters:</p>
\[\begin{align}
\alpha_0 & := \alpha_0 + m - 1 \quad \text{and} \\
\beta_0 & := \beta_0 - \sum_{i = 1}^{m - 1} \ln(1 - q_i)
\end{align}\]
<p>where \(\{ q_i \}\) come from the stick-breaking construction. This is a fourth
step in the Gibbs sampler. We set \(\alpha_0\) and \(\beta_0\) to 2 and 0.1,
respectively, which entails the following prior assumption about \(\lambda\):</p>
<p><img src="/assets/images/2021-01-25-dirichlet-process/mixture-prior-lambda-1.svg" alt="" /></p>
<p>The parameter is allowed to vary freely from small to large values, as desired.</p>
<p>Having chosen all priors and their hyperparameters, we are ready to investigate
the behavior of the entire model; see Equations (6), (7), and (9). In what
follows, we shall limit the number of mixture components to 25; that is, \(m =
25\). Furthermore, we shall perform 2000 Gibbs iterations and discard the first
half as a warm-up period. As before, we start without conditioning on the data
to observe draws from the prior itself. The following figure shows two sample
draws:</p>
<p><img src="/assets/images/2021-01-25-dirichlet-process/mixture-prior-check-1.svg" alt="" /></p>
<p>It can be seen that clusters of galaxies can appear anywhere in the region of
interest and can be of various sizes. We conclude that the prior is adequate.
When taking the observed velocities into account, we obtain a full posterior
distribution in the form of 1000 draws. The following shows two random draws:</p>
<p><img src="/assets/images/2021-01-25-dirichlet-process/mixture-posterior-check-1.svg" alt="" /></p>
<p>Indeed, mixture components have started to appear in the regions where there are
observations.</p>
<p>Before we proceed to the final summary of results, it is prudent to inspect
sample chains for a few parameters in order to ensure there are not problems
with convergence to the stationary distribution. The following shows the number
of occupied components among the 25 permitted:</p>
<p><img src="/assets/images/2021-01-25-dirichlet-process/mixture-posterior-k-1.svg" alt="" /></p>
<p>The chain fluctuates around a fixed level without any prominent pattern, as it
should. One can plot the actual marginal posterior distribution for the number
of components; however, it is already clear that the distribution of the number
of clusters of galaxies is mostly between 5 and 10 with a median of 7.</p>
<p>As for the concentration parameter, \(\lambda\), the chain is as follows:</p>
<p><img src="/assets/images/2021-01-25-dirichlet-process/mixture-posterior-lambda-1.svg" alt="" /></p>
<p>The behavior is uneventful, which is a good sign.</p>
<p>Let us now take a look at the posterior distributions of the first seven
components highlighted earlier (note the different scales on the vertical axes):</p>
<p><img src="/assets/images/2021-01-25-dirichlet-process/mixture-posterior-mu-1.svg" alt="" /></p>
<p>The components clearly change roles, which can be seen by the multimodal nature
of the distributions. Component 1 is most often at 10 (times \(10^6\) m/s);
however, it also peaks between 24 and 25 and even above 30. Components 2 and 3
are the most certain ones, which is due to a relatively large number of samples
present in the corresponding region. They seem to exchanges roles and capture
velocities of around 20 and 23. Components 4 and 5, on the other hand, appear to
play the same role. Unlike Component 1, they are most likely to be found at
around 33. Components 6 and 7 are similar too. They seem to be responsible for
the small formation to the left and right next to the bulk in the middle (at
16); recall the histogram of the data. The small formation on the other side of
the bulk at around 26 is captured as well, which is mostly done by Component 6.</p>
<p>Lastly, we summarize the inference using the following figure where the median
distribution and a 95% uncertainty band—composed of distributions at the 0.025
and 0.975 quantiles—are plotted:</p>
<p><img src="/assets/images/2021-01-25-dirichlet-process/mixture-posterior-summary-1.svg" alt="" /></p>
<p>In this view, only five components are visible to the naked eye. The median
curve matches well the findings in <a href="https://doi.org/10.2307/2289993">Roeder (1990)</a>. Judging by the width of the
uncertainty band, there is a lot of plausible alternatives, and it is important
to communicate this uncertainty to those who base decisions on the inference.
The ability to quantify uncertainty with such ease is a prominent advantage of
Bayesian inference.</p>
<h1 id="conclusion">Conclusion</h1>
<p>In this article, the family of Dirichlet processes has been presented in the
context of Bayesian inference. More specifically, it has been shown how a
Dirichlet process can be utilized as a prior for an unknown discrete
distribution and as a prior for mixing distributions from a given family. In
both cases, it has been illustrated how to perform inference via a finite
approximation and the stick-breaking construction.</p>
<p>Clearly, the overall procedure is more complicated than counting observations
falling in a number of fixed bins, which is what a histogram does, or placing
kernels all over the place, which is what a kernel density estimator does.
However, “anything in life worth having is worth working for.” The advantages of
the Bayesian approach include the ability to incorporate prior knowledge, which
is crucial in situations with little data, and the ability to propagate and
quantify uncertainty, which is a must.</p>
<blockquote>
<p>Recall that the source code of this <a href="https://github.com/IvanUkhov/blog/blob/master/_posts/2021-01-25-dirichlet-process.Rmd">notebook</a> along with auxiliary <a href="https://github.com/IvanUkhov/blog/tree/master/_scripts/2021-01-25-dirichlet-process">scripts</a>
that were used for performing the calculations presented above can be found on
GitHub. Any feedback is welcome!</p>
</blockquote>
<h1 id="acknowledgments">Acknowledgments</h1>
<p>I would like to thank <a href="https://www.mattiasvillani.com/">Mattias Villani</a> for the insightful and informative
graduate course in Bayesian statistics titled “<a href="https://github.com/mattiasvillani/AdvBayesLearnCourse">Advanced Bayesian
learning</a>,” which was the inspiration behind writing this
article, and for his guidance regarding the implementation.</p>
<h1 id="references">References</h1>
<ul>
<li>Andrew Gelman et al., <em><a href="http://www.stat.columbia.edu/~gelman/book/">Bayesian Data Analysis</a></em>, Chapman and
Hall/CRC, 2014.</li>
<li>Kathryn Roeder, “<a href="https://doi.org/10.2307/2289993">Density estimation with confidence sets exemplified by
superclusters and voids in galaxies</a>,” Journal of the American
Statistical Association, 1990.</li>
<li>Rick Durrett, <em><a href="https://services.math.duke.edu/~rtd/PTE/pte.html">Probability: Theory and Examples</a></em>, Cambridge
University Press, 2010.</li>
</ul>Ivan UkhovRecall the last time you wanted to understand the distribution of given data. One alternative was to plot a histogram. However, it resulted in frustration due to the choice of the number of bins to use, which led to drastically different outcomes. Another alternative was kernel density estimation. Despite having a similar choice to make, it has the advantage of producing smooth estimates, which are more realistic for continuous quantities with regularities. However, kernel density estimation was unsatisfactory too: it did not aid in understanding the underlying structure of the data and, moreover, provided no means of quantifying the uncertainty associated with the results. In this article, we discuss a Bayesian approach to the estimation of data-generating distributions that addresses the aforementioned concerns.Heteroscedastic Gaussian process regression2020-06-22T06:00:00+00:002020-06-22T06:00:00+00:00https://blog.ivanukhov.com/2020/06/22/gaussian-process<p>Gaussian process regression is a nonparametric Bayesian technique for modeling
relationships between variables of interest. The vast flexibility and rigor
mathematical foundation of this approach make it the default choice in many
problems involving small- to medium-sized data sets. In this article, we
illustrate how Gaussian process regression can be utilized in practice. To make
the case more compelling, we consider a setting where linear regression would be
inadequate. The focus will be <em>not</em> on getting the job done as fast as possible
but on learning the technique and understanding the choices being made.</p>
<h1 id="data">Data</h1>
<p>Consider the following example taken from <a href="http://www.stat.tamu.edu/~carroll/semiregbook"><em>Semiparametric
Regression</em></a> by Ruppert <em>et al.</em>:</p>
<p><img src="/assets/images/2020-06-22-gaussian-process/data-1.svg" alt="" /></p>
<p>The figure shows 221 observations collected in a <a href="https://en.wikipedia.org/wiki/Lidar">light detection and
ranging</a> experiment. Each observation can be interpreted as the sum of
the true underlying response at the corresponding distance and random noise. It
can be clearly seen that the variance of the noise varies with the distance: the
spread is substantially larger toward the right-hand side. This phenomenon is
known as heteroscedasticity. Homoscedasticity (the absence of
heteroscedasticity) is one of the key assumptions of linear regression. Applying
linear regression to the above problem would yield suboptimal results. The
estimates of the regression coefficients would still be unbiased; however, the
standard errors of the coefficients would be incorrect and hence misleading. A
different modeling technique is needed in this case.</p>
<p>The above data set will be our running example. For formally and slightly more
generally, we assume that there is a data set of \(m\) observations:</p>
\[\left\{
(\mathbf{x}_i, y_i): \,
\mathbf{x}_i \in \mathbb{R}^d; \,
y_i \in \mathbb{R}; \,
i = 1, \dots, m
\right\}\]
<p>where the independent variable, \(\mathbf{x}\), is \(d\)-dimensional, and the
dependent variable, \(y\), is scalar. In the running example, \(d\) is 1, and
\(m\) is 221. It is time for modeling.</p>
<h1 id="model">Model</h1>
<p>To begin with, consider the following model with additive noise:</p>
\[y_i = f(\mathbf{x}_i) + \epsilon_i, \text{ for } i = 1, \dots, m. \tag{1}\]
<p>In the above, \(f: \mathbb{R}^d \to \mathbb{R}\) represents the true but unknown
underlying function, and \(\epsilon_i\) represents the perturbation of the
\(i\)th observation by random noise. In the classical linear-regression setting,
the unknown function is modeled as a linear combination of (arbitrary
transformations of) the \(d\) covariates. Instead of assuming any particular
functional form, we put a Gaussian process prior on the function:</p>
\[f(\mathbf{x}) \sim \text{Gaussian Process}\left( 0, k(\mathbf{x}, \mathbf{x}') \right).\]
<p>The above notation means that, before observing any data, the function is a draw
from a Gaussian process with zero mean and a covariance function \(k\). The
covariance function dictates the degree of correlation between two arbitrary
locations \(\mathbf{x}\) and \(\mathbf{x}'\) in \(\mathbb{R}^d\). For instance,
a frequent choice for \(k\) is the squared-exponential covariance function:</p>
\[k(\mathbf{x}, \mathbf{x}')
= \sigma_\text{process}^2 \exp\left( -\frac{\|\mathbf{x} - \mathbf{x}'\|_2^2}{2 \, \ell_\text{process}^2} \right)\]
<p>where \(\|\cdot\|_2\) stands for the Euclidean norm, \(\sigma_\text{process}^2\)
is the variance (to see this, substitute \(\mathbf{x}\) for \(\mathbf{x}'\)),
and \(\ell_\text{process}\) is known as the length scale. While the variance
parameter is intuitive, the length-scale one requires an illustration. The
parameter controls the speed with which the correlation fades with the distance.
The following figure shows 10 random draws for \(\ell_\text{process} = 0.1\):</p>
<p><img src="/assets/images/2020-06-22-gaussian-process/prior-process-short-1.svg" alt="" /></p>
<p>With \(\ell_\text{process} = 0.5\), the behavior changes to the following:</p>
<p><img src="/assets/images/2020-06-22-gaussian-process/prior-process-long-1.svg" alt="" /></p>
<p>It can be seen that it takes a greater distance for a function with a larger
length scale (<em>top</em>) to change to the same extent compared to a function with a
smaller length scale (<em>bottom</em>).</p>
<p>Let us now return to Equation (1) and discuss the error terms, \(\epsilon_i\).
In linear regression, they are modeled as independent identically distributed
Gaussian random variables:</p>
\[\epsilon_i \sim \text{Gaussian}\left( 0, \sigma_\text{noise}^2 \right),
\text{ for } i = 1, \dots, m. \tag{2}\]
<p>This is also the approach one can take with Gaussian process regression;
however, one does not have to. There are reasons to believe the problem at hand
is heteroscedastic, and it should be reflected in the model. To this end, the
magnitude of the noise is allowed to vary with the covariates:</p>
\[\epsilon_i | \mathbf{x}_i \sim \text{Gaussian}\left(0, \sigma^2_{\text{noise}, i}\right),
\text{ for } i = 1, \dots, m. \tag{3}\]
<p>The error terms are still independent (given the covariates) but not identically
distributed. At this point, one has to make a choice about the dependence of
\(\sigma_{\text{noise}, i}\) on \(\mathbf{x}_i\). This dependence could be
modeled with another Gaussian process with an appropriate link function to
ensure \(\sigma_{\text{noise}, i}\) is nonnegative. Another reasonable choice is
a generalized linear model, which is what we shall use:</p>
\[\ln \sigma^2_{\text{noise}, i} = \alpha_\text{noise} + \boldsymbol{\beta}^\intercal_\text{noise} \, \mathbf{x}_i,
\text{ for } i = 1, \dots, m, \tag{4}\]
<p>where \(\alpha\) is the intercept of the regression line, and
\(\boldsymbol{\beta} \in \mathbb{R}^d\) contains the slopes.</p>
<p>Thus far, a model for the unknown function \(f\) and a model for the noise have
been prescribed. In total, there are \(d + 3\) parameters:
\(\sigma_\text{process}\), \(\ell_\text{process}\), \(\alpha_\text{noise}\), and
\(\beta_{\text{noise}, i}\) for \(i = 1, \dots, d\). The first two are
positive, and the rest are arbitrary. The final piece is prior distributions for
these parameters.</p>
<p>The variance of the coveriance function, \(\sigma^2_\text{process}\),
corresponds to the amount of variance in the data that is explained by the
Gaussian process. It poses no particular problem and can be tackled with a
half-Gaussian or a half-Student’s t distribution:</p>
\[\sigma_\text{process} \sim \text{Half-Gaussian}\left( 0, 1 \right).\]
<p>The notation means that the standard Gaussian distribution is truncated at zero
and renormalized. The nontrivial mass around zero implied by the prior is
considered to be beneficial in this case.<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote">1</a></sup></p>
<p>A prior for the length scale of the covariance function,
\(\ell_\text{process}\), should be chosen with care. Small values—especially,
those below the resolution of the data—give the Gaussian process extreme
flexibility and easily leads to overfitting. Moreover, there are numerical
ramifications of the length scale approaching zero as well: the quality of
Hamiltonian Monte Carlo sampling degrades.<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote">2</a></sup> The bottom line is that a prior
penalizing values close to zero is needed. A reasonable choice is an inverse
gamma distribution:</p>
\[\ell_\text{process} \sim \text{Inverse Gamma}\left( 1, 1 \right).\]
<p>To understand the implications, let us perform a prior predictive check for this
component in isolation:</p>
<p><img src="/assets/images/2020-06-22-gaussian-process/prior-process-length-scale-1.svg" alt="" /></p>
<p>It can be seen that the density is very low in the region close to zero, while
being rather permissive to the right of that region, especially considering the
scale of the distance in the data; recall the very first figure. Consequently,
the choice is adequate.</p>
<p>The choice of priors for the parameters of the noise is complicated by the
nonlinear link function; see Equation (4). What is important to realize is that
small amounts of noise correspond to negative values in the linear space, which
is probably what one should be expecting given the scale of the response.
Therefore, the priors should allow for large negative values. Let us make an
educated assumption and perform a prior predictive check to understand the
consequences. Consider the following:</p>
\[\begin{align}
\alpha_\text{noise} & \sim \text{Gaussian}\left( -1, 1 \right) \text{ and} \\
\beta_{\text{noise}, i} & \sim \text{Gaussian}\left( 0, 1 \right),
\text{ for } i = 1, \dots, d.\\
\end{align}\]
<p>The density of \(\sigma_\text{noise}\) without considering the regression slopes
is depicted below (note the logarithmic scale on the horizontal axis):</p>
<p><img src="/assets/images/2020-06-22-gaussian-process/prior-noise-sigma-1.svg" alt="" /></p>
<p>The variability in the intercept, \(\alpha_\text{noise}\), allows the standard
deviation, \(\sigma_\text{noise}\), to comfortably vary from small to large
values, keeping in mind the scale of the response. Here are two draws from the
prior distribution of the noise, including Equations (3) and (4):</p>
<p><img src="/assets/images/2020-06-22-gaussian-process/prior-noise-1.svg" alt="" /></p>
<p>The large ones are perhaps unrealistic and could be addressed by further
shifting the distribution of the intercept. However, they should not cause
problems for the inference.</p>
<p>Putting everything together, the final model is as follows:</p>
\[\begin{align}
y_i
& = f(\mathbf{x}_i) + \epsilon_i,
\text{ for } i = 1, \dots, m; \\
f(\mathbf{x})
& \sim \text{Gaussian Process}\left( 0, k(\mathbf{x}, \mathbf{x}') \right); \\
k(\mathbf{x}, \mathbf{x}')
& = \sigma_\text{process}^2 \exp\left( -\frac{\|\mathbf{x} - \mathbf{x}'\|_2^2}{2 \, \ell_\text{process}^2} \right); \\
\epsilon_i | \mathbf{x}_i
& \sim \text{Gaussian}\left( 0, \sigma^2_{\text{noise}, i} \right),
\text{ for } i = 1, \dots, m; \\
\ln \sigma^2_{\text{noise}, i}
& = \alpha_\text{noise} + \boldsymbol{\beta}_\text{noise}^\intercal \, \mathbf{x}_i,
\text{ for } i = 1, \dots, m; \\
\sigma_\text{process}
& \sim \text{Half-Gaussian}\left( 0, 1 \right); \\
\ell_\text{process}
& \sim \text{Inverse Gamma}\left( 1, 1 \right); \\
\alpha_\text{noise}
& \sim \text{Gaussian}\left( -1, 1 \right); \text{ and} \\
\beta_{\text{noise}, i}
& \sim \text{Gaussian}\left( 0, 1 \right),
\text{ for } i = 1, \dots, d.\\
\end{align}\]
<p>This concludes the modeling part. The remaining two steps are to infer the
parameters and to make predictions using the posterior predictive distribution.</p>
<h1 id="inference">Inference</h1>
<p>The model is analytically intractable; one has to resort to sampling or
variational methods for inferring the parameters. We shall use Hamiltonian
Markov chain Monte Carlo sampling via <a href="https://mc-stan.org/">Stan</a>. The model can be seen in the
following listing, where the notation closely follows the one used throughout
the article:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">data</span> <span class="p">{</span>
<span class="kt">int</span><span class="o"><</span><span class="n">lower</span> <span class="o">=</span> <span class="mi">1</span><span class="o">></span> <span class="n">d</span><span class="p">;</span>
<span class="kt">int</span><span class="o"><</span><span class="n">lower</span> <span class="o">=</span> <span class="mi">1</span><span class="o">></span> <span class="n">m</span><span class="p">;</span>
<span class="n">vector</span><span class="p">[</span><span class="n">d</span><span class="p">]</span> <span class="n">x</span><span class="p">[</span><span class="n">m</span><span class="p">];</span>
<span class="n">vector</span><span class="p">[</span><span class="n">m</span><span class="p">]</span> <span class="n">y</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">transformed</span> <span class="n">data</span> <span class="p">{</span>
<span class="n">vector</span><span class="p">[</span><span class="n">m</span><span class="p">]</span> <span class="n">mu</span> <span class="o">=</span> <span class="n">rep_vector</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">m</span><span class="p">);</span>
<span class="n">matrix</span><span class="p">[</span><span class="n">m</span><span class="p">,</span> <span class="n">d</span><span class="p">]</span> <span class="n">X</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="n">i</span> <span class="n">in</span> <span class="mi">1</span><span class="o">:</span><span class="n">m</span><span class="p">)</span> <span class="p">{</span>
<span class="n">X</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">x</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="err">'</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="n">parameters</span> <span class="p">{</span>
<span class="n">real</span><span class="o"><</span><span class="n">lower</span> <span class="o">=</span> <span class="mi">0</span><span class="o">></span> <span class="n">sigma_process</span><span class="p">;</span>
<span class="n">real</span><span class="o"><</span><span class="n">lower</span> <span class="o">=</span> <span class="mi">0</span><span class="o">></span> <span class="n">ell_process</span><span class="p">;</span>
<span class="n">real</span> <span class="n">alpha_noise</span><span class="p">;</span>
<span class="n">vector</span><span class="p">[</span><span class="n">d</span><span class="p">]</span> <span class="n">beta_noise</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">model</span> <span class="p">{</span>
<span class="n">matrix</span><span class="p">[</span><span class="n">m</span><span class="p">,</span> <span class="n">m</span><span class="p">]</span> <span class="n">K</span> <span class="o">=</span> <span class="n">cov_exp_quad</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">sigma_process</span><span class="p">,</span> <span class="n">ell_process</span><span class="p">);</span>
<span class="n">vector</span><span class="p">[</span><span class="n">m</span><span class="p">]</span> <span class="n">sigma_noise_squared</span> <span class="o">=</span> <span class="n">exp</span><span class="p">(</span><span class="n">alpha_noise</span> <span class="o">+</span> <span class="n">X</span> <span class="o">*</span> <span class="n">beta_noise</span><span class="p">);</span>
<span class="n">matrix</span><span class="p">[</span><span class="n">m</span><span class="p">,</span> <span class="n">m</span><span class="p">]</span> <span class="n">L</span> <span class="o">=</span> <span class="n">cholesky_decompose</span><span class="p">(</span><span class="n">add_diag</span><span class="p">(</span><span class="n">K</span><span class="p">,</span> <span class="n">sigma_noise_squared</span><span class="p">));</span>
<span class="n">y</span> <span class="o">~</span> <span class="n">multi_normal_cholesky</span><span class="p">(</span><span class="n">mu</span><span class="p">,</span> <span class="n">L</span><span class="p">);</span>
<span class="n">sigma_process</span> <span class="o">~</span> <span class="n">normal</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
<span class="n">ell_process</span> <span class="o">~</span> <span class="n">inv_gamma</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
<span class="n">alpha_noise</span> <span class="o">~</span> <span class="n">normal</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
<span class="n">beta_noise</span> <span class="o">~</span> <span class="n">normal</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>In the <code class="language-plaintext highlighter-rouge">parameters</code> block, one can find the \(d + 3\) parameters identified
earlier. In regards to the <code class="language-plaintext highlighter-rouge">model</code> block, it is worth noting that there is no
any Gaussian process distribution in Stan. Instead, a multivariate Gaussian
distribution is utilized to model \(f\) at \(\mathbf{X} = (\mathbf{x}_i)_{i =
1}^m \in \mathbb{R}^{m \times d}\) and eventually \(\mathbf{y} = (y_i)_{i =
1}^m\), which is for a good reason. Even though a Gaussian process is an
infinite-dimensional object, in practice, one always works with finite amounts
of data. For instance, in the running example, there are only 221 data points.
By definition, a Gaussian process is a stochastic process with the condition
that any finite collection of points from this process has a multivariate
Gaussian distribution. This fact combined with the conditional independence of
the process and the noise given the covariates yields the following and explains
the usage of a multivariate Gaussian distribution:</p>
\[\mathbf{y} | \mathbf{X}, \sigma_\text{process}, \ell_\text{process}, \alpha_\text{noise}, \boldsymbol{\beta}_\text{noise}
\sim \text{Multivariate Gaussian}\left( \mathbf{0}, \mathbf{K} + \mathbf{D} \right)\]
<p>where \(\mathbf{K} \in \mathbb{R}^{m \times m}\) is a covariance matrix computed
by evaluating the covariance function \(k\) at all pairs of locations in the
observed data, and \(\mathbf{D} = \text{diag}(\sigma^2_{\text{noise}, i})_{i =
1}^m \in \mathbb{R}^{m \times m}\) is a diagonal matrix of the variances of the
noise at the corresponding locations.</p>
<p>After running the inference, the following posterior distributions are obtained:</p>
<p><img src="/assets/images/2020-06-22-gaussian-process/posterior-parameters-1.svg" alt="" /></p>
<p>The intervals are at the bottom of the densities are 66% and 95% equal-tailed
probability intervals, and the dots indicate the medians. Let us also take a
look at the 95% probability interval for the noise with respect to the distance:</p>
<p><img src="/assets/images/2020-06-22-gaussian-process/posterior-predictive-noise-1.svg" alt="" /></p>
<p>As expected, the variance of the noise increases with the distance.</p>
<h1 id="prediction">Prediction</h1>
<p>Suppose there are \(n\) locations \(\mathbf{X}_\text{new} =
(\mathbf{x}_{\text{new}, i})_{i = 1}^n \in \mathbb{R}^{n \times d}\) where one
wishes to make predictions. Let \(\mathbf{f}_\text{new} \in \mathbb{R}^n\) be
the values of \(f\) at those locations. Assuming all the data and parameters
given, the joint distribution of \(\mathbf{y}\) and \(\mathbf{f}_\text{new}\) is
as follows:</p>
\[\left[
\begin{matrix}
\mathbf{y} \\
\mathbf{f}_\text{new}
\end{matrix}
\right]
\sim \text{Multivariate Gaussian}\left(
\mathbf{0},
\left[
\begin{matrix}
\mathbf{K} + \mathbf{D} & k(\mathbf{X}, \mathbf{X}_\text{new}) \\
k(\mathbf{X}_\text{new}, \mathbf{X}) & k(\mathbf{X}_\text{new}, \mathbf{X}_\text{new})
\end{matrix}
\right]
\right)\]
<p>where, with a slight abuse of notation, \(k(\cdot, \cdot)\) stands for a
covariance matrix computed by evaluating the covariance function \(k\) at the
specified locations, which is analogous to \(\mathbf{K}\). It is well known (see
<a href="http://www.gaussianprocess.org/gpml">Rasmussen et al. 2006</a>, for instance) that the marginal
distribution of \(\mathbf{f}_\text{new}\) is a multivariate Gaussian with the
following mean vector and covariance matrix, respectively:</p>
\[\begin{align}
E(\mathbf{f}_\text{new})
& = k(\mathbf{X}_\text{new}, \mathbf{X})(\mathbf{K} + \mathbf{D})^{-1} \, \mathbf{y} \quad \text{and} \\
\text{cov}(\mathbf{f}_\text{new})
& = k(\mathbf{X}_\text{new}, \mathbf{X}_\text{new})
- k(\mathbf{X}_\text{new}, \mathbf{X})(\mathbf{K} + \mathbf{D})^{-1} k(\mathbf{X}, \mathbf{X}_\text{new}).
\end{align}\]
<p>The final component is the noise, as per Equation (1). The noise does not change
the mean of the multivariate Gaussian distribution but does magnify the
variance:</p>
\[\begin{align}
E(\mathbf{y}_\text{new})
& = k(\mathbf{X}_\text{new}, \mathbf{X})(\mathbf{K} + \mathbf{D})^{-1} \, \mathbf{y} \quad \text{and} \\
\text{cov}(\mathbf{y}_\text{new})
& = k(\mathbf{X}_\text{new}, \mathbf{X}_\text{new})
- k(\mathbf{X}_\text{new}, \mathbf{X})(\mathbf{K} + \mathbf{D})^{-1} k(\mathbf{X}, \mathbf{X}_\text{new})
+ \text{diag}(\sigma^2_\text{noise}(\mathbf{X}_\text{new}))
\end{align}\]
<p>where \(\text{diag}(\sigma^2_\text{noise}(\cdot))\) stands for a diagonal matrix
composed of the noise variance evaluated at the specified locations, which is
analogous to \(\mathbf{D}\).</p>
<p>Given a set of draws from the joint posterior distribution of the parameters and
the last two expressions, it is now straightforward to draw samples from the
posterior predictive distribution of the response: for each draw of the
parameters, one has to evaluate the mean vector and the covariance matrix and
sample the corresponding multivariate Gaussian distribution. The result is given
in the following figure:</p>
<p><img src="/assets/images/2020-06-22-gaussian-process/posterior-predictive-heteroscedastic-1.svg" alt="" /></p>
<p>The graph shows the mean value of the posterior predictive distribution given by
the black line along with a 95% equal-tailed probability band about the mean. It
can be seen that the uncertainty in the predictions is adequately captured along
the entire support. Naturally, the full predictive posterior distribution is
available at any location of interest.</p>
<p>Before we conclude, let us illustrate what would happen if the data were modeled
as having homogeneous noise. To this end, the variance of the noise is assumed
to be independent of the covariates, as in Equation (2). After repeating the
inference and prediction processes, the following is obtained:</p>
<p><img src="/assets/images/2020-06-22-gaussian-process/posterior-predictive-homoscedastic-1.svg" alt="" /></p>
<p>The inference is inadequate, which can be seen by the probability band: the
variance is largely overestimated on the left-hand side and underestimated on
the right-hand side. This justifies well the choice of heteroscedastic
regression presented earlier.</p>
<h1 id="conclusion">Conclusion</h1>
<p>In this article, it has been illustrated how a functional relationship can be
modeled using a Gaussian process as a prior. Particular attention has been
dedicated to adequately capturing error terms in the presence of
heteroscedasticity. In addition, a practical implementation has been discussed,
and the experimental results have demonstrated the appropriateness of this
approach.</p>
<p>For the curious reader, the source code of this <a href="https://github.com/IvanUkhov/blog/blob/master/_posts/2020-06-22-gaussian-process.Rmd">notebook</a> along with a number
of auxiliary <a href="https://github.com/IvanUkhov/blog/tree/master/_scripts/2020-06-22-gaussian-process">scripts</a>, such as the definition of the model in Stan, can be
found on GitHub.</p>
<h1 id="acknowledgments">Acknowledgments</h1>
<p>I would like to thank <a href="https://www.mattiasvillani.com/">Mattias Villani</a> for the insightful and informative
graduate course in statistics titled “<a href="https://github.com/mattiasvillani/AdvBayesLearnCourse">Advanced Bayesian learning</a>,” which was the inspiration behind writing this article.</p>
<h1 id="references">References</h1>
<ul>
<li>Carl Rasmussen <em>et al.</em>, <a href="http://www.gaussianprocess.org/gpml"><em>Gaussian Processes for Machine
Learning</em></a>, the MIT Press, 2006.</li>
<li>David Ruppert <em>et al.</em>, <a href="http://www.stat.tamu.edu/~carroll/semiregbook"><em>Semiparametric Regression</em></a>, Cambridge
University Press, 2003.</li>
</ul>
<h1 id="footnotes">Footnotes</h1>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>“<a href="https://mc-stan.org/docs/2_19/stan-users-guide/fit-gp-section.html#priors-for-marginal-standard-deviation">Priors for marginal standard deviation</a>,”
Stan User’s Guide, 2020. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>“<a href="https://mc-stan.org/docs/2_19/stan-users-guide/fit-gp-section.html#priors-for-length-scale">Priors for length-scale</a>,” Stan User’s Guide,
2020. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Ivan UkhovGaussian process regression is a nonparametric Bayesian technique for modeling relationships between variables of interest. The vast flexibility and rigor mathematical foundation of this approach make it the default choice in many problems involving small- to medium-sized data sets. In this article, we illustrate how Gaussian process regression can be utilized in practice. To make the case more compelling, we consider a setting where linear regression would be inadequate. The focus will be not on getting the job done as fast as possible but on learning the technique and understanding the choices being made.What is the easiest way to compare two data sets?2020-04-10T06:00:00+00:002020-04-10T06:00:00+00:00https://blog.ivanukhov.com/2020/04/10/comparison<p>One has probably come across this problem numerous times. There are two versions
of a tabular data set with a lot of columns of different types, and one wants to
quickly identify any differences between the two. For example, the pipeline
providing data to a predictive model might have been updated, and the goal is to
understand if there have been any side effects of this update for the training
data.</p>
<p>One solution is to start to iterate over the columns of the two tables,
computing five-number summaries and plotting histograms or identifying distinct
values and plotting bar charts, depending on the column’s type. However, this
can quickly get out of hand and evolve into an endeavor for the rest of the day.</p>
<p>An alternative is to leverage the amazing tools that already exist in the data
community.</p>
<h1 id="solution">Solution</h1>
<p>The key takeaway is the following three lines of code, excluding the import:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">tensorflow_data_validation</span> <span class="k">as</span> <span class="n">dv</span>
<span class="n">statistics_1</span> <span class="o">=</span> <span class="n">dv</span><span class="p">.</span><span class="n">generate_statistics_from_dataframe</span><span class="p">(</span><span class="n">data_1</span><span class="p">)</span>
<span class="n">statistics_2</span> <span class="o">=</span> <span class="n">dv</span><span class="p">.</span><span class="n">generate_statistics_from_dataframe</span><span class="p">(</span><span class="n">data_2</span><span class="p">)</span>
<span class="n">dv</span><span class="p">.</span><span class="n">visualize_statistics</span><span class="p">(</span><span class="n">lhs_statistics</span><span class="o">=</span><span class="n">statistics_1</span><span class="p">,</span>
<span class="n">rhs_statistics</span><span class="o">=</span><span class="n">statistics_2</span><span class="p">)</span>
</code></pre></div></div>
<p>This is all it takes to get a versatile dashboard embedded right into a cell of
a Jupyter notebook. The visualization itself is based on <a href="https://pair-code.github.io/facets">Facets</a>, and it is
conveniently provided by <a href="https://www.tensorflow.org/tfx/data_validation/get_started">TensorFlow Data Validation</a> (which does not have much
to do with TensorFlow and can be used stand-alone).</p>
<p>It is pointless to try to describe in words what the dashboard can do; instead,
here is a demonstration taken from <a href="https://pair-code.github.io/facets">Facets</a> where the tool is applied the <a href="http://archive.ics.uci.edu/ml/datasets/Census+Income">UCI
Census Income</a> data set:</p>
<div id="facets-overview-container"></div>
<p>Go ahead and give a try to all the different controls!</p>
<p>In this case, it is helpful to toggle the “percentages” checkbox, since the data
sets are of different sizes. Then it becomes apparent that the two partitions
are fairly balanced. The only problem is that <code class="language-plaintext highlighter-rouge">Target</code>, which represents income,
happened to be encoded incorrectly in the partition for testing.</p>
<p>Lastly, an example in a Jupyter notebook can be found on <a href="https://github.com/chain-rule/example-comparison/blob/master/census.ipynb">GitHub</a>.</p>
<h1 id="conclusion">Conclusion</h1>
<p>It can be difficult to navigate and particularly challenging to compare wide
data sets. A lot of effort can be put into this exercise. However, the landscape
of open-source tools has a lot to offer too. Facets is one such example. The
library and its straightforward availability via TensorFlow Data Validation are
arguably less known. This short note can hopefully rectify this to some extent.</p>Ivan UkhovOne has probably come across this problem numerous times. There are two versions of a tabular data set with a lot of columns of different types, and one wants to quickly identify any differences between the two. For example, the pipeline providing data to a predictive model might have been updated, and the goal is to understand if there have been any side effects of this update for the training data.Bayesian inference of the net promoter score via multilevel regression with poststratification2020-02-03T07:00:00+00:002020-02-03T07:00:00+00:00https://blog.ivanukhov.com/2020/02/03/net-promoter<p>Customer surveys are naturally prone to biases. One prominent example is
participation bias, which arises when individuals decide not to respond to the
survey, and this pattern is not random. For instance, new customers might reply
less eagerly than those who are senior. This renders the obtained responses
unrepresentative of the target population. In this article, we tackle
participation bias for the case of the net promoter survey by means of
multilevel regression and poststratification.</p>
<p>More specifically, the discussion here is a sequel to “<a href="/2019/08/19/net-promoter.html">A Bayesian approach to
the inference of the net promoter score</a>,” where we built a
hierarchical model for inferring the net promoter score for an arbitrary
segmentation of a customer base. The reader is encouraged to skim over that
article to recall the mechanics of the score and the structure of the model that
was constructed. In that article, there was an assumption made that the sample
was representative of the population, which, as mentioned earlier, is often not
the case. In what follows, we mitigate this problem using a technique called
poststratification. The technique works by matching proportions observed in the
sample with those observed in the population with respect to several dimensions,
such as age, country, and gender. However, in order to be able to poststratify,
the model has to have access to all these dimensions at once, which the model
built earlier is not suited for. To enable this, we switch gears to multilevel
multinomial regression.</p>
<h1 id="problem">Problem</h1>
<p>Suppose the survey is to measure the net promoter score for a population that
consists of \(N\) customers. The score is to be reported with respect to
individual values of \(M\) grouping variables where variable \(i\) has \(m_i\)
possible values, for \(i = 1, \dots, M\). For instance, it might be important to
know the score for different age groups, in which case the variable would be the
customer’s age with values such as 18–25, 26–35, and so on. This implies that,
in total, \(\sum_i m_i\) scores have to be estimated.</p>
<p>Depending on the size of the business, one might or might not try to reach out
to all customers, except for those who have opted out of communications.
Regardless of the decision, the resulting sample size, which is denoted by
\(n\), is likely to be substantially smaller than \(N\), as the response rate is
typically low. Therefore, there is uncertainty about the opinion of those who
abstained or were not targeted.</p>
<p>More importantly, a random sample is desired; however, certain subpopulations of
customers might end up being significantly overrepresented due to participation
bias, driving the score astray. Let us quantify this concern. We begin by taking
the Cartesian product of the aforementioned \(M\) variables. This results in \(K
= \prod_i m_i\) distinct combinations of the variables’ values, which are
referred to as cells in what follows. For each cell, the number of detractors,
neutrals, and promoters observed in the sample are computed and denoted by
\(d_i\), \(u_i\), and \(p_i\), respectively. The number of respondents in call
\(i\) is then</p>
\[n_i = d_i + u_i + p_i \tag{1}\]
<p>for \(i = 1, \dots, K\). For convenience, all counts are arranged in the
following matrix:</p>
\[y = \left(
\begin{matrix}
y_1 \\
\vdots \\
y_i \\
\vdots \\
y_K
\end{matrix}
\right)
= \left(
\begin{matrix}
d_1 & u_1 & p_1 \\
\vdots & \vdots & \vdots \\
d_i & u_i & p_i \\
\vdots & \vdots & \vdots \\
d_K & u_K & p_K
\end{matrix}
\right). \tag{2}\]
<p>Given \(y\), the observed net promoter score for value \(j\) of variable \(i\)
can be evaluated as follows:</p>
\[s^i_j = 100 \times \frac{\sum_{k \in I^i_j}(p_k - d_k)}{\sum_{k \in I^i_j} n_k} \tag{3}\]
<p>where \(I^i_j\) is an index set traversing cells with variable \(i\) set to
value \(j\), which has the effect of marginalizing out other variables
conditioned on the chosen value of variable \(i\), that is, on value \(j\).</p>
<p>We can now compare \(n_i\), computed according to Equation (1), with its
counterpart in the population (the total number of customers who belong to cell
\(i\)), which is denoted by \(N_i\), taking into consideration the sample size
\(n\) and the population size \(N\). Problems occur when the ratios within one
or more of the following tuples largely disagree:</p>
\[\left(\frac{n_i}{n}, \frac{N_i}{N}\right) \tag{4}\]
<p>for \(i = 1, \dots, K\). When this happens, the scores given by Equation (3) or
any analyses oblivious of this disagreement cannot be trusted, since they
misrepresent the population. (It should be noted, however, that equality within
each tuple does not guarantee the absence of participation bias, since there
might be other, potentially unobserved, dimensions along which there are
deviations.)</p>
<p>The survey has been conducted, and there are deviations. What do we do with all
these responses that have come in? Should we discard and run a new survey,
hoping that, this time, it would be different?</p>
<h1 id="solution">Solution</h1>
<p>The fact that the sample covers only a fraction of the population is, of course,
no news, and the solution is standard: one has to infer the net promoter score
for the population given the sample and domain knowledge. This is what was done
in the <a href="/2019/08/19/net-promoter.html">previous article</a> for one grouping variable. However, due to
participation bias, additional measures are needed as follows.</p>
<p>Taking inspiration from political science, we proceed in two steps.</p>
<ol>
<li>
<p>Using an adequate model, \(K = \prod_i m_i\) net promoter scores are
inferred—one for each cell, that is, for each combination of the values of
the grouping variables.</p>
</li>
<li>
<p>The \(\prod_i m_i\) “cell-scores” are combined to produce \(\sum_i m_i\)
“value-scores”—one for each value of each variable. This is done in such a
way that the contribution of each cell to the score is equal to the relative
size of that cell in the population given by Equation (4).</p>
</li>
</ol>
<p>The two steps are discussed in the following two subsections.</p>
<h2 id="modeling">Modeling</h2>
<p>Step 1 can, in principle, be undertaken by any model of choice. A prominent
candidate is multilevel multinomial regression, which is what we shall explore.
<em>Multilevel</em> refers to having a hierarchical structure where parameters on a
higher level give birth to parameters on a lower level, which, in particular,
enables information exchange through a common ancestor. <em>Multinomial</em> refers to
the distribution used for modeling the response variable. The family of
multinomial distributions is appropriate, since we work with counts of events
falling into one of several categories: detractors, neutrals, and promoters; see
Equation (2). The response for each cell is then as follows:</p>
\[y_i | \theta_i \sim \text{Multinomial}(n_i, \theta_i)\]
<p>where \(n_i\) is given by Equation (1), and</p>
\[\theta_i = \left\langle\theta^d_i, \theta^u_i, \theta^p_i\right\rangle\]
<p>is a simplex (sums up to one) of probabilities of the three categories.</p>
<p>Multinomial regression belongs to the class of generalized linear models. This
means that the inference takes place in a linear domain, and that \(\theta_i\)
is obtained by applying a deterministic transformation to the corresponding
linear model or models; the inverse of this transformation is known as the link
function. In the case of multinomial regression, the aforementioned
transformation is the softmax function, which is a generalization of the
logistic function allowing more than two categories:</p>
\[\theta_i = \text{Softmax}\left(\mu_i\right)\]
<p>where</p>
\[\mu_i = \left(0, \mu^u_i, \mu^p_i\right)\]
<p>is the average log-odds of the three categories with respect to a reference
category, which, by conventions, is taken to be the first one, that is,
detractors. The first entry is zero, since \(\ln(1) = 0\). Therefore, there are
only two linear models: one is for neutrals (\(\mu^u_i\)), and one is for
promoters (\(\mu^p_i\)).</p>
<p>Now, there are many alternatives when it comes to the two linear parts. In this
article, we use the following architecture. Both the model for neutrals and the
one for promoters have the same structure, and for brevity, only the former is
described. For the log-odds of neutrals, the model is</p>
\[\mu^u_i = b^u + \sum_{j = 1}^M \delta^{uj}_{I_j[i]}\]
<p>where</p>
\[\delta^{uj} = \left(\delta^{uj}_1, \dots, \delta^{uj}_{m_j}\right)\]
<p>is a vector of deviations from intercept \(b^u\) specific to grouping variable
\(j\) (one entry for each value of the variable), and \(I_j[i]\) yields the
index of the value that cell \(i\) has, for \(i = 1, \dots, K\) and \(j = 1,
\dots, M\).</p>
<p>Let us now turn to the multilevel aspect. For each grouping variable, the
corresponding values, represented by the elements of \(\delta^{uj}\), are
allowed to be different but assumed to have something in common and thus
originate from a common distribution. To this end, they are assigned
distributions with a shared parameter as follows:</p>
\[\delta^{uj}_i | \sigma^{uj} \sim \text{Gaussian}\left(0, \sigma^{uj}\right)\]
<p>for \(i = 1, \dots, m_j\). The mean is zero, since \(\delta^{uj}_i\) represents
a deviation.</p>
<p>Lastly, we have to decide on prior distributions of the intercept, \(b^u\), and
the standard deviations, \(\sigma^{uj}\) for \(j = 1, \dots, M\). The intercept
is given the following prior:</p>
\[b^u \sim \text{Student’s t}(5, 0, 1).\]
<p>The mean is zero in order to center at even odds. Regarding the standard
deviations, they are given the following prior:</p>
\[\sigma^{uj} \sim \text{Half-Student’s t}(5, 0, 1).\]
<p>In order to understand the implications of these prior choices, let us take a
look at the prior distribution assuming two grouping variables:</p>
<p><img src="/assets/images/2020-02-03-net-promoter/prior-distribution-1.svg" alt="" /></p>
<p>The left and right dashed lines demarcate tail regions that, for practical
purposes, can be thought of as “never” and “always,” respectively. For instance,
log-odds of five or higher are so extreme that detractors are rendered nearly
non-existent when compared to neutrals. These regions are arguably unrealistic.
The prior does not exclude these possibilities; however, it does not favor them
either. The vast majority of the probability mass is still in the middle around
zero.</p>
<p>The overall model is then as follow:</p>
\[\begin{align}
& y_i | \theta_i \sim \text{Multinomial}(n_i, \theta_i),
\text{ for } i = 1, \dots, K; \\
& \theta_i = \text{Softmax}\left(\mu_i\right),
\text{ for } i = 1, \dots, K; \\
& \mu_i = (0, \mu^u_i, \mu^p_i),
\text{ for } i = 1, \dots, K; \\
& \mu^u_i = b^u + \sum_{j = 1}^M \delta^{uj}_{I_j[i]},
\text{ for } i = 1, \dots, K; \\
& \mu^p_i = b^p + \sum_{j = 1}^M \delta^{pj}_{I_j[i]},
\text{ for } i = 1, \dots, K; \\
& b^u \sim \text{Student’s t}(5, 0, 1); \\
& b^p \sim \text{Student’s t}(5, 0, 1); \\
& \delta^{uj}_k | \sigma^{uj} \sim \text{Gaussian}\left(0, \sigma^{uj}\right),
\text{ for } j = 1, \dots, M \text{ and } k = 1, \dots, m_j; \tag{5a} \\
& \delta^{pj}_k | \sigma^{pj} \sim \text{Gaussian}\left(0, \sigma^{pj}\right),
\text{ for } j = 1, \dots, M \text{ and } k = 1, \dots, m_j; \tag{5b} \\
& \sigma^{uj} \sim \text{Half-Student’s t}(5, 0, 1),
\text{ for } j = 1, \dots, M; \text{ and} \\
& \sigma^{pj} \sim \text{Half-Student’s t}(5, 0, 1),
\text{ for } j = 1, \dots, M.
\end{align}\]
<p>The model has \(2 \times (1 + \sum_i m_i + M)\) parameters in total. The
structure that can be seen in Equations (5a) and (5b) is what makes the model
multilevel. This is an important feature, since it allows for information
sharing between the individual values of the grouping variables. In particular,
this has a regularizing effect on the estimates, which is also known as
shrinkage resulting from partial pooling.</p>
<p>Having defined the model, the posterior distribution can now be obtained by
means of Markov chain Monte Carlo sampling. This procedure is standard and can
be performed using, for instance, Stan or a higher-level package, such as
<a href="https://github.com/paul-buerkner/brms"><code class="language-plaintext highlighter-rouge">brms</code></a>, which is what is exemplified in the Implementation section. The result
is a collection of draws of the parameters from the posterior distribution. For
each draw of the parameters, a draw of the net promoter score can be computed
using the following formula:</p>
\[s_i = 100 \times (\theta^p_i - \theta^d_i) \tag{6}\]
<p>for \(i = 1, \dots, K\). This means that we have obtained a (joint) posterior
distribution of the net promoter score over the \(K\) cells. It is now time to
combine the scores for the cells on the level of the values of the \(M\)
grouping variables, which results in \(\sum_i m_i\) scores in total.</p>
<h2 id="poststratification">Poststratification</h2>
<p>Step 2 is poststratification, whose purpose is to correct for potential
deviations of the sample from the population; recall the discussion around
Equation (4). The foundation laid in the previous subsection makes the work here
straightforward. The idea is as follows. Each draw from the posterior
distribution consists of \(K\) values for the net promoter score, one for each
cell. All one has to do in order to correct for a mismatch in proportions is to
take a weighted average of these scores where the weights are the counts
observed in the population:</p>
\[s^i_j = \frac{\sum_{k \in I^i_j} N_k \, s_k}{\sum_{k \in I^i_j} N_k}\]
<p>where \(I^i_j\) is as in Equation (3), for \(i = 1, \dots, M\) and \(j = 1,
\dots, m_i\). The above gives a poststratified draw from the posterior
distribution of the net promoter score for variable \(i\) and value \(j\). In
practice, depending on the tool used, one might perform the poststratification
procedure differently, such as predicting counts of detractors, neutrals, and
promoters in the cells given their in-population sizes and then aggregating
those counts and following the definition of the net promoter score.</p>
<h1 id="implementation">Implementation</h1>
<p>In what follows, we consider a contrived example with the sole purpose of
illustrating how the presented workflow can be implemented in practice. To this
end, we generate some data with two grouping variables, age and seniority, and
then perform inference using <a href="https://github.com/paul-buerkner/brms"><code class="language-plaintext highlighter-rouge">brms</code></a>, which leverages Stan under the hood. For
a convenient manipulation of posterior draws, <a href="https://github.com/mjskay/tidybayes"><code class="language-plaintext highlighter-rouge">tidybayes</code></a> is used as well.</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">brms</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">tidybayes</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">tidyverse</span><span class="p">)</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">42</span><span class="p">)</span><span class="w">
</span><span class="n">options</span><span class="p">(</span><span class="n">mc.cores</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">parallel</span><span class="o">::</span><span class="n">detectCores</span><span class="p">())</span><span class="w">
</span><span class="c1"># Load data</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">load_data</span><span class="p">()</span><span class="w">
</span><span class="c1"># => list(</span><span class="w">
</span><span class="c1"># => population = tibble(age, seniority, cell_size),</span><span class="w">
</span><span class="c1"># => sample = tibble(age, seniority, cell_size,</span><span class="w">
</span><span class="c1"># => cell_counts = (detractors, neutrals, promoters))</span><span class="w">
</span><span class="c1"># => )</span><span class="w">
</span><span class="c1"># Modeling</span><span class="w">
</span><span class="n">priors</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="w">
</span><span class="n">prior</span><span class="p">(</span><span class="s1">'student_t(5, 0, 1)'</span><span class="p">,</span><span class="w"> </span><span class="n">class</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Intercept'</span><span class="p">,</span><span class="w"> </span><span class="n">dpar</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'muneutral'</span><span class="p">),</span><span class="w">
</span><span class="n">prior</span><span class="p">(</span><span class="s1">'student_t(5, 0, 1)'</span><span class="p">,</span><span class="w"> </span><span class="n">class</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Intercept'</span><span class="p">,</span><span class="w"> </span><span class="n">dpar</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'mupromoter'</span><span class="p">),</span><span class="w">
</span><span class="n">prior</span><span class="p">(</span><span class="s1">'student_t(5, 0, 1)'</span><span class="p">,</span><span class="w"> </span><span class="n">class</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'sd'</span><span class="p">,</span><span class="w"> </span><span class="n">dpar</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'muneutral'</span><span class="p">),</span><span class="w">
</span><span class="n">prior</span><span class="p">(</span><span class="s1">'student_t(5, 0, 1)'</span><span class="p">,</span><span class="w"> </span><span class="n">class</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'sd'</span><span class="p">,</span><span class="w"> </span><span class="n">dpar</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'mupromoter'</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">formula</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">brmsformula</span><span class="p">(</span><span class="w">
</span><span class="n">cell_counts</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">trials</span><span class="p">(</span><span class="n">cell_size</span><span class="p">)</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">age</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">seniority</span><span class="p">))</span><span class="w">
</span><span class="n">model</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">brm</span><span class="p">(</span><span class="n">formula</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="o">$</span><span class="n">sample</span><span class="p">,</span><span class="w"> </span><span class="n">multinomial</span><span class="p">(),</span><span class="w"> </span><span class="n">priors</span><span class="p">,</span><span class="w">
</span><span class="n">control</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">adapt_delta</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.99</span><span class="p">),</span><span class="w"> </span><span class="n">seed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">42</span><span class="p">)</span><span class="w">
</span><span class="c1"># Poststratification</span><span class="w">
</span><span class="n">prediction</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data</span><span class="o">$</span><span class="n">population</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">add_predicted_draws</span><span class="p">(</span><span class="n">model</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">spread</span><span class="p">(</span><span class="n">.category</span><span class="p">,</span><span class="w"> </span><span class="n">.prediction</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">group_by</span><span class="p">(</span><span class="n">age</span><span class="p">,</span><span class="w"> </span><span class="n">.draw</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">summarize</span><span class="p">(</span><span class="n">score</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">100</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">promoter</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">detractor</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">cell_size</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">mean_hdi</span><span class="p">()</span><span class="w">
</span></code></pre></div></div>
<p>The final aggregation is given for age; it is similar for seniority. It can be
seen in the above listing that modern tools allow for rather complex ideas to be
expressed and explored in a very laconic way.</p>
<p>The curious reader is encouraged to run the above code. The appendix contains a
function for generating synthetic data. It should be noted, however, that <code class="language-plaintext highlighter-rouge">brms</code>
and <code class="language-plaintext highlighter-rouge">tidybayes</code> should be of versions greater than 2.11.1 and 2.0.1,
respectively, which, at the time of writing, are available for installation only
on GitHub. The appendix contains instructions for updating the packages.</p>
<h1 id="conclusion">Conclusion</h1>
<p>In this article, we have discussed a multilevel multinomial model for inferring
the net promoter score with respect to several grouping variables in accordance
with the business needs. It has been argued that poststratification is an
essential stage of the inference process, since it mitigates the deleterious
consequences of participation bias on the subsequent decision-making.</p>
<p>There are still some aspects that could be improved. For instance, there is a
natural ordering to the three categories of customers, detractors, neutrals, and
promoters; however, it is currently ignored. Furthermore, there is some
information thrown away when customer-level scores, which range from zero to
ten, are aggregated on the category level. Lastly, the net promoter survey often
happens in periodic waves, which calls for a single model capturing and learning
from changes over time.</p>
<h1 id="acknowledgments">Acknowledgments</h1>
<p>I would like to thank <a href="http://www.stat.columbia.edu/~gelman/">Andrew Gelman</a> for the guidance on multilevel modeling
and <a href="https://paul-buerkner.github.io/">Paul-Christian Bürkner</a> for the help with understanding the <code class="language-plaintext highlighter-rouge">brms</code> package.</p>
<h1 id="references">References</h1>
<ul>
<li>Andrew Gelman et al., “<a href="http://www.stat.columbia.edu/~gelman/research/unpublished/MRT(1).pdf">Using multilevel regression and poststratification to
estimate dynamic public opinion</a>,” 2018.</li>
<li>Andrew Gelman and Jennifer Hill, <em><a href="https://doi.org/10.1017/CBO9780511790942">Data Analysis Using Regression and
Multilevel/Hierarchical Models</a></em>, Cambridge University Press, 2006.</li>
<li>Andrew Gelman and Thomas Little, “<a href="http://www.stat.columbia.edu/~gelman/research/published/poststrat3.pdf">Poststratification into many categories
using hierarchical logistic regression</a>,” Survey Methodology, 1997.</li>
<li>Paul-Christian Bürkner, “<a href="http://dx.doi.org/10.18637/jss.v080.i01">brms: An R package for Bayesian multilevel models
using Stan</a>,” Journal of Statistical Software, 2017.</li>
</ul>
<h1 id="appendix">Appendix</h1>
<p>The following listing defines a function that makes the illustrative example
given in the Implementation section self-sufficient. By default, the population
contains one million customers, and the sample contains one percent. There are
two grouping variables: age with six values and seniority with seven values.</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">load_data</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">N</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1000000</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10000</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">softmax</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="nf">exp</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="nf">exp</span><span class="p">(</span><span class="n">x</span><span class="p">))</span><span class="w">
</span><span class="c1"># Age</span><span class="w">
</span><span class="n">age_values</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'18–25'</span><span class="p">,</span><span class="w"> </span><span class="s1">'26–35'</span><span class="p">,</span><span class="w"> </span><span class="s1">'36–45'</span><span class="p">,</span><span class="w"> </span><span class="s1">'46–55'</span><span class="p">,</span><span class="w"> </span><span class="s1">'56–65'</span><span class="p">,</span><span class="w"> </span><span class="s1">'66+'</span><span class="p">)</span><span class="w">
</span><span class="n">age_probabilities</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">softmax</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">))</span><span class="w">
</span><span class="c1"># Seniority</span><span class="w">
</span><span class="n">seniority_values</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'6M'</span><span class="p">,</span><span class="w"> </span><span class="s1">'1Y'</span><span class="p">,</span><span class="w"> </span><span class="s1">'2Y'</span><span class="p">,</span><span class="w"> </span><span class="s1">'3Y'</span><span class="p">,</span><span class="w"> </span><span class="s1">'4Y'</span><span class="p">,</span><span class="w"> </span><span class="s1">'5Y'</span><span class="p">,</span><span class="w"> </span><span class="s1">'6Y+'</span><span class="p">)</span><span class="w">
</span><span class="n">seniority_probabilities</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">softmax</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">))</span><span class="w">
</span><span class="c1"># Score</span><span class="w">
</span><span class="n">score_values</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">10</span><span class="p">)</span><span class="w">
</span><span class="n">score_probabilities</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">softmax</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">4</span><span class="p">))</span><span class="w">
</span><span class="c1"># Generate a population</span><span class="w">
</span><span class="n">population</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">tibble</span><span class="p">(</span><span class="n">age</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sample</span><span class="p">(</span><span class="n">age_values</span><span class="p">,</span><span class="w"> </span><span class="n">N</span><span class="p">,</span><span class="w">
</span><span class="n">prob</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">age_probabilities</span><span class="p">,</span><span class="w">
</span><span class="n">replace</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">),</span><span class="w">
</span><span class="n">seniority</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sample</span><span class="p">(</span><span class="n">seniority_values</span><span class="p">,</span><span class="w"> </span><span class="n">N</span><span class="p">,</span><span class="w">
</span><span class="n">prob</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">seniority_probabilities</span><span class="p">,</span><span class="w">
</span><span class="n">replace</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">))</span><span class="w">
</span><span class="c1"># Take a sample from the population</span><span class="w">
</span><span class="n">sample</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">population</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">sample_n</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">score</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sample</span><span class="p">(</span><span class="n">score_values</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w">
</span><span class="n">prob</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">score_probabilities</span><span class="p">,</span><span class="w">
</span><span class="n">replace</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">category</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">case_when</span><span class="p">(</span><span class="n">score</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="m">7</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s1">'detractor'</span><span class="p">,</span><span class="w">
</span><span class="n">score</span><span class="w"> </span><span class="o">></span><span class="w"> </span><span class="m">8</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s1">'promoter'</span><span class="p">,</span><span class="w">
</span><span class="kc">TRUE</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s1">'neutral'</span><span class="p">))</span><span class="w">
</span><span class="c1"># Summarize the population</span><span class="w">
</span><span class="n">population</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">population</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">group_by</span><span class="p">(</span><span class="n">age</span><span class="p">,</span><span class="w"> </span><span class="n">seniority</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">count</span><span class="p">(</span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'cell_size'</span><span class="p">)</span><span class="w">
</span><span class="c1"># Summarize the sample</span><span class="w">
</span><span class="n">sample</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sample</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">group_by</span><span class="p">(</span><span class="n">age</span><span class="p">,</span><span class="w"> </span><span class="n">seniority</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">summarize</span><span class="p">(</span><span class="n">detractors</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">category</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s1">'detractor'</span><span class="p">),</span><span class="w">
</span><span class="n">neutrals</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">category</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s1">'neutral'</span><span class="p">),</span><span class="w">
</span><span class="n">promoters</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">category</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s1">'promoter'</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">cell_size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">detractors</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">neutrals</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">promoters</span><span class="p">)</span><span class="w">
</span><span class="c1"># Bind counts of neutrals, detractors, and promoters (needed for brms)</span><span class="w">
</span><span class="n">sample</span><span class="o">$</span><span class="n">cell_counts</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">with</span><span class="p">(</span><span class="n">sample</span><span class="p">,</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="n">detractors</span><span class="p">,</span><span class="w"> </span><span class="n">neutrals</span><span class="p">,</span><span class="w"> </span><span class="n">promoters</span><span class="p">))</span><span class="w">
</span><span class="n">colnames</span><span class="p">(</span><span class="n">sample</span><span class="o">$</span><span class="n">cell_counts</span><span class="p">)</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'detractor'</span><span class="p">,</span><span class="w"> </span><span class="s1">'neutral'</span><span class="p">,</span><span class="w"> </span><span class="s1">'promoter'</span><span class="p">)</span><span class="w">
</span><span class="c1"># Remove unused columns</span><span class="w">
</span><span class="n">sample</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sample</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">select</span><span class="p">(</span><span class="o">-</span><span class="n">detractors</span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="n">neutrals</span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="n">promoters</span><span class="p">)</span><span class="w">
</span><span class="nf">list</span><span class="p">(</span><span class="n">population</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">population</span><span class="p">,</span><span class="w"> </span><span class="n">sample</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sample</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>Lastly, the following snippet shows how to update <code class="language-plaintext highlighter-rouge">brms</code> and <code class="language-plaintext highlighter-rouge">tidybayes</code> from
GitHub:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">packageVersion</span><span class="p">(</span><span class="s1">'brms'</span><span class="p">)</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="s1">'2.11.2'</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">remotes</span><span class="o">::</span><span class="n">install_github</span><span class="p">(</span><span class="s1">'paul-buerkner/brms'</span><span class="p">,</span><span class="w"> </span><span class="n">upgrade</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'never'</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">packageVersion</span><span class="p">(</span><span class="s1">'tidybayes'</span><span class="p">)</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="s1">'2.0.1.9000'</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">remotes</span><span class="o">::</span><span class="n">install_github</span><span class="p">(</span><span class="s1">'mjskay/tidybayes'</span><span class="p">,</span><span class="w"> </span><span class="n">upgrade</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'never'</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>Ivan UkhovCustomer surveys are naturally prone to biases. One prominent example is participation bias, which arises when individuals decide not to respond to the survey, and this pattern is not random. For instance, new customers might reply less eagerly than those who are senior. This renders the obtained responses unrepresentative of the target population. In this article, we tackle participation bias for the case of the net promoter survey by means of multilevel regression and poststratification.Ingestion of sequential data from BigQuery into TensorFlow2019-11-08T07:00:00+00:002019-11-08T07:00:00+00:00https://blog.ivanukhov.com/2019/11/08/sequential-data<p>How hard can it be to ingest sequential data into a <a href="https://www.tensorflow.org">TensorFlow</a> model? As
always, the answer is, “It depends.” Where are the sequences in question stored?
Can they fit in main memory? Are they of the same length? In what follows, we
shall build a flexible and scalable workflow for feeding sequential observations
into a TensorFlow graph starting from <a href="https://cloud.google.com/bigquery/">BigQuery</a> as the data warehouse.</p>
<p>To make the discussion tangible, consider the following problem. Suppose the
goal is to predict the peak temperature at an arbitrary weather station present
in the <a href="https://www.ncdc.noaa.gov/data-access/land-based-station-data/land-based-datasets/global-historical-climatology-network-ghcn">Global Historical Climatology Network</a> for each day between June 1 and
August 31. More concretely, given observations from June 1 up to an arbitrary
day before August 31, the objective is to complete the sequence until August 31.
For instance, if we find ourselves in Stockholm on June 12, we ask for the
maximum temperatures from June 12 to August 31 given the temperature values
between June 1 to June 11 at a weather station in Stockholm.</p>
<p>To set the expectations right, in this article, we are not going to build a
predictive model but to cater for its development by making the data from the
aforementioned database readily available in a TensorFlow graph. The final chain
of states and operations is as follows:</p>
<ol>
<li>
<p>Historical temperature measurements from the Global Historical Climatology
Network are stored in a <a href="https://console.cloud.google.com/marketplace/details/noaa-public/ghcn-d">public data set</a> in BigQuery. Each row
corresponds to a weather station and a date. There are missing observations
due to such reasons as measurements not passing quality checks.</p>
</li>
<li>
<p>Relevant measurements are grouped in BigQuery by the weather station and
year. Therefore, each row corresponds to a weather station and a year,
implying that all information about a particular example (a specific weather
station on a specific year) is gathered in one place.</p>
</li>
<li>
<p>The sequences are read, analyzed, and transformed by <a href="https://cloud.google.com/dataflow/">Cloud Dataflow</a>.</p>
<ul>
<li>
<p>The data are split into a training, a validation, and a testing set of
examples.</p>
</li>
<li>
<p>The training set is used to compute statistics needed for transforming the
measurements to a form suitable for the subsequent modeling.
Standardization is used as an example.</p>
</li>
<li>
<p>The training and validation sets are transformed using the statistics
computed with respect to the training set in order to avoid performing
these computations during the training-with-validation phase. The
corresponding transform is available for the testing phase.</p>
</li>
</ul>
</li>
<li>
<p>The processed training and validation examples and the raw testing examples
are written by Dataflow to <a href="https://cloud.google.com/storage/">Cloud Storage</a> in the <a href="https://www.tensorflow.org/tutorials/load_data/tfrecord">TFRecord</a> format, which is
a format native to TensorFlow.</p>
</li>
<li>
<p>The files containing TFRecords are read by the <a href="https://www.tensorflow.org/guide/data"><code class="language-plaintext highlighter-rouge">tf.data</code></a> API of TensorFlow
and eventually transformed into a data set of appropriately padded batches of
examples.</p>
</li>
</ol>
<p>The above workflow is not as simple as reading data from a Pandas DataFrame
comfortably resting in main memory; however, it is much more scalable. This
pipeline can handle arbitrary amounts of data. Moreover, it operates on
complete examples, not on individual measurements.</p>
<p>In the rest of the article, the aforementioned steps will be described in more
detail. The corresponding source code can be found in the following repository
on GitHub:</p>
<ul>
<li><a href="https://github.com/chain-rule/example-weather-forecast">example-weather-forecast</a>.</li>
</ul>
<h1 id="data">Data</h1>
<p>It all starts with data. The data come from the Global Historical Climatology
Network, which is <a href="https://console.cloud.google.com/marketplace/details/noaa-public/ghcn-d">available in BigQuery</a> for public use. Steps 1 and 2
in the list above are covered by the <a href="https://github.com/chain-rule/example-weather-forecast/blob/master/configs/training/data.sql">following query</a>:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">WITH</span>
<span class="c1">-- Select relevant measurements</span>
<span class="n">data_1</span> <span class="k">AS</span> <span class="p">(</span>
<span class="k">SELECT</span>
<span class="n">id</span><span class="p">,</span>
<span class="nb">date</span><span class="p">,</span>
<span class="c1">-- Find the date of the previous observation</span>
<span class="n">LAG</span><span class="p">(</span><span class="nb">date</span><span class="p">)</span> <span class="n">OVER</span> <span class="p">(</span><span class="n">station_year</span><span class="p">)</span> <span class="k">AS</span> <span class="n">date_last</span><span class="p">,</span>
<span class="n">latitude</span><span class="p">,</span>
<span class="n">longitude</span><span class="p">,</span>
<span class="c1">-- Convert to degrees Celsius</span>
<span class="n">value</span> <span class="o">/</span> <span class="mi">10</span> <span class="k">AS</span> <span class="n">temperature</span>
<span class="k">FROM</span>
<span class="nv">`bigquery-public-data.ghcn_d.ghcnd_201*`</span>
<span class="k">INNER</span> <span class="k">JOIN</span>
<span class="nv">`bigquery-public-data.ghcn_d.ghcnd_stations`</span> <span class="k">USING</span> <span class="p">(</span><span class="n">id</span><span class="p">)</span>
<span class="k">WHERE</span>
<span class="c1">-- Take years from 2010 to 2019</span>
<span class="k">CAST</span><span class="p">(</span><span class="n">_TABLE_SUFFIX</span> <span class="k">AS</span> <span class="n">INT64</span><span class="p">)</span> <span class="k">BETWEEN</span> <span class="mi">0</span> <span class="k">AND</span> <span class="mi">9</span>
<span class="c1">-- Take months from June to August</span>
<span class="k">AND</span> <span class="k">EXTRACT</span><span class="p">(</span><span class="k">MONTH</span> <span class="k">FROM</span> <span class="nb">date</span><span class="p">)</span> <span class="k">BETWEEN</span> <span class="mi">6</span> <span class="k">AND</span> <span class="mi">8</span>
<span class="c1">-- Take the maximum temperature</span>
<span class="k">AND</span> <span class="n">element</span> <span class="o">=</span> <span class="s1">'TMAX'</span>
<span class="c1">-- Take observations passed spatio-temporal quality-control checks</span>
<span class="k">AND</span> <span class="n">qflag</span> <span class="k">IS</span> <span class="k">NULL</span>
<span class="n">WINDOW</span>
<span class="n">station_year</span> <span class="k">AS</span> <span class="p">(</span>
<span class="n">PARTITION</span> <span class="k">BY</span> <span class="n">id</span><span class="p">,</span> <span class="k">EXTRACT</span><span class="p">(</span><span class="nb">YEAR</span> <span class="k">FROM</span> <span class="nb">date</span><span class="p">)</span>
<span class="k">ORDER</span> <span class="k">BY</span> <span class="nb">date</span>
<span class="p">)</span>
<span class="p">),</span>
<span class="c1">-- Group into examples (a specific station and a specific year)</span>
<span class="n">data_2</span> <span class="k">AS</span> <span class="p">(</span>
<span class="k">SELECT</span>
<span class="n">id</span><span class="p">,</span>
<span class="k">MIN</span><span class="p">(</span><span class="nb">date</span><span class="p">)</span> <span class="k">AS</span> <span class="nb">date</span><span class="p">,</span>
<span class="n">latitude</span><span class="p">,</span>
<span class="n">longitude</span><span class="p">,</span>
<span class="c1">-- Compute gaps between observations</span>
<span class="n">ARRAY_AGG</span><span class="p">(</span>
<span class="n">DATE_DIFF</span><span class="p">(</span><span class="nb">date</span><span class="p">,</span> <span class="n">IFNULL</span><span class="p">(</span><span class="n">date_last</span><span class="p">,</span> <span class="nb">date</span><span class="p">),</span> <span class="k">DAY</span><span class="p">)</span>
<span class="k">ORDER</span> <span class="k">BY</span> <span class="nb">date</span>
<span class="p">)</span> <span class="k">AS</span> <span class="n">duration</span><span class="p">,</span>
<span class="n">ARRAY_AGG</span><span class="p">(</span><span class="n">temperature</span> <span class="k">ORDER</span> <span class="k">BY</span> <span class="nb">date</span><span class="p">)</span> <span class="k">AS</span> <span class="n">temperature</span>
<span class="k">FROM</span>
<span class="n">data_1</span>
<span class="k">GROUP</span> <span class="k">BY</span>
<span class="n">id</span><span class="p">,</span> <span class="n">latitude</span><span class="p">,</span> <span class="n">longitude</span><span class="p">,</span> <span class="k">EXTRACT</span><span class="p">(</span><span class="nb">YEAR</span> <span class="k">FROM</span> <span class="nb">date</span><span class="p">)</span>
<span class="p">)</span>
<span class="c1">-- Partition into training, validation, and testing sets</span>
<span class="k">SELECT</span>
<span class="o">*</span><span class="p">,</span>
<span class="k">CASE</span>
<span class="k">WHEN</span> <span class="k">EXTRACT</span><span class="p">(</span><span class="nb">YEAR</span> <span class="k">FROM</span> <span class="nb">date</span><span class="p">)</span> <span class="o"><</span> <span class="mi">2019</span> <span class="k">THEN</span> <span class="s1">'analysis,training'</span>
<span class="k">WHEN</span> <span class="k">MOD</span><span class="p">(</span><span class="k">ABS</span><span class="p">(</span><span class="n">FARM_FINGERPRINT</span><span class="p">(</span><span class="n">id</span><span class="p">)),</span> <span class="mi">100</span><span class="p">)</span> <span class="o"><</span> <span class="mi">50</span> <span class="k">THEN</span> <span class="s1">'validation'</span>
<span class="k">ELSE</span> <span class="s1">'testing'</span>
<span class="k">END</span> <span class="k">AS</span> <span class="k">mode</span>
<span class="k">FROM</span>
<span class="n">data_2</span>
</code></pre></div></div>
<p>The query fetches peak temperatures, denoted by <code class="language-plaintext highlighter-rouge">temperature</code>, for all available
weather stations between June and August in 2010–2019. The crucial part is the
usage of <code class="language-plaintext highlighter-rouge">ARRAY_AGG</code>, which is what makes it possible to gather all relevant
data about a specific station and a specific year in the same row. The number of
days since the previous measurement, which is denoted by <code class="language-plaintext highlighter-rouge">duration</code>, is also
computed. Ideally, <code class="language-plaintext highlighter-rouge">duration</code> should always be one (except for the first day,
which has no predecessor); however, this is not the case, which makes the
resulting time series vary in length.</p>
<p>In addition, in order to illustrate the generality of this approach, two
contextual (that is, non-sequential) explanatory variables are added: <code class="language-plaintext highlighter-rouge">latitude</code>
and <code class="language-plaintext highlighter-rouge">longitude</code>. They are scalars stored side by side with <code class="language-plaintext highlighter-rouge">duration</code> and
<code class="language-plaintext highlighter-rouge">temperature</code>, which are arrays.</p>
<p>Another important moment in the final <code class="language-plaintext highlighter-rouge">SELECT</code> statement, which defines a column
called <code class="language-plaintext highlighter-rouge">mode</code>. This column indicates what each example is used for, allowing one
to use the same query for different purposes and to avoid inconsistencies due to
multiple queries. In this case, observations prior to 2019 are reserved for
training, while the rest is split pseudo-randomly and reproducibly into two
approximately equal parts: one is for validation, and one is for testing. This
last operation is explained in detail in “<a href="https://www.oreilly.com/learning/repeatable-sampling-of-data-sets-in-bigquery-for-machine-learning">Repeatable sampling of data sets in
BigQuery for machine learning</a>” by Lak Lakshmanan.</p>
<h1 id="preprocessing">Preprocessing</h1>
<p>In this section, we cover Steps 4 and 5 in the list given at the beginning. This
job is done by <a href="https://www.tensorflow.org/tfx">TensorFlow Extended</a>, which is a library for building
machine-learning pipelines. Internally, it relies on <a href="https://beam.apache.org/">Apache Beam</a> as a language
for defining pipelines. Once an adequate pipeline is created, it can be executed
using an executor, and the executor that we shall use is <a href="https://cloud.google.com/dataflow/">Cloud Dataflow</a>.</p>
<p>Before we proceed to the pipeline itself, the construction process is
orchestrated by a <a href="https://github.com/chain-rule/example-weather-forecast/blob/master/configs/training/preprocessing.json">configuration file</a>, which will
be referred to as <code class="language-plaintext highlighter-rouge">config</code> in the pipeline code (to be discussed shortly):</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
</span><span class="nl">"data"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"path"</span><span class="p">:</span><span class="w"> </span><span class="s2">"configs/training/data.sql"</span><span class="p">,</span><span class="w">
</span><span class="nl">"schema"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
</span><span class="p">{</span><span class="w"> </span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"latitude"</span><span class="p">,</span><span class="w"> </span><span class="nl">"kind"</span><span class="p">:</span><span class="w"> </span><span class="s2">"float32"</span><span class="p">,</span><span class="w"> </span><span class="nl">"transform"</span><span class="p">:</span><span class="w"> </span><span class="s2">"z"</span><span class="w"> </span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w"> </span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"longitude"</span><span class="p">,</span><span class="w"> </span><span class="nl">"kind"</span><span class="p">:</span><span class="w"> </span><span class="s2">"float32"</span><span class="p">,</span><span class="w"> </span><span class="nl">"transform"</span><span class="p">:</span><span class="w"> </span><span class="s2">"z"</span><span class="w"> </span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w"> </span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"duration"</span><span class="p">,</span><span class="w"> </span><span class="nl">"kind"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">"float32"</span><span class="p">],</span><span class="w"> </span><span class="nl">"transform"</span><span class="p">:</span><span class="w"> </span><span class="s2">"z"</span><span class="w"> </span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w"> </span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"temperature"</span><span class="p">,</span><span class="w"> </span><span class="nl">"kind"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">"float32"</span><span class="p">],</span><span class="w"> </span><span class="nl">"transform"</span><span class="p">:</span><span class="w"> </span><span class="s2">"z"</span><span class="w"> </span><span class="p">}</span><span class="w">
</span><span class="p">]</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="nl">"modes"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
</span><span class="p">{</span><span class="w"> </span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"analysis"</span><span class="w"> </span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w"> </span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"training"</span><span class="p">,</span><span class="w"> </span><span class="nl">"transform"</span><span class="p">:</span><span class="w"> </span><span class="s2">"analysis"</span><span class="p">,</span><span class="w"> </span><span class="nl">"shuffle"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="w"> </span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w"> </span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"validation"</span><span class="p">,</span><span class="w"> </span><span class="nl">"transform"</span><span class="p">:</span><span class="w"> </span><span class="s2">"analysis"</span><span class="w"> </span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w"> </span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"testing"</span><span class="p">,</span><span class="w"> </span><span class="nl">"transform"</span><span class="p">:</span><span class="w"> </span><span class="s2">"identity"</span><span class="w"> </span><span class="p">}</span><span class="w">
</span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>It is worth noting that this way of working with a separate configuration file
is not something standard that comes with TensorFlow or Beam. It is a
convenience that we build for ourselves in order to keep the main logic reusable
and extendable without touching the code.</p>
<p>The <code class="language-plaintext highlighter-rouge">data</code> block describes where the data can be found and provides a schema for
the columns that are used. (Recall the SQL query given earlier and note that
<code class="language-plaintext highlighter-rouge">id</code>, <code class="language-plaintext highlighter-rouge">date</code>, and <code class="language-plaintext highlighter-rouge">partition</code> are omitted.) For instance, <code class="language-plaintext highlighter-rouge">latitude</code> is a scale
of type <code class="language-plaintext highlighter-rouge">FLOAT32</code>, while <code class="language-plaintext highlighter-rouge">temperature</code> is a sequence of type <code class="language-plaintext highlighter-rouge">FLOAT32</code>. Both are
standardized to have a zero mean and a unit standard deviation, which is
indicated by <code class="language-plaintext highlighter-rouge">"transform": "z"</code> and is typically needed for training neural
networks.</p>
<p>The <code class="language-plaintext highlighter-rouge">modes</code> block defines four passes over the data, corresponding to four
operating modes. In each mode, a specific subset of examples is considered,
which is given by the <code class="language-plaintext highlighter-rouge">mode</code> column returned by the query. There are two types
of modes: analysis and transform; recall Step 3. Whenever the <code class="language-plaintext highlighter-rouge">transform</code> key is
present, it is a transform mode; otherwise, it is an analysis mode. In this
example, there are one analysis and three transform modes.</p>
<p>Below is an excerpt from a <a href="https://github.com/chain-rule/example-weather-forecast/blob/master/forecast/pipeline.py">Python class</a> responsible for building
the pipeline:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># config = ...
# schema = ...
</span>
<span class="c1"># Read the SQL code
</span><span class="n">query</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="n">config</span><span class="p">[</span><span class="s">'data'</span><span class="p">][</span><span class="s">'path'</span><span class="p">]).</span><span class="n">read</span><span class="p">()</span>
<span class="c1"># Create a BigQuery source
</span><span class="n">source</span> <span class="o">=</span> <span class="n">beam</span><span class="p">.</span><span class="n">io</span><span class="p">.</span><span class="n">BigQuerySource</span><span class="p">(</span><span class="n">query</span><span class="o">=</span><span class="n">query</span><span class="p">,</span> <span class="n">use_standard_sql</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="c1"># Create metadata needed later
</span><span class="n">spec</span> <span class="o">=</span> <span class="n">schema</span><span class="p">.</span><span class="n">to_feature_spec</span><span class="p">()</span>
<span class="n">meta</span> <span class="o">=</span> <span class="n">dataset_metadata</span><span class="p">.</span><span class="n">DatasetMetadata</span><span class="p">(</span>
<span class="n">schema</span><span class="o">=</span><span class="n">dataset_schema</span><span class="p">.</span><span class="n">from_feature_spec</span><span class="p">(</span><span class="n">spec</span><span class="p">))</span>
<span class="c1"># Read data from BigQuery
</span><span class="n">data</span> <span class="o">=</span> <span class="n">pipeline</span> \
<span class="o">|</span> <span class="s">'read'</span> <span class="o">>></span> <span class="n">beam</span><span class="p">.</span><span class="n">io</span><span class="p">.</span><span class="n">Read</span><span class="p">(</span><span class="n">source</span><span class="p">)</span>
<span class="c1"># Loop over modes whose purpose is analysis
</span><span class="n">transform_functions</span> <span class="o">=</span> <span class="p">{}</span>
<span class="k">for</span> <span class="n">mode</span> <span class="ow">in</span> <span class="n">config</span><span class="p">[</span><span class="s">'modes'</span><span class="p">]:</span>
<span class="k">if</span> <span class="s">'transform'</span> <span class="ow">in</span> <span class="n">mode</span><span class="p">:</span>
<span class="k">continue</span>
<span class="n">name</span> <span class="o">=</span> <span class="n">mode</span><span class="p">[</span><span class="s">'name'</span><span class="p">]</span>
<span class="c1"># Select examples that belong to the current mode
</span> <span class="n">data_</span> <span class="o">=</span> <span class="n">data</span> \
<span class="o">|</span> <span class="n">name</span> <span class="o">+</span> <span class="s">'-filter'</span> <span class="o">>></span> <span class="n">beam</span><span class="p">.</span><span class="n">Filter</span><span class="p">(</span><span class="n">partial</span><span class="p">(</span><span class="n">_filter</span><span class="p">,</span> <span class="n">mode</span><span class="p">))</span>
<span class="c1"># Analyze the examples
</span> <span class="n">transform_functions</span><span class="p">[</span><span class="n">name</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">data_</span><span class="p">,</span> <span class="n">meta</span><span class="p">)</span> \
<span class="o">|</span> <span class="n">name</span> <span class="o">+</span> <span class="s">'-analyze'</span> <span class="o">>></span> <span class="n">tt_beam</span><span class="p">.</span><span class="n">AnalyzeDataset</span><span class="p">(</span><span class="n">_analyze</span><span class="p">)</span>
<span class="n">path</span> <span class="o">=</span> <span class="n">_locate</span><span class="p">(</span><span class="n">config</span><span class="p">,</span> <span class="n">name</span><span class="p">,</span> <span class="s">'transform'</span><span class="p">)</span>
<span class="c1"># Store the transform function
</span> <span class="n">transform_functions</span><span class="p">[</span><span class="n">name</span><span class="p">]</span> \
<span class="o">|</span> <span class="n">name</span> <span class="o">+</span> <span class="s">'-write-transform'</span> <span class="o">>></span> <span class="n">transform_fn_io</span><span class="p">.</span><span class="n">WriteTransformFn</span><span class="p">(</span><span class="n">path</span><span class="p">)</span>
<span class="c1"># Loop over modes whose purpose is transformation
</span><span class="k">for</span> <span class="n">mode</span> <span class="ow">in</span> <span class="n">config</span><span class="p">[</span><span class="s">'modes'</span><span class="p">]:</span>
<span class="k">if</span> <span class="ow">not</span> <span class="s">'transform'</span> <span class="ow">in</span> <span class="n">mode</span><span class="p">:</span>
<span class="k">continue</span>
<span class="n">name</span> <span class="o">=</span> <span class="n">mode</span><span class="p">[</span><span class="s">'name'</span><span class="p">]</span>
<span class="c1"># Select examples that belong to the current mode
</span> <span class="n">data_</span> <span class="o">=</span> <span class="n">data</span> \
<span class="o">|</span> <span class="n">name</span> <span class="o">+</span> <span class="s">'-filter'</span> <span class="o">>></span> <span class="n">beam</span><span class="p">.</span><span class="n">Filter</span><span class="p">(</span><span class="n">partial</span><span class="p">(</span><span class="n">_filter</span><span class="p">,</span> <span class="n">mode</span><span class="p">))</span>
<span class="c1"># Shuffle examples if needed
</span> <span class="k">if</span> <span class="n">mode</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">'shuffle'</span><span class="p">,</span> <span class="bp">False</span><span class="p">):</span>
<span class="n">data_</span> <span class="o">=</span> <span class="n">data_</span> \
<span class="o">|</span> <span class="n">name</span> <span class="o">+</span> <span class="s">'-shuffle'</span> <span class="o">>></span> <span class="n">beam</span><span class="p">.</span><span class="n">transforms</span><span class="p">.</span><span class="n">Reshuffle</span><span class="p">()</span>
<span class="c1"># Transform the examples using an appropriate transform function
</span> <span class="k">if</span> <span class="n">mode</span><span class="p">[</span><span class="s">'transform'</span><span class="p">]</span> <span class="o">==</span> <span class="s">'identity'</span><span class="p">:</span>
<span class="n">coder</span> <span class="o">=</span> <span class="n">tft</span><span class="p">.</span><span class="n">coders</span><span class="p">.</span><span class="n">ExampleProtoCoder</span><span class="p">(</span><span class="n">meta</span><span class="p">.</span><span class="n">schema</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">data_</span><span class="p">,</span> <span class="n">meta_</span> <span class="o">=</span> <span class="p">((</span><span class="n">data_</span><span class="p">,</span> <span class="n">meta</span><span class="p">),</span> <span class="n">transform_functions</span><span class="p">[</span><span class="n">mode</span><span class="p">[</span><span class="s">'transform'</span><span class="p">]])</span> \
<span class="o">|</span> <span class="n">name</span> <span class="o">+</span> <span class="s">'-transform'</span> <span class="o">>></span> <span class="n">tt_beam</span><span class="p">.</span><span class="n">TransformDataset</span><span class="p">()</span>
<span class="n">coder</span> <span class="o">=</span> <span class="n">tft</span><span class="p">.</span><span class="n">coders</span><span class="p">.</span><span class="n">ExampleProtoCoder</span><span class="p">(</span><span class="n">meta_</span><span class="p">.</span><span class="n">schema</span><span class="p">)</span>
<span class="n">path</span> <span class="o">=</span> <span class="n">_locate</span><span class="p">(</span><span class="n">config</span><span class="p">,</span> <span class="n">name</span><span class="p">,</span> <span class="s">'examples'</span><span class="p">,</span> <span class="s">'part'</span><span class="p">)</span>
<span class="c1"># Store the transformed examples as TFRecords
</span> <span class="n">data_</span> \
<span class="o">|</span> <span class="n">name</span> <span class="o">+</span> <span class="s">'-encode'</span> <span class="o">>></span> <span class="n">beam</span><span class="p">.</span><span class="n">Map</span><span class="p">(</span><span class="n">coder</span><span class="p">.</span><span class="n">encode</span><span class="p">)</span> \
<span class="o">|</span> <span class="n">name</span> <span class="o">+</span> <span class="s">'-write-examples'</span> <span class="o">>></span> <span class="n">beam</span><span class="p">.</span><span class="n">io</span><span class="p">.</span><span class="n">tfrecordio</span><span class="p">.</span><span class="n">WriteToTFRecord</span><span class="p">(</span><span class="n">path</span><span class="p">)</span>
</code></pre></div></div>
<p>At the very beginning, a BigQuery source is created, which is then branched out
according to the operating modes found in the configuration file. Specifically,
the first for-loop corresponds to the analysis modes, and the second for-loop
goes over the transform modes. The former ends with <code class="language-plaintext highlighter-rouge">WriteTransformFn</code>, which
saves the resulting transform, and the latter ends with <code class="language-plaintext highlighter-rouge">WriteToTFRecord</code>, which
writes the resulting examples as TFRecords.</p>
<p>The distinction between the contextual and sequential features is given by the
<a href="https://github.com/chain-rule/example-weather-forecast/blob/master/forecast/schema.py"><code class="language-plaintext highlighter-rouge">schema</code></a> object created based on the <code class="language-plaintext highlighter-rouge">schema</code> block in the
configuration file. The call <code class="language-plaintext highlighter-rouge">schema.to_feature_spec()</code> shown above alternates
between <a href="https://www.tensorflow.org/api_docs/python/tf/io/FixedLenFeature"><code class="language-plaintext highlighter-rouge">tf.io.FixedLenFeature</code></a> and <a href="https://www.tensorflow.org/api_docs/python/tf/io/VarLenFeature"><code class="language-plaintext highlighter-rouge">tf.io.VarLenFeature</code></a> and produces a
feature specification that is understood by TensorFlow and TensorFlow Extended.</p>
<p>The <a href="https://github.com/chain-rule/example-weather-forecast">repository</a> provides a wrapper for executing the
pipeline on Cloud Dataflow. The following figure shows the flow of the data with
respect to the four operating modes:</p>
<p><img src="/assets/images/2019-11-08-sequential-data/dataflow.svg" alt="" /></p>
<p>The outcome is a hierarchy of files on Cloud Storage:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>.
└── data/
└── training/
└── 2019-11-01-12-00-00/
├── analysis/
│ └── transform/
│ ├── transform_fn/...
│ └── transform_metadata/...
├── testing/
│ └── examples/
│ ├── part-000000-of-00004
│ ├── ...
│ └── part-000003-of-00004
├── training/
│ └── examples/
│ ├── part-000000-of-00006
│ ├── ...
│ └── part-000005-of-00006
└── validation/
└── examples/
├── part-000000-of-00004
├── ...
└── part-000003-of-00004
</code></pre></div></div>
<p>Here, <code class="language-plaintext highlighter-rouge">data/training</code> contains all data needed for the training phase, which
collectively refers to training entwined with validation and followed by
testing. Moving forward, this hierarchy is meant to accommodate the application
phase as well by populating a <code class="language-plaintext highlighter-rouge">data/application</code> entry next to the
<code class="language-plaintext highlighter-rouge">data/training</code> one. It can also accommodate trained models and the results of
applying these models by having a <code class="language-plaintext highlighter-rouge">model</code> entry with a structure similar to the
one of the <code class="language-plaintext highlighter-rouge">data</code> entry.</p>
<p>In the listing above, the files whose name starts with <code class="language-plaintext highlighter-rouge">part-</code> are the ones
containing TFRecords. It can be seen that, for each mode, the corresponding
examples have been split into multiple files, which is done for more efficient
access during the usage stage discussed in the next section.</p>
<h1 id="execution">Execution</h1>
<p>At this point, the data have made it all the way to the execution phase,
referring to training, validation, and testing; however, the data are yet to be
injected into a TensorFlow graph, which is the topic of this section. As before,
relevant parameters are kept in a <a href="https://github.com/chain-rule/example-weather-forecast/blob/master/configs/training/execution.json">separate configuration file</a>:</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
</span><span class="nl">"data"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"schema"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
</span><span class="p">{</span><span class="w"> </span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"latitude"</span><span class="p">,</span><span class="w"> </span><span class="nl">"kind"</span><span class="p">:</span><span class="w"> </span><span class="s2">"float32"</span><span class="w"> </span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w"> </span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"longitude"</span><span class="p">,</span><span class="w"> </span><span class="nl">"kind"</span><span class="p">:</span><span class="w"> </span><span class="s2">"float32"</span><span class="w"> </span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w"> </span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"duration"</span><span class="p">,</span><span class="w"> </span><span class="nl">"kind"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">"float32"</span><span class="p">]</span><span class="w"> </span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w"> </span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"temperature"</span><span class="p">,</span><span class="w"> </span><span class="nl">"kind"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">"float32"</span><span class="p">]</span><span class="w"> </span><span class="p">}</span><span class="w">
</span><span class="p">],</span><span class="w">
</span><span class="nl">"modes"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"training"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"spec"</span><span class="p">:</span><span class="w"> </span><span class="s2">"transformed"</span><span class="p">,</span><span class="w">
</span><span class="nl">"shuffle_macro"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"buffer_size"</span><span class="p">:</span><span class="w"> </span><span class="mi">100</span><span class="w"> </span><span class="p">},</span><span class="w">
</span><span class="nl">"interleave"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"cycle_length"</span><span class="p">:</span><span class="w"> </span><span class="mi">100</span><span class="p">,</span><span class="w"> </span><span class="nl">"num_parallel_calls"</span><span class="p">:</span><span class="w"> </span><span class="mi">-1</span><span class="w"> </span><span class="p">},</span><span class="w">
</span><span class="nl">"shuffle_micro"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"buffer_size"</span><span class="p">:</span><span class="w"> </span><span class="mi">512</span><span class="w"> </span><span class="p">},</span><span class="w">
</span><span class="nl">"map"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"num_parallel_calls"</span><span class="p">:</span><span class="w"> </span><span class="mi">-1</span><span class="w"> </span><span class="p">},</span><span class="w">
</span><span class="nl">"batch"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"batch_size"</span><span class="p">:</span><span class="w"> </span><span class="mi">128</span><span class="w"> </span><span class="p">},</span><span class="w">
</span><span class="nl">"prefetch"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"buffer_size"</span><span class="p">:</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="p">},</span><span class="w">
</span><span class="nl">"repeat"</span><span class="p">:</span><span class="w"> </span><span class="p">{}</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="nl">"validation"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"spec"</span><span class="p">:</span><span class="w"> </span><span class="s2">"transformed"</span><span class="p">,</span><span class="w">
</span><span class="nl">"shuffle_macro"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"buffer_size"</span><span class="p">:</span><span class="w"> </span><span class="mi">100</span><span class="w"> </span><span class="p">},</span><span class="w">
</span><span class="nl">"interleave"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"cycle_length"</span><span class="p">:</span><span class="w"> </span><span class="mi">100</span><span class="p">,</span><span class="w"> </span><span class="nl">"num_parallel_calls"</span><span class="p">:</span><span class="w"> </span><span class="mi">-1</span><span class="w"> </span><span class="p">},</span><span class="w">
</span><span class="nl">"map"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"num_parallel_calls"</span><span class="p">:</span><span class="w"> </span><span class="mi">-1</span><span class="w"> </span><span class="p">},</span><span class="w">
</span><span class="nl">"batch"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"batch_size"</span><span class="p">:</span><span class="w"> </span><span class="mi">128</span><span class="w"> </span><span class="p">},</span><span class="w">
</span><span class="nl">"prefetch"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"buffer_size"</span><span class="p">:</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="p">},</span><span class="w">
</span><span class="nl">"repeat"</span><span class="p">:</span><span class="w"> </span><span class="p">{}</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="nl">"testing"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"spec"</span><span class="p">:</span><span class="w"> </span><span class="s2">"original"</span><span class="p">,</span><span class="w">
</span><span class="nl">"interleave"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"cycle_length"</span><span class="p">:</span><span class="w"> </span><span class="mi">100</span><span class="p">,</span><span class="w"> </span><span class="nl">"num_parallel_calls"</span><span class="p">:</span><span class="w"> </span><span class="mi">-1</span><span class="w"> </span><span class="p">},</span><span class="w">
</span><span class="nl">"map"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"num_parallel_calls"</span><span class="p">:</span><span class="w"> </span><span class="mi">-1</span><span class="w"> </span><span class="p">},</span><span class="w">
</span><span class="nl">"batch"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"batch_size"</span><span class="p">:</span><span class="w"> </span><span class="mi">128</span><span class="w"> </span><span class="p">},</span><span class="w">
</span><span class="nl">"prefetch"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"buffer_size"</span><span class="p">:</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>It can be seen that the file contains only one block: <code class="language-plaintext highlighter-rouge">data</code>. This is sufficient
for the purposes of this article; however, it is also meant to cover the
construction of the model in mind, including its hyperparameters, and the
execution process, including the optimizer and evaluation metrics.</p>
<p>The <code class="language-plaintext highlighter-rouge">data</code> block is similar to the one we saw before. In this case, <code class="language-plaintext highlighter-rouge">modes</code>
describes various calls to the <a href="https://www.tensorflow.org/guide/data"><code class="language-plaintext highlighter-rouge">tf.data</code></a> API related to shuffling, batching,
and so on. Those who are familiar with the API will probably immediately
recognize them. It is now instructive to go straight to the Python code.</p>
<p>Below is an excerpt from a <a href="https://github.com/chain-rule/example-weather-forecast/blob/master/forecast/data.py">Python class</a> responsible for building the
pipeline on the TensorFlow side:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># config = ...
</span>
<span class="c1"># List all files matching a given pattern
</span><span class="n">pattern</span> <span class="o">=</span> <span class="p">[</span><span class="bp">self</span><span class="p">.</span><span class="n">path</span><span class="p">,</span> <span class="n">name</span><span class="p">,</span> <span class="s">'examples'</span><span class="p">,</span> <span class="s">'part-*'</span><span class="p">]</span>
<span class="n">dataset</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">data</span><span class="p">.</span><span class="n">Dataset</span><span class="p">.</span><span class="n">list_files</span><span class="p">(</span><span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="o">*</span><span class="n">pattern</span><span class="p">))</span>
<span class="c1"># Shuffle the files if needed
</span><span class="k">if</span> <span class="s">'shuffle_macro'</span> <span class="ow">in</span> <span class="n">config</span><span class="p">:</span>
<span class="n">dataset</span> <span class="o">=</span> <span class="n">dataset</span><span class="p">.</span><span class="n">shuffle</span><span class="p">(</span><span class="o">**</span><span class="n">config</span><span class="p">[</span><span class="s">'shuffle_macro'</span><span class="p">])</span>
<span class="c1"># Convert the files into datasets of examples stored as TFRecords and
# amalgamate these datasets into one dataset of examples
</span><span class="n">dataset</span> <span class="o">=</span> <span class="n">dataset</span> \
<span class="p">.</span><span class="n">interleave</span><span class="p">(</span><span class="n">tf</span><span class="p">.</span><span class="n">data</span><span class="p">.</span><span class="n">TFRecordDataset</span><span class="p">,</span> <span class="o">**</span><span class="n">config</span><span class="p">[</span><span class="s">'interleave'</span><span class="p">])</span>
<span class="c1"># Shuffle the examples if needed
</span><span class="k">if</span> <span class="s">'shuffle_micro'</span> <span class="ow">in</span> <span class="n">config</span><span class="p">:</span>
<span class="n">dataset</span> <span class="o">=</span> <span class="n">dataset</span><span class="p">.</span><span class="n">shuffle</span><span class="p">(</span><span class="o">**</span><span class="n">config</span><span class="p">[</span><span class="s">'shuffle_micro'</span><span class="p">])</span>
<span class="c1"># Preprocess the examples with respect to a given spec, pad the examples
# and form batches of different sizes, and postprocess the batches
</span><span class="n">dataset</span> <span class="o">=</span> <span class="n">dataset</span> \
<span class="p">.</span><span class="nb">map</span><span class="p">(</span><span class="n">_preprocess</span><span class="p">,</span> <span class="o">**</span><span class="n">config</span><span class="p">[</span><span class="s">'map'</span><span class="p">])</span> \
<span class="p">.</span><span class="n">padded_batch</span><span class="p">(</span><span class="n">padded_shapes</span><span class="o">=</span><span class="n">_shape</span><span class="p">(),</span> <span class="o">**</span><span class="n">config</span><span class="p">[</span><span class="s">'batch'</span><span class="p">])</span> \
<span class="p">.</span><span class="nb">map</span><span class="p">(</span><span class="n">_postprocess</span><span class="p">,</span> <span class="o">**</span><span class="n">config</span><span class="p">[</span><span class="s">'map'</span><span class="p">])</span>
<span class="c1"># Prefetch the batches if needed
</span><span class="k">if</span> <span class="s">'prefetch'</span> <span class="ow">in</span> <span class="n">config</span><span class="p">:</span>
<span class="n">dataset</span> <span class="o">=</span> <span class="n">dataset</span><span class="p">.</span><span class="n">prefetch</span><span class="p">(</span><span class="o">**</span><span class="n">config</span><span class="p">[</span><span class="s">'prefetch'</span><span class="p">])</span>
<span class="c1"># Repeat the data once the source is exhausted if needed
</span><span class="k">if</span> <span class="s">'repeat'</span> <span class="ow">in</span> <span class="n">config</span><span class="p">:</span>
<span class="n">dataset</span> <span class="o">=</span> <span class="n">dataset</span><span class="p">.</span><span class="n">repeat</span><span class="p">(</span><span class="o">**</span><span class="n">config</span><span class="p">[</span><span class="s">'repeat'</span><span class="p">])</span>
</code></pre></div></div>
<p>The pipeline is self-explanatory. It is simply a chain of operations stacked on
top of each other. It is, however, worth taking a closer look at the
preprocessing and postprocessing mappings, which can be seen before and after
the padding step, respectively:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">_preprocess</span><span class="p">(</span><span class="n">proto</span><span class="p">):</span>
<span class="n">spec</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">transforms</span><span class="p">[</span><span class="n">config</span><span class="p">[</span><span class="s">'transform'</span><span class="p">]]</span> \
<span class="p">.</span><span class="n">transformed_feature_spec</span><span class="p">()</span>
<span class="n">example</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">io</span><span class="p">.</span><span class="n">parse_single_example</span><span class="p">(</span><span class="n">proto</span><span class="p">,</span> <span class="n">spec</span><span class="p">)</span>
<span class="k">return</span> <span class="p">(</span>
<span class="p">{</span><span class="n">name</span><span class="p">:</span> <span class="n">example</span><span class="p">[</span><span class="n">name</span><span class="p">]</span> <span class="k">for</span> <span class="n">name</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">contextual_names</span><span class="p">},</span>
<span class="p">{</span>
<span class="c1"># Convert the sequential columns from sparse to dense
</span> <span class="n">name</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">schema</span><span class="p">[</span><span class="n">name</span><span class="p">].</span><span class="n">to_dense</span><span class="p">(</span><span class="n">example</span><span class="p">[</span><span class="n">name</span><span class="p">])</span>
<span class="k">for</span> <span class="n">name</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">sequential_names</span>
<span class="p">},</span>
<span class="p">)</span>
<span class="k">def</span> <span class="nf">_postprocess</span><span class="p">(</span><span class="n">contextual</span><span class="p">,</span> <span class="n">sequential</span><span class="p">):</span>
<span class="n">sequential</span> <span class="o">=</span> <span class="p">{</span>
<span class="c1"># Convert the sequential columns from dense to sparse
</span> <span class="n">name</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">schema</span><span class="p">[</span><span class="n">name</span><span class="p">].</span><span class="n">to_sparse</span><span class="p">(</span><span class="n">sequential</span><span class="p">[</span><span class="n">name</span><span class="p">])</span>
<span class="k">for</span> <span class="n">name</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">sequential_names</span>
<span class="p">}</span>
<span class="k">return</span> <span class="p">{</span><span class="o">**</span><span class="n">contextual</span><span class="p">,</span> <span class="o">**</span><span class="n">sequential</span><span class="p">}</span>
</code></pre></div></div>
<p>Currently, <code class="language-plaintext highlighter-rouge">tf.data</code> does not support padding sparse tensors, which is the
representation used for sequential features in TensorFlow. In the running
example about forecasting weather, such features are <code class="language-plaintext highlighter-rouge">duration</code> and
<code class="language-plaintext highlighter-rouge">temperature</code>. This is the reason such features are converted to their dense
counterparts in <code class="language-plaintext highlighter-rouge">_preprocess</code>. However, the final representation has to be
sparse still. Therefore, the sequential features are converted back to the
sparse format in <code class="language-plaintext highlighter-rouge">_postprocess</code>. Hopefully, this back-and-forth conversion will
be rendered obsolete in future versions.</p>
<p>Having executed the above steps, we have an instance of <a href="https://www.tensorflow.org/api_docs/python/tf/data/Dataset"><code class="language-plaintext highlighter-rouge">tf.data.Dataset</code></a>,
which is the ultimate goal, as it is the standard way of ingesting data into a
TensorFlow graph. At this point, one might create a Keras model leveraging
<a href="https://www.tensorflow.org/api_docs/python/tf/keras/layers/DenseFeatures"><code class="language-plaintext highlighter-rouge">tf.keras.layers.DenseFeatures</code></a> and <a href="https://www.tensorflow.org/api_docs/python/tf/keras/experimental/SequenceFeatures"><code class="language-plaintext highlighter-rouge">tf.keras.experimental.SequenceFeatures</code></a>
for constructing the input layer and then pass the data set to the <code class="language-plaintext highlighter-rouge">fit</code>
function of the model. A <a href="https://github.com/chain-rule/example-weather-forecast/blob/master/forecast/model.py">skeleton</a> for this part can be found in the
repository.</p>
<h1 id="conclusion">Conclusion</h1>
<p>In this article, we have discussed a scalable approach to the ingestion of
sequential observations from BigQuery into a TensorFlow graph. The key tools
that have been used to this end are TensorFlow Extended in combination with
Cloud Dataflow and the <code class="language-plaintext highlighter-rouge">tf.data</code> API of TensorFlow.</p>
<p>In addition, the provided code has been written to be general and easily
customizable. It has been achieved by separating the configuration part from the
implementation one.</p>
<h1 id="references">References</h1>
<ul>
<li>Lak Lakshmanan, “<a href="https://www.oreilly.com/learning/repeatable-sampling-of-data-sets-in-bigquery-for-machine-learning">Repeatable sampling of data sets in BigQuery for machine
learning</a>,” 2016.</li>
</ul>Ivan UkhovHow hard can it be to ingest sequential data into a TensorFlow model? As always, the answer is, “It depends.” Where are the sequences in question stored? Can they fit in main memory? Are they of the same length? In what follows, we shall build a flexible and scalable workflow for feeding sequential observations into a TensorFlow graph starting from BigQuery as the data warehouse.Sample size determination using historical data and simulation2019-09-25T06:00:00+00:002019-09-25T06:00:00+00:00https://blog.ivanukhov.com/2019/09/25/bootstrap<p>In order to test a hypothesis, one has to design and execute an adequate
experiment. Typically, it is neither feasible nor desirable to involve the whole
population. Instead, a relatively small subset of the population is studied, and
given the outcome for this small sample, relevant conclusions are drawn with
respect to the population. An important question to answer is then, What is the
minimal sample size needed for the experiment to succeed? In what follows, we
answer this question using solely historical data and computer simulation,
without invoking any classical statistical procedures.</p>
<p>Although, as we shall see, the ideas are straightforward, direct calculations
were impossible to perform before computers. To be able to answer this kind of
questions back then, statisticians developed mathematical theories in order to
approximate the calculations for specific situations. Since nothing else was
possible, these approximations and the various terms and conditions under which
they operate made up a large part of traditional textbooks and courses in
statistics. However, the advent of today’s computing power has enabled one to
estimate required sample sizes in a more direct and intuitive way, with the only
prerequisites being an understanding of statistical inference, the availability
of historical data describing the status quo, and the ability to write a few
lines of code in a programming language.</p>
<h1 id="problem">Problem</h1>
<p>For concreteness, consider the following scenario. We run an online business and
hypothesize that a specific change in promotion campaigns, such as making them
personalized, will have a positive effect on a specific performance metric, such
as the average deposit. In order to investigate if it is the case, we decide to
perform a two-sample test. There are the following two competing hypotheses.</p>
<ul>
<li>
<p>The null hypothesis postulates that the change has no effect on the metric.</p>
</li>
<li>
<p>The alternative hypothesis postulates that the change has a positive effect on
the metric.</p>
</li>
</ul>
<p>There will be two groups: a control group and a treatment group. The former will
be exposed to the current promotion policy, while the latter to the new one.
There are also certain requirements imposed on the test. First, we have a level
of statistical significance \(\alpha\) and a level of practical significance
\(\delta\) in mind. The former puts a limit on the false-positive rate, and the
latter indicates the smallest effect that we still care about; anything smaller
is as good as zero for any practical purpose. In addition, we require the test
to have a prescribed false-negative rate \(\beta\), ensuring that the test has
enough statistical power.</p>
<p>For our purposes, the test is considered well designed if it is capable of
detecting a difference as small as \(\delta\) so that the false-positive and
false-negative rates are controlled to levels \(\alpha\) and \(\beta\),
respectively. Typically, parameters \(\alpha\) and \(\delta\) are held constant,
and the desired false-positive rate \(\beta\) is attained by varying the number
of participants in each group, which we denote by \(n\). Note that we do not
want any of the parameters to be smaller than the prescribed values, as it would
be wasteful.</p>
<p>So what should the sample size be for the test to be well designed?</p>
<h1 id="solution">Solution</h1>
<p>Depending on the distribution of the data and on the chosen metric, one might or
might not be able to find a suitable test among the standard ones, while
ensuring that the test’s assumptions can safely be considered satisfied. More
importantly, a textbook solution might not be the most intuitive one, which, in
particular, might lead to misuse of the test. It is the understanding that
matters.</p>
<p>Here we take a more pragmatic and rather general approach that circumvents the
above concerns. It requires only historical data and basic programming skills.
Despite its simplicity, the method below goes straight to the core of what the
famed statistical tests are doing behind all the math. The approach belongs to
the class of so-called bootstrap techniques and is as follows.</p>
<p>Suppose we have historical data on customers’ behavior under the current
promotion policy, which is commonplace in practice. An important realization is
that this data set represents what we expect to observe in the control group. It
is also what is expected of the treatment group provided that the null
hypothesis is true, that is, when the proposed change has no effect. This
realization enables one to simulate what would happen if each group was limited
to an arbitrary number of participants. Then, by varying this size parameter, it
is possible to find the smallest value that makes the test well designed, that
is, make the test satisfy the requirements on \(\alpha\), \(\beta\), and
\(\delta\), as discussed in the previous section.</p>
<p>This is all. The rest is an elaboration of the above idea.</p>
<p>The simulation entails the following. To begin with, note that what we are
interested in testing is the difference between the performance metric applied
to the treatment group and the same metric applied to the control group, which
is referred to as the test statistic:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Test statistic = Metric(Treatment sample) - Metric(Control sample).
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">Treatment sample</code> and <code class="language-plaintext highlighter-rouge">Control sample</code> stand for sets of observations, and
<code class="language-plaintext highlighter-rouge">Metric(Sample)</code> stands for computing the performance metric given such a
sample. For instance, each observation could be the total deposit of a customer,
and the metric could be the average value:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Metric(Sample) = Sum of observations / Number of observations.
</code></pre></div></div>
<p>Note, however, that it is an example; the metric can be arbitrary, and this is a
huge advantage of this approach to sample size determination based on data and
simulation.</p>
<p>Large positive values of the test statistic speak in favor of the treatment
(that is, the new promotion policy in our example), while those that are close
to zero suggest that the treatment is futile.</p>
<p>A sample of \(n\) observations corresponding to the status quo (that is, the
current policy in our example) can be easily obtained by drawing \(n\) data
points with replacement from the historical data:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Sample = Choose random with replacement(Data, N).
</code></pre></div></div>
<p>This expression is used for <code class="language-plaintext highlighter-rouge">Control sample</code> under both the null and alternative
hypotheses. As alluded to earlier, this is also how <code class="language-plaintext highlighter-rouge">Treatment sample</code> is
obtained under the null. Regarding the alternative hypothesis being true, one
has to express the hypothesized outcome as a distribution for the case of the
minimal detectable difference, \(\delta\). The simplest and reasonable solution
is to sample the data again, apply the metric, and then adjust the result to
reflect the alternative hypothesis:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Metric(Choose random with replacement(Data, N)) + Delta.
</code></pre></div></div>
<p>Here, again, one is free to change the logic under the alternative according to
the situation at hand. For instance, instead of an additive effect, one could
simulate a multiplicative one.</p>
<p>The above is a way to simulate a single instance of the experiment under either
the null or alternative hypothesis; the result is a single value for the test
statistic. The next step is to estimate how the test statistic would vary if the
experiment was repeated many times in the two scenarios. This simply means that
the procedure should be repeated multiple times:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Repeat many times {
Sample 1 = Choose random with replacement(Data, N)
Sample 2 = Choose random with replacement(Data, N)
Metric 1 = Metric(Sample 1)
Metric 2 = Metric(Sample 2)
Test statistic under null = Metric 1 - Metric 2
Sample 3 = Choose random with replacement(Data, N)
Sample 4 = Choose random with replacement(Data, N)
Metric 3 = Metric(Sample 3) + Delta
Metric 4 = Metric(Sample 4)
Test statistic under alternative = Metric 3 - Metric 4
}
</code></pre></div></div>
<p>This yields a collection of values for the test statistic under the null
hypothesis and a collection of values for the test statistic under the
alternative hypothesis. Each one contains realizations from the so-called
sampling distribution in the corresponding scenario. The following figure gives
an illustration:</p>
<p><img src="/assets/images/2019-09-25-bootstrap/sampling-distribution-1.svg" alt="" /></p>
<p>The blue shape is the sampling distribution under the null hypothesis, and the
red one is the sampling distribution under the alternative hypothesis. We shall
come back to this figure shortly.</p>
<p>These two distributions of the test statistic are what we are after, as they
allow one to compute the false-positive rate and eventually choose a sample
size. First, given \(\alpha\), the sampling distribution under the null (the
blue one) is used in order to find a value beyond which the probability mass is
equal to \(\alpha\):</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Critical value = Quantile([Test statistic under null], 1 - alpha).
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">Quantile</code> computes the quantile specified by the second argument given a set of
observations. This quantity is called the critical value of the test. In the
figure above, it is denoted by a dashed line. When the test statistic falls to
the right of the critical value, we reject the null hypothesis; otherwise, we
fail to reject it. Second, the sampling distribution in the case of the
alternative hypothesis being true (the red one) is used in order to compute the
false-negative rate:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Attained beta = Mean([Test statistic under alternative < Critical value]).
</code></pre></div></div>
<p>It corresponds to the probability mass of the sampling distribution under the
alternative to the left of the critical value. In the figure, it is the red area
to the left of the dashed line.</p>
<p>The final step is to put the above procedure in an optimization loop that
minimizes the distance between the target and attained \(\beta\)’s with respect
to the sample size:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Optimize N until Attained beta is close to Target beta {
Repeat many times {
Test statistic under null = ...
Test statistic under alternative = ...
}
Critical value = ...
Attained beta = ...
}
</code></pre></div></div>
<p>This concludes the calculation of the size that the control and treatment groups
should have in order for the upcoming test in promotion campaigns to be well
designed in terms of the level of statistical significance \(\alpha\), the
false-negative rate \(\beta\), and the level of practical significance
\(\delta\).</p>
<p>An example of how this technique could be implemented in practice can be found
in the appendix.</p>
<h1 id="conclusion">Conclusion</h1>
<p>In this article, we have discussed an approach to sample size determination that
is based on historical data and computer simulation rather than on mathematical
formulae tailored for specific situations. It is general and straightforward to
implement. More importantly, the technique is intuitive, since it directly
follows the narrative of null hypothesis significance testing. It does require
prior knowledge of the key concepts in statistical inference. However, this
knowledge is arguably essential for those who are involved in scientific
experimentation. It constitutes the core of statistical literacy.</p>
<h1 id="acknowledgments">Acknowledgments</h1>
<p>This article was inspired by a blog post authored by <a href="http://allendowney.blogspot.com/2011/05/there-is-only-one-test.html">Allen Downey</a> and a talk
given by <a href="https://www.youtube.com/watch?v=5Dnw46eC-0o">John Rauser</a>. I also would like to thank <a href="http://users.stat.umn.edu/~rend0020/">Aaron Rendahl</a> for his
feedback on the introduction to the method presented here and for his help with
the implementation given in the appendix.</p>
<h1 id="references">References</h1>
<ul>
<li>Allen Downey, “<a href="http://allendowney.blogspot.com/2011/05/there-is-only-one-test.html">There is only one test!</a>,” 2011.</li>
<li>John Rauser, “<a href="https://www.youtube.com/watch?v=5Dnw46eC-0o">Statistics without the agonizing pain</a>,” 2014.</li>
<li>Joseph Lee Rodgers, “<a href="https://doi.org/10.1207/S15327906MBR3404_2">The bootstrap, the jackknife, and the randomization
test: A sampling taxonomy</a>,” Multivariate Behavioral Research,
2010.</li>
</ul>
<h1 id="appendix">Appendix</h1>
<p>The following listing shows an implementation of the bootstrap approach in R:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="n">tidyverse</span><span class="p">)</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">42</span><span class="p">)</span><span class="w">
</span><span class="c1"># Artificial data for illustration</span><span class="w">
</span><span class="n">observation_count</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">20000</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">tibble</span><span class="p">(</span><span class="n">value</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rlnorm</span><span class="p">(</span><span class="n">observation_count</span><span class="p">))</span><span class="w">
</span><span class="c1"># Performance metric</span><span class="w">
</span><span class="n">metric</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mean</span><span class="w">
</span><span class="c1"># Statistical significance</span><span class="w">
</span><span class="n">alpha</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">0.05</span><span class="w">
</span><span class="c1"># False-negative rate</span><span class="w">
</span><span class="n">beta</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">0.2</span><span class="w">
</span><span class="c1"># Practical significance</span><span class="w">
</span><span class="n">delta</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">0.1</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">metric</span><span class="p">(</span><span class="n">data</span><span class="o">$</span><span class="n">value</span><span class="p">)</span><span class="w">
</span><span class="n">simulate</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">sample_size</span><span class="p">,</span><span class="w"> </span><span class="n">replication_count</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="c1"># Function for drawing a single sample of size sample_size</span><span class="w">
</span><span class="n">run_one</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">()</span><span class="w"> </span><span class="n">sample</span><span class="p">(</span><span class="n">data</span><span class="o">$</span><span class="n">value</span><span class="p">,</span><span class="w"> </span><span class="n">sample_size</span><span class="p">,</span><span class="w"> </span><span class="n">replace</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="c1"># Function for drawing replication_count samples of size sample_size</span><span class="w">
</span><span class="n">run_many</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">()</span><span class="w"> </span><span class="n">replicate</span><span class="p">(</span><span class="n">replication_count</span><span class="p">,</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">metric</span><span class="p">(</span><span class="n">run_one</span><span class="p">())</span><span class="w"> </span><span class="p">})</span><span class="w">
</span><span class="c1"># Simulation under the null hypothesis</span><span class="w">
</span><span class="n">control_null</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">run_many</span><span class="p">()</span><span class="w">
</span><span class="n">treatment_null</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">run_many</span><span class="p">()</span><span class="w">
</span><span class="n">difference_null</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">treatment_null</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">control_null</span><span class="w">
</span><span class="c1"># Simulation under the alternative hypothesis</span><span class="w">
</span><span class="n">control_alternative</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">run_many</span><span class="p">()</span><span class="w">
</span><span class="n">treatment_alternative</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">run_many</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">delta</span><span class="w">
</span><span class="n">difference_alternative</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">treatment_alternative</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">control_alternative</span><span class="w">
</span><span class="c1"># Computation of the critical value</span><span class="w">
</span><span class="n">critical_value</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">quantile</span><span class="p">(</span><span class="n">difference_null</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">alpha</span><span class="p">)</span><span class="w">
</span><span class="c1"># Computation of the false-negative rate</span><span class="w">
</span><span class="n">beta</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">difference_alternative</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="n">critical_value</span><span class="p">)</span><span class="w">
</span><span class="nf">list</span><span class="p">(</span><span class="n">difference_null</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">difference_null</span><span class="p">,</span><span class="w">
</span><span class="n">difference_alternative</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">difference_alternative</span><span class="p">,</span><span class="w">
</span><span class="n">critical_value</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">critical_value</span><span class="p">,</span><span class="w">
</span><span class="n">beta</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">beta</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1"># Number of replications</span><span class="w">
</span><span class="n">replication_count</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1000</span><span class="w">
</span><span class="c1"># Interval of possible values for the sample size</span><span class="w">
</span><span class="n">search_interval</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">10000</span><span class="p">)</span><span class="w">
</span><span class="c1"># Root finding to attain the desired value by varying the sample size</span><span class="w">
</span><span class="n">target</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="n">beta</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">simulate</span><span class="p">(</span><span class="nf">as.integer</span><span class="p">(</span><span class="n">n</span><span class="p">),</span><span class="w"> </span><span class="n">replication_count</span><span class="p">)</span><span class="o">$</span><span class="n">beta</span><span class="w">
</span><span class="n">sample_size</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">as.integer</span><span class="p">(</span><span class="n">uniroot</span><span class="p">(</span><span class="n">target</span><span class="p">,</span><span class="w"> </span><span class="n">interval</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">search_interval</span><span class="p">)</span><span class="o">$</span><span class="n">root</span><span class="p">)</span></code></pre></figure>
<p>The illustrative figure shown in the solution section displays the sampling
distribution of the test statistic under the null and alternative for the sample
size found by this code snippet.</p>Ivan UkhovIn order to test a hypothesis, one has to design and execute an adequate experiment. Typically, it is neither feasible nor desirable to involve the whole population. Instead, a relatively small subset of the population is studied, and given the outcome for this small sample, relevant conclusions are drawn with respect to the population. An important question to answer is then, What is the minimal sample size needed for the experiment to succeed? In what follows, we answer this question using solely historical data and computer simulation, without invoking any classical statistical procedures.A Bayesian approach to the inference of the net promoter score2019-08-19T06:00:00+00:002019-08-19T06:00:00+00:00https://blog.ivanukhov.com/2019/08/19/net-promoter<p>The net promoter score is a widely adopted metric for gauging customers’
satisfaction with a product. The popularity of the score is arguably attributed
to the simplicity of measurement and the intuitiveness of interpretation.
Moreover, it is claimed to be correlated with revenue growth, which, ignoring
causality, makes it even more appealing. In this article, we leverage Bayesian
statistics in order to infer the net promoter score for an arbitrary
segmentation of a customer base. The outcome of the inference is a distribution
over all possible values of the score weighted by probabilities, which provides
exhaustive information for the subsequent decision-making.</p>
<p>A bare-bones net promoter survey is composed of only one question: “How likely
are you to recommend us to a friend?” The answer is an integer ranging from 0 to
10 inclusively. If the grade is between 0 and 6 inclusively, the person in
question is said to be a detractor. If it is 7 or 8, the person is said to be a
neutral. Lastly, if it is 9 or 10, the person is deemed a promoter. The net
promoter score itself is then the percentage of promoters minus the percentage
of detractors. The minimum and maximum attainable values of the score are −100
and 100, respectively. In this case, the greater, the better.</p>
<p>As it is usually the case with surveys, a small but representative subset of
customers is reached out to, and the collected responses are then used to draw
conclusions about the target population of customers. Our objective is to
facilitate this last step by estimating the net promoter score given a set of
responses and necessarily quantify and put front and center the uncertainty in
our estimates.</p>
<p>Before we proceed, since a net promoter survey is an observational study, which
is prone to such biases as participation and response biases, great care must be
taken when analyzing the results. In this article, however, we focus on the
inference of the net promoter score under the assumption that the given sample
of responses is representative of the target population.</p>
<h1 id="problem">Problem</h1>
<p>In practice, one is interested to know the net promoter scope for different
subpopulations of customers, such as countries of operation and age groups,
which is the scenario that we shall target. To this end, suppose that there are
\(m\) segments of interest, and each customer belongs to strictly one of them.
The results of a net promoter survey can then be summarized using the following
\(m \times 3\) matrix:</p>
\[y = \left(
\begin{matrix}
d_1 & n_1 & p_1 \\
\vdots & \vdots & \vdots \\
d_i & n_i & p_i \\
\vdots & \vdots & \vdots \\
d_m & n_m & p_m
\end{matrix}
\right)\]
<p>where \(d_i\), \(n_i\), and \(p_i\) denote the number of detractors, neutrals,
and promoters in segment \(i\), respectively. For segment \(i\), the <em>observed</em>
net promoter score can be computed as follows:</p>
\[\hat{s}_i = 100 \times \frac{p_i - d_i}{d_i + n_i + p_i}.\]
<p>However, this observed score is a single scalar value calculated using \(d_i +
n_i + p_i\) data points, which is only a subset of the corresponding
subpopulation. It may or may not correspond well to the actual net promoter
score of that subpopulation. We have no reason to trust it, since the above
estimate alone does not tell us anything about the uncertainty associated with
it. Uncertainty quantification is essential for sound decision-making, which is
what we are after.</p>
<p>Ideally, for each segment, given the observed data, we would like to have a
distribution of all possible values of the score with probabilities attached.
Such a probability distribution would be exhaustive information, from which any
other statistic could be easily derived. Here we tackle the problem by means of
Bayesian inference, which we discuss next.</p>
<h1 id="solution">Solution</h1>
<p>In order to perform Bayesian inference of the net promoter score, we need to
decide on an adequate Bayesian model for the problem at hand. Recall first that
we are interested in inferring scores for several segments. Even though there
might be segment-specific variations in the product, such as special offers in
certain countries, or in customers’ perception of the product, such as
age-related preferences, it is conceptually the same product that the customers
were asked to evaluate. It is then sensible to expect the scores in different
segments to have something in common. With this in mind, we construct a
hierarchical model with parameters shared by the segments.</p>
<p>First, let</p>
\[\theta_i = (\theta_{id}, \theta_{in}, \theta_{ip}) \in \langle 0, 1 \rangle^3\]
<p>be a triplet of parameters corresponding to the proportion of detractors,
neutrals, and promoters in segment \(i\), respectively, with the constraint that
they have to sum up to one. The constraint makes the triplet a simplex, which is
what is emphasized by the angle brackets on the right-hand side. These are the
main parameters we are interested in inferring. If the true value of
\(\theta_i\) was known, the net promoter score would be computed as follows:</p>
\[\hat{s}_i = 100 \times (\theta_{ip} - \theta_{id}).\]
<p>Parameter \(\theta_i\) can also be thought of as a vector of probabilities of
observing one of the three types of customers in segment \(i\), that is,
detractors, neutrals, and promoters. Then the natural model for the observed
data is a multinomial distribution with \(d_i + n_i + p_i\) trials and
probabilities \(\theta_i\):</p>
\[y_i | \theta_i \sim \text{Multinomial}(d_i + n_i + p_i, \theta_i)\]
<p>where \(y_i\) refers to the \(i\)th row of matrix \(y\) introduced earlier. The
family of multinomial distributions is a generalization of the family of
binomial distributions to more than two outcomes.</p>
<p>The above gives a data distribution. In order to complete the modeling part, we
need to decide on a prior probability distribution for \(\theta_i\). Each
\(\theta_i\) is a simplex of probabilities. In such a case, a reasonable choice
is a Dirichlet distribution:</p>
\[\theta_i | \phi \sim \text{Dirichlet}(\phi)\]
<p>where \(\phi = (\phi_d, \phi_n, \phi_p)\) is a vector of strictly positive
parameters. This family of distributions is a generalization of the family of
beta distributions to more than two categories. Note that \(\phi\) is the same
for all segments, which is what enables information sharing. In particular, it
means that the less reliable estimates for segments with fewer observations will
be shrunk toward the more reliable estimates for segments with more
observations. In other words, with this architecture, segments with fewer
observations are able to draw strength from those with more observations.</p>
<p>How about \(\phi\)? This triplet is a characteristic of the product irrespective
of the segment. Its individual components can be utilized in order to encode
one’s prior knowledge about the net promoter score. Specifically, \(\phi_d\),
\(\phi_n\), and \(\phi_p\) could be set to imaginary observations of detractors,
neutrals, and promoters, respectively, reflecting one’s beliefs prior to
conducting the survey. The higher these imaginary counts are, the more certain
one claims to be about the true score. One could certainly set these
hyperparameters to fixed values; however, a more comprehensive solution is to
infer them from the data as well, giving the model more flexibility by making it
hierarchical. In addition, an inspection of \(\phi\) afterward can provide
insights into the overall satisfaction with the product.</p>
<p>We now need to specify a prior, or rather a hyperprior, for \(\phi\). We proceed
under the assumption that we have little knowledge about the true score. Even if
there were surveys in the past, it is still a valid choice, especially when the
product evolves rapidly, rendering prior surveys marginally relevant.</p>
<p>Now, it is more convenient to think in terms of expected values and variances
instead of imaginary counts, which is what \(\phi\) represents. Let us find an
alternative parameterization of the Dirichlet distribution. The expected value
of this distribution is as follows:</p>
\[\mu = (\mu_d, \mu_n, \mu_p) = \frac{\phi}{\phi_d + \phi_n + \phi_p} \in \langle 0, 1 \rangle^3.\]
<p>It can be seen that it is a simplex of proportions of detractors, neutrals, and
promoters of the whole population, which is similar to \(\theta_i\) describing
segment \(i\). Regarding the variance,</p>
\[\sigma^2 = \frac{1}{\phi_d + \phi_n + \phi_p}\]
<p>is considered to capture it sufficiently well. Solving the system of the last
two equations for \(\phi\) yields the following result:</p>
\[\phi = \frac{\mu}{\sigma^2}.\]
<p>The prior for \(\theta_i\) can then be rewritten as follows:</p>
\[\theta_i | \mu, \sigma \sim \text{Dirichlet}\left(\frac{\mu}{\sigma^2}\right).\]
<p>This new parameterization requires two hyperpriors: one is for \(\mu\), and one
is for \(\sigma\). For \(\mu\), a reasonable choice is a uniform distribution
(over a simplex), and for \(\sigma\), a half-Cauchy distribution:</p>
\[\begin{align}
& \mu \sim \text{Uniform}(\langle 0, 1 \rangle^3) \text{ and} \\
& \sigma \sim \text{Half-Cauchy}(0, 1).
\end{align}\]
<p>The two distributions are relatively week, which is intended in order to let the
data speak for themselves. At this point, all parameters have been defined. Of
course, one could go further if the problem at hand had a deeper structure;
however, in this case, it is arguably not justifiable.</p>
<p>The final model is as follows:</p>
\[\begin{align}
y_i | \theta_i & \sim \text{Multinomial}(d_i + n_i + p_i, \theta_i), \\
\theta_i | \mu, \sigma & \sim \text{Dirichlet}(\mu / \sigma^2), \\
\mu & \sim \text{Uniform}(\langle 0, 1 \rangle^3), \text{ and} \\
\sigma & \sim \text{Half-Cauchy}(0, 1).
\end{align}\]
<p>The posterior distribution factorizes as follows:</p>
\[p(\theta_1, \dots, \theta_m, \mu, \sigma | y) \propto
p(y | \theta_1, \dots, \theta_m) \,
p(\theta_1 | \mu, \sigma) \cdots
p(\theta_m | \mu, \sigma) \,
p(\mu) \,
p(\sigma),\]
<p>which relies on the usual assumption of independence given the parameters. One
could make a few simplifications by, for instance, leveraging the conjugacy of
the Dirichlet distribution with respect to the multinomial distribution;
however, it is not needed in practice, as we shall see shortly.</p>
<p>The above posterior distribution is our ultimate goal. It is the one that gives
us a complete picture of what the true net promoter score in each segment might
be given the available evidence, that is, the responses from the survey. All
that is left is to draw a large enough sample from this distribution and start
to summarize and visualize the results.</p>
<p>Unfortunately, as one might probably suspect, drawing samples from the posterior
is not an easy task. It does not correspond to any standard distribution and
hence does not have a readily available random number generator. Fortunately,
the topic is sufficiently mature, and there have been developed techniques for
sampling complex distributions, such as the family of Markov chain Monte Carlo
methods. Unfortunately, the most effective and efficient of these techniques are
notoriously complex themselves, and it might be extremely difficult and tedious
to implement and apply them correctly in practice. Fortunately, the need for
versatile tools for modeling and inference with the focus on the problem at hand
and not on implementation details has been recognized and addressed. Nontrivial
scenarios can be tackled with a surprisingly small amount of effort nowadays,
which we illustrate next.</p>
<h1 id="implementation">Implementation</h1>
<p>In this section, we implement the model using the probabilistic programming
language <a href="https://mc-stan.org/">Stan</a>. Stan is straightforward to integrate into one’s workflow, as it
has interfaces for many general-purpose programming languages, including Python
and R. Here we only highlight the main points of the implementation and leave it
to the curious reader to discover Stan on their own.</p>
<p>The following listing is a complete implementation of the model:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">data</span> <span class="p">{</span>
<span class="kt">int</span><span class="o"><</span><span class="n">lower</span> <span class="o">=</span> <span class="mi">0</span><span class="o">></span> <span class="n">m</span><span class="p">;</span> <span class="c1">// The number of segments</span>
<span class="kt">int</span><span class="o"><</span><span class="n">lower</span> <span class="o">=</span> <span class="mi">0</span><span class="o">></span> <span class="n">n</span><span class="p">;</span> <span class="c1">// The number of categories, which is always three</span>
<span class="kt">int</span> <span class="n">y</span><span class="p">[</span><span class="n">m</span><span class="p">,</span> <span class="n">n</span><span class="p">];</span> <span class="c1">// The observed counts of detractors, neutrals, and promoters</span>
<span class="p">}</span>
<span class="n">parameters</span> <span class="p">{</span>
<span class="n">simplex</span><span class="p">[</span><span class="n">n</span><span class="p">]</span> <span class="n">mu</span><span class="p">;</span>
<span class="n">real</span><span class="o"><</span><span class="n">lower</span> <span class="o">=</span> <span class="mi">0</span><span class="o">></span> <span class="n">sigma</span><span class="p">;</span>
<span class="n">simplex</span><span class="p">[</span><span class="n">n</span><span class="p">]</span> <span class="n">theta</span><span class="p">[</span><span class="n">m</span><span class="p">];</span>
<span class="p">}</span>
<span class="n">transformed</span> <span class="n">parameters</span> <span class="p">{</span>
<span class="n">vector</span><span class="o"><</span><span class="n">lower</span> <span class="o">=</span> <span class="mi">0</span><span class="o">></span><span class="p">[</span><span class="n">n</span><span class="p">]</span> <span class="n">phi</span><span class="p">;</span>
<span class="n">phi</span> <span class="o">=</span> <span class="n">mu</span> <span class="o">/</span> <span class="n">sigma</span><span class="o">^</span><span class="mi">2</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">model</span> <span class="p">{</span>
<span class="n">mu</span> <span class="o">~</span> <span class="n">uniform</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
<span class="n">sigma</span> <span class="o">~</span> <span class="n">cauchy</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
<span class="k">for</span> <span class="p">(</span><span class="n">i</span> <span class="n">in</span> <span class="mi">1</span><span class="o">:</span><span class="n">m</span><span class="p">)</span> <span class="p">{</span>
<span class="n">theta</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">~</span> <span class="n">dirichlet</span><span class="p">(</span><span class="n">phi</span><span class="p">);</span>
<span class="n">y</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">~</span> <span class="n">multinomial</span><span class="p">(</span><span class="n">theta</span><span class="p">[</span><span class="n">i</span><span class="p">]);</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>It can be seen that the code is very laconic and follows closely the development
given in the previous section, including the notation. It is worth noting that,
in the model block, we seemingly use unconstrained uniform and Cauchy
distributions; however, the constraints are enforced by the definitions of the
corresponding hyperparameters, <code class="language-plaintext highlighter-rouge">mu</code> and <code class="language-plaintext highlighter-rouge">sigma</code>.</p>
<p>This is practically all that is needed; the rest will be taken care of by Stan,
which is actually a lot of work, including an adequate initialization, an
efficient execution, and necessary diagnostics and quality checks. Under the
hood, the sampling of the posterior in Stan is based on the Hamiltonian Monte
Carlo algorithm and the no-U-turn sampler, which are considered to be the
state-of-the-art.</p>
<p>The output of the sampling procedure is a set of draws from the posterior
distribution, which, again, is exhaustive information about the net promoter
score in the segments of interest. In particular, one can quantify the
uncertainty in and the probability of any statement one makes about the score.
For instance, if a concise summary is needed, one could compute the mean of the
score and accompany it with a high-posterior-density credible interval,
capturing the true value with the desired probability. However, if applicable,
the full distribution should be integrated into the decision-making process.</p>
<h1 id="conclusion">Conclusion</h1>
<p>In this article, we have constructed a hierarchical Bayesian model for inferring
the net promoter score for an arbitrary segmentation of a customer base. The
model features shared parameters, which enable information exchange between the
segments. This allows for a more robust estimation of the score, especially in
the case of segments with few observations. The final output of the inference is
a probability distribution over all possible values of the score in each
segment, which lays a solid foundation for the subsequent decision-making. We
have also seen how seamlessly the model can be implemented in practice using
modern tools for statistical inference, such as Stan.</p>
<p>Lastly, note that the presented model is only one alternative; there are many
other. How would <em>you</em> model the net promoter score? What changes would you
make? Make sure to leave a comment.</p>
<h1 id="references">References</h1>
<ul>
<li>Andrew Gelman et al., <em><a href="http://www.stat.columbia.edu/~gelman/book/">Bayesian Data Analysis</a></em>, Chapman and Hall/CRC,
2014.</li>
<li>Andrew Gelman, “<a href="https://statmodeling.stat.columbia.edu/2009/10/21/some_practical/">Some practical questions about prior distributions</a>,”
2009.</li>
</ul>Ivan UkhovThe net promoter score is a widely adopted metric for gauging customers’ satisfaction with a product. The popularity of the score is arguably attributed to the simplicity of measurement and the intuitiveness of interpretation. Moreover, it is claimed to be correlated with revenue growth, which, ignoring causality, makes it even more appealing. In this article, we leverage Bayesian statistics in order to infer the net promoter score for an arbitrary segmentation of a customer base. The outcome of the inference is a distribution over all possible values of the score weighted by probabilities, which provides exhaustive information for the subsequent decision-making.Interactive notebooks in tightly sealed disposable containers2019-07-24T06:00:00+00:002019-07-24T06:00:00+00:00https://blog.ivanukhov.com/2019/07/24/notebook<p>It is truly amazing how interactive notebooks—where a narrative in a spoken
language is entwined with executable chunks of code in a programming
language—have revolutionized the way we work with data and document our thought
processes and findings for others and, equally importantly, for our future
selves. They are ubiquitous and taken for granted. It is hard to imagine where
data enthusiasts would be without them. Most likely, we would be spending too
much time staring at a terminal window, anxiously re-running scripts from start
to finish, printing variables, and saving lots of files with tables and graphs
on disk for further inspection. Interactive notebooks are an essential tool in
the data scientist’s toolbox, and in this article, we are going to make them
readily available for our use with our favorite packages installed and
preferences set up, no matter where we find ourselves working and regardless of
the mess we might have left behind during the previous session.</p>
<p>Python and R (in alphabetic order) are arguably the primary languages used by
data scientists nowadays. In the context of interactive computations, <a href="https://ipython.org/">IPython</a>
and later on <a href="https://jupyter.org/">Project Jupyter</a> have been of paramount importance for the Python
community (the latter is actually language agnostic). In the R community, this
role has been played by <a href="https://www.rstudio.com/">RStudio</a>. Therefore, having at one’s disposal
<a href="https://jupyter.org/">JupyterLab</a>, which is Project Jupyter’s flagship, and RStudio should make one
well equipped for a wide range of data challenges. As alluded to earlier, the
objective is to have an environment that has a fixed initial state defined by us
and is accessible to us on any machine we might happen to work on. This problem
definition is a perfect fit for containerization. Specifically, we shall build
custom-tailored <a href="https://www.docker.com/">Docker</a> images for JupyterLab and RStudio and create a few
convenient shortcuts for launching them.</p>
<p>The code discussed below can be found in the following two repositories:</p>
<ul>
<li><a href="https://github.com/chain-rule/JupyterLab/tree/article">JupyterLab</a> and</li>
<li><a href="https://github.com/chain-rule/RStudio/tree/article">RStudio</a>.</li>
</ul>
<h1 id="jupyterlab">JupyterLab</h1>
<p>In order to build a Docker image for JupyterLab, we begin with a
<a href="https://github.com/chain-rule/JupyterLab/blob/article/Dockerfile"><code class="language-plaintext highlighter-rouge">Dockerfile</code></a>:</p>
<div class="language-docker highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Start with a minimal Python image</span>
<span class="k">FROM</span><span class="s"> python:3.7-slim</span>
<span class="c"># Install the desired Python packages</span>
<span class="k">COPY</span><span class="s"> requirements.txt /tmp/requirements.txt</span>
<span class="k">RUN </span>pip <span class="nb">install</span> <span class="nt">--upgrade</span> pip
<span class="k">RUN </span>pip <span class="nb">install</span> <span class="nt">--upgrade</span> <span class="nt">--requirement</span> /tmp/requirements.txt
<span class="c"># Configure JupyterLab to use a specific IP address and port</span>
<span class="k">RUN </span><span class="nb">mkdir</span> <span class="nt">-p</span> ~/.jupyter
<span class="k">RUN </span><span class="nb">echo</span> <span class="s2">"c.NotebookApp.ip = '0.0.0.0'"</span> <span class="o">>></span> ~/.jupyter/jupyter_notebook_config.py
<span class="k">RUN </span><span class="nb">echo</span> <span class="s2">"c.NotebookApp.port = 8888"</span> <span class="o">>></span> ~/.jupyter/jupyter_notebook_config.py
<span class="c"># Set the working directory</span>
<span class="k">WORKDIR</span><span class="s"> /home/jupyterlab</span>
<span class="c"># Stort JupyterLab once the container is launched</span>
<span class="k">ENTRYPOINT</span><span class="s"> jupyter lab --allow-root --no-browser</span>
</code></pre></div></div>
<p>In words, we take a minimalistic image with the desired version of Python
preinstalled—in this case, it is the <a href="https://hub.docker.com/_/python">official Python image</a>
tagged <code class="language-plaintext highlighter-rouge">3.7-slim</code>, which refers to Python 3.7 with any available bug fixes
promptly applied—and add packages that we consider to be important for our work.
These packages are gathered in the usual
<a href="https://github.com/chain-rule/JupyterLab/blob/article/requirements.txt"><code class="language-plaintext highlighter-rouge">requirements.txt</code></a>, which might look as follows:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>jupyterlab
matplotlib
numpy
pandas
pylint
pytest
scikit-learn
scipy
seaborn
tensorflow
yapf
</code></pre></div></div>
<p>The first one, <code class="language-plaintext highlighter-rouge">jupyterlab</code>, is essential; the rest is up to the data
scientist’s taste. An important aspect to note is that, in this example, the
versions of the listed packages are not fixed; hence, the latest available
versions will be taken each time a new image is built. Alternatively, one can
pin them to specific numbers by changing <code class="language-plaintext highlighter-rouge">requirements.txt</code>. For instance, one
might write <code class="language-plaintext highlighter-rouge">tensorflow==1.14.0</code> instead of <code class="language-plaintext highlighter-rouge">tensorflow</code>.</p>
<p>Having defined an image, we need a tool for orchestration. We would like to have
a convenient command for actually building the image and, more importantly, a
convenient command for launching a container with that image from an arbitrary
directory. The versatile <code class="language-plaintext highlighter-rouge">make</code> to the rescue!</p>
<div class="language-make highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># The name of the Docker image
</span><span class="nv">name</span> <span class="o">:=</span> jupyterlab
<span class="c"># The directory to be mounted to the container
</span><span class="nv">root</span> <span class="o">?=</span> <span class="nv">${PWD}</span>
<span class="c"># Build a new image
</span><span class="nl">build</span><span class="o">:</span>
docker rmi <span class="nv">${name}</span> <span class="o">||</span> <span class="nb">true</span>
docker build <span class="nt">--tag</span> <span class="nv">${name}</span> .
<span class="c"># Start a new container
</span><span class="nl">start</span><span class="o">:</span>
<span class="p">@</span>docker run <span class="nt">--interactive</span> <span class="nt">--tty</span> <span class="nt">--rm</span> <span class="se">\</span>
<span class="nt">--name</span> <span class="nv">${name}</span> <span class="se">\</span>
<span class="nt">--publish</span> 8888:8888 <span class="se">\</span>
<span class="nt">--volume</span> <span class="s2">"</span><span class="nv">${root}</span><span class="s2">:/home/jupyterlab"</span> <span class="se">\</span>
<span class="nv">${name}</span>
</code></pre></div></div>
<p>In the above <a href="https://github.com/chain-rule/JupyterLab/blob/article/Makefile"><code class="language-plaintext highlighter-rouge">Makefile</code></a>, we define two commands: <code class="language-plaintext highlighter-rouge">build</code>
and <code class="language-plaintext highlighter-rouge">start</code>. The <code class="language-plaintext highlighter-rouge">build</code> command instructs Docker to build a new image according
to the recipe in <code class="language-plaintext highlighter-rouge">Dockerfile</code>. The <code class="language-plaintext highlighter-rouge">start</code> command launches a new container and
mounts the directory specified by the <code class="language-plaintext highlighter-rouge">root</code> variable to the file system inside
the container using the <code class="language-plaintext highlighter-rouge">--volume</code> option. It also forwards port 8888 inside the
container, which is the one specified in <code class="language-plaintext highlighter-rouge">Dockerfile</code>, to port 8888 on the host
machine so that JupyterLab can be reached from the browser.</p>
<p>Let us now go ahead and try the two commands:</p>
<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>make build
make start
</code></pre></div></div>
<p>JupyterLab should come back with usage instructions similar to the following:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>...
[I 18:40:15.078 LabApp] The Jupyter Notebook is running at:
[I 18:40:15.078 LabApp] http://e4edba021595:8888/?token=<token>
[I 18:40:15.078 LabApp] or http://127.0.0.1:8888/?token=<token>
[I 18:40:15.078 LabApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 18:40:15.082 LabApp]
To access the notebook, open this file in a browser:
file:///root/.local/share/jupyter/runtime/nbserver-6-open.html
Or copy and paste one of these URLs:
http://e4edba021595:8888/?token=<token>
or http://127.0.0.1:8888/?token=<token>
...
</code></pre></div></div>
<p>By clicking on the last link, we end up in a fully fledged JupyterLab.
Congratulations! However, there is one step left. JupyterLab is currently
running in the folder with our <code class="language-plaintext highlighter-rouge">Dockerfile</code> and <code class="language-plaintext highlighter-rouge">Makefile</code>, which is not
particularly useful, as each project we might want to work on probably lives in
its own folder elsewhere in the file system. Fortunately, it is easy to fix with
an alias:</p>
<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">alias </span><span class="nv">jupyterlab</span><span class="o">=</span><span class="s1">'make -C /path/to/the/folder/with/the/Makefile root="${PWD}"'</span>
</code></pre></div></div>
<p>This command should be placed in the start-up script of the shell being
utilized. In the case of Bash, it can be done as follows:</p>
<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">echo</span> <span class="s2">"alias jupyterlab='make -C </span><span class="se">\"</span><span class="k">${</span><span class="nv">PWD</span><span class="k">}</span><span class="se">\"</span><span class="s2"> root=</span><span class="se">\"\$</span><span class="s2">{PWD}</span><span class="se">\"</span><span class="s2">'"</span> <span class="o">>></span> ~/.bashrc
</code></pre></div></div>
<p>Now, in a new terminal, one should be able to run JupyterLab from any directory
as follows:</p>
<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> /path/to/some/project
jupyterlab
</code></pre></div></div>
<p>Note that the content of the current working directory (that is,
<code class="language-plaintext highlighter-rouge">/path/to/some/project</code>) is readily available inside JupyterLab. All notebooks
created and modified in the GUI there will be stored directly in this folder,
and they will remain here when the container is shut down.</p>
<h1 id="rstudio">RStudio</h1>
<p>It is time to get to grips with an image for R notebooks. As before, we begin
with a <a href="https://github.com/chain-rule/RStudio/blob/article/Dockerfile"><code class="language-plaintext highlighter-rouge">Dockerfile</code></a>:</p>
<div class="language-docker highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Start with an RStudio image</span>
<span class="k">FROM</span><span class="s"> rocker/rstudio:latest</span>
<span class="c"># Install the software that R packages require</span>
<span class="k">RUN </span>apt-get update
<span class="k">RUN </span>apt-get <span class="nb">install</span> <span class="nt">-y</span> libxml2-dev texlive texlive-latex-extra zlib1g-dev
<span class="c"># Set the working directory</span>
<span class="k">WORKDIR</span><span class="s"> /home/rstudio</span>
<span class="c"># Install the desired R packages</span>
<span class="k">COPY</span><span class="s"> requirements.txt /tmp/requirements.txt</span>
<span class="k">RUN </span><span class="nb">echo</span> <span class="s2">"install.packages(readLines('/tmp/requirements.txt'), </span><span class="se">\
</span><span class="s2"> repos = 'http://cran.us.r-project.org')"</span> | R
</code></pre></div></div>
<p>Installing RStudio from scratch is not an easy task. Fortunately, we can start
with the <a href="https://hub.docker.com/r/rocker/rstudio/">official RStudio image</a>, which is what is specified at
the top of the file. If desired, the <code class="language-plaintext highlighter-rouge">latest</code> tag can be changed to a specific
version. The second block of Docker instructions is to provide programs and
libraries that are needed by the R packages that one is planning to install. For
instance, TeX Live is needed for rendering notebooks as PDF documents using
LaTeX. The last block of instructions in <code class="language-plaintext highlighter-rouge">Dockerfile</code> is for installing the R
packages themselves. As with Python, all necessary packages are gathered in a
single file called <a href="https://github.com/chain-rule/RStudio/blob/article/requirements.txt"><code class="language-plaintext highlighter-rouge">requirements.txt</code></a>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>devtools
glmnet
plotly
rmarkdown
rstan
testthat
tidytext
tidyverse
</code></pre></div></div>
<p>The <code class="language-plaintext highlighter-rouge">rmarkdown</code> package is required for notebooks in Markdown. The rest is
intended to be changed according to one’s preferences; although, <code class="language-plaintext highlighter-rouge">tidyverse</code> is
arguably a must in modern R.</p>
<p>All right, in order to build the image and launch containers, we create the
following <a href="https://github.com/chain-rule/RStudio/blob/article/Makefile"><code class="language-plaintext highlighter-rouge">Makefile</code></a>:</p>
<div class="language-make highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># The name of the Docker image
</span><span class="nv">name</span> <span class="o">:=</span> rstudio
<span class="c"># The directory to be mounted to the container
</span><span class="nv">root</span> <span class="o">?=</span> <span class="nv">${PWD}</span>
<span class="c"># Build a new image
</span><span class="nl">build</span><span class="o">:</span>
docker rmi <span class="nv">${name}</span> <span class="o">||</span> <span class="nb">true</span>
docker build <span class="nt">--tag</span> <span class="nv">${name}</span> .
<span class="c"># Start a new container
</span><span class="nl">start</span><span class="o">:</span>
<span class="p">@</span><span class="nb">echo</span> <span class="s2">"Address: http://localhost:8787/"</span>
<span class="p">@</span><span class="nb">echo</span> <span class="s2">"User: rstudio"</span>
<span class="p">@</span><span class="nb">echo</span> <span class="s2">"Password: rstud10"</span>
<span class="p">@</span><span class="nb">echo</span>
<span class="p">@</span><span class="nb">echo</span> <span class="s1">'Press Control-C to terminate...'</span>
<span class="p">@</span>docker run <span class="nt">--interactive</span> <span class="nt">--tty</span> <span class="nt">--rm</span> <span class="se">\</span>
<span class="nt">--name</span> <span class="nv">${name}</span> <span class="se">\</span>
<span class="nt">--publish</span> 8787:8787 <span class="se">\</span>
<span class="nt">--volume</span> <span class="s2">"</span><span class="nv">${root}</span><span class="s2">:/home/rstudio"</span> <span class="se">\</span>
<span class="nt">--env</span> <span class="nv">PASSWORD</span><span class="o">=</span>rstud10 <span class="se">\</span>
<span class="nv">${name}</span> <span class="o">></span> /dev/null
</code></pre></div></div>
<p>It is similar to the one for JupyterLab; however, since the default prompt of
RStudio is not as informative as the one of JupyterLab, we print our own usage
instructions upon <code class="language-plaintext highlighter-rouge">start</code>.</p>
<p>The final piece is the shortcut for launching RStudio:</p>
<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">alias </span><span class="nv">rstudio</span><span class="o">=</span><span class="s1">'make -C /path/to/the/folder/with/the/Makefile root="${PWD}"'</span>
</code></pre></div></div>
<p>In the case of Bash, it can be installed as follows:</p>
<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">echo</span> <span class="s2">"alias rstudio='make -C </span><span class="se">\"</span><span class="k">${</span><span class="nv">PWD</span><span class="k">}</span><span class="se">\"</span><span class="s2"> root=</span><span class="se">\"\$</span><span class="s2">{PWD}</span><span class="se">\"</span><span class="s2">'"</span> <span class="o">>></span> ~/.bashrc
</code></pre></div></div>
<p>Now it is time to build the image, go to an arbitrary directory, and test the
alias:</p>
<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>make build
<span class="nb">cd</span> /path/to/some/project
rstudio
</code></pre></div></div>
<p>Unlike the JupyterLab image, this one is much slower to build due to R packages
traditionally compiling a lot of C++ code upon installation.</p>
<p>Lastly, it might be particularly convenient to have one’s GUI preferences (such
as the font size in the editor) and alike be automatically set up upon each
container launch. This can be achieved by realizing that RStudio stores user
preferences in a local folder called <code class="language-plaintext highlighter-rouge">.rstudio</code>. Then the <code class="language-plaintext highlighter-rouge">start</code> command can be
adjusted to silently plant a preconfigured <code class="language-plaintext highlighter-rouge">.rstudio</code> into the current working
directory, which can be seen in the <a href="https://github.com/chain-rule/RStudio/tree/article">repository</a> accompanying this
article.</p>
<h1 id="conclusion">Conclusion</h1>
<p>Having completed the above steps, we have two Docker images: one is for Python
notebooks via JupyterLab, and one is for R notebooks via RStudio. At the moment,
the images are stored locally; however, they can be pushed to a public or
private image repository, such as <a href="https://hub.docker.com/">Docker Hub</a> and <a href="https://cloud.google.com/container-registry/">Google Container Registry</a>,
and subsequently pulled on an arbitrary machine having Docker installed.
Alternatively, they can be built on each machine separately. Regardless of the
installation, the crucial point is that our working environment will unshakably
remain in a specific pristine state defined by us.</p>
<p>Lastly, it is worth noting that similar images can straightforwardly be built
for more specific scenarios. For instance, the following repository provides a
skeleton for building and using a custom <a href="https://cloud.google.com/datalab/">Datalab</a>, which is Google’s wrapper
for Jupyter notebooks that run in the cloud: <a href="https://github.com/chain-rule/Datalab">Datalab</a>.</p>Ivan UkhovIt is truly amazing how interactive notebooks—where a narrative in a spoken language is entwined with executable chunks of code in a programming language—have revolutionized the way we work with data and document our thought processes and findings for others and, equally importantly, for our future selves. They are ubiquitous and taken for granted. It is hard to imagine where data enthusiasts would be without them. Most likely, we would be spending too much time staring at a terminal window, anxiously re-running scripts from start to finish, printing variables, and saving lots of files with tables and graphs on disk for further inspection. Interactive notebooks are an essential tool in the data scientist’s toolbox, and in this article, we are going to make them readily available for our use with our favorite packages installed and preferences set up, no matter where we find ourselves working and regardless of the mess we might have left behind during the previous session.On the expected utility in conversion rate optimization2019-07-08T06:00:00+00:002019-07-08T06:00:00+00:00https://blog.ivanukhov.com/2019/07/08/conversion<p>It can be not only extremely useful but also deeply satisfying to occasionally
dust off one’s math skills. In this article, we approach the classical problem
of conversion rate optimization—which is frequently faced by companies operating
online—and derive the expected utility of switching from variant A to variant B
under some modeling assumptions. This information can subsequently be utilized
in order to support the corresponding decision-making process.</p>
<p>An R implementation of the math below and more can be found in the following
repository:</p>
<ul>
<li><a href="https://github.com/chain-rule/conversion-rate">conversion-rate</a>.</li>
</ul>
<p>However, it was written for personal exploratory purposes and has no
documentation at the moment. If you decide to dive in, you will be on your own.</p>
<h1 id="problem">Problem</h1>
<p>Suppose, as a business, you send communications to your customers in order to
increase their engagement with the product. Furthermore, suppose you suspect
that a certain change to the usual way of working might increase the uplift. In
order to test your hypothesis, you set up an A/B test. The only decision you
care about is whether or not you should switch from variant A to variant B where
variant A is the baseline (the usual way of working). The twist is that, from
the perspective of the business, variant B comes with its own gain if it is the
winner, and its own loss if it is the loser. The goal is to incorporate this
information in the final decision, making necessary assumptions along the way.</p>
<h1 id="solution">Solution</h1>
<p>Let \(A\) and \(B\) be two random variables modeling the conversion rates of the
two variants, variant A and variant B. Furthermore, let \(p\) be the probability
density function of the joint distribution of \(A\) and \(B\). In what follows,
concrete values assumed by the variables are denoted by \(a\) and \(b\),
respectively.</p>
<p>Define the utility function as</p>
\[U(a, b) = G(a, b) I(a < b) + L(a, b) I(a > b)\]
<p>where \(G\) and \(L\) are referred to as the gain and loss functions,
respectively. The gain function takes effect when variant B has a higher
conversion rate than the one of variant A, and the loss function takes effect
when variant A is better than variant B, which is what is enforced by the two
indicator functions (the equality is not essential). The expected utility is
then as follows:</p>
\[\begin{align}
E(U(A, B))
&= \int_0^1 \int_0^1 U(a, b) p(a, b) \, db \, da \\
&=
\int_0^1 \int_a^1 G(a, b) p(a, b) \, db \, da +
\int_0^1 \int_0^a L(a, b) p(a, b) \, db \, da.
\end{align}\]
<p>We assume further the gain and loss are linear:</p>
\[\begin{align}
& G(a, b) = w_g (b - a) \text{ and} \\
& L(a, b) = w_l (b - a).
\end{align}\]
<p>In the above, \(w_g\) and \(w_l\) are two non-negative scaling factors, which
can be used to encode business preferences. Then we have that</p>
\[\begin{align}
E(U(A, B)) =
&
w_g \int_0^1 \int_a^1 b \, p(a, b) \, db \, da -
w_g \int_0^1 \int_a^1 a \, p(a, b) \, db \, da + {} \\
&
w_l \int_0^1 \int_0^a b \, p(a, b) \, db \, da -
w_l \int_0^1 \int_0^a a \, p(a, b) \, db \, da.
\end{align}\]
<p>For convenience, denote the four integrals by \(G_1\), \(G_2\), \(L_1\), and
\(L_2\), respectively, in which case we have that</p>
\[E(U(A, B)) = w_g \, G_1 - w_g \, G_2 + w_l \, L_1 - w_l \, L_2.\]
<p>Now, suppose the distributions of \(A\) and \(B\) are estimated using Bayesian
inference. In this approach, the prior knowledge of the decision-maker about the
conversion rates of the two variants is combined with the evidence in the form
of data continuously streaming from the A/B test. It is natural to use a
binomial distribution for the data and a beta distribution for the prior
knowledge, which results in a posterior distribution that is also a beta
distribution due to conjugacy.</p>
<p><em>A posteriori</em>, we have the following marginal distributions:</p>
\[\begin{align}
& A \sim \text{Beta}(\alpha_a, \beta_a) \text{ and} \\
& B \sim \text{Beta}(\alpha_b, \beta_b)
\end{align}\]
<p>where \(\alpha_a\) and \(\beta_a\) the shape parameters of \(A\), and
\(\alpha_b\) and \(\beta_b\) of the shape parameters of \(B\). Assuming that the
two random variables are independent given the parameters,</p>
\[p(a, b) =
p(a) \, p(b) =
\frac{a^{\alpha_a - 1} (1 - a)^{\beta_a - 1}}{B(\alpha_a, \beta_a)}
\frac{b^{\alpha_b - 1} (1 - b)^{\beta_b - 1}}{B(\alpha_b, \beta_b)}.\]
<p>We can now compute the expected utility. The first integral is as follows:</p>
\[\begin{align}
G_1
&=
\int_0^1 \int_a^1
\frac{a^{\alpha_a - 1} (1 - a)^{\beta_a - 1}}{B(\alpha_a, \beta_a)}
\frac{b^{\alpha_b} (1 - b)^{\beta_b - 1}}{B(\alpha_b, \beta_b)} \, db \, da \\
&=
\frac{B(\alpha_b + 1, \beta_b)}{B(\alpha_b, \beta_b)}
\int_0^1 \int_a^1
\frac{a^{\alpha_a - 1} (1 - a)^{\beta_a - 1}}{B(\alpha_a, \beta_a)}
\frac{b^{\alpha_b} (1 - b)^{\beta_b - 1}}{B(\alpha_b + 1, \beta_b)} \, db \, da \\
&=
\frac{B(\alpha_b + 1, \beta_b)}{B(\alpha_b, \beta_b)}
h(\alpha_a, \beta_a, \alpha_b + 1, \beta_b)
\end{align}\]
<p>where, which a slight abuse of notation, \(B\) is the beta function and</p>
\[h(\alpha_1, \beta_1, \alpha_2, \beta_2) = P(X_1 < X_2)\]
<p>for any</p>
\[\begin{align}
& X_1 \sim \text{Beta}(\alpha_1, \beta_1) \text{ and} \\
& X_2 \sim \text{Beta}(\alpha_2, \beta_2).
\end{align}\]
<p>The function \(h\) can be computed analytically, as shown in the blog posts
mentioned above. Specifically,</p>
\[h(\alpha_1, \beta_1, \alpha_2, \beta_2) =
\sum_{i = 0}^{\alpha_2 - 1} \frac{B(\alpha_1 + i, \beta_1 + \beta_2)}{(\beta_2 + i) B(1 + i, \beta_2) B(\alpha_1, \beta_1)}.\]
<p>Similarly,</p>
\[G_2 =
\frac{B(\alpha_a + 1, \beta_a)}{B(\alpha_a, \beta_a)}
h(\alpha_a + 1, \beta_a, \alpha_b, \beta_b).\]
<p>Regarding the last two integrals in the expression of the utility function,</p>
\[\begin{align}
L_1
&=
\int_0^1 \int_0^a
\frac{a^{\alpha_a - 1} (1 - a)^{\beta_a - 1}}{B(\alpha_a, \beta_a)}
\frac{b^{\alpha_b} (1 - b)^{\beta_b - 1}}{B(\alpha_b, \beta_b)} \, db \, da \\
&=
\frac{B(\alpha_b + 1, \beta_b)}{B(\alpha_b, \beta_b)}
\int_0^1 \int_0^a
\frac{a^{\alpha_a - 1} (1 - a)^{\beta_a - 1}}{B(\alpha_a, \beta_a)}
\frac{b^{\alpha_b} (1 - b)^{\beta_b - 1}}{B(\alpha_b + 1, \beta_b)} \, db \, da \\
&=
\frac{B(\alpha_b + 1, \beta_b)}{B(\alpha_b, \beta_b)}
h(\alpha_b + 1, \beta_b, \alpha_a, \beta_a).
\end{align}\]
<p>Also,</p>
\[L_2 =
\frac{B(\alpha_a + 1, \beta_a)}{B(\alpha_a, \beta_a)}
h(\alpha_b, \beta_b, \alpha_a + 1, \beta_a).\]
<p>Assembling the integrals together, we obtain</p>
\[\begin{align}
E(U(A, B)) =
& w_g \, \frac{B(\alpha_b + 1, \beta_b)}{B(\alpha_b, \beta_b)}
h(\alpha_a, \beta_a, \alpha_b + 1, \beta_b) - {} \\
& w_g \, \frac{B(\alpha_a + 1, \beta_a)}{B(\alpha_a, \beta_a)}
h(\alpha_a + 1, \beta_a, \alpha_b, \beta_b) + {} \\
& w_l \, \frac{B(\alpha_b + 1, \beta_b)}{B(\alpha_b, \beta_b)}
h(\alpha_b + 1, \beta_b, \alpha_a, \beta_a) - {} \\
& w_l \, \frac{B(\alpha_a + 1, \beta_a)}{B(\alpha_a, \beta_a)}
h(\alpha_b, \beta_b, \alpha_a + 1, \beta_a).
\end{align}\]
<p>At this point, we could call it a day, but there is some room for
simplification. Note that, in the case of the assumed linear model, we have the
following relationship between \(G\) and \(L\):</p>
\[\begin{align}
G_1 - G_2
&=
\frac{B(\alpha_b + 1, \beta_b)}{B(\alpha_b, \beta_b)}
h(\alpha_a, \beta_a, \alpha_b + 1, \beta_b) -
\frac{B(\alpha_a + 1, \beta_a)}{B(\alpha_a, \beta_a)}
h(\alpha_a + 1, \beta_a, \alpha_b, \beta_b) \\
&=
\frac{B(\alpha_b + 1, \beta_b)}{B(\alpha_b, \beta_b)}
(1 - h(\alpha_b + 1, \beta_b, \alpha_a, \beta_a)) -
\frac{B(\alpha_a + 1, \beta_a)}{B(\alpha_a, \beta_a)}
(1 - h(\alpha_b, \beta_b, \alpha_a + 1, \beta_a)) \\
&=
\frac{B(\alpha_b + 1, \beta_b)}{B(\alpha_b, \beta_b)} -
\frac{B(\alpha_a + 1, \beta_a)}{B(\alpha_a, \beta_a)} -
(L_1 - L_2) \\
&=
\Delta - (L_1 - L_2)
\end{align}\]
<p>where \(\Delta\) is the different between the above two ratios of beta
functions. Therefore,</p>
\[\begin{align}
E(U(A, B))
&= w_g (G_1 - G_2) + w_l (L_1 - L_2) \\
&= w_g (G_1 - G_2) + w_l (\Delta - (G_1 - G_2)) \\
&= (w_g - w_l) (G_1 - G_2) + w_l \, \Delta.
\end{align}\]
<h1 id="conclusion">Conclusion</h1>
<p>The decision-maker is now better equipped to take action. Having obtained the
posterior distributions of the conversion rates of the two variants, the derived
formula allows one to assess whether variant B is worth switching to,
considering its utility to the business at hand.</p>
<p>The reason the expected utility \(E(U(A, B))\) can be evaluated in closed form
in this case is the linearity of the utility function \(U(a, b)\). More nuanced
preferences require a different approach. The most flexible candidate is
simulation, which is straightforward and should arguably be the go-to tool
regardless of the availability of a closed-form solution, as it is less
error-prone.</p>
<p>Please feel free to reach out if you have any thoughts or suggestions.</p>
<h1 id="acknowledgments">Acknowledgments</h1>
<p>This article is largely inspired by a series of excellent blog posts by <a href="http://www.evanmiller.org/bayesian-ab-testing.html">Evan
Miller</a>, <a href="https://www.chrisstucchio.com/blog/2014/bayesian_ab_decision_rule.html">Chris Stucchio</a>, and <a href="http://varianceexplained.org/r/bayesian-ab-testing/">David Robinson</a>, which are strongly recommended.</p>
<h1 id="references">References</h1>
<ul>
<li>Chris Stucchio, “<a href="https://www.chrisstucchio.com/blog/2014/bayesian_ab_decision_rule.html">Easy evaluation of decision rules in Bayesian A/B
testing</a>,” 2014.</li>
<li>David Robinson, “<a href="http://varianceexplained.org/r/bayesian-ab-testing/">Is Bayesian A/B testing immune to peeking? Not
exactly</a>,” 2015.</li>
<li>Evan Miller, “<a href="http://www.evanmiller.org/bayesian-ab-testing.html">Formulas for Bayesian A/B testing</a>,” 2014.</li>
</ul>Ivan UkhovIt can be not only extremely useful but also deeply satisfying to occasionally dust off one’s math skills. In this article, we approach the classical problem of conversion rate optimization—which is frequently faced by companies operating online—and derive the expected utility of switching from variant A to variant B under some modeling assumptions. This information can subsequently be utilized in order to support the corresponding decision-making process.A poor man’s orchestration of predictive models, or do it yourself2019-07-01T06:00:00+00:002019-07-01T06:00:00+00:00https://blog.ivanukhov.com/2019/07/01/orchestration<p>As a data scientist focusing on developing data products, you naturally want
your work to reach its target audience. Suppose, however, that your company does
not have a dedicated engineering team for productizing data-science code. One
solution is to seek help in other teams, which are surely busy with their own
endeavors, and spend months waiting. Alternatively, you could take the
initiative and do it yourself. In this article, we take the initiative and
schedule the training and application phases of a predictive model using Apache
<a href="https://airflow.apache.org/">Airflow</a>, Google <a href="https://cloud.google.com/compute/">Compute Engine</a>, and <a href="https://www.docker.com/">Docker</a>.</p>
<p>Let us first set expectations for what is assumed to be given and what will be
attained by the end of this article. It is assumed that a predictive model for
supporting business decisions—such as a model for identifying potential churners
or a model for estimating the lifetime value of customers—has already been
developed. This means that a business problem has already been identified and
translated into a concrete question, the data needed for answering the question
have already been collected and transformed into a target variable and a set of
explanatory variables, and a modeling technique has already been selected and
calibrated in order to answer the question by predicting the target variable
given the explanatory variables. For the sake of concreteness, the model is
assumed to be written in Python. We also assume that the company at hand has
chosen Google Cloud Platform as its primary platform, which makes a certain
suite of tools readily available.</p>
<p>Our goal is then to schedule the model to run in the cloud via Airflow, Compute
Engine, and Docker so that it is periodically retrained (in order to take into
account potential fluctuations in the data distribution) and periodically
applied (in order to actually make predictions), delivering predictions to the
data warehouse in the form of <a href="https://cloud.google.com/bigquery/">BigQuery</a> for further consumption by other
parties.</p>
<p>It is important to note that this article is not a tutorial on any of the
aforementioned technologies. The reader is assumed to be familiar with Google
Cloud Platform and to have an understanding of Airflow and Docker, as well as to
be comfortable with finding out missing details on their own.</p>
<p>Lastly, the following two repositories contain the code discussed below:</p>
<ul>
<li><a href="https://github.com/chain-rule/example-prediction">example-prediction</a> and</li>
<li><a href="https://github.com/chain-rule/example-prediction-service">example-prediction-service</a>.</li>
</ul>
<h1 id="preparing-the-model">Preparing the model</h1>
<p>The suggested structure of the repository hosting the model is as follows:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>.
├── configs/
│ ├── application.json
│ └── training.json
├── prediction/
│ ├── __init__.py
│ ├── main.py
│ ├── model.py
│ └── task.py
├── README.md
└── requirements.txt
</code></pre></div></div>
<p>Here <a href="https://github.com/chain-rule/example-prediction/tree/master/prediction"><code class="language-plaintext highlighter-rouge">prediction/</code></a> is a Python package, and it is likely to contain many more
files than the ones listed. The <a href="https://github.com/chain-rule/example-prediction/blob/master/prediction/main.py"><code class="language-plaintext highlighter-rouge">main</code></a> file is the entry point for
command-line invocation, the <a href="https://github.com/chain-rule/example-prediction/blob/master/prediction/task.py"><code class="language-plaintext highlighter-rouge">task</code></a> module defines the actions that the
package is capable of performing, and the <a href="https://github.com/chain-rule/example-prediction/blob/master/prediction/model.py"><code class="language-plaintext highlighter-rouge">model</code></a> module defines the model.</p>
<p>As alluded to above, the primary job of the <code class="language-plaintext highlighter-rouge">main</code> file is to parse command-line
arguments, read a configuration file, potentially set up logging and alike, and
delegate the rest to the <code class="language-plaintext highlighter-rouge">task</code> module. At a later stage, an invocation of an
action might look as follows:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python <span class="nt">-m</span> prediction.main <span class="nt">--action</span> training <span class="nt">--config</span> configs/training.json
</code></pre></div></div>
<p>Here we are passing two arguments: <code class="language-plaintext highlighter-rouge">--action</code> and <code class="language-plaintext highlighter-rouge">--config</code>. The former is to
specify the desired action, and the latter is to supply additional configuration
parameters, such as the location of the training data and the values of the
model’s hyperparameters. Keeping all parameters in a separate file, as opposed
to hard-coding them, makes the code reusable, and passing them all at once as a
single file scales much better than passing each parameter as a separate
argument.</p>
<p>The <code class="language-plaintext highlighter-rouge">task</code> module is conceptually as follows (see the repository for the exact
implementation):</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Task</span><span class="p">:</span>
<span class="k">def</span> <span class="nf">training</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="c1"># Read the training data
</span> <span class="c1"># Train the model
</span> <span class="c1"># Save the trained model
</span>
<span class="k">def</span> <span class="nf">application</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="c1"># Read the application data
</span> <span class="c1"># Load the trained model
</span> <span class="c1"># Make predictions
</span> <span class="c1"># Save the predictions
</span></code></pre></div></div>
<p>In this example, there are two tasks: training and application. The training
task is responsible for fetching the training data, training the model, and
saving the result in a predefined location for future usage by the application
task. The application task is responsible for fetching the application data
(that is, the data the model is supposed to be applied to), loading the trained
model produced by the training task, making predictions, and saving them in a
predefined location to be picked up for the subsequent delivery to the data
warehouse.</p>
<p>Likewise, the <code class="language-plaintext highlighter-rouge">model</code> module can be simplified as follows:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Model</span><span class="p">:</span>
<span class="k">def</span> <span class="nf">fit</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">data</span><span class="p">):</span>
<span class="c1"># Estimate the model’s parameters
</span>
<span class="k">def</span> <span class="nf">predict</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">data</span><span class="p">):</span>
<span class="c1"># Make predictions using the estimated parameters
</span></code></pre></div></div>
<p>It can be seen that the structure presented above makes very few assumptions
about the model, which makes it generally applicable. It can also be easily
extended to accommodate other actions. For instance, one could have a separate
action for testing the model on unseen data.</p>
<p>Having structured the model as shown above, it can now be productized, which we
discuss next.</p>
<h1 id="wrapping-the-model-into-a-service">Wrapping the model into a service</h1>
<p>Now it is time to turn the model into a service. In the scope of this article, a
service is a self-sufficient piece of code that can be executed in the cloud
upon request. To this end, another repository is created, adhering to the
separation-of-concerns design principle. Specifically, by doing so, we avoid
mixing the modeling code with the code specific to a particular environment
where the model happens to be deployed. The suggested structure of the
repository is as follows:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>.
├── container/
│ ├── Dockerfile
│ ├── run.sh
│ └── wait.sh
├── service/
│ ├── configs/
│ │ ├── application.json
│ │ └── training.json
│ ├── source/ # the first repository as a submodule
│ └── requirements.txt
├── scheduler/
│ ├── configs/
│ │ ├── application.json
│ │ └── training.json
│ ├── application.py # a symbolic link to graph.py
│ ├── graph.py
│ └── training.py # a symbolic link to graph.py
├── Makefile
└── README.md
</code></pre></div></div>
<p>The <a href="https://github.com/chain-rule/example-prediction-service/tree/master/container"><code class="language-plaintext highlighter-rouge">container/</code></a> folder contains files for building a Docker image for the
service. The <a href="https://github.com/chain-rule/example-prediction-service/tree/master/service"><code class="language-plaintext highlighter-rouge">service/</code></a> folder is the service itself, meaning that these files
will be present in the container and eventually executed. Lastly, the
<a href="https://github.com/chain-rule/example-prediction-service/tree/master/scheduler"><code class="language-plaintext highlighter-rouge">scheduler/</code></a> folder contains files for scheduling the service using Airflow.
The last one will be covered in the next section; here we focus on the first
two.</p>
<p>Let us start with <code class="language-plaintext highlighter-rouge">service/</code>. The first repository (the one discussed in the
previous section) is added to this second repository as a Git submodule living
in <code class="language-plaintext highlighter-rouge">service/source/</code>. This means that the model will essentially be embedded in
the service but will conveniently remain an independent entity. At all times,
the service contains a reference to a particular state (a particular commit,
potentially on a dedicated release branch) of the model, guaranteeing that the
desired version of the model is in production. However, when invoking the model
from the service, we might want to use a different set of configuration files
than the ones present in the first repository. To this end, a service-specific
version of the configuration files is created in <code class="language-plaintext highlighter-rouge">service/configs/</code>. We might
also want to install additional Python dependencies; hence, there is a separate
file with requirements.</p>
<p>Now it is time to containerize the service code by building a Docker image. The
relevant files are gathered in <code class="language-plaintext highlighter-rouge">container/</code>. The image is defined in
<a href="https://github.com/chain-rule/example-prediction-service/tree/master/container/Dockerfile"><code class="language-plaintext highlighter-rouge">container/Dockerfile</code></a> and is as follows:</p>
<div class="language-docker highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Use a minimal Python image</span>
<span class="k">FROM</span><span class="s"> python:3.7-slim</span>
<span class="c"># Install Google Cloud SDK as described in</span>
<span class="c"># https://cloud.google.com/sdk/docs/quickstart-debian-ubuntu</span>
<span class="c"># Copy the service directory to the image</span>
<span class="k">COPY</span><span class="s"> service /service</span>
<span class="c"># Copy the run script to the image</span>
<span class="k">COPY</span><span class="s"> container/run.sh /run.sh</span>
<span class="c"># Install Python dependencies specific to the predictive model</span>
<span class="k">RUN </span>pip <span class="nb">install</span> <span class="nt">--upgrade</span> <span class="nt">--requirement</span> /service/source/requirements.txt
<span class="c"># Install Python dependencies specific to the service</span>
<span class="k">RUN </span>pip <span class="nb">install</span> <span class="nt">--upgrade</span> <span class="nt">--requirement</span> /service/requirements.txt
<span class="c"># Set the working directory to be the service directory</span>
<span class="k">WORKDIR</span><span class="s"> /service</span>
<span class="c"># Set the entry point to be the run script</span>
<span class="k">ENTRYPOINT</span><span class="s"> /run.sh</span>
</code></pre></div></div>
<p>As mentioned earlier, <code class="language-plaintext highlighter-rouge">service/</code> gets copied as is (including <code class="language-plaintext highlighter-rouge">service/source</code>
with the model), and it will be the working directory inside the container. We
also copy <a href="https://github.com/chain-rule/example-prediction-service/tree/master/container/run.sh"><code class="language-plaintext highlighter-rouge">container/run.sh</code></a>, which becomes the entry point of the container;
this script is executed whenever a container is launched. Let us take a look at
the content of the script (as before, some parts omitted for clarity):</p>
<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/bash</span>
<span class="k">function </span>process_training<span class="o">()</span> <span class="o">{</span>
<span class="c"># Invoke training</span>
python <span class="nt">-m</span> prediction.main <span class="se">\</span>
<span class="nt">--action</span> <span class="k">${</span><span class="nv">ACTION</span><span class="k">}</span> <span class="se">\</span>
<span class="nt">--config</span> configs/<span class="k">${</span><span class="nv">ACTION</span><span class="k">}</span>.json
<span class="c"># Set the output location in Cloud Storage</span>
<span class="nb">local </span><span class="nv">output</span><span class="o">=</span>gs://<span class="k">${</span><span class="nv">NAME</span><span class="k">}</span>/<span class="k">${</span><span class="nv">VERSION</span><span class="k">}</span>/<span class="k">${</span><span class="nv">ACTION</span><span class="k">}</span>/<span class="k">${</span><span class="nv">timestamp</span><span class="k">}</span>
<span class="c"># Copy the trained model from the output directory to Cloud Storage</span>
save output <span class="k">${</span><span class="nv">output</span><span class="k">}</span>
<span class="o">}</span>
<span class="k">function </span>process_application<span class="o">()</span> <span class="o">{</span>
<span class="c"># Find the latest trained model in Cloud Storage</span>
<span class="c"># Copy the trained model from Cloud Storage to the output directory</span>
load <span class="k">${</span><span class="nv">input</span><span class="k">}</span> output
<span class="c"># Invoke application</span>
python <span class="nt">-m</span> prediction.main <span class="se">\</span>
<span class="nt">--action</span> <span class="k">${</span><span class="nv">ACTION</span><span class="k">}</span> <span class="se">\</span>
<span class="nt">--config</span> configs/<span class="k">${</span><span class="nv">ACTION</span><span class="k">}</span>.json
<span class="c"># Set the output location in Cloud Storage</span>
<span class="nb">local </span><span class="nv">output</span><span class="o">=</span>gs://<span class="k">${</span><span class="nv">NAME</span><span class="k">}</span>/<span class="k">${</span><span class="nv">VERSION</span><span class="k">}</span>/<span class="k">${</span><span class="nv">ACTION</span><span class="k">}</span>/<span class="k">${</span><span class="nv">timestamp</span><span class="k">}</span>
<span class="c"># Copy the predictions from the output directory to Cloud Storage</span>
save output <span class="k">${</span><span class="nv">output</span><span class="k">}</span>
<span class="c"># Set the input file in Cloud Storage</span>
<span class="c"># Set the output data set and table in BigQuery</span>
<span class="c"># Ingest the predictions from Cloud Storage into BigQuery</span>
ingest <span class="k">${</span><span class="nv">input</span><span class="k">}</span> <span class="k">${</span><span class="nv">output</span><span class="k">}</span> player_id:STRING,label:BOOL
<span class="o">}</span>
<span class="k">function </span>delete<span class="o">()</span> <span class="o">{</span>
<span class="c"># Delete a Compute Engine instance called "${NAME}-${VERSION}-${ACTION}"</span>
<span class="o">}</span>
<span class="k">function </span>ingest<span class="o">()</span> <span class="o">{</span>
<span class="c"># Ingest a file from Cloud Storage into a table in BigQuery</span>
<span class="o">}</span>
<span class="k">function </span>load<span class="o">()</span> <span class="o">{</span>
<span class="c"># Sync the content of a location in Cloud Storage with a local directory</span>
<span class="o">}</span>
<span class="k">function </span>save<span class="o">()</span> <span class="o">{</span>
<span class="c"># Sync the content of a local directory with a location in Cloud Storage</span>
<span class="o">}</span>
<span class="k">function </span>send<span class="o">()</span> <span class="o">{</span>
<span class="c"># Write into a Stackdriver log called "${NAME}-${VERSION}-${ACTION}"</span>
<span class="o">}</span>
<span class="c"># Invoke the delete function when the script exits regardless of the reason</span>
<span class="nb">trap </span>delete EXIT
<span class="c"># Report a successful start to Stackdriver</span>
send <span class="s1">'Running the action...'</span>
<span class="c"># Invoke the function specified by the ACTION environment variable</span>
process_<span class="k">${</span><span class="nv">ACTION</span><span class="k">}</span>
<span class="c"># Report a successful completion to Stackdriver</span>
send <span class="s1">'Well done.'</span>
</code></pre></div></div>
<p>The script expects a number of environment variables to be set upon each
container launch, which will be discussed shortly. The primary ones are <code class="language-plaintext highlighter-rouge">NAME</code>,
<code class="language-plaintext highlighter-rouge">VERSION</code>, and <code class="language-plaintext highlighter-rouge">ACTION</code>, indicating the name of the service, version of the
service, and action to be executed by the service, respectively.</p>
<p>As we shall see below, the above script interacts with several different
products on Google Cloud Platform. It might then be surprising that there is
only a handful of variables passed to the script. The explanation is that the
convention-over-configuration design paradigm is followed to a great extent
here, meaning that other necessary variables can be derived (save sensible
default values) from the ones given, since there are certain naming conventions
used throughout the project.</p>
<p>The <code class="language-plaintext highlighter-rouge">process_training</code> and <code class="language-plaintext highlighter-rouge">process_application</code> are responsible for training
and application, respectively. It can be seen that they leverage the
command-line interface by invoking the <code class="language-plaintext highlighter-rouge">main</code> file, which was discussed in the
previous section. Since containers are stateless, all artifacts are stored in an
external storage, which is a bucket in <a href="https://cloud.google.com/storage/">Cloud Storage</a> in our case, and this job
is delegated to the <code class="language-plaintext highlighter-rouge">load</code> and <code class="language-plaintext highlighter-rouge">save</code> functions used in both <code class="language-plaintext highlighter-rouge">process_training</code>
and <code class="language-plaintext highlighter-rouge">process_application</code>. In addition, the result of the application action
(that is, the predictions) is ingested into a table in BigQuery using <a href="https://cloud.google.com/sdk/">Cloud
SDK</a>, which can be seen in the <code class="language-plaintext highlighter-rouge">ingest</code> function in <a href="https://github.com/chain-rule/example-prediction-service/tree/master/container/run.sh"><code class="language-plaintext highlighter-rouge">container/run.sh</code></a>.</p>
<p>The container communicates with the outside world using <a href="https://cloud.google.com/stackdriver/">Stackdriver</a> via the
<code class="language-plaintext highlighter-rouge">send</code> function, which writes messages to a log dedicated to the current service
run. The most important message is the one indicating a successful completion,
which is sent at the very end; we use “Well done” for this purpose. This is the
message that will be looked for in order to determine the overall outcome of a
service run.</p>
<p>Note also that, upon successful or unsuccessful completion, the container
deletes its hosting virtual machine, which is achieved by setting a handler
(<code class="language-plaintext highlighter-rouge">delete</code>) for the <code class="language-plaintext highlighter-rouge">EXIT</code> event.</p>
<p>Lastly, let us discuss the commands used for building the image and launching
the actions. This entails a few lengthy invocations of Cloud SDK, which can be
neatly organized in a <a href="https://github.com/chain-rule/example-prediction-service/tree/master/Makefile"><code class="language-plaintext highlighter-rouge">Makefile</code></a>:</p>
<div class="language-make highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># The name of the service
</span><span class="nv">name</span> <span class="o">?=</span> example-prediction-service
<span class="c"># The version of the service
</span><span class="nv">version</span> <span class="o">?=</span> 2019-00-00
<span class="c"># The name of the project on Google Cloud Platform
</span><span class="nv">project</span> <span class="o">?=</span> example-cloud-project
<span class="c"># The zone for operations in Compute Engine
</span><span class="nv">zone</span> <span class="o">?=</span> europe-west1-b
<span class="c"># The address of Container Registry
</span><span class="nv">registry</span> <span class="o">?=</span> eu.gcr.io
<span class="c"># The name of the Docker image
</span><span class="nv">image</span> <span class="o">:=</span> <span class="nv">${name}</span>
<span class="c"># The name of the instance excluding the action
</span><span class="nv">instance</span> <span class="o">:=</span> <span class="nv">${name}</span>-<span class="nv">${version}</span>
<span class="nl">build</span><span class="o">:</span>
docker rmi <span class="nv">${image}</span> 2> /dev/null <span class="o">||</span> <span class="nb">true</span>
docker build <span class="nt">--file</span> container/Dockerfile <span class="nt">--tag</span> <span class="nv">${image}</span> .
docker tag <span class="nv">${image}</span> <span class="nv">${registry}</span>/<span class="nv">${project}</span>/<span class="nv">${image}</span>:<span class="nv">${version}</span>
docker push <span class="nv">${registry}</span>/<span class="nv">${project}</span>/<span class="nv">${image}</span>:<span class="nv">${version}</span>
<span class="nl">training-start</span><span class="o">:</span>
gcloud compute instances create-with-container <span class="nv">${instance}</span><span class="nt">-training</span> <span class="se">\</span>
<span class="nt">--container-image</span> <span class="nv">${registry}</span>/<span class="nv">${project}</span>/<span class="nv">${image}</span>:<span class="nv">${version}</span> <span class="se">\</span>
<span class="nt">--container-env</span> <span class="nv">NAME</span><span class="o">=</span><span class="nv">${name}</span> <span class="se">\</span>
<span class="nt">--container-env</span> <span class="nv">VERSION</span><span class="o">=</span><span class="nv">${version}</span> <span class="se">\</span>
<span class="nt">--container-env</span> <span class="nv">ACTION</span><span class="o">=</span>training <span class="se">\</span>
<span class="nt">--container-env</span> <span class="nv">ZONE</span><span class="o">=</span><span class="nv">${zone}</span> <span class="se">\</span>
<span class="nt">--container-restart-policy</span> never <span class="se">\</span>
<span class="nt">--no-restart-on-failure</span> <span class="se">\</span>
<span class="nt">--machine-type</span> n1-standard-1 <span class="se">\</span>
<span class="nt">--scopes</span> default,bigquery,compute-rw,storage-rw
<span class="p">-</span><span class="nt">-zone</span> <span class="nv">${zone}</span>
<span class="nl">training-wait</span><span class="o">:</span>
container/wait.sh instance <span class="nv">${instance}</span><span class="nt">-training</span> <span class="nv">${zone}</span>
<span class="nl">training-check</span><span class="o">:</span>
container/wait.sh success <span class="nv">${instance}</span><span class="nt">-training</span>
<span class="c"># Similar for application
</span></code></pre></div></div>
<p>Here we define one command for building images, namely <code class="language-plaintext highlighter-rouge">build</code>, and three
commands per action, namely <code class="language-plaintext highlighter-rouge">start</code>, <code class="language-plaintext highlighter-rouge">wait</code>, and <code class="language-plaintext highlighter-rouge">check</code>. In this section, we
discuss <code class="language-plaintext highlighter-rouge">build</code> and <code class="language-plaintext highlighter-rouge">start</code> and leave the last two for the next section, as they
are needed specifically for scheduling.</p>
<p>The <code class="language-plaintext highlighter-rouge">build</code> command is invoked as follows:</p>
<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>make build
</code></pre></div></div>
<p>It has to be used each time a new version of the service is to be deployed. The
command creates a local Docker image according to the recipe in
<code class="language-plaintext highlighter-rouge">container/Dockerfile</code> and uploads it to <a href="https://cloud.google.com/container-registry/">Container Registry</a>, which is Google’s
storage for Docker images. For the last operation to succeed, your local Docker
has to be configured appropriately, which boils down to the following lines:</p>
<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gcloud auth login <span class="c"># General authentication for Cloud SDK</span>
gcloud auth configure-docker
</code></pre></div></div>
<p>Once <code class="language-plaintext highlighter-rouge">build</code> has finished successfully, one should be able to see the newly
created image in <a href="https://console.cloud.google.com">Cloud Console</a> by navigating to Container Registry in the menu
to the left. All future versions of the service will be neatly grouped in a
separate folder in the registry.</p>
<p>Given that the image is in the cloud, we can start to create virtual machines
running containers with this particular image, which is what the <code class="language-plaintext highlighter-rouge">start</code> command
is for:</p>
<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>make training-start <span class="c"># Similar for application</span>
</code></pre></div></div>
<p>Internally, it relies on <code class="language-plaintext highlighter-rouge">gcloud compute instances create-with-container</code>, which
can be seen in <code class="language-plaintext highlighter-rouge">Makefile</code> listed above. There are a few aspects to note about
this command. Apart from selecting the right image and version
(<code class="language-plaintext highlighter-rouge">--container-image</code>), one has to make sure to set the environment variables
mentioned earlier, as they control what the container will be doing once
launched. This is achieved by passing a number of <code class="language-plaintext highlighter-rouge">--container-env</code> options to
<code class="language-plaintext highlighter-rouge">create-with-container</code>. Here one can also easily scale up and down the host
virtual machine via the <code class="language-plaintext highlighter-rouge">--machine-type</code> option. Lastly, it is important to set
the <code class="language-plaintext highlighter-rouge">--scopes</code> option correctly in order to empower the container to work with
BigQuery, Compute Engine, and Cloud Storage.</p>
<p>At this point, we have a few handy commands for working with the service. It is
time for scheduling.</p>
<h1 id="scheduling-the-service">Scheduling the service</h1>
<p>The goal now is to make both training and application be executed periodically,
promptly delivering predictions to the data warehouse. Technically, one could
just keep invoking <code class="language-plaintext highlighter-rouge">make training-start</code> and <code class="language-plaintext highlighter-rouge">make application-start</code> on their
local machine, but of course, this is neither convenient nor reliable. Instead,
we would like to have an autonomous scheduler running in the cloud that would,
apart from its primary task of dispatching jobs, manage temporal dependencies
between jobs, keep record of all past and upcoming jobs, and preferably provide
a web-based dashboard for monitoring. One such tool is Airflow, and it is the
one used in this article.</p>
<p>In Airflow, the work to be performed is expressed as a directed acyclic graph
defined using Python. Our job is to create two such graphs. One is for training,
and one is for application, each with its own periodicity. At this point, it
might seem that each graph should contain only one node calling the <code class="language-plaintext highlighter-rouge">start</code>
command, which was introduced earlier. However, a more comprehensive solution is
to not only start the service but also wait for its termination and check that
it successfully executed the corresponding logic. It will give us great
visibility on the life cycle of the service in terms of various statistics (for
instance, the duration and outcome of all runs) directly in Airflow.</p>
<p>The above is the reason we have defined two additional commands in <code class="language-plaintext highlighter-rouge">Makefile</code>:
<code class="language-plaintext highlighter-rouge">wait</code> and <code class="language-plaintext highlighter-rouge">check</code>. The <code class="language-plaintext highlighter-rouge">wait</code> command ensures that the virtual machine reached
a terminal state (regardless of the outcome), and the <code class="language-plaintext highlighter-rouge">check</code> command ensures
that the terminal state was the one expected. This functionality can be
implemented in different ways. The approach that we use can be seen in
<a href="https://github.com/chain-rule/example-prediction-service/tree/master/container/wait.sh"><code class="language-plaintext highlighter-rouge">container/wait.sh</code></a>, which is invoked by both operations from <code class="language-plaintext highlighter-rouge">Makefile</code>:</p>
<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/bash</span>
<span class="k">function </span>process_instance<span class="o">()</span> <span class="o">{</span>
<span class="nb">echo</span> <span class="s1">'Waiting for the instance to finish...'</span>
<span class="k">while </span><span class="nb">true</span><span class="p">;</span> <span class="k">do</span>
<span class="c"># Try to read some information about the instance</span>
<span class="c"># Exit successfully when there is no such instance</span>
<span class="nb">wait
</span><span class="k">done</span>
<span class="o">}</span>
<span class="k">function </span>process_success<span class="o">()</span> <span class="o">{</span>
<span class="nb">echo</span> <span class="s1">'Waiting for the success to be reported...'</span>
<span class="k">while </span><span class="nb">true</span><span class="p">;</span> <span class="k">do</span>
<span class="c"># Check if the last entry in Stackdriver contains “Well done”</span>
<span class="c"># Exit successfully if the phrase is present</span>
<span class="nb">wait
</span><span class="k">done</span>
<span class="o">}</span>
<span class="k">function </span><span class="nb">wait</span><span class="o">()</span> <span class="o">{</span>
<span class="nb">echo</span> <span class="s1">'Waiting...'</span>
<span class="nb">sleep </span>10
<span class="o">}</span>
<span class="c"># Invoke the function specified by the first command-line argument and forward</span>
<span class="c"># the rest of the arguments to this function</span>
process_<span class="k">${</span><span class="nv">1</span><span class="k">}</span> <span class="k">${</span><span class="p">@</span>:2:10<span class="k">}</span>
</code></pre></div></div>
<p>The script has two main functions. The <code class="language-plaintext highlighter-rouge">process_instance</code> function waits for the
virtual machine to finish, and it is currently based on trying to fetch some
information about the machine in question using Cloud SDK. Whenever this
fetching fails, it is an indication of the machine being shut down and
destroyed, which is exactly what is needed in this case. The <code class="language-plaintext highlighter-rouge">process_success</code>
function waits for the key phrase “Well done” to appear in Stackdriver. However,
this message might never appear, and this is the reason <code class="language-plaintext highlighter-rouge">process_success</code> has a
timeout, unlike <code class="language-plaintext highlighter-rouge">process_instance</code>.</p>
<p>All right, there are now three commands to schedule in sequence: <code class="language-plaintext highlighter-rouge">start</code>,
<code class="language-plaintext highlighter-rouge">wait</code>, and <code class="language-plaintext highlighter-rouge">check</code>. For instance, for training, the exact command sequence is
the following:</p>
<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>make training-start
make training-wait
make training-check
</code></pre></div></div>
<p>We need to create two separate Python files defining two separate Airflow
graphs; however, the graphs will be almost identical except for the triggering
interval and the prefix of the <code class="language-plaintext highlighter-rouge">start</code>, <code class="language-plaintext highlighter-rouge">wait</code>, and <code class="language-plaintext highlighter-rouge">check</code> commands. It then
makes sense to keep the varying parts in separate configuration files and use
the exact same code for constructing the graphs, adhering to the
do-not-repeat-yourself design principle. The <a href="https://github.com/chain-rule/example-prediction-service/tree/master/scheduler/configs"><code class="language-plaintext highlighter-rouge">scheduler/configs/</code></a> folder
contains the configuration files suggested, and <a href="https://github.com/chain-rule/example-prediction-service/tree/master/scheduler/graph.py"><code class="language-plaintext highlighter-rouge">scheduler/graph.py</code></a> is the
Python script creating a graph:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">airflow</span> <span class="kn">import</span> <span class="n">DAG</span>
<span class="kn">from</span> <span class="nn">airflow.operators.bash_operator</span> <span class="kn">import</span> <span class="n">BashOperator</span>
<span class="k">def</span> <span class="nf">configure</span><span class="p">():</span>
<span class="c1"># Extract the directory containing the current file
</span> <span class="n">path</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">dirname</span><span class="p">(</span><span class="n">__file__</span><span class="p">)</span>
<span class="c1"># Extract the name of the current file without its extension
</span> <span class="n">name</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">splitext</span><span class="p">(</span><span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">basename</span><span class="p">(</span><span class="n">__file__</span><span class="p">))[</span><span class="mi">0</span><span class="p">]</span>
<span class="c1"># Load the configuration file corresponding to the extracted name
</span> <span class="n">config</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="s">'configs'</span><span class="p">,</span> <span class="n">name</span> <span class="o">+</span> <span class="s">'.json'</span><span class="p">)</span>
<span class="n">config</span> <span class="o">=</span> <span class="n">json</span><span class="p">.</span><span class="n">loads</span><span class="p">(</span><span class="nb">open</span><span class="p">(</span><span class="n">config</span><span class="p">).</span><span class="n">read</span><span class="p">())</span>
<span class="k">return</span> <span class="n">config</span>
<span class="k">def</span> <span class="nf">construct</span><span class="p">(</span><span class="n">config</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">_construct_graph</span><span class="p">(</span><span class="n">default_args</span><span class="p">,</span> <span class="n">start_date</span><span class="p">,</span> <span class="o">**</span><span class="n">options</span><span class="p">):</span>
<span class="n">start_date</span> <span class="o">=</span> <span class="n">datetime</span><span class="p">.</span><span class="n">datetime</span><span class="p">.</span><span class="n">strptime</span><span class="p">(</span><span class="n">start_date</span><span class="p">,</span> <span class="s">'%Y-%m-%d'</span><span class="p">)</span>
<span class="k">return</span> <span class="n">DAG</span><span class="p">(</span><span class="n">default_args</span><span class="o">=</span><span class="n">default_args</span><span class="p">,</span> <span class="n">start_date</span><span class="o">=</span><span class="n">start_date</span><span class="p">,</span> <span class="o">**</span><span class="n">options</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">_construct_task</span><span class="p">(</span><span class="n">graph</span><span class="p">,</span> <span class="n">name</span><span class="p">,</span> <span class="n">code</span><span class="p">):</span>
<span class="k">return</span> <span class="n">BashOperator</span><span class="p">(</span><span class="n">task_id</span><span class="o">=</span><span class="n">name</span><span class="p">,</span> <span class="n">bash_command</span><span class="o">=</span><span class="n">code</span><span class="p">,</span> <span class="n">dag</span><span class="o">=</span><span class="n">graph</span><span class="p">)</span>
<span class="c1"># Construct an empty graph
</span> <span class="n">graph</span> <span class="o">=</span> <span class="n">_construct_graph</span><span class="p">(</span><span class="o">**</span><span class="n">config</span><span class="p">[</span><span class="s">'graph'</span><span class="p">])</span>
<span class="c1"># Construct the specified tasks
</span> <span class="n">tasks</span> <span class="o">=</span> <span class="p">[</span><span class="n">_construct_task</span><span class="p">(</span><span class="n">graph</span><span class="p">,</span> <span class="o">**</span><span class="n">task</span><span class="p">)</span> <span class="k">for</span> <span class="n">task</span> <span class="ow">in</span> <span class="n">config</span><span class="p">[</span><span class="s">'tasks'</span><span class="p">]]</span>
<span class="n">tasks</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">([(</span><span class="n">task</span><span class="p">.</span><span class="n">task_id</span><span class="p">,</span> <span class="n">task</span><span class="p">)</span> <span class="k">for</span> <span class="n">task</span> <span class="ow">in</span> <span class="n">tasks</span><span class="p">])</span>
<span class="c1"># Enforce the specified dependencies between the tasks
</span> <span class="k">for</span> <span class="n">child</span><span class="p">,</span> <span class="n">parent</span> <span class="ow">in</span> <span class="n">config</span><span class="p">[</span><span class="s">'dependencies'</span><span class="p">]:</span>
<span class="n">tasks</span><span class="p">[</span><span class="n">parent</span><span class="p">].</span><span class="n">set_downstream</span><span class="p">(</span><span class="n">tasks</span><span class="p">[</span><span class="n">child</span><span class="p">])</span>
<span class="k">return</span> <span class="n">graph</span>
<span class="k">try</span><span class="p">:</span>
<span class="c1"># Load an appropriate configuration file and construct a graph accordingly
</span> <span class="n">graph</span> <span class="o">=</span> <span class="n">construct</span><span class="p">(</span><span class="n">configure</span><span class="p">())</span>
<span class="k">except</span> <span class="nb">FileNotFoundError</span><span class="p">:</span>
<span class="c1"># Exit without errors in case the current file has no configuration file
</span> <span class="k">pass</span>
</code></pre></div></div>
<p>The script receives no arguments and instead tries to find a suitable
configuration file based on its own name, which can be seen in the <code class="language-plaintext highlighter-rouge">configure</code>
function. Then <code class="language-plaintext highlighter-rouge">scheduler/training.py</code> and <code class="language-plaintext highlighter-rouge">scheduler/application.py</code> can simply
be symbolic links to <code class="language-plaintext highlighter-rouge">scheduler/graph.py</code>, avoiding any code repetition. When
they are read by Airflow, each one will have its own name, and it will load its
own configuration if there is one in <code class="language-plaintext highlighter-rouge">scheduler/configs/</code>.</p>
<p>For instance, for training, <a href="https://github.com/chain-rule/example-prediction-service/tree/master/scheduler/configs/training.json"><code class="language-plaintext highlighter-rouge">scheduler/configs/training.json</code></a> is as follows:</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
</span><span class="nl">"graph"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"dag_id"</span><span class="p">:</span><span class="w"> </span><span class="s2">"example-prediction-service-training"</span><span class="p">,</span><span class="w">
</span><span class="nl">"schedule_interval"</span><span class="p">:</span><span class="w"> </span><span class="s2">"0 0 1 * *"</span><span class="p">,</span><span class="w">
</span><span class="nl">"start_date"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2019-07-01"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="nl">"tasks"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"start"</span><span class="p">,</span><span class="w">
</span><span class="nl">"code"</span><span class="p">:</span><span class="w"> </span><span class="s2">"make -C '${ROOT}/..' training-start"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"wait"</span><span class="p">,</span><span class="w">
</span><span class="nl">"code"</span><span class="p">:</span><span class="w"> </span><span class="s2">"make -C '${ROOT}/..' training-wait"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="p">{</span><span class="w">
</span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"check"</span><span class="p">,</span><span class="w">
</span><span class="nl">"code"</span><span class="p">:</span><span class="w"> </span><span class="s2">"make -C '${ROOT}/..' training-check"</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">],</span><span class="w">
</span><span class="nl">"dependencies"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
</span><span class="p">[</span><span class="s2">"wait"</span><span class="p">,</span><span class="w"> </span><span class="s2">"start"</span><span class="p">],</span><span class="w">
</span><span class="p">[</span><span class="s2">"check"</span><span class="p">,</span><span class="w"> </span><span class="s2">"wait"</span><span class="p">]</span><span class="w">
</span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>Each configuration file contains three main sections: <code class="language-plaintext highlighter-rouge">graph</code>, <code class="language-plaintext highlighter-rouge">tasks</code>, and
<code class="language-plaintext highlighter-rouge">dependencies</code>. The first section prescribes the desired start date,
periodicity, and other parameters specific to the graph itself. In this example,
the graph is triggered on the first day of every month at midnight (<code class="language-plaintext highlighter-rouge">0 0 1 *
*</code>), which might be a reasonable frequency for retraining the model. The second
section defines what commands should be executed. It can be seen that there is
one task for each of the three actions. The <code class="language-plaintext highlighter-rouge">-C '${ROOT}/..'</code> part is needed in
order to ensure that the right <code class="language-plaintext highlighter-rouge">Makefile</code> is used, which is taken care of in
<code class="language-plaintext highlighter-rouge">scheduler/graph.py</code>. Lastly, the third section dictates the order of execution
by enforcing dependencies. In this case, we are saying that <code class="language-plaintext highlighter-rouge">wait</code> depends on
(should be executed after) <code class="language-plaintext highlighter-rouge">start</code>, and that <code class="language-plaintext highlighter-rouge">check</code> depends on <code class="language-plaintext highlighter-rouge">wait</code>, forming
a chain of tasks.</p>
<p>At this point, the graphs are considered to be complete. In order to make
Airflow aware of them, the repository can be simply cloned into the <code class="language-plaintext highlighter-rouge">dags</code>
directory of Airflow.</p>
<p>Lastly, Airflow itself can live on a separate instance in Compute Engine.
Alternatively, <a href="https://cloud.google.com/composer/">Cloud Composer</a> provided by Google Cloud Platform can be
utilized for this purpose.</p>
<h1 id="conclusion">Conclusion</h1>
<p>Having reached this point, our predictive model is up and running in the cloud
in an autonomous fashion, delivering predictions to the data warehouse to act
upon. The data warehouse is certainly not the end of the journey, but we stop
here and save the discussion for another time.</p>
<p>Although the presented workflow gets the job done, it has its own limitations
and weaknesses, which one has to be aware of. The most prominent one is the
communication between a Docker container running inside a virtual machine and
the scheduler, Airflow. Busy waiting for a virtual machine in Compute Engine to
shut down and for Stackdriver to deliver a certain message is arguably not the
most reliable solution. There is also a certain overhead associated with
starting a virtual machine in Compute Engine, downloading an image from
Container Registry, and launching a container. Furthermore, this approach is not
suitable for online prediction, as the service does not expose any API for other
services to integrate with—its job is making periodically batch predictions.</p>
<p>If you have any suggestions regarding improving the workflow or simply would
like to share your thoughts, please leave a comment below or send an e-mail.
Feel also free to <a href="https://github.com/chain-rule/example-prediction-service/issues">create an issue</a> or <a href="https://github.com/chain-rule/example-prediction-service/pulls">open a pull request</a> on GitHub. Any
feedback is very much appreciated!</p>
<h1 id="follow-up">Follow-up</h1>
<p>Since its publication, the workflow presented in this article has been
significantly simplified. More specifically, on July 16, 2019, it became
possible to execute arbitrary Docker images on Google <a href="https://cloud.google.com/ai-platform/">AI Platform</a>. The
platform takes care of the whole life cycle of the container, obviating the need
for any wait scripts and ad-hoc communication mechanisms via Stackdriver. Refer
to “<a href="https://medium.com/google-cloud/how-to-run-serverless-batch-jobs-on-google-cloud-ca45a4e33cb1">How to run serverless batch jobs on Google Cloud</a>” by Lak
Lakshmanan for further details.</p>
<h1 id="references">References</h1>
<ul>
<li>Lak Lakshmanan, “<a href="https://medium.com/google-cloud/how-to-run-serverless-batch-jobs-on-google-cloud-ca45a4e33cb1">How to run serverless batch jobs on Google Cloud</a>,” 2019.</li>
</ul>Ivan UkhovAs a data scientist focusing on developing data products, you naturally want your work to reach its target audience. Suppose, however, that your company does not have a dedicated engineering team for productizing data-science code. One solution is to seek help in other teams, which are surely busy with their own endeavors, and spend months waiting. Alternatively, you could take the initiative and do it yourself. In this article, we take the initiative and schedule the training and application phases of a predictive model using Apache Airflow, Google Compute Engine, and Docker.