Jekyll2021-02-05T12:18:19+00:00https://blog.ivanukhov.com/feed.xmlGood news, everyone!On to data scienceIvan UkhovBreaking sticks, or estimation of probability distributions using the Dirichlet process2021-01-25T06:00:00+00:002021-01-25T06:00:00+00:00https://blog.ivanukhov.com/2021/01/25/dirichlet-process<p>Recall the last time you wanted to understand the distribution of given data. One alternative was to plot a histogram. However, it resulted in frustration due to the choice of the number of bins to use, which led to drastically different outcomes. Another alternative was kernel density estimation. Despite having a similar choice to make, it has the advantage of producing smooth estimates, which are more realistic for continuous quantities with regularities. However, kernel density estimation was unsatisfactory too: it did not aid in understanding the underlying structure of the data and, moreover, provided no means of quantifying the uncertainty associated with the results. In this article, we discuss a Bayesian approach to the estimation of data-generating distributions that addresses the aforementioned concerns.</p> <p>The approach we shall discuss is based on the family of Dirichlet processes. How specifically such processes are constructed will be described in the next section; here, we focus on the big picture.</p> <p>A Dirichlet process is a stochastic process, that is, an indexed sequence of random variables. Each realization of this process is a discrete probability distribution, which makes the process a distribution over distributions, similarly to a Dirichlet distribution. The process has only one parameter: a measure $$\nu: \mathcal{B} \to [0, \infty]$$ in a suitable finite measure space $$(\mathcal{X}, \mathcal{B}, \nu)$$ where $$\mathcal{X}$$ is a set, and $$\mathcal{B}$$ is a $$\sigma$$-algebra on $$\mathcal{X}$$. We shall adopt the following notation:</p> $P \sim \text{Dirichlet Process}(\nu)$ <p>where $$P$$ is a <em>random</em> probability distribution that is distributed according to the Dirichlet process. Note that measure $$\nu$$ does not have to be a probability measure; that is, $$\nu(\mathcal{X}) = 1$$ is not required. To obtain a probability measure, one can divide $$\nu$$ by the total volume $$\lambda = \nu(\mathcal{X})$$:</p> $P_0(\cdot) = \frac{1}{\lambda} \nu(\cdot).$ <p>Since this normalization is always possible, it is common and convenient to replace $$\nu$$ with $$\lambda P_0$$ and consider the process to be parametrized by two quantities instead of one:</p> $P \sim \text{Dirichlet Process}(\lambda P_0).$ <p>Parameter $$\lambda$$ is referred to as the concentration parameter of the process.</p> <p>There are two major alternatives of using the Dirichlet process for estimating distributions: as a direct prior for the data at hand and as a mixing prior. We begin with the former.</p> <h1 id="direct-prior">Direct prior</h1> <p>Given a data set of $$n$$ observations $$\{ x_i \}_{i = 1}^n$$, a Dirichlet process can be used as a prior:</p> \begin{align} x_i | P_x &amp; \sim P_x, \text{ for } i = 1, \dots, n; \text{ and} \\ P_x &amp; \sim \text{Dirichlet Process}(\lambda P_0). \tag{1} \end{align} <p>It is important to realize that the $$x_i$$’s are assumed to be distributed <em>not</em> according to the Dirichlet process but according to a distribution drawn from the Dirichlet process. Parameter $$\lambda$$ allows one to control the strength of the prior: the larger it is, the more shrinkage toward the prior is induced.</p> <h2 id="inference">Inference</h2> <p>Due to the conjugacy property of the Dirichlet process in the above setting, the posterior is also a Dirichlet process and has the following simple form:</p> $P_x | \{ x_i \}_{i = 1}^n \sim \text{Dirichlet Process}\left( \lambda P_0 + \sum_{i = 1}^n \delta_{x_i} \right). \tag{2}$ <p>That is, the total volume and normalized measure are updated as follows:</p> \begin{align} \lambda &amp; := \lambda + n \quad \text{and} \\ P_0 &amp; := \frac{\lambda}{\lambda + n} P_0 + \frac{1}{\lambda + n} \sum_{i = 1}^n \delta_{x_i}. \end{align} <p>Here, $$\delta_x(\cdot)$$ is the Dirac measure, meaning that $$\delta_x(X) = 1$$ if $$x \in X$$ for any $$X \subseteq \mathcal{X}$$, and otherwise, it is zero. It can be seen in Equation (2) that the base measure has simply been augmented with unit masses placed at the $$n$$ observed data points.</p> <p>The main question now is, How to draw samples from a Dirichlet process given $$\lambda$$ and $$P_0$$?</p> <p>As noted earlier, a draw from a Dirichlet process is a discrete probability distribution $$P_x$$. The probability measure of this distribution admits the following representation:</p> $P_x(\cdot) = \sum_{i = 1}^\infty p_i \delta_{x_i}(\cdot) \tag{3}$ <p>where $$\{ p_i \}$$ is a set of probabilities that sum up to one, and $$\{ x_i \}$$ is a set of points in $$\mathcal{X}$$. Such a draw can be obtained using the so-called stick-breaking construction, which prescribes $$\{ p_i \}$$ and $$\{ x_i \}$$. To begin with, for practical computations, the infinite summation is truncated to retain the only first $$m$$ elements:</p> $P_x(\cdot) = \sum_{i = 1}^m p_i \delta_{x_i}(\cdot).$ <p>Atoms $$\{ x_i \}_{i = 1}^m$$ are drawn independently from the normalized base measure $$P_0$$. The calculation of probabilities $$\{ p_i \}$$ is more elaborate, and this is where the construction and this article get their name, “stick breaking.” Imagine a stick of unit length, representing the total probability. The procedure is to keep breaking the stick into two parts where, for each iteration, the left part yields $$p_i$$, and the right one, the remainder, is carried over to the next iteration. How much to break off is decided on by drawing $$m$$ independent realizations from a carefully chosen beta distribution:</p> $q_i \sim \text{Beta}(1, \lambda), \text{ for } i = 1, \dots, m. \tag{4}$ <p>All of them lie in the unit interval and are the proportions to break off of the remainder. When $$\lambda = 1$$, these proportions (of the reminder) are uniformly distributed. When $$\lambda &lt; 1$$, the probability mass is shifted to the right, which means that there are likely to be a small number of large pieces, covering virtually the entire stick. When $$\lambda &gt; 1$$, the probability mass is shifted to the left, which means that there are likely to be a large number of small pieces, struggling to reach the end of the stick.</p> <p>Formally, the desired probabilities are given by the following expression:</p> $p_i = q_i \prod_{j = 1}^{i - 1} (1 - q_j), \text{ for } i = 1, \dots, m,$ <p>which, as noted earlier, are the left parts of the remainder of the stick during each iteration. For instance, $$p_1 = q_1$$, $$p_2 = q_2 (1 - q_1)$$, and so on. Due to the truncation, the probabilities $$\{ p_i \}_{i = 1}^m$$ do not sum up to one, and it is common to set $$q_m := 1$$ so that $$p_m$$ takes up the remaining probability mass.</p> <p>To recapitulate, a single draw from a Dirichlet process is obtained in two steps: prescribe atoms $$\{ x_i \}$$ via draws from the normalized base measure and prescribe the corresponding probabilities $$\{ p_i \}$$ via the stick-breaking construction. The two give a complete description of a discrete probability distribution. Recall that this distribution is still a single draw. By repeating this process many times, one obtains the distribution of this distribution, which can be used to, for instance, quantify uncertainty in the estimation.</p> <h2 id="illustration">Illustration</h2> <p>It is time to demonstrate how the Dirichlet process behaves as a direct prior. To this end, we shall use a <a href="https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/galaxies.html">data set</a> containing velocities of “82 galaxies from 6 well-separated conic sections of an unfilled survey of the Corona Borealis region.” It was studied in <a href="https://doi.org/10.2307/2289993">Roeder (1990)</a>, which gives us a reference point.</p> <blockquote> <p>For the curious reader, the source code of this <a href="https://github.com/IvanUkhov/blog/blob/master/_posts/2021-01-25-dirichlet-process.Rmd">notebook</a> along with auxiliary <a href="https://github.com/IvanUkhov/blog/tree/master/_scripts/2021-01-25-dirichlet-process">scripts</a> that are used for performing all the calculations presented below can be found on GitHub.</p> </blockquote> <p>The empirical cumulative distribution function of the velocity is as follows:</p> <p><img src="/assets/images/2021-01-25-dirichlet-process/data-cdf-1.svg" alt="" /></p> <p>Already here, it is apparent that the distribution is multimodal: there are two distinct regions, one to the left and one to the right, where the curve is flat, meaning there are no observations there. The proverbial histogram gives a confirmation:</p> <p><img src="/assets/images/2021-01-25-dirichlet-process/data-histogram-1.svg" alt="" /></p> <p>It can be seen that there is a handful of galaxies moving relatively slowly and relatively fast compared to the majority located somewhere in the middle around twenty thousand kilometers per second. For completeness, kernel density estimation results in the following plot:</p> <p><img src="/assets/images/2021-01-25-dirichlet-process/data-kde-1.svg" alt="" /></p> <p>How many clusters of galaxies are there? What are their average velocities? How uncertain are these estimates? Our goal is to answer these questions by virtue of the Dirichlet process.</p> <p>Now that the intention is to apply the presented theory in practice, we have to make all choices we have conveniently glanced over. Specifically, $$P_0$$ has to be chosen, and we shall use the following:</p> $P_0(\cdot) = \text{Gaussian}(\, \cdot \, | \mu_0, \sigma_0^2). \tag{5}$ <p>In the above, $$\text{Gaussian}(\cdot)$$ refers to the probability measure of a Gaussian distribution with parameters $$\mu_0$$ and $$\sigma_0$$. In addition to these two, there is one more: $$\lambda$$. We shall set $$\mu_0$$ and $$\sigma_0$$ to 20 and 5, respectively—which correspond roughly to the mean and standard deviation of the data—and present results for different $$\lambda$$’s to investigate how the prior volume affects shrinkage toward the prior.</p> <p>First, we do not condition on the data to get a better understanding of the prior itself, which corresponds to Equation (1). The following figure shows a single draw from four Dirichlet processes with different $$\lambda$$’s (the gray curves show the cumulative distribution function of the data as a reference):</p> <p><img src="/assets/images/2021-01-25-dirichlet-process/direct-prior-1.svg" alt="" /></p> <p>It can be seen that the larger the prior volume, the smoother the curve. This is because larger $$\lambda$$’s “break” the stick into more pieces, allowing the normalized base measure to be extensively sampled, which, in the limit, converges to this very measure; see Equation (5).</p> <p>Now, conditioning on the observed velocities of galaxies—that is, sampling as shown in Equation (2)—we obtain the following draws from the posterior Dirichlet distributions with different $$\lambda$$’s:</p> <p><img src="/assets/images/2021-01-25-dirichlet-process/direct-posterior-1.svg" alt="" /></p> <p>When the prior volume is small, virtually no data points come from $$P_0$$; instead, they are mostly uniform draws from the observed data set, leading to a curve that is nearly indistinguishable from the one of the data (the top curve). As $$\lambda$$ gets larger, the prior gets stronger, and the estimate gets shrunk toward it, up to a point where the observations appear to be entirely ignored (the bottom curve).</p> <p>The above model has a serious limitation: it assumes a discrete probability distribution for the data-generating process, which can be seen in the prior and posterior given in Equation (1) and (2), respectively, and it is also apparent in the decomposition given in Equation (3). In some cases, it might be appropriate; however, there is arguably more situations where it is inadequate, including the running example.</p> <h1 id="mixing-prior">Mixing prior</h1> <p>Instead of using a Dirichlet process as a direct prior for the given data, it can be used as a prior for mixing distributions from a given family. The resulting posterior will then naturally inherit the properties of the family, such as continuity. The general structure is as follows:</p> \begin{align} x_i | \theta_i &amp; \sim P_x \left( \theta_i \right), \text{ for } i = 1, \dots, n; \tag{6} \\ \theta_i | P_\theta &amp; \sim P_\theta, \text{ for } i = 1, \dots, n; \text{ and} \\ P_\theta &amp; \sim \text{Dirichlet Process}(\lambda P_0). \\ \end{align} <p>The $$i$$th data point, $$x_i$$, is distributed according to distribution $$P_x$$ with parameters $$\theta_i$$. For instance, $$P_x$$ could refer to the Gaussian family with $$\theta_i = (\mu_i, \sigma_i)$$ identifying a particular member of the family by its mean and standard deviation. Parameters $$\{ \theta_i \}_{i = 1}^n$$ are unknown and distributed according to distribution $$P_\theta$$. Distribution $$P_\theta$$ is not known either and gets a Dirichlet process prior with measure $$\lambda P_0$$.</p> <p>It can be seen in Equation (6) that each data point can potentially have its own unique set of parameters. However, this is not what usually happens in practice. If $$\lambda$$ is reasonably small, the vast majority of the stick—the one we explained how to break in the previous section—tends to be consumed by a small number of pieces. This makes many data points share the same parameters, which is akin to clustering. In fact, clustering is a prominent use case for the Dirichlet process.</p> <h2 id="inference-1">Inference</h2> <p>Unlike the previous model, there is no conjugacy in this case, and hence the posterior is not a Dirichlet process. There is, however, a simple Markov chain Monte Carlo sampling strategy based on the stick-breaking construction. It belongs to the class of Gibbs samplers and is as follows.</p> <p>Similarly to Equation (3), we have the following decomposition:</p> $P_m(\cdot) = \sum_{i = 1}^\infty p_i P_x(\cdot | \theta_i)$ <p>where $$P_m$$ is the probability measure of the mixture. As before, the infinite decomposition has to be made finite to be usable in practice:</p> $P_m(\cdot) = \sum_{i = 1}^m p_i P_x(\cdot | \theta_i).$ <p>Here, $$m$$ represents an upper limit on the number of mixture components. Each data point $$x_i$$, for $$i = 1, \dots, n$$, is mapped to one of the $$m$$ components, which we denote by $$k_i \in \{ 1, \dots, m \}$$. In other words, $$k_i$$ takes values from 1 to $$m$$ and gives the index of the component of the $$i$$th observation.</p> <p>There are $$m + m \times |\theta| + n$$ parameters to be inferred where $$|\theta|$$ denotes the number of parameters of $$P_x$$. These parameters are $$\{ p_i \}_{i = 1}^m$$, $$\{ \theta_i \}_{i = 1}^m$$, and $$\{ k_i \}_{i = 1}^n$$. As usual in Gibbs sampling, the parameters assume arbitrary but compatible initial values. The sampler has the following three steps.</p> <p>First, given $$\{ p_i \}$$, $$\{ \theta_i \}$$, and $$\{ x_i \}$$, the mapping of the observations to the mixture components, $$\{ k_i \}$$, is updated as follows:</p> $k_i \sim \text{Categorical}\left( m, \left\{ \frac{p_j P_x(x_i | \theta_j)}{\sum_{l = 1}^m p_l P_x(x_i | \theta_l)} \right\}_{j = 1}^m \right), \text{ for } i = 1, \dots, n.$ <p>That is, $$k_i$$ is a draw from a categorical distribution with $$m$$ categories whose unnormalized probabilities are given by $$p_j P_x(x_i | \theta_j)$$, for $$j = 1, \dots, m$$.</p> <p>Second, given $$\{ k_i \}$$, the probabilities of the mixture components, $$\{ p_i \}$$, are updated using the stick-breaking construction described earlier. This time, however, the beta distribution for sampling $$\{ q_i \}$$ in Equation (4) is replaced with the following:</p> $q_i \sim \text{Beta}\left( 1 + n_i, \lambda + \sum_{j = i + 1}^m n_j \right), \text{ for } i = 1, \dots, m,$ <p>where</p> $n_i = \sum_{j = 1}^n I_{\{i\}}(k_j), \text{ for } i = 1, \dots, m,$ <p>is the number of data points that are currently allocated to component $$i$$. Here, $$I_A$$ is the indicator function of a set $$A$$. As before, in order for the $$p_i$$’s to sum up to one, it is common to set $$q_m := 1$$.</p> <p>Third, given $$\{ k_i \}$$ and $$\{ x_i \}$$, the parameters of the mixture components, $$\{ \theta_i \}$$, are updated. This is done by sampling from the posterior distribution of each component. In this case, the posterior is a prior of choice that is updated using the data points that are currently allocated to the corresponding component. To streamline this step, a conjugate prior for the data distribution, $$P_x$$, is commonly utilized, which we shall illustrate shortly.</p> <p>To recapitulate, a single draw from the posterior is obtained in a number of steps where parameters or groups of parameters are updated in turn, while treating the other parameters as known. This Gibbs procedure is very flexible. Other parameters can be inferred too, instead of setting them to fixed values. An important example is the concentration parameter, $$\lambda$$. This parameter controls the formation of clusters, and one might let the data decide what the value should be, in which case a step similar to the third one is added to the procedure to update $$\lambda$$. This will be also illustrated below.</p> <h2 id="illustration-1">Illustration</h2> <p>We continue working with the galaxy data. For concreteness, consider the following choices:</p> \begin{align} \theta_i &amp;= (\mu_i, \sigma_i), \text{ for } i = 1, \dots, n; \\ P_x (\theta_i) &amp;= \text{Gaussian}(\mu_i, \sigma_i^2), \text{ for } i = 1, \dots, n; \text{ and} \\ P_0(\cdot) &amp;= \text{Gaussian–Scaled-Inverse-}\chi^2(\, \cdot \, | \mu_0, \kappa_0, \nu_0, \sigma_0^2). \end{align} \tag{7} <p>In the above, $$\text{Gaussian–Scaled-Inverse-}\chi^2(\cdot)$$ refers to the probability measure of a bivariate distribution composed of a conditional Gaussian and an unconditional scaled inverse chi-squared distribution. Some intuition about this distribution can be built via the following decomposition:</p> \begin{align} \mu_i | \sigma_i^2 &amp; \sim \text{Gaussian}\left(\mu_0, \frac{\sigma_i^2}{\kappa_0}\right) \text{ and} \\ \sigma_i^2 &amp; \sim \text{Scaled-Inverse-}\chi^2(\nu_0, \sigma_0^2). \end{align} \tag{8} <p>This prior is a conjugate prior for a Gaussian data distribution with unknown mean and variance, which we assume here. This means that the posterior is also a Gaussian–scaled-inverse-chi-squared distribution. Given a data set with $$n$$ observations $$x_1, \dots, x_n$$, the four parameters of the prior are updated simultaneously (not sequentially) as follows:</p> \begin{align} \mu_0 &amp; := \frac{\kappa_0}{\kappa_0 + n} \mu_0 + \frac{n}{\kappa_0 + n} \mu_x, \\ \kappa_0 &amp; := \kappa_0 + n, \\ \nu_0 &amp; := \nu_0 + n, \text{ and} \\ \sigma_0^2 &amp; := \frac{1}{\nu_0 + n} \left( \nu_0 \sigma_0^2 + ss_x + \frac{\kappa_0 n}{\kappa_0 + n}(\mu_x - \mu_0)^2 \right) \end{align} <p>where $$\mu_x = \sum_{i = 1}^n x_i / n$$ and $$ss_x = \sum_{i = 1}^n (x_i - \mu_x)^2$$. It can be seen that $$\kappa_0$$ and $$\nu_0$$ act as counters of the number of observations; $$\mu_0$$ is a weighted sum of two means; and $$\nu_0 \sigma_0^2$$ is a sum of two sums of squares and a third term increasing the uncertainty due to the difference in the means. In the Gibbs sampler, each component (each cluster of galaxies) will have its own posterior based on the data points that are assigned to that component during each iteration of the process. Therefore, $$n$$, $$\mu_x$$, and $$ss_x$$ will generally be different for different components and, moreover, will vary from iteration to iteration.</p> <p>We set $$\mu_0$$ to 20, which is roughly the mean of the data, and $$\nu_0$$ to 3, which is the smallest integer that allows the scaled chi-squared distribution to have a finite expectation. The choice of $$\kappa_0$$ and $$\sigma_0$$ is more subtle. Recall Equation (8). What we would like from the prior is to allow for free formation of clusters in a region generously covering the support of the data. To this end, the uncertainty in the mean, $$\mu_i$$, has to be high; however, it should not come from $$\sigma_i$$, since it would produce very diffuse clusters. We set $$\kappa_0$$ to 0.01 to magnify the variance of $$\mu_i$$ without affecting $$\sigma_i$$, and $$\sigma_0$$ to 1 to keep clusters compact.</p> <p>Now, let us take a look at what the above choices entail. The following figure illustrates the prior for the mean of a component:</p> <p><img src="/assets/images/2021-01-25-dirichlet-process/mixture-prior-mu-1.svg" alt="" /></p> <p>The negative part is unrealistic for velocity; however, it is rarely a problem in practice. What is important is that there is a generous coverage of the plausible values. The following figure shows the prior for the standard deviation of a component:</p> <p><img src="/assets/images/2021-01-25-dirichlet-process/mixture-prior-sigma-1.svg" alt="" /></p> <p>The bulk is below the standard deviation of the data; however, this is by choice: we expect more than one cluster of galaxies with similar velocities.</p> <p>As mentioned earlier, we intend to include $$\lambda$$ in the inference. First, we put the following prior:</p> $\lambda \sim \text{Gamma}(\alpha_0, \beta_0). \tag{9}$ <p>Note this is the rate parameterization of the Gamma family. Conditionally, this is a conjugate prior with the following update rule for the two parameters:</p> \begin{align} \alpha_0 &amp; := \alpha_0 + m - 1 \quad \text{and} \\ \beta_0 &amp; := \beta_0 - \sum_{i = 1}^{m - 1} \ln(1 - q_i) \end{align} <p>where $$\{ q_i \}$$ come from the stick-breaking construction. This is a fourth step in the Gibbs sampler. We set $$\alpha_0$$ and $$\beta_0$$ to 2 and 0.1, respectively, which entails the following prior assumption about $$\lambda$$:</p> <p><img src="/assets/images/2021-01-25-dirichlet-process/mixture-prior-lambda-1.svg" alt="" /></p> <p>The parameter is allowed to vary freely from small to large values, as desired.</p> <p>Having chosen all priors and their hyperparameters, we are ready to investigate the behavior of the entire model; see Equations (6), (7), and (9). In what follows, we shall limit the number of mixture components to 25; that is, $$m = 25$$. Furthermore, we shall perform 2000 Gibbs iterations and discard the first half as a warm-up period. As before, we start without conditioning on the data to observe draws from the prior itself. The following figure shows two sample draws:</p> <p><img src="/assets/images/2021-01-25-dirichlet-process/mixture-prior-check-1.svg" alt="" /></p> <p>It can be seen that clusters of galaxies can appear anywhere in the region of interest and can be of various sizes. We conclude that the prior is adequate. When taking the observed velocities into account, we obtain a full posterior distribution in the form of 1000 draws. The following shows two random draws:</p> <p><img src="/assets/images/2021-01-25-dirichlet-process/mixture-posterior-check-1.svg" alt="" /></p> <p>Indeed, mixture components have started to appear in the regions where there are observations.</p> <p>Before we proceed to the final summary of results, it is prudent to inspect sample chains for a few parameters in order to ensure there are not problems with convergence to the stationary distribution. The following shows the number of occupied components among the 25 permitted:</p> <p><img src="/assets/images/2021-01-25-dirichlet-process/mixture-posterior-k-1.svg" alt="" /></p> <p>The chain fluctuates around a fixed level without any prominent pattern, as it should. One can plot the actual marginal posterior distribution for the number of components; however, it is already clear that the distribution of the number of clusters of galaxies is mostly between 5 and 10 with a median of 7.</p> <p>As for the concentration parameter, $$\lambda$$, the chain is as follows:</p> <p><img src="/assets/images/2021-01-25-dirichlet-process/mixture-posterior-lambda-1.svg" alt="" /></p> <p>The behavior is uneventful, which is a good sign.</p> <p>Let us now take a look at the posterior distributions of the first seven components highlighted earlier (note the different scales on the vertical axes):</p> <p><img src="/assets/images/2021-01-25-dirichlet-process/mixture-posterior-mu-1.svg" alt="" /></p> <p>The components clearly change roles, which can be seen by the multimodal nature of the distributions. Component 1 is most often at 10 (times $$10^6$$ m/s); however, it also peaks between 24 and 25 and even above 30. Components 2 and 3 are the most certain ones, which is due to a relatively large number of samples present in the corresponding region. They seem to exchanges roles and capture velocities of around 20 and 23. Components 4 and 5, on the other hand, appear to play the same role. Unlike Component 1, they are most likely to be found at around 33. Components 6 and 7 are similar too. They seem to be responsible for the small formation to the left and right next to the bulk in the middle (at 16); recall the histogram of the data. The small formation on the other side of the bulk at around 26 is captured as well, which is mostly done by Component 6.</p> <p>Lastly, we summarize the inference using the following figure where the median distribution and a 95% uncertainty band—composed of distributions at the 0.025 and 0.975 quantiles—are plotted:</p> <p><img src="/assets/images/2021-01-25-dirichlet-process/mixture-posterior-summary-1.svg" alt="" /></p> <p>In this view, only five components are visible to the naked eye. The median curve matches well the findings in <a href="https://doi.org/10.2307/2289993">Roeder (1990)</a>. Judging by the width of the uncertainty band, there is a lot of plausible alternatives, and it is important to communicate this uncertainty to those who base decisions on the inference. The ability to quantify uncertainty with such ease is a prominent advantage of Bayesian inference.</p> <h1 id="conclusion">Conclusion</h1> <p>In this article, the family of Dirichlet processes has been presented in the context of Bayesian inference. More specifically, it has been shown how a Dirichlet process can be utilized as a prior for an unknown discrete distribution and as a prior for mixing distributions from a given family. In both cases, it has been illustrated how to perform inference via a finite approximation and the stick-breaking construction.</p> <p>Clearly, the overall procedure is more complicated than counting observations falling in a number of fixed bins, which is what a histogram does, or placing kernels all over the place, which is what a kernel density estimator does. However, “anything in life worth having is worth working for.” The advantages of the Bayesian approach include the ability to incorporate prior knowledge, which is crucial in situations with little data, and the ability to propagate and quantify uncertainty, which is a must.</p> <blockquote> <p>Recall that the source code of this <a href="https://github.com/IvanUkhov/blog/blob/master/_posts/2021-01-25-dirichlet-process.Rmd">notebook</a> along with auxiliary <a href="https://github.com/IvanUkhov/blog/tree/master/_scripts/2021-01-25-dirichlet-process">scripts</a> that were used for performing the calculations presented above can be found on GitHub. Any feedback is welcome!</p> </blockquote> <h1 id="acknowledgments">Acknowledgments</h1> <p>I would like to thank <a href="https://www.mattiasvillani.com/">Mattias Villani</a> for the insightful and informative graduate course in Bayesian statistics titled “<a href="https://github.com/mattiasvillani/AdvBayesLearnCourse">Advanced Bayesian learning</a>,” which was the inspiration behind writing this article, and for his guidance regarding the implementation.</p> <h1 id="references">References</h1> <ul> <li>Andrew Gelman et al., <em><a href="http://www.stat.columbia.edu/~gelman/book/">Bayesian Data Analysis</a></em>, Chapman and Hall/CRC, 2014.</li> <li>Kathryn Roeder, “<a href="https://doi.org/10.2307/2289993">Density estimation with confidence sets exemplified by superclusters and voids in galaxies</a>,” Journal of the American Statistical Association, 1990.</li> <li>Rick Durrett, <em><a href="https://services.math.duke.edu/~rtd/PTE/pte.html">Probability: Theory and Examples</a></em>, Cambridge University Press, 2010.</li> </ul>Ivan UkhovRecall the last time you wanted to understand the distribution of given data. One alternative was to plot a histogram. However, it resulted in frustration due to the choice of the number of bins to use, which led to drastically different outcomes. Another alternative was kernel density estimation. Despite having a similar choice to make, it has the advantage of producing smooth estimates, which are more realistic for continuous quantities with regularities. However, kernel density estimation was unsatisfactory too: it did not aid in understanding the underlying structure of the data and, moreover, provided no means of quantifying the uncertainty associated with the results. In this article, we discuss a Bayesian approach to the estimation of data-generating distributions that addresses the aforementioned concerns.Heteroscedastic Gaussian process regression2020-06-22T06:00:00+00:002020-06-22T06:00:00+00:00https://blog.ivanukhov.com/2020/06/22/gaussian-process<p>Gaussian process regression is a nonparametric Bayesian technique for modeling relationships between variables of interest. The vast flexibility and rigor mathematical foundation of this approach make it the default choice in many problems involving small- to medium-sized data sets. In this article, we illustrate how Gaussian process regression can be utilized in practice. To make the case more compelling, we consider a setting where linear regression would be inadequate. The focus will be <em>not</em> on getting the job done as fast as possible but on learning the technique and understanding the choices being made.</p> <h1 id="data">Data</h1> <p>Consider the following example taken from <a href="http://www.stat.tamu.edu/~carroll/semiregbook"><em>Semiparametric Regression</em></a> by Ruppert <em>et al.</em>:</p> <p><img src="/assets/images/2020-06-22-gaussian-process/data-1.svg" alt="" /></p> <p>The figure shows 221 observations collected in a <a href="https://en.wikipedia.org/wiki/Lidar">light detection and ranging</a> experiment. Each observation can be interpreted as the sum of the true underlying response at the corresponding distance and random noise. It can be clearly seen that the variance of the noise varies with the distance: the spread is substantially larger toward the right-hand side. This phenomenon is known as heteroscedasticity. Homoscedasticity (the absence of heteroscedasticity) is one of the key assumptions of linear regression. Applying linear regression to the above problem would yield suboptimal results. The estimates of the regression coefficients would still be unbiased; however, the standard errors of the coefficients would be incorrect and hence misleading. A different modeling technique is needed in this case.</p> <p>The above data set will be our running example. For formally and slightly more generally, we assume that there is a data set of $$m$$ observations:</p> $\left\{ (\mathbf{x}_i, y_i): \, \mathbf{x}_i \in \mathbb{R}^d; \, y_i \in \mathbb{R}; \, i = 1, \dots, m \right\}$ <p>where the independent variable, $$\mathbf{x}$$, is $$d$$-dimensional, and the dependent variable, $$y$$, is scalar. In the running example, $$d$$ is 1, and $$m$$ is 221. It is time for modeling.</p> <h1 id="model">Model</h1> <p>To begin with, consider the following model with additive noise:</p> $y_i = f(\mathbf{x}_i) + \epsilon_i, \text{ for } i = 1, \dots, m. \tag{1}$ <p>In the above, $$f: \mathbb{R}^d \to \mathbb{R}$$ represents the true but unknown underlying function, and $$\epsilon_i$$ represents the perturbation of the $$i$$th observation by random noise. In the classical linear-regression setting, the unknown function is modeled as a linear combination of (arbitrary transformations of) the $$d$$ covariates. Instead of assuming any particular functional form, we put a Gaussian process prior on the function:</p> $f(\mathbf{x}) \sim \text{Gaussian Process}\left( 0, k(\mathbf{x}, \mathbf{x}') \right).$ <p>The above notation means that, before observing any data, the function is a draw from a Gaussian process with zero mean and a covariance function $$k$$. The covariance function dictates the degree of correlation between two arbitrary locations $$\mathbf{x}$$ and $$\mathbf{x}'$$ in $$\mathbb{R}^d$$. For instance, a frequent choice for $$k$$ is the squared-exponential covariance function:</p> $k(\mathbf{x}, \mathbf{x}') = \sigma_\text{process}^2 \exp\left( -\frac{\|\mathbf{x} - \mathbf{x}'\|_2^2}{2 \, \ell_\text{process}^2} \right)$ <p>where $$\|\cdot\|_2$$ stands for the Euclidean norm, $$\sigma_\text{process}^2$$ is the variance (to see this, substitute $$\mathbf{x}$$ for $$\mathbf{x}'$$), and $$\ell_\text{process}$$ is known as the length scale. While the variance parameter is intuitive, the length-scale one requires an illustration. The parameter controls the speed with which the correlation fades with the distance. The following figure shows 10 random draws for $$\ell_\text{process} = 0.1$$:</p> <p><img src="/assets/images/2020-06-22-gaussian-process/prior-process-short-1.svg" alt="" /></p> <p>With $$\ell_\text{process} = 0.5$$, the behavior changes to the following:</p> <p><img src="/assets/images/2020-06-22-gaussian-process/prior-process-long-1.svg" alt="" /></p> <p>It can be seen that it takes a greater distance for a function with a larger length scale (<em>top</em>) to change to the same extent compared to a function with a smaller length scale (<em>bottom</em>).</p> <p>Let us now return to Equation (1) and discuss the error terms, $$\epsilon_i$$. In linear regression, they are modeled as independent identically distributed Gaussian random variables:</p> $\epsilon_i \sim \text{Gaussian}\left( 0, \sigma_\text{noise}^2 \right), \text{ for } i = 1, \dots, m. \tag{2}$ <p>This is also the approach one can take with Gaussian process regression; however, one does not have to. There are reasons to believe the problem at hand is heteroscedastic, and it should be reflected in the model. To this end, the magnitude of the noise is allowed to vary with the covariates:</p> $\epsilon_i | \mathbf{x}_i \sim \text{Gaussian}\left(0, \sigma^2_{\text{noise}, i}\right), \text{ for } i = 1, \dots, m. \tag{3}$ <p>The error terms are still independent (given the covariates) but not identically distributed. At this point, one has to make a choice about the dependence of $$\sigma_{\text{noise}, i}$$ on $$\mathbf{x}_i$$. This dependence could be modeled with another Gaussian process with an appropriate link function to ensure $$\sigma_{\text{noise}, i}$$ is nonnegative. Another reasonable choice is a generalized linear model, which is what we shall use:</p> $\ln \sigma^2_{\text{noise}, i} = \alpha_\text{noise} + \boldsymbol{\beta}^\intercal_\text{noise} \, \mathbf{x}_i, \text{ for } i = 1, \dots, m, \tag{4}$ <p>where $$\alpha$$ is the intercept of the regression line, and $$\boldsymbol{\beta} \in \mathbb{R}^d$$ contains the slopes.</p> <p>Thus far, a model for the unknown function $$f$$ and a model for the noise have been prescribed. In total, there are $$d + 3$$ parameters: $$\sigma_\text{process}$$, $$\ell_\text{process}$$, $$\alpha_\text{noise}$$, and $$\beta_{\text{noise}, i}$$ for $$i = 1, \dots, d$$. The first two are positive, and the rest are arbitrary. The final piece is prior distributions for these parameters.</p> <p>The variance of the coveriance function, $$\sigma^2_\text{process}$$, corresponds to the amount of variance in the data that is explained by the Gaussian process. It poses no particular problem and can be tackled with a half-Gaussian or a half-Student’s t distribution:</p> $\sigma_\text{process} \sim \text{Half-Gaussian}\left( 0, 1 \right).$ <p>The notation means that the standard Gaussian distribution is truncated at zero and renormalized. The nontrivial mass around zero implied by the prior is considered to be beneficial in this case.<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote">1</a></sup></p> <p>A prior for the length scale of the covariance function, $$\ell_\text{process}$$, should be chosen with care. Small values—especially, those below the resolution of the data—give the Gaussian process extreme flexibility and easily leads to overfitting. Moreover, there are numerical ramifications of the length scale approaching zero as well: the quality of Hamiltonian Monte Carlo sampling degrades.<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote">2</a></sup> The bottom line is that a prior penalizing values close to zero is needed. A reasonable choice is an inverse gamma distribution:</p> $\ell_\text{process} \sim \text{Inverse Gamma}\left( 1, 1 \right).$ <p>To understand the implications, let us perform a prior predictive check for this component in isolation:</p> <p><img src="/assets/images/2020-06-22-gaussian-process/prior-process-length-scale-1.svg" alt="" /></p> <p>It can be seen that the density is very low in the region close to zero, while being rather permissive to the right of that region, especially considering the scale of the distance in the data; recall the very first figure. Consequently, the choice is adequate.</p> <p>The choice of priors for the parameters of the noise is complicated by the nonlinear link function; see Equation (4). What is important to realize is that small amounts of noise correspond to negative values in the linear space, which is probably what one should be expecting given the scale of the response. Therefore, the priors should allow for large negative values. Let us make an educated assumption and perform a prior predictive check to understand the consequences. Consider the following:</p> \begin{align} \alpha_\text{noise} &amp; \sim \text{Gaussian}\left( -1, 1 \right) \text{ and} \\ \beta_{\text{noise}, i} &amp; \sim \text{Gaussian}\left( 0, 1 \right), \text{ for } i = 1, \dots, d.\\ \end{align} <p>The density of $$\sigma_\text{noise}$$ without considering the regression slopes is depicted below (note the logarithmic scale on the horizontal axis):</p> <p><img src="/assets/images/2020-06-22-gaussian-process/prior-noise-sigma-1.svg" alt="" /></p> <p>The variability in the intercept, $$\alpha_\text{noise}$$, allows the standard deviation, $$\sigma_\text{noise}$$, to comfortably vary from small to large values, keeping in mind the scale of the response. Here are two draws from the prior distribution of the noise, including Equations (3) and (4):</p> <p><img src="/assets/images/2020-06-22-gaussian-process/prior-noise-1.svg" alt="" /></p> <p>The large ones are perhaps unrealistic and could be addressed by further shifting the distribution of the intercept. However, they should not cause problems for the inference.</p> <p>Putting everything together, the final model is as follows:</p> \begin{align} y_i &amp; = f(\mathbf{x}_i) + \epsilon_i, \text{ for } i = 1, \dots, m; \\ f(\mathbf{x}) &amp; \sim \text{Gaussian Process}\left( 0, k(\mathbf{x}, \mathbf{x}') \right); \\ k(\mathbf{x}, \mathbf{x}') &amp; = \sigma_\text{process}^2 \exp\left( -\frac{\|\mathbf{x} - \mathbf{x}'\|_2^2}{2 \, \ell_\text{process}^2} \right); \\ \epsilon_i | \mathbf{x}_i &amp; \sim \text{Gaussian}\left( 0, \sigma^2_{\text{noise}, i} \right), \text{ for } i = 1, \dots, m; \\ \ln \sigma^2_{\text{noise}, i} &amp; = \alpha_\text{noise} + \boldsymbol{\beta}_\text{noise}^\intercal \, \mathbf{x}_i, \text{ for } i = 1, \dots, m; \\ \sigma_\text{process} &amp; \sim \text{Half-Gaussian}\left( 0, 1 \right); \\ \ell_\text{process} &amp; \sim \text{Inverse Gamma}\left( 1, 1 \right); \\ \alpha_\text{noise} &amp; \sim \text{Gaussian}\left( -1, 1 \right); \text{ and} \\ \beta_{\text{noise}, i} &amp; \sim \text{Gaussian}\left( 0, 1 \right), \text{ for } i = 1, \dots, d.\\ \end{align} <p>This concludes the modeling part. The remaining two steps are to infer the parameters and to make predictions using the posterior predictive distribution.</p> <h1 id="inference">Inference</h1> <p>The model is analytically intractable; one has to resort to sampling or variational methods for inferring the parameters. We shall use Hamiltonian Markov chain Monte Carlo sampling via <a href="https://mc-stan.org/">Stan</a>. The model can be seen in the following listing, where the notation closely follows the one used throughout the article:</p> <div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">data</span> <span class="p">{</span> <span class="kt">int</span><span class="o">&lt;</span><span class="n">lower</span> <span class="o">=</span> <span class="mi">1</span><span class="o">&gt;</span> <span class="n">d</span><span class="p">;</span> <span class="kt">int</span><span class="o">&lt;</span><span class="n">lower</span> <span class="o">=</span> <span class="mi">1</span><span class="o">&gt;</span> <span class="n">m</span><span class="p">;</span> <span class="n">vector</span><span class="p">[</span><span class="n">d</span><span class="p">]</span> <span class="n">x</span><span class="p">[</span><span class="n">m</span><span class="p">];</span> <span class="n">vector</span><span class="p">[</span><span class="n">m</span><span class="p">]</span> <span class="n">y</span><span class="p">;</span> <span class="p">}</span> <span class="n">transformed</span> <span class="n">data</span> <span class="p">{</span> <span class="n">vector</span><span class="p">[</span><span class="n">m</span><span class="p">]</span> <span class="n">mu</span> <span class="o">=</span> <span class="n">rep_vector</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">m</span><span class="p">);</span> <span class="n">matrix</span><span class="p">[</span><span class="n">m</span><span class="p">,</span> <span class="n">d</span><span class="p">]</span> <span class="n">X</span><span class="p">;</span> <span class="k">for</span> <span class="p">(</span><span class="n">i</span> <span class="n">in</span> <span class="mi">1</span><span class="o">:</span><span class="n">m</span><span class="p">)</span> <span class="p">{</span> <span class="n">X</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">x</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="err">'</span><span class="p">;</span> <span class="p">}</span> <span class="p">}</span> <span class="n">parameters</span> <span class="p">{</span> <span class="n">real</span><span class="o">&lt;</span><span class="n">lower</span> <span class="o">=</span> <span class="mi">0</span><span class="o">&gt;</span> <span class="n">sigma_process</span><span class="p">;</span> <span class="n">real</span><span class="o">&lt;</span><span class="n">lower</span> <span class="o">=</span> <span class="mi">0</span><span class="o">&gt;</span> <span class="n">ell_process</span><span class="p">;</span> <span class="n">real</span> <span class="n">alpha_noise</span><span class="p">;</span> <span class="n">vector</span><span class="p">[</span><span class="n">d</span><span class="p">]</span> <span class="n">beta_noise</span><span class="p">;</span> <span class="p">}</span> <span class="n">model</span> <span class="p">{</span> <span class="n">matrix</span><span class="p">[</span><span class="n">m</span><span class="p">,</span> <span class="n">m</span><span class="p">]</span> <span class="n">K</span> <span class="o">=</span> <span class="n">cov_exp_quad</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">sigma_process</span><span class="p">,</span> <span class="n">ell_process</span><span class="p">);</span> <span class="n">vector</span><span class="p">[</span><span class="n">m</span><span class="p">]</span> <span class="n">sigma_noise_squared</span> <span class="o">=</span> <span class="n">exp</span><span class="p">(</span><span class="n">alpha_noise</span> <span class="o">+</span> <span class="n">X</span> <span class="o">*</span> <span class="n">beta_noise</span><span class="p">);</span> <span class="n">matrix</span><span class="p">[</span><span class="n">m</span><span class="p">,</span> <span class="n">m</span><span class="p">]</span> <span class="n">L</span> <span class="o">=</span> <span class="n">cholesky_decompose</span><span class="p">(</span><span class="n">add_diag</span><span class="p">(</span><span class="n">K</span><span class="p">,</span> <span class="n">sigma_noise_squared</span><span class="p">));</span> <span class="n">y</span> <span class="o">~</span> <span class="n">multi_normal_cholesky</span><span class="p">(</span><span class="n">mu</span><span class="p">,</span> <span class="n">L</span><span class="p">);</span> <span class="n">sigma_process</span> <span class="o">~</span> <span class="n">normal</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span> <span class="n">ell_process</span> <span class="o">~</span> <span class="n">inv_gamma</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span> <span class="n">alpha_noise</span> <span class="o">~</span> <span class="n">normal</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span> <span class="n">beta_noise</span> <span class="o">~</span> <span class="n">normal</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span> <span class="p">}</span> </code></pre></div></div> <p>In the <code class="language-plaintext highlighter-rouge">parameters</code> block, one can find the $$d + 3$$ parameters identified earlier. In regards to the <code class="language-plaintext highlighter-rouge">model</code> block, it is worth noting that there is no any Gaussian process distribution in Stan. Instead, a multivariate Gaussian distribution is utilized to model $$f$$ at $$\mathbf{X} = (\mathbf{x}_i)_{i = 1}^m \in \mathbb{R}^{m \times d}$$ and eventually $$\mathbf{y} = (y_i)_{i = 1}^m$$, which is for a good reason. Even though a Gaussian process is an infinite-dimensional object, in practice, one always works with finite amounts of data. For instance, in the running example, there are only 221 data points. By definition, a Gaussian process is a stochastic process with the condition that any finite collection of points from this process has a multivariate Gaussian distribution. This fact combined with the conditional independence of the process and the noise given the covariates yields the following and explains the usage of a multivariate Gaussian distribution:</p> $\mathbf{y} | \mathbf{X}, \sigma_\text{process}, \ell_\text{process}, \alpha_\text{noise}, \boldsymbol{\beta}_\text{noise} \sim \text{Multivariate Gaussian}\left( \mathbf{0}, \mathbf{K} + \mathbf{D} \right)$ <p>where $$\mathbf{K} \in \mathbb{R}^{m \times m}$$ is a covariance matrix computed by evaluating the covariance function $$k$$ at all pairs of locations in the observed data, and $$\mathbf{D} = \text{diag}(\sigma^2_{\text{noise}, i})_{i = 1}^m \in \mathbb{R}^{m \times m}$$ is a diagonal matrix of the variances of the noise at the corresponding locations.</p> <p>After running the inference, the following posterior distributions are obtained:</p> <p><img src="/assets/images/2020-06-22-gaussian-process/posterior-parameters-1.svg" alt="" /></p> <p>The intervals are at the bottom of the densities are 66% and 95% equal-tailed probability intervals, and the dots indicate the medians. Let us also take a look at the 95% probability interval for the noise with respect to the distance:</p> <p><img src="/assets/images/2020-06-22-gaussian-process/posterior-predictive-noise-1.svg" alt="" /></p> <p>As expected, the variance of the noise increases with the distance.</p> <h1 id="prediction">Prediction</h1> <p>Suppose there are $$n$$ locations $$\mathbf{X}_\text{new} = (\mathbf{x}_{\text{new}, i})_{i = 1}^n \in \mathbb{R}^{n \times d}$$ where one wishes to make predictions. Let $$\mathbf{f}_\text{new} \in \mathbb{R}^n$$ be the values of $$f$$ at those locations. Assuming all the data and parameters given, the joint distribution of $$\mathbf{y}$$ and $$\mathbf{f}_\text{new}$$ is as follows:</p> $\left[ \begin{matrix} \mathbf{y} \\ \mathbf{f}_\text{new} \end{matrix} \right] \sim \text{Multivariate Gaussian}\left( \mathbf{0}, \left[ \begin{matrix} \mathbf{K} + \mathbf{D} &amp; k(\mathbf{X}, \mathbf{X}_\text{new}) \\ k(\mathbf{X}_\text{new}, \mathbf{X}) &amp; k(\mathbf{X}_\text{new}, \mathbf{X}_\text{new}) \end{matrix} \right] \right)$ <p>where, with a slight abuse of notation, $$k(\cdot, \cdot)$$ stands for a covariance matrix computed by evaluating the covariance function $$k$$ at the specified locations, which is analogous to $$\mathbf{K}$$. It is well known (see <a href="http://www.gaussianprocess.org/gpml">Rasmussen et al. 2006</a>, for instance) that the marginal distribution of $$\mathbf{f}_\text{new}$$ is a multivariate Gaussian with the following mean vector and covariance matrix, respectively:</p> \begin{align} E(\mathbf{f}_\text{new}) &amp; = k(\mathbf{X}_\text{new}, \mathbf{X})(\mathbf{K} + \mathbf{D})^{-1} \, \mathbf{y} \quad \text{and} \\ \text{cov}(\mathbf{f}_\text{new}) &amp; = k(\mathbf{X}_\text{new}, \mathbf{X}_\text{new}) - k(\mathbf{X}_\text{new}, \mathbf{X})(\mathbf{K} + \mathbf{D})^{-1} k(\mathbf{X}, \mathbf{X}_\text{new}). \end{align} <p>The final component is the noise, as per Equation (1). The noise does not change the mean of the multivariate Gaussian distribution but does magnify the variance:</p> \begin{align} E(\mathbf{y}_\text{new}) &amp; = k(\mathbf{X}_\text{new}, \mathbf{X})(\mathbf{K} + \mathbf{D})^{-1} \, \mathbf{y} \quad \text{and} \\ \text{cov}(\mathbf{y}_\text{new}) &amp; = k(\mathbf{X}_\text{new}, \mathbf{X}_\text{new}) - k(\mathbf{X}_\text{new}, \mathbf{X})(\mathbf{K} + \mathbf{D})^{-1} k(\mathbf{X}, \mathbf{X}_\text{new}) + \text{diag}(\sigma^2_\text{noise}(\mathbf{X}_\text{new})) \end{align} <p>where $$\text{diag}(\sigma^2_\text{noise}(\cdot))$$ stands for a diagonal matrix composed of the noise variance evaluated at the specified locations, which is analogous to $$\mathbf{D}$$.</p> <p>Given a set of draws from the joint posterior distribution of the parameters and the last two expressions, it is now straightforward to draw samples from the posterior predictive distribution of the response: for each draw of the parameters, one has to evaluate the mean vector and the covariance matrix and sample the corresponding multivariate Gaussian distribution. The result is given in the following figure:</p> <p><img src="/assets/images/2020-06-22-gaussian-process/posterior-predictive-heteroscedastic-1.svg" alt="" /></p> <p>The graph shows the mean value of the posterior predictive distribution given by the black line along with a 95% equal-tailed probability band about the mean. It can be seen that the uncertainty in the predictions is adequately captured along the entire support. Naturally, the full predictive posterior distribution is available at any location of interest.</p> <p>Before we conclude, let us illustrate what would happen if the data were modeled as having homogeneous noise. To this end, the variance of the noise is assumed to be independent of the covariates, as in Equation (2). After repeating the inference and prediction processes, the following is obtained:</p> <p><img src="/assets/images/2020-06-22-gaussian-process/posterior-predictive-homoscedastic-1.svg" alt="" /></p> <p>The inference is inadequate, which can be seen by the probability band: the variance is largely overestimated on the left-hand side and underestimated on the right-hand side. This justifies well the choice of heteroscedastic regression presented earlier.</p> <h1 id="conclusion">Conclusion</h1> <p>In this article, it has been illustrated how a functional relationship can be modeled using a Gaussian process as a prior. Particular attention has been dedicated to adequately capturing error terms in the presence of heteroscedasticity. In addition, a practical implementation has been discussed, and the experimental results have demonstrated the appropriateness of this approach.</p> <p>For the curious reader, the source code of this <a href="https://github.com/IvanUkhov/blog/blob/master/_posts/2020-06-22-gaussian-process.Rmd">notebook</a> along with a number of auxiliary <a href="https://github.com/IvanUkhov/blog/tree/master/_scripts/2020-06-22-gaussian-process">scripts</a>, such as the definition of the model in Stan, can be found on GitHub.</p> <h1 id="acknowledgments">Acknowledgments</h1> <p>I would like to thank <a href="https://www.mattiasvillani.com/">Mattias Villani</a> for the insightful and informative graduate course in statistics titled “<a href="https://github.com/mattiasvillani/AdvBayesLearnCourse">Advanced Bayesian learning</a>,” which was the inspiration behind writing this article.</p> <h1 id="references">References</h1> <ul> <li>Carl Rasmussen <em>et al.</em>, <a href="http://www.gaussianprocess.org/gpml"><em>Gaussian Processes for Machine Learning</em></a>, the MIT Press, 2006.</li> <li>David Ruppert <em>et al.</em>, <a href="http://www.stat.tamu.edu/~carroll/semiregbook"><em>Semiparametric Regression</em></a>, Cambridge University Press, 2003.</li> </ul> <h1 id="footnotes">Footnotes</h1> <div class="footnotes" role="doc-endnotes"> <ol> <li id="fn:1" role="doc-endnote"> <p>“<a href="https://mc-stan.org/docs/2_19/stan-users-guide/fit-gp-section.html#priors-for-marginal-standard-deviation">Priors for marginal standard deviation</a>,” Stan User’s Guide, 2020. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:2" role="doc-endnote"> <p>“<a href="https://mc-stan.org/docs/2_19/stan-users-guide/fit-gp-section.html#priors-for-length-scale">Priors for length-scale</a>,” Stan User’s Guide, 2020. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> </ol> </div>Ivan UkhovGaussian process regression is a nonparametric Bayesian technique for modeling relationships between variables of interest. The vast flexibility and rigor mathematical foundation of this approach make it the default choice in many problems involving small- to medium-sized data sets. In this article, we illustrate how Gaussian process regression can be utilized in practice. To make the case more compelling, we consider a setting where linear regression would be inadequate. The focus will be not on getting the job done as fast as possible but on learning the technique and understanding the choices being made.What is the easiest way to compare two data sets?2020-04-10T06:00:00+00:002020-04-10T06:00:00+00:00https://blog.ivanukhov.com/2020/04/10/comparison<p>One has probably come across this problem numerous times. There are two versions of a tabular data set with a lot of columns of different types, and one wants to quickly identify any differences between the two. For example, the pipeline providing data to a predictive model might have been updated, and the goal is to understand if there have been any side effects of this update for the training data.</p> <p>One solution is to start to iterate over the columns of the two tables, computing five-number summaries and plotting histograms or identifying distinct values and plotting bar charts, depending on the column’s type. However, this can quickly get out of hand and evolve into an endeavor for the rest of the day.</p> <p>An alternative is to leverage the amazing tools that already exist in the data community.</p> <h1 id="solution">Solution</h1> <p>The key takeaway is the following three lines of code, excluding the import:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">tensorflow_data_validation</span> <span class="k">as</span> <span class="n">dv</span> <span class="n">statistics_1</span> <span class="o">=</span> <span class="n">dv</span><span class="p">.</span><span class="n">generate_statistics_from_dataframe</span><span class="p">(</span><span class="n">data_1</span><span class="p">)</span> <span class="n">statistics_2</span> <span class="o">=</span> <span class="n">dv</span><span class="p">.</span><span class="n">generate_statistics_from_dataframe</span><span class="p">(</span><span class="n">data_2</span><span class="p">)</span> <span class="n">dv</span><span class="p">.</span><span class="n">visualize_statistics</span><span class="p">(</span><span class="n">lhs_statistics</span><span class="o">=</span><span class="n">statistics_1</span><span class="p">,</span> <span class="n">rhs_statistics</span><span class="o">=</span><span class="n">statistics_2</span><span class="p">)</span> </code></pre></div></div> <p>This is all it takes to get a versatile dashboard embedded right into a cell of a Jupyter notebook. The visualization itself is based on <a href="https://pair-code.github.io/facets">Facets</a>, and it is conveniently provided by <a href="https://www.tensorflow.org/tfx/data_validation/get_started">TensorFlow Data Validation</a> (which does not have much to do with TensorFlow and can be used stand-alone).</p> <p>It is pointless to try to describe in words what the dashboard can do; instead, here is a demonstration taken from <a href="https://pair-code.github.io/facets">Facets</a> where the tool is applied the <a href="http://archive.ics.uci.edu/ml/datasets/Census+Income">UCI Census Income</a> data set:</p> <div id="facets-overview-container"></div> <p>Go ahead and give a try to all the different controls!</p> <p>In this case, it is helpful to toggle the “percentages” checkbox, since the data sets are of different sizes. Then it becomes apparent that the two partitions are fairly balanced. The only problem is that <code class="language-plaintext highlighter-rouge">Target</code>, which represents income, happened to be encoded incorrectly in the partition for testing.</p> <p>Lastly, an example in a Jupyter notebook can be found on <a href="https://github.com/chain-rule/example-comparison/blob/master/census.ipynb">GitHub</a>.</p> <h1 id="conclusion">Conclusion</h1> <p>It can be difficult to navigate and particularly challenging to compare wide data sets. A lot of effort can be put into this exercise. However, the landscape of open-source tools has a lot to offer too. Facets is one such example. The library and its straightforward availability via TensorFlow Data Validation are arguably less known. This short note can hopefully rectify this to some extent.</p>Ivan UkhovOne has probably come across this problem numerous times. There are two versions of a tabular data set with a lot of columns of different types, and one wants to quickly identify any differences between the two. For example, the pipeline providing data to a predictive model might have been updated, and the goal is to understand if there have been any side effects of this update for the training data.Bayesian inference of the net promoter score via multilevel regression with poststratification2020-02-03T07:00:00+00:002020-02-03T07:00:00+00:00https://blog.ivanukhov.com/2020/02/03/net-promoter<p>Customer surveys are naturally prone to biases. One prominent example is participation bias, which arises when individuals decide not to respond to the survey, and this pattern is not random. For instance, new customers might reply less eagerly than those who are senior. This renders the obtained responses unrepresentative of the target population. In this article, we tackle participation bias for the case of the net promoter survey by means of multilevel regression and poststratification.</p> <p>More specifically, the discussion here is a sequel to “<a href="/2019/08/19/net-promoter.html">A Bayesian approach to the inference of the net promoter score</a>,” where we built a hierarchical model for inferring the net promoter score for an arbitrary segmentation of a customer base. The reader is encouraged to skim over that article to recall the mechanics of the score and the structure of the model that was constructed. In that article, there was an assumption made that the sample was representative of the population, which, as mentioned earlier, is often not the case. In what follows, we mitigate this problem using a technique called poststratification. The technique works by matching proportions observed in the sample with those observed in the population with respect to several dimensions, such as age, country, and gender. However, in order to be able to poststratify, the model has to have access to all these dimensions at once, which the model built earlier is not suited for. To enable this, we switch gears to multilevel multinomial regression.</p> <h1 id="problem">Problem</h1> <p>Suppose the survey is to measure the net promoter score for a population that consists of $$N$$ customers. The score is to be reported with respect to individual values of $$M$$ grouping variables where variable $$i$$ has $$m_i$$ possible values, for $$i = 1, \dots, M$$. For instance, it might be important to know the score for different age groups, in which case the variable would be the customer’s age with values such as 18–25, 26–35, and so on. This implies that, in total, $$\sum_i m_i$$ scores have to be estimated.</p> <p>Depending on the size of the business, one might or might not try to reach out to all customers, except for those who have opted out of communications. Regardless of the decision, the resulting sample size, which is denoted by $$n$$, is likely to be substantially smaller than $$N$$, as the response rate is typically low. Therefore, there is uncertainty about the opinion of those who abstained or were not targeted.</p> <p>More importantly, a random sample is desired; however, certain subpopulations of customers might end up being significantly overrepresented due to participation bias, driving the score astray. Let us quantify this concern. We begin by taking the Cartesian product of the aforementioned $$M$$ variables. This results in $$K = \prod_i m_i$$ distinct combinations of the variables’ values, which are referred to as cells in what follows. For each cell, the number of detractors, neutrals, and promoters observed in the sample are computed and denoted by $$d_i$$, $$u_i$$, and $$p_i$$, respectively. The number of respondents in call $$i$$ is then</p> $n_i = d_i + u_i + p_i \tag{1}$ <p>for $$i = 1, \dots, K$$. For convenience, all counts are arranged in the following matrix:</p> $y = \left( \begin{matrix} y_1 \\ \vdots \\ y_i \\ \vdots \\ y_K \end{matrix} \right) = \left( \begin{matrix} d_1 &amp; u_1 &amp; p_1 \\ \vdots &amp; \vdots &amp; \vdots \\ d_i &amp; u_i &amp; p_i \\ \vdots &amp; \vdots &amp; \vdots \\ d_K &amp; u_K &amp; p_K \end{matrix} \right). \tag{2}$ <p>Given $$y$$, the observed net promoter score for value $$j$$ of variable $$i$$ can be evaluated as follows:</p> $s^i_j = 100 \times \frac{\sum_{k \in I^i_j}(p_k - d_k)}{\sum_{k \in I^i_j} n_k} \tag{3}$ <p>where $$I^i_j$$ is an index set traversing cells with variable $$i$$ set to value $$j$$, which has the effect of marginalizing out other variables conditioned on the chosen value of variable $$i$$, that is, on value $$j$$.</p> <p>We can now compare $$n_i$$, computed according to Equation (1), with its counterpart in the population (the total number of customers who belong to cell $$i$$), which is denoted by $$N_i$$, taking into consideration the sample size $$n$$ and the population size $$N$$. Problems occur when the ratios within one or more of the following tuples largely disagree:</p> $\left(\frac{n_i}{n}, \frac{N_i}{N}\right) \tag{4}$ <p>for $$i = 1, \dots, K$$. When this happens, the scores given by Equation (3) or any analyses oblivious of this disagreement cannot be trusted, since they misrepresent the population. (It should be noted, however, that equality within each tuple does not guarantee the absence of participation bias, since there might be other, potentially unobserved, dimensions along which there are deviations.)</p> <p>The survey has been conducted, and there are deviations. What do we do with all these responses that have come in? Should we discard and run a new survey, hoping that, this time, it would be different?</p> <h1 id="solution">Solution</h1> <p>The fact that the sample covers only a fraction of the population is, of course, no news, and the solution is standard: one has to infer the net promoter score for the population given the sample and domain knowledge. This is what was done in the <a href="/2019/08/19/net-promoter.html">previous article</a> for one grouping variable. However, due to participation bias, additional measures are needed as follows.</p> <p>Taking inspiration from political science, we proceed in two steps.</p> <ol> <li> <p>Using an adequate model, $$K = \prod_i m_i$$ net promoter scores are inferred—one for each cell, that is, for each combination of the values of the grouping variables.</p> </li> <li> <p>The $$\prod_i m_i$$ “cell-scores” are combined to produce $$\sum_i m_i$$ “value-scores”—one for each value of each variable. This is done in such a way that the contribution of each cell to the score is equal to the relative size of that cell in the population given by Equation (4).</p> </li> </ol> <p>The two steps are discussed in the following two subsections.</p> <h2 id="modeling">Modeling</h2> <p>Step 1 can, in principle, be undertaken by any model of choice. A prominent candidate is multilevel multinomial regression, which is what we shall explore. <em>Multilevel</em> refers to having a hierarchical structure where parameters on a higher level give birth to parameters on a lower level, which, in particular, enables information exchange through a common ancestor. <em>Multinomial</em> refers to the distribution used for modeling the response variable. The family of multinomial distributions is appropriate, since we work with counts of events falling into one of several categories: detractors, neutrals, and promoters; see Equation (2). The response for each cell is then as follows:</p> $y_i | \theta_i \sim \text{Multinomial}(n_i, \theta_i)$ <p>where $$n_i$$ is given by Equation (1), and</p> $\theta_i = \left\langle\theta^d_i, \theta^u_i, \theta^p_i\right\rangle$ <p>is a simplex (sums up to one) of probabilities of the three categories.</p> <p>Multinomial regression belongs to the class of generalized linear models. This means that the inference takes place in a linear domain, and that $$\theta_i$$ is obtained by applying a deterministic transformation to the corresponding linear model or models; the inverse of this transformation is known as the link function. In the case of multinomial regression, the aforementioned transformation is the softmax function, which is a generalization of the logistic function allowing more than two categories:</p> $\theta_i = \text{Softmax}\left(\mu_i\right)$ <p>where</p> $\mu_i = \left(0, \mu^u_i, \mu^p_i\right)$ <p>is the average log-odds of the three categories with respect to a reference category, which, by conventions, is taken to be the first one, that is, detractors. The first entry is zero, since $$\ln(1) = 0$$. Therefore, there are only two linear models: one is for neutrals ($$\mu^u_i$$), and one is for promoters ($$\mu^p_i$$).</p> <p>Now, there are many alternatives when it comes to the two linear parts. In this article, we use the following architecture. Both the model for neutrals and the one for promoters have the same structure, and for brevity, only the former is described. For the log-odds of neutrals, the model is</p> $\mu^u_i = b^u + \sum_{j = 1}^M \delta^{uj}_{I_j[i]}$ <p>where</p> $\delta^{uj} = \left(\delta^{uj}_1, \dots, \delta^{uj}_{m_j}\right)$ <p>is a vector of deviations from intercept $$b^u$$ specific to grouping variable $$j$$ (one entry for each value of the variable), and $$I_j[i]$$ yields the index of the value that cell $$i$$ has, for $$i = 1, \dots, K$$ and $$j = 1, \dots, M$$.</p> <p>Let us now turn to the multilevel aspect. For each grouping variable, the corresponding values, represented by the elements of $$\delta^{uj}$$, are allowed to be different but assumed to have something in common and thus originate from a common distribution. To this end, they are assigned distributions with a shared parameter as follows:</p> $\delta^{uj}_i | \sigma^{uj} \sim \text{Gaussian}\left(0, \sigma^{uj}\right)$ <p>for $$i = 1, \dots, m_j$$. The mean is zero, since $$\delta^{uj}_i$$ represents a deviation.</p> <p>Lastly, we have to decide on prior distributions of the intercept, $$b^u$$, and the standard deviations, $$\sigma^{uj}$$ for $$j = 1, \dots, M$$. The intercept is given the following prior:</p> $b^u \sim \text{Student’s t}(5, 0, 1).$ <p>The mean is zero in order to center at even odds. Regarding the standard deviations, they are given the following prior:</p> $\sigma^{uj} \sim \text{Half-Student’s t}(5, 0, 1).$ <p>In order to understand the implications of these prior choices, let us take a look at the prior distribution assuming two grouping variables:</p> <p><img src="/assets/images/2020-02-03-net-promoter/prior-distribution-1.svg" alt="" /></p> <p>The left and right dashed lines demarcate tail regions that, for practical purposes, can be thought of as “never” and “always,” respectively. For instance, log-odds of five or higher are so extreme that detractors are rendered nearly non-existent when compared to neutrals. These regions are arguably unrealistic. The prior does not exclude these possibilities; however, it does not favor them either. The vast majority of the probability mass is still in the middle around zero.</p> <p>The overall model is then as follow:</p> \begin{align} &amp; y_i | \theta_i \sim \text{Multinomial}(n_i, \theta_i), \text{ for } i = 1, \dots, K; \\ &amp; \theta_i = \text{Softmax}\left(\mu_i\right), \text{ for } i = 1, \dots, K; \\ &amp; \mu_i = (0, \mu^u_i, \mu^p_i), \text{ for } i = 1, \dots, K; \\ &amp; \mu^u_i = b^u + \sum_{j = 1}^M \delta^{uj}_{I_j[i]}, \text{ for } i = 1, \dots, K; \\ &amp; \mu^p_i = b^p + \sum_{j = 1}^M \delta^{pj}_{I_j[i]}, \text{ for } i = 1, \dots, K; \\ &amp; b^u \sim \text{Student’s t}(5, 0, 1); \\ &amp; b^p \sim \text{Student’s t}(5, 0, 1); \\ &amp; \delta^{uj}_k | \sigma^{uj} \sim \text{Gaussian}\left(0, \sigma^{uj}\right), \text{ for } j = 1, \dots, M \text{ and } k = 1, \dots, m_j; \tag{5a} \\ &amp; \delta^{pj}_k | \sigma^{pj} \sim \text{Gaussian}\left(0, \sigma^{pj}\right), \text{ for } j = 1, \dots, M \text{ and } k = 1, \dots, m_j; \tag{5b} \\ &amp; \sigma^{uj} \sim \text{Half-Student’s t}(5, 0, 1), \text{ for } j = 1, \dots, M; \text{ and} \\ &amp; \sigma^{pj} \sim \text{Half-Student’s t}(5, 0, 1), \text{ for } j = 1, \dots, M. \end{align} <p>The model has $$2 \times (1 + \sum_i m_i + M)$$ parameters in total. The structure that can be seen in Equations (5a) and (5b) is what makes the model multilevel. This is an important feature, since it allows for information sharing between the individual values of the grouping variables. In particular, this has a regularizing effect on the estimates, which is also known as shrinkage resulting from partial pooling.</p> <p>Having defined the model, the posterior distribution can now be obtained by means of Markov chain Monte Carlo sampling. This procedure is standard and can be performed using, for instance, Stan or a higher-level package, such as <a href="https://github.com/paul-buerkner/brms"><code class="language-plaintext highlighter-rouge">brms</code></a>, which is what is exemplified in the Implementation section. The result is a collection of draws of the parameters from the posterior distribution. For each draw of the parameters, a draw of the net promoter score can be computed using the following formula:</p> $s_i = 100 \times (\theta^p_i - \theta^d_i) \tag{6}$ <p>for $$i = 1, \dots, K$$. This means that we have obtained a (joint) posterior distribution of the net promoter score over the $$K$$ cells. It is now time to combine the scores for the cells on the level of the values of the $$M$$ grouping variables, which results in $$\sum_i m_i$$ scores in total.</p> <h2 id="poststratification">Poststratification</h2> <p>Step 2 is poststratification, whose purpose is to correct for potential deviations of the sample from the population; recall the discussion around Equation (4). The foundation laid in the previous subsection makes the work here straightforward. The idea is as follows. Each draw from the posterior distribution consists of $$K$$ values for the net promoter score, one for each cell. All one has to do in order to correct for a mismatch in proportions is to take a weighted average of these scores where the weights are the counts observed in the population:</p> $s^i_j = \frac{\sum_{k \in I^i_j} N_k \, s_k}{\sum_{k \in I^i_j} N_k}$ <p>where $$I^i_j$$ is as in Equation (3), for $$i = 1, \dots, M$$ and $$j = 1, \dots, m_i$$. The above gives a poststratified draw from the posterior distribution of the net promoter score for variable $$i$$ and value $$j$$. In practice, depending on the tool used, one might perform the poststratification procedure differently, such as predicting counts of detractors, neutrals, and promoters in the cells given their in-population sizes and then aggregating those counts and following the definition of the net promoter score.</p> <h1 id="implementation">Implementation</h1> <p>In what follows, we consider a contrived example with the sole purpose of illustrating how the presented workflow can be implemented in practice. To this end, we generate some data with two grouping variables, age and seniority, and then perform inference using <a href="https://github.com/paul-buerkner/brms"><code class="language-plaintext highlighter-rouge">brms</code></a>, which leverages Stan under the hood. For a convenient manipulation of posterior draws, <a href="https://github.com/mjskay/tidybayes"><code class="language-plaintext highlighter-rouge">tidybayes</code></a> is used as well.</p> <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">brms</span><span class="p">)</span><span class="w"> </span><span class="n">library</span><span class="p">(</span><span class="n">tidybayes</span><span class="p">)</span><span class="w"> </span><span class="n">library</span><span class="p">(</span><span class="n">tidyverse</span><span class="p">)</span><span class="w"> </span><span class="n">set.seed</span><span class="p">(</span><span class="m">42</span><span class="p">)</span><span class="w"> </span><span class="n">options</span><span class="p">(</span><span class="n">mc.cores</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">parallel</span><span class="o">::</span><span class="n">detectCores</span><span class="p">())</span><span class="w"> </span><span class="c1"># Load data</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">load_data</span><span class="p">()</span><span class="w"> </span><span class="c1"># =&gt; list(</span><span class="w"> </span><span class="c1"># =&gt; population = tibble(age, seniority, cell_size),</span><span class="w"> </span><span class="c1"># =&gt; sample = tibble(age, seniority, cell_size,</span><span class="w"> </span><span class="c1"># =&gt; cell_counts = (detractors, neutrals, promoters))</span><span class="w"> </span><span class="c1"># =&gt; )</span><span class="w"> </span><span class="c1"># Modeling</span><span class="w"> </span><span class="n">priors</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="w"> </span><span class="n">prior</span><span class="p">(</span><span class="s1">'student_t(5, 0, 1)'</span><span class="p">,</span><span class="w"> </span><span class="n">class</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Intercept'</span><span class="p">,</span><span class="w"> </span><span class="n">dpar</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'muneutral'</span><span class="p">),</span><span class="w"> </span><span class="n">prior</span><span class="p">(</span><span class="s1">'student_t(5, 0, 1)'</span><span class="p">,</span><span class="w"> </span><span class="n">class</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Intercept'</span><span class="p">,</span><span class="w"> </span><span class="n">dpar</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'mupromoter'</span><span class="p">),</span><span class="w"> </span><span class="n">prior</span><span class="p">(</span><span class="s1">'student_t(5, 0, 1)'</span><span class="p">,</span><span class="w"> </span><span class="n">class</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'sd'</span><span class="p">,</span><span class="w"> </span><span class="n">dpar</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'muneutral'</span><span class="p">),</span><span class="w"> </span><span class="n">prior</span><span class="p">(</span><span class="s1">'student_t(5, 0, 1)'</span><span class="p">,</span><span class="w"> </span><span class="n">class</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'sd'</span><span class="p">,</span><span class="w"> </span><span class="n">dpar</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'mupromoter'</span><span class="p">)</span><span class="w"> </span><span class="p">)</span><span class="w"> </span><span class="n">formula</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">brmsformula</span><span class="p">(</span><span class="w"> </span><span class="n">cell_counts</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">trials</span><span class="p">(</span><span class="n">cell_size</span><span class="p">)</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">age</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">seniority</span><span class="p">))</span><span class="w"> </span><span class="n">model</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">brm</span><span class="p">(</span><span class="n">formula</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="o">$</span><span class="n">sample</span><span class="p">,</span><span class="w"> </span><span class="n">multinomial</span><span class="p">(),</span><span class="w"> </span><span class="n">priors</span><span class="p">,</span><span class="w"> </span><span class="n">control</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">adapt_delta</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.99</span><span class="p">),</span><span class="w"> </span><span class="n">seed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">42</span><span class="p">)</span><span class="w"> </span><span class="c1"># Poststratification</span><span class="w"> </span><span class="n">prediction</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">data</span><span class="o">$</span><span class="n">population</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">add_predicted_draws</span><span class="p">(</span><span class="n">model</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">spread</span><span class="p">(</span><span class="n">.category</span><span class="p">,</span><span class="w"> </span><span class="n">.prediction</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">group_by</span><span class="p">(</span><span class="n">age</span><span class="p">,</span><span class="w"> </span><span class="n">.draw</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">summarize</span><span class="p">(</span><span class="n">score</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">100</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">promoter</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">detractor</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">cell_size</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">mean_hdi</span><span class="p">()</span><span class="w"> </span></code></pre></div></div> <p>The final aggregation is given for age; it is similar for seniority. It can be seen in the above listing that modern tools allow for rather complex ideas to be expressed and explored in a very laconic way.</p> <p>The curious reader is encouraged to run the above code. The appendix contains a function for generating synthetic data. It should be noted, however, that <code class="language-plaintext highlighter-rouge">brms</code> and <code class="language-plaintext highlighter-rouge">tidybayes</code> should be of versions greater than 2.11.1 and 2.0.1, respectively, which, at the time of writing, are available for installation only on GitHub. The appendix contains instructions for updating the packages.</p> <h1 id="conclusion">Conclusion</h1> <p>In this article, we have discussed a multilevel multinomial model for inferring the net promoter score with respect to several grouping variables in accordance with the business needs. It has been argued that poststratification is an essential stage of the inference process, since it mitigates the deleterious consequences of participation bias on the subsequent decision-making.</p> <p>There are still some aspects that could be improved. For instance, there is a natural ordering to the three categories of customers, detractors, neutrals, and promoters; however, it is currently ignored. Furthermore, there is some information thrown away when customer-level scores, which range from zero to ten, are aggregated on the category level. Lastly, the net promoter survey often happens in periodic waves, which calls for a single model capturing and learning from changes over time.</p> <h1 id="acknowledgments">Acknowledgments</h1> <p>I would like to thank <a href="http://www.stat.columbia.edu/~gelman/">Andrew Gelman</a> for the guidance on multilevel modeling and <a href="https://paul-buerkner.github.io/">Paul-Christian Bürkner</a> for the help with understanding the <code class="language-plaintext highlighter-rouge">brms</code> package.</p> <h1 id="references">References</h1> <ul> <li>Andrew Gelman et al., “<a href="http://www.stat.columbia.edu/~gelman/research/unpublished/MRT(1).pdf">Using multilevel regression and poststratification to estimate dynamic public opinion</a>,” 2018.</li> <li>Andrew Gelman and Jennifer Hill, <em><a href="https://doi.org/10.1017/CBO9780511790942">Data Analysis Using Regression and Multilevel/Hierarchical Models</a></em>, Cambridge University Press, 2006.</li> <li>Andrew Gelman and Thomas Little, “<a href="http://www.stat.columbia.edu/~gelman/research/published/poststrat3.pdf">Poststratification into many categories using hierarchical logistic regression</a>,” Survey Methodology, 1997.</li> <li>Paul-Christian Bürkner, “<a href="http://dx.doi.org/10.18637/jss.v080.i01">brms: An R package for Bayesian multilevel models using Stan</a>,” Journal of Statistical Software, 2017.</li> </ul> <h1 id="appendix">Appendix</h1> <p>The following listing defines a function that makes the illustrative example given in the Implementation section self-sufficient. By default, the population contains one million customers, and the sample contains one percent. There are two grouping variables: age with six values and seniority with seven values.</p> <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">load_data</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">N</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1000000</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10000</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">softmax</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="nf">exp</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="nf">exp</span><span class="p">(</span><span class="n">x</span><span class="p">))</span><span class="w"> </span><span class="c1"># Age</span><span class="w"> </span><span class="n">age_values</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'18–25'</span><span class="p">,</span><span class="w"> </span><span class="s1">'26–35'</span><span class="p">,</span><span class="w"> </span><span class="s1">'36–45'</span><span class="p">,</span><span class="w"> </span><span class="s1">'46–55'</span><span class="p">,</span><span class="w"> </span><span class="s1">'56–65'</span><span class="p">,</span><span class="w"> </span><span class="s1">'66+'</span><span class="p">)</span><span class="w"> </span><span class="n">age_probabilities</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">softmax</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">))</span><span class="w"> </span><span class="c1"># Seniority</span><span class="w"> </span><span class="n">seniority_values</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'6M'</span><span class="p">,</span><span class="w"> </span><span class="s1">'1Y'</span><span class="p">,</span><span class="w"> </span><span class="s1">'2Y'</span><span class="p">,</span><span class="w"> </span><span class="s1">'3Y'</span><span class="p">,</span><span class="w"> </span><span class="s1">'4Y'</span><span class="p">,</span><span class="w"> </span><span class="s1">'5Y'</span><span class="p">,</span><span class="w"> </span><span class="s1">'6Y+'</span><span class="p">)</span><span class="w"> </span><span class="n">seniority_probabilities</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">softmax</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">))</span><span class="w"> </span><span class="c1"># Score</span><span class="w"> </span><span class="n">score_values</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">10</span><span class="p">)</span><span class="w"> </span><span class="n">score_probabilities</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">softmax</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">4</span><span class="p">))</span><span class="w"> </span><span class="c1"># Generate a population</span><span class="w"> </span><span class="n">population</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">tibble</span><span class="p">(</span><span class="n">age</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sample</span><span class="p">(</span><span class="n">age_values</span><span class="p">,</span><span class="w"> </span><span class="n">N</span><span class="p">,</span><span class="w"> </span><span class="n">prob</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">age_probabilities</span><span class="p">,</span><span class="w"> </span><span class="n">replace</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">),</span><span class="w"> </span><span class="n">seniority</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sample</span><span class="p">(</span><span class="n">seniority_values</span><span class="p">,</span><span class="w"> </span><span class="n">N</span><span class="p">,</span><span class="w"> </span><span class="n">prob</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">seniority_probabilities</span><span class="p">,</span><span class="w"> </span><span class="n">replace</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">))</span><span class="w"> </span><span class="c1"># Take a sample from the population</span><span class="w"> </span><span class="n">sample</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">population</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">sample_n</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="n">score</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sample</span><span class="p">(</span><span class="n">score_values</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">prob</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">score_probabilities</span><span class="p">,</span><span class="w"> </span><span class="n">replace</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="n">category</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">case_when</span><span class="p">(</span><span class="n">score</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="m">7</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s1">'detractor'</span><span class="p">,</span><span class="w"> </span><span class="n">score</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="m">8</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s1">'promoter'</span><span class="p">,</span><span class="w"> </span><span class="kc">TRUE</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="s1">'neutral'</span><span class="p">))</span><span class="w"> </span><span class="c1"># Summarize the population</span><span class="w"> </span><span class="n">population</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">population</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">group_by</span><span class="p">(</span><span class="n">age</span><span class="p">,</span><span class="w"> </span><span class="n">seniority</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">count</span><span class="p">(</span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'cell_size'</span><span class="p">)</span><span class="w"> </span><span class="c1"># Summarize the sample</span><span class="w"> </span><span class="n">sample</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">sample</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">group_by</span><span class="p">(</span><span class="n">age</span><span class="p">,</span><span class="w"> </span><span class="n">seniority</span><span class="p">)</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">summarize</span><span class="p">(</span><span class="n">detractors</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">category</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s1">'detractor'</span><span class="p">),</span><span class="w"> </span><span class="n">neutrals</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">category</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s1">'neutral'</span><span class="p">),</span><span class="w"> </span><span class="n">promoters</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">category</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s1">'promoter'</span><span class="p">))</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="n">cell_size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">detractors</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">neutrals</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">promoters</span><span class="p">)</span><span class="w"> </span><span class="c1"># Bind counts of neutrals, detractors, and promoters (needed for brms)</span><span class="w"> </span><span class="n">sample</span><span class="o">$</span><span class="n">cell_counts</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">with</span><span class="p">(</span><span class="n">sample</span><span class="p">,</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="n">detractors</span><span class="p">,</span><span class="w"> </span><span class="n">neutrals</span><span class="p">,</span><span class="w"> </span><span class="n">promoters</span><span class="p">))</span><span class="w"> </span><span class="n">colnames</span><span class="p">(</span><span class="n">sample</span><span class="o">$</span><span class="n">cell_counts</span><span class="p">)</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'detractor'</span><span class="p">,</span><span class="w"> </span><span class="s1">'neutral'</span><span class="p">,</span><span class="w"> </span><span class="s1">'promoter'</span><span class="p">)</span><span class="w"> </span><span class="c1"># Remove unused columns</span><span class="w"> </span><span class="n">sample</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">sample</span><span class="w"> </span><span class="o">%&gt;%</span><span class="w"> </span><span class="n">select</span><span class="p">(</span><span class="o">-</span><span class="n">detractors</span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="n">neutrals</span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="n">promoters</span><span class="p">)</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">population</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">population</span><span class="p">,</span><span class="w"> </span><span class="n">sample</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">sample</span><span class="p">)</span><span class="w"> </span><span class="p">}</span><span class="w"> </span></code></pre></div></div> <p>Lastly, the following snippet shows how to update <code class="language-plaintext highlighter-rouge">brms</code> and <code class="language-plaintext highlighter-rouge">tidybayes</code> from GitHub:</p> <div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">packageVersion</span><span class="p">(</span><span class="s1">'brms'</span><span class="p">)</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="s1">'2.11.2'</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">remotes</span><span class="o">::</span><span class="n">install_github</span><span class="p">(</span><span class="s1">'paul-buerkner/brms'</span><span class="p">,</span><span class="w"> </span><span class="n">upgrade</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'never'</span><span class="p">)</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">packageVersion</span><span class="p">(</span><span class="s1">'tidybayes'</span><span class="p">)</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="s1">'2.0.1.9000'</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">remotes</span><span class="o">::</span><span class="n">install_github</span><span class="p">(</span><span class="s1">'mjskay/tidybayes'</span><span class="p">,</span><span class="w"> </span><span class="n">upgrade</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'never'</span><span class="p">)</span><span class="w"> </span><span class="p">}</span><span class="w"> </span></code></pre></div></div>Ivan UkhovCustomer surveys are naturally prone to biases. One prominent example is participation bias, which arises when individuals decide not to respond to the survey, and this pattern is not random. For instance, new customers might reply less eagerly than those who are senior. This renders the obtained responses unrepresentative of the target population. In this article, we tackle participation bias for the case of the net promoter survey by means of multilevel regression and poststratification.Ingestion of sequential data from BigQuery into TensorFlow2019-11-08T07:00:00+00:002019-11-08T07:00:00+00:00https://blog.ivanukhov.com/2019/11/08/sequential-data<p>How hard can it be to ingest sequential data into a <a href="https://www.tensorflow.org">TensorFlow</a> model? As always, the answer is, “It depends.” Where are the sequences in question stored? Can they fit in main memory? Are they of the same length? In what follows, we shall build a flexible and scalable workflow for feeding sequential observations into a TensorFlow graph starting from <a href="https://cloud.google.com/bigquery/">BigQuery</a> as the data warehouse.</p> <p>To make the discussion tangible, consider the following problem. Suppose the goal is to predict the peak temperature at an arbitrary weather station present in the <a href="https://www.ncdc.noaa.gov/data-access/land-based-station-data/land-based-datasets/global-historical-climatology-network-ghcn">Global Historical Climatology Network</a> for each day between June 1 and August 31. More concretely, given observations from June 1 up to an arbitrary day before August 31, the objective is to complete the sequence until August 31. For instance, if we find ourselves in Stockholm on June 12, we ask for the maximum temperatures from June 12 to August 31 given the temperature values between June 1 to June 11 at a weather station in Stockholm.</p> <p>To set the expectations right, in this article, we are not going to build a predictive model but to cater for its development by making the data from the aforementioned database readily available in a TensorFlow graph. The final chain of states and operations is as follows:</p> <ol> <li> <p>Historical temperature measurements from the Global Historical Climatology Network are stored in a <a href="https://console.cloud.google.com/marketplace/details/noaa-public/ghcn-d">public data set</a> in BigQuery. Each row corresponds to a weather station and a date. There are missing observations due to such reasons as measurements not passing quality checks.</p> </li> <li> <p>Relevant measurements are grouped in BigQuery by the weather station and year. Therefore, each row corresponds to a weather station and a year, implying that all information about a particular example (a specific weather station on a specific year) is gathered in one place.</p> </li> <li> <p>The sequences are read, analyzed, and transformed by <a href="https://cloud.google.com/dataflow/">Cloud Dataflow</a>.</p> <ul> <li> <p>The data are split into a training, a validation, and a testing set of examples.</p> </li> <li> <p>The training set is used to compute statistics needed for transforming the measurements to a form suitable for the subsequent modeling. Standardization is used as an example.</p> </li> <li> <p>The training and validation sets are transformed using the statistics computed with respect to the training set in order to avoid performing these computations during the training-with-validation phase. The corresponding transform is available for the testing phase.</p> </li> </ul> </li> <li> <p>The processed training and validation examples and the raw testing examples are written by Dataflow to <a href="https://cloud.google.com/storage/">Cloud Storage</a> in the <a href="https://www.tensorflow.org/tutorials/load_data/tfrecord">TFRecord</a> format, which is a format native to TensorFlow.</p> </li> <li> <p>The files containing TFRecords are read by the <a href="https://www.tensorflow.org/guide/data"><code class="language-plaintext highlighter-rouge">tf.data</code></a> API of TensorFlow and eventually transformed into a data set of appropriately padded batches of examples.</p> </li> </ol> <p>The above workflow is not as simple as reading data from a Pandas DataFrame comfortably resting in main memory; however, it is much more scalable. This pipeline can handle arbitrary amounts of data. Moreover, it operates on complete examples, not on individual measurements.</p> <p>In the rest of the article, the aforementioned steps will be described in more detail. The corresponding source code can be found in the following repository on GitHub:</p> <ul> <li><a href="https://github.com/chain-rule/example-weather-forecast">example-weather-forecast</a>.</li> </ul> <h1 id="data">Data</h1> <p>It all starts with data. The data come from the Global Historical Climatology Network, which is <a href="https://console.cloud.google.com/marketplace/details/noaa-public/ghcn-d">available in BigQuery</a> for public use. Steps 1 and 2 in the list above are covered by the <a href="https://github.com/chain-rule/example-weather-forecast/blob/master/configs/training/data.sql">following query</a>:</p> <div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">WITH</span> <span class="c1">-- Select relevant measurements</span> <span class="n">data_1</span> <span class="k">AS</span> <span class="p">(</span> <span class="k">SELECT</span> <span class="n">id</span><span class="p">,</span> <span class="nb">date</span><span class="p">,</span> <span class="c1">-- Find the date of the previous observation</span> <span class="n">LAG</span><span class="p">(</span><span class="nb">date</span><span class="p">)</span> <span class="n">OVER</span> <span class="p">(</span><span class="n">station_year</span><span class="p">)</span> <span class="k">AS</span> <span class="n">date_last</span><span class="p">,</span> <span class="n">latitude</span><span class="p">,</span> <span class="n">longitude</span><span class="p">,</span> <span class="c1">-- Convert to degrees Celsius</span> <span class="n">value</span> <span class="o">/</span> <span class="mi">10</span> <span class="k">AS</span> <span class="n">temperature</span> <span class="k">FROM</span> <span class="nv">bigquery-public-data.ghcn_d.ghcnd_201*</span> <span class="k">INNER</span> <span class="k">JOIN</span> <span class="nv">bigquery-public-data.ghcn_d.ghcnd_stations</span> <span class="k">USING</span> <span class="p">(</span><span class="n">id</span><span class="p">)</span> <span class="k">WHERE</span> <span class="c1">-- Take years from 2010 to 2019</span> <span class="k">CAST</span><span class="p">(</span><span class="n">_TABLE_SUFFIX</span> <span class="k">AS</span> <span class="n">INT64</span><span class="p">)</span> <span class="k">BETWEEN</span> <span class="mi">0</span> <span class="k">AND</span> <span class="mi">9</span> <span class="c1">-- Take months from June to August</span> <span class="k">AND</span> <span class="k">EXTRACT</span><span class="p">(</span><span class="k">MONTH</span> <span class="k">FROM</span> <span class="nb">date</span><span class="p">)</span> <span class="k">BETWEEN</span> <span class="mi">6</span> <span class="k">AND</span> <span class="mi">8</span> <span class="c1">-- Take the maximum temperature</span> <span class="k">AND</span> <span class="n">element</span> <span class="o">=</span> <span class="s1">'TMAX'</span> <span class="c1">-- Take observations passed spatio-temporal quality-control checks</span> <span class="k">AND</span> <span class="n">qflag</span> <span class="k">IS</span> <span class="k">NULL</span> <span class="n">WINDOW</span> <span class="n">station_year</span> <span class="k">AS</span> <span class="p">(</span> <span class="n">PARTITION</span> <span class="k">BY</span> <span class="n">id</span><span class="p">,</span> <span class="k">EXTRACT</span><span class="p">(</span><span class="nb">YEAR</span> <span class="k">FROM</span> <span class="nb">date</span><span class="p">)</span> <span class="k">ORDER</span> <span class="k">BY</span> <span class="nb">date</span> <span class="p">)</span> <span class="p">),</span> <span class="c1">-- Group into examples (a specific station and a specific year)</span> <span class="n">data_2</span> <span class="k">AS</span> <span class="p">(</span> <span class="k">SELECT</span> <span class="n">id</span><span class="p">,</span> <span class="k">MIN</span><span class="p">(</span><span class="nb">date</span><span class="p">)</span> <span class="k">AS</span> <span class="nb">date</span><span class="p">,</span> <span class="n">latitude</span><span class="p">,</span> <span class="n">longitude</span><span class="p">,</span> <span class="c1">-- Compute gaps between observations</span> <span class="n">ARRAY_AGG</span><span class="p">(</span> <span class="n">DATE_DIFF</span><span class="p">(</span><span class="nb">date</span><span class="p">,</span> <span class="n">IFNULL</span><span class="p">(</span><span class="n">date_last</span><span class="p">,</span> <span class="nb">date</span><span class="p">),</span> <span class="k">DAY</span><span class="p">)</span> <span class="k">ORDER</span> <span class="k">BY</span> <span class="nb">date</span> <span class="p">)</span> <span class="k">AS</span> <span class="n">duration</span><span class="p">,</span> <span class="n">ARRAY_AGG</span><span class="p">(</span><span class="n">temperature</span> <span class="k">ORDER</span> <span class="k">BY</span> <span class="nb">date</span><span class="p">)</span> <span class="k">AS</span> <span class="n">temperature</span> <span class="k">FROM</span> <span class="n">data_1</span> <span class="k">GROUP</span> <span class="k">BY</span> <span class="n">id</span><span class="p">,</span> <span class="n">latitude</span><span class="p">,</span> <span class="n">longitude</span><span class="p">,</span> <span class="k">EXTRACT</span><span class="p">(</span><span class="nb">YEAR</span> <span class="k">FROM</span> <span class="nb">date</span><span class="p">)</span> <span class="p">)</span> <span class="c1">-- Partition into training, validation, and testing sets</span> <span class="k">SELECT</span> <span class="o">*</span><span class="p">,</span> <span class="k">CASE</span> <span class="k">WHEN</span> <span class="k">EXTRACT</span><span class="p">(</span><span class="nb">YEAR</span> <span class="k">FROM</span> <span class="nb">date</span><span class="p">)</span> <span class="o">&lt;</span> <span class="mi">2019</span> <span class="k">THEN</span> <span class="s1">'analysis,training'</span> <span class="k">WHEN</span> <span class="k">MOD</span><span class="p">(</span><span class="k">ABS</span><span class="p">(</span><span class="n">FARM_FINGERPRINT</span><span class="p">(</span><span class="n">id</span><span class="p">)),</span> <span class="mi">100</span><span class="p">)</span> <span class="o">&lt;</span> <span class="mi">50</span> <span class="k">THEN</span> <span class="s1">'validation'</span> <span class="k">ELSE</span> <span class="s1">'testing'</span> <span class="k">END</span> <span class="k">AS</span> <span class="k">mode</span> <span class="k">FROM</span> <span class="n">data_2</span> </code></pre></div></div> <p>The query fetches peak temperatures, denoted by <code class="language-plaintext highlighter-rouge">temperature</code>, for all available weather stations between June and August in 2010–2019. The crucial part is the usage of <code class="language-plaintext highlighter-rouge">ARRAY_AGG</code>, which is what makes it possible to gather all relevant data about a specific station and a specific year in the same row. The number of days since the previous measurement, which is denoted by <code class="language-plaintext highlighter-rouge">duration</code>, is also computed. Ideally, <code class="language-plaintext highlighter-rouge">duration</code> should always be one (except for the first day, which has no predecessor); however, this is not the case, which makes the resulting time series vary in length.</p> <p>In addition, in order to illustrate the generality of this approach, two contextual (that is, non-sequential) explanatory variables are added: <code class="language-plaintext highlighter-rouge">latitude</code> and <code class="language-plaintext highlighter-rouge">longitude</code>. They are scalars stored side by side with <code class="language-plaintext highlighter-rouge">duration</code> and <code class="language-plaintext highlighter-rouge">temperature</code>, which are arrays.</p> <p>Another important moment in the final <code class="language-plaintext highlighter-rouge">SELECT</code> statement, which defines a column called <code class="language-plaintext highlighter-rouge">mode</code>. This column indicates what each example is used for, allowing one to use the same query for different purposes and to avoid inconsistencies due to multiple queries. In this case, observations prior to 2019 are reserved for training, while the rest is split pseudo-randomly and reproducibly into two approximately equal parts: one is for validation, and one is for testing. This last operation is explained in detail in “<a href="https://www.oreilly.com/learning/repeatable-sampling-of-data-sets-in-bigquery-for-machine-learning">Repeatable sampling of data sets in BigQuery for machine learning</a>” by Lak Lakshmanan.</p> <h1 id="preprocessing">Preprocessing</h1> <p>In this section, we cover Steps 4 and 5 in the list given at the beginning. This job is done by <a href="https://www.tensorflow.org/tfx">TensorFlow Extended</a>, which is a library for building machine-learning pipelines. Internally, it relies on <a href="https://beam.apache.org/">Apache Beam</a> as a language for defining pipelines. Once an adequate pipeline is created, it can be executed using an executor, and the executor that we shall use is <a href="https://cloud.google.com/dataflow/">Cloud Dataflow</a>.</p> <p>Before we proceed to the pipeline itself, the construction process is orchestrated by a <a href="https://github.com/chain-rule/example-weather-forecast/blob/master/configs/training/preprocessing.json">configuration file</a>, which will be referred to as <code class="language-plaintext highlighter-rouge">config</code> in the pipeline code (to be discussed shortly):</p> <div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w"> </span><span class="nl">"data"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"path"</span><span class="p">:</span><span class="w"> </span><span class="s2">"configs/training/data.sql"</span><span class="p">,</span><span class="w"> </span><span class="nl">"schema"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"latitude"</span><span class="p">,</span><span class="w"> </span><span class="nl">"kind"</span><span class="p">:</span><span class="w"> </span><span class="s2">"float32"</span><span class="p">,</span><span class="w"> </span><span class="nl">"transform"</span><span class="p">:</span><span class="w"> </span><span class="s2">"z"</span><span class="w"> </span><span class="p">},</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"longitude"</span><span class="p">,</span><span class="w"> </span><span class="nl">"kind"</span><span class="p">:</span><span class="w"> </span><span class="s2">"float32"</span><span class="p">,</span><span class="w"> </span><span class="nl">"transform"</span><span class="p">:</span><span class="w"> </span><span class="s2">"z"</span><span class="w"> </span><span class="p">},</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"duration"</span><span class="p">,</span><span class="w"> </span><span class="nl">"kind"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">"float32"</span><span class="p">],</span><span class="w"> </span><span class="nl">"transform"</span><span class="p">:</span><span class="w"> </span><span class="s2">"z"</span><span class="w"> </span><span class="p">},</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"temperature"</span><span class="p">,</span><span class="w"> </span><span class="nl">"kind"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">"float32"</span><span class="p">],</span><span class="w"> </span><span class="nl">"transform"</span><span class="p">:</span><span class="w"> </span><span class="s2">"z"</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="p">},</span><span class="w"> </span><span class="nl">"modes"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"analysis"</span><span class="w"> </span><span class="p">},</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"training"</span><span class="p">,</span><span class="w"> </span><span class="nl">"transform"</span><span class="p">:</span><span class="w"> </span><span class="s2">"analysis"</span><span class="p">,</span><span class="w"> </span><span class="nl">"shuffle"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="w"> </span><span class="p">},</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"validation"</span><span class="p">,</span><span class="w"> </span><span class="nl">"transform"</span><span class="p">:</span><span class="w"> </span><span class="s2">"analysis"</span><span class="w"> </span><span class="p">},</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"testing"</span><span class="p">,</span><span class="w"> </span><span class="nl">"transform"</span><span class="p">:</span><span class="w"> </span><span class="s2">"identity"</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="p">}</span><span class="w"> </span></code></pre></div></div> <p>It is worth noting that this way of working with a separate configuration file is not something standard that comes with TensorFlow or Beam. It is a convenience that we build for ourselves in order to keep the main logic reusable and extendable without touching the code.</p> <p>The <code class="language-plaintext highlighter-rouge">data</code> block describes where the data can be found and provides a schema for the columns that are used. (Recall the SQL query given earlier and note that <code class="language-plaintext highlighter-rouge">id</code>, <code class="language-plaintext highlighter-rouge">date</code>, and <code class="language-plaintext highlighter-rouge">partition</code> are omitted.) For instance, <code class="language-plaintext highlighter-rouge">latitude</code> is a scale of type <code class="language-plaintext highlighter-rouge">FLOAT32</code>, while <code class="language-plaintext highlighter-rouge">temperature</code> is a sequence of type <code class="language-plaintext highlighter-rouge">FLOAT32</code>. Both are standardized to have a zero mean and a unit standard deviation, which is indicated by <code class="language-plaintext highlighter-rouge">"transform": "z"</code> and is typically needed for training neural networks.</p> <p>The <code class="language-plaintext highlighter-rouge">modes</code> block defines four passes over the data, corresponding to four operating modes. In each mode, a specific subset of examples is considered, which is given by the <code class="language-plaintext highlighter-rouge">mode</code> column returned by the query. There are two types of modes: analysis and transform; recall Step 3. Whenever the <code class="language-plaintext highlighter-rouge">transform</code> key is present, it is a transform mode; otherwise, it is an analysis mode. In this example, there are one analysis and three transform modes.</p> <p>Below is an excerpt from a <a href="https://github.com/chain-rule/example-weather-forecast/blob/master/forecast/pipeline.py">Python class</a> responsible for building the pipeline:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># config = ... # schema = ... </span> <span class="c1"># Read the SQL code </span><span class="n">query</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="n">config</span><span class="p">[</span><span class="s">'data'</span><span class="p">][</span><span class="s">'path'</span><span class="p">]).</span><span class="n">read</span><span class="p">()</span> <span class="c1"># Create a BigQuery source </span><span class="n">source</span> <span class="o">=</span> <span class="n">beam</span><span class="p">.</span><span class="n">io</span><span class="p">.</span><span class="n">BigQuerySource</span><span class="p">(</span><span class="n">query</span><span class="o">=</span><span class="n">query</span><span class="p">,</span> <span class="n">use_standard_sql</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span> <span class="c1"># Create metadata needed later </span><span class="n">spec</span> <span class="o">=</span> <span class="n">schema</span><span class="p">.</span><span class="n">to_feature_spec</span><span class="p">()</span> <span class="n">meta</span> <span class="o">=</span> <span class="n">dataset_metadata</span><span class="p">.</span><span class="n">DatasetMetadata</span><span class="p">(</span> <span class="n">schema</span><span class="o">=</span><span class="n">dataset_schema</span><span class="p">.</span><span class="n">from_feature_spec</span><span class="p">(</span><span class="n">spec</span><span class="p">))</span> <span class="c1"># Read data from BigQuery </span><span class="n">data</span> <span class="o">=</span> <span class="n">pipeline</span> \ <span class="o">|</span> <span class="s">'read'</span> <span class="o">&gt;&gt;</span> <span class="n">beam</span><span class="p">.</span><span class="n">io</span><span class="p">.</span><span class="n">Read</span><span class="p">(</span><span class="n">source</span><span class="p">)</span> <span class="c1"># Loop over modes whose purpose is analysis </span><span class="n">transform_functions</span> <span class="o">=</span> <span class="p">{}</span> <span class="k">for</span> <span class="n">mode</span> <span class="ow">in</span> <span class="n">config</span><span class="p">[</span><span class="s">'modes'</span><span class="p">]:</span> <span class="k">if</span> <span class="s">'transform'</span> <span class="ow">in</span> <span class="n">mode</span><span class="p">:</span> <span class="k">continue</span> <span class="n">name</span> <span class="o">=</span> <span class="n">mode</span><span class="p">[</span><span class="s">'name'</span><span class="p">]</span> <span class="c1"># Select examples that belong to the current mode </span> <span class="n">data_</span> <span class="o">=</span> <span class="n">data</span> \ <span class="o">|</span> <span class="n">name</span> <span class="o">+</span> <span class="s">'-filter'</span> <span class="o">&gt;&gt;</span> <span class="n">beam</span><span class="p">.</span><span class="n">Filter</span><span class="p">(</span><span class="n">partial</span><span class="p">(</span><span class="n">_filter</span><span class="p">,</span> <span class="n">mode</span><span class="p">))</span> <span class="c1"># Analyze the examples </span> <span class="n">transform_functions</span><span class="p">[</span><span class="n">name</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">data_</span><span class="p">,</span> <span class="n">meta</span><span class="p">)</span> \ <span class="o">|</span> <span class="n">name</span> <span class="o">+</span> <span class="s">'-analyze'</span> <span class="o">&gt;&gt;</span> <span class="n">tt_beam</span><span class="p">.</span><span class="n">AnalyzeDataset</span><span class="p">(</span><span class="n">_analyze</span><span class="p">)</span> <span class="n">path</span> <span class="o">=</span> <span class="n">_locate</span><span class="p">(</span><span class="n">config</span><span class="p">,</span> <span class="n">name</span><span class="p">,</span> <span class="s">'transform'</span><span class="p">)</span> <span class="c1"># Store the transform function </span> <span class="n">transform_functions</span><span class="p">[</span><span class="n">name</span><span class="p">]</span> \ <span class="o">|</span> <span class="n">name</span> <span class="o">+</span> <span class="s">'-write-transform'</span> <span class="o">&gt;&gt;</span> <span class="n">transform_fn_io</span><span class="p">.</span><span class="n">WriteTransformFn</span><span class="p">(</span><span class="n">path</span><span class="p">)</span> <span class="c1"># Loop over modes whose purpose is transformation </span><span class="k">for</span> <span class="n">mode</span> <span class="ow">in</span> <span class="n">config</span><span class="p">[</span><span class="s">'modes'</span><span class="p">]:</span> <span class="k">if</span> <span class="ow">not</span> <span class="s">'transform'</span> <span class="ow">in</span> <span class="n">mode</span><span class="p">:</span> <span class="k">continue</span> <span class="n">name</span> <span class="o">=</span> <span class="n">mode</span><span class="p">[</span><span class="s">'name'</span><span class="p">]</span> <span class="c1"># Select examples that belong to the current mode </span> <span class="n">data_</span> <span class="o">=</span> <span class="n">data</span> \ <span class="o">|</span> <span class="n">name</span> <span class="o">+</span> <span class="s">'-filter'</span> <span class="o">&gt;&gt;</span> <span class="n">beam</span><span class="p">.</span><span class="n">Filter</span><span class="p">(</span><span class="n">partial</span><span class="p">(</span><span class="n">_filter</span><span class="p">,</span> <span class="n">mode</span><span class="p">))</span> <span class="c1"># Shuffle examples if needed </span> <span class="k">if</span> <span class="n">mode</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">'shuffle'</span><span class="p">,</span> <span class="bp">False</span><span class="p">):</span> <span class="n">data_</span> <span class="o">=</span> <span class="n">data_</span> \ <span class="o">|</span> <span class="n">name</span> <span class="o">+</span> <span class="s">'-shuffle'</span> <span class="o">&gt;&gt;</span> <span class="n">beam</span><span class="p">.</span><span class="n">transforms</span><span class="p">.</span><span class="n">Reshuffle</span><span class="p">()</span> <span class="c1"># Transform the examples using an appropriate transform function </span> <span class="k">if</span> <span class="n">mode</span><span class="p">[</span><span class="s">'transform'</span><span class="p">]</span> <span class="o">==</span> <span class="s">'identity'</span><span class="p">:</span> <span class="n">coder</span> <span class="o">=</span> <span class="n">tft</span><span class="p">.</span><span class="n">coders</span><span class="p">.</span><span class="n">ExampleProtoCoder</span><span class="p">(</span><span class="n">meta</span><span class="p">.</span><span class="n">schema</span><span class="p">)</span> <span class="k">else</span><span class="p">:</span> <span class="n">data_</span><span class="p">,</span> <span class="n">meta_</span> <span class="o">=</span> <span class="p">((</span><span class="n">data_</span><span class="p">,</span> <span class="n">meta</span><span class="p">),</span> <span class="n">transform_functions</span><span class="p">[</span><span class="n">mode</span><span class="p">[</span><span class="s">'transform'</span><span class="p">]])</span> \ <span class="o">|</span> <span class="n">name</span> <span class="o">+</span> <span class="s">'-transform'</span> <span class="o">&gt;&gt;</span> <span class="n">tt_beam</span><span class="p">.</span><span class="n">TransformDataset</span><span class="p">()</span> <span class="n">coder</span> <span class="o">=</span> <span class="n">tft</span><span class="p">.</span><span class="n">coders</span><span class="p">.</span><span class="n">ExampleProtoCoder</span><span class="p">(</span><span class="n">meta_</span><span class="p">.</span><span class="n">schema</span><span class="p">)</span> <span class="n">path</span> <span class="o">=</span> <span class="n">_locate</span><span class="p">(</span><span class="n">config</span><span class="p">,</span> <span class="n">name</span><span class="p">,</span> <span class="s">'examples'</span><span class="p">,</span> <span class="s">'part'</span><span class="p">)</span> <span class="c1"># Store the transformed examples as TFRecords </span> <span class="n">data_</span> \ <span class="o">|</span> <span class="n">name</span> <span class="o">+</span> <span class="s">'-encode'</span> <span class="o">&gt;&gt;</span> <span class="n">beam</span><span class="p">.</span><span class="n">Map</span><span class="p">(</span><span class="n">coder</span><span class="p">.</span><span class="n">encode</span><span class="p">)</span> \ <span class="o">|</span> <span class="n">name</span> <span class="o">+</span> <span class="s">'-write-examples'</span> <span class="o">&gt;&gt;</span> <span class="n">beam</span><span class="p">.</span><span class="n">io</span><span class="p">.</span><span class="n">tfrecordio</span><span class="p">.</span><span class="n">WriteToTFRecord</span><span class="p">(</span><span class="n">path</span><span class="p">)</span> </code></pre></div></div> <p>At the very beginning, a BigQuery source is created, which is then branched out according to the operating modes found in the configuration file. Specifically, the first for-loop corresponds to the analysis modes, and the second for-loop goes over the transform modes. The former ends with <code class="language-plaintext highlighter-rouge">WriteTransformFn</code>, which saves the resulting transform, and the latter ends with <code class="language-plaintext highlighter-rouge">WriteToTFRecord</code>, which writes the resulting examples as TFRecords.</p> <p>The distinction between the contextual and sequential features is given by the <a href="https://github.com/chain-rule/example-weather-forecast/blob/master/forecast/schema.py"><code class="language-plaintext highlighter-rouge">schema</code></a> object created based on the <code class="language-plaintext highlighter-rouge">schema</code> block in the configuration file. The call <code class="language-plaintext highlighter-rouge">schema.to_feature_spec()</code> shown above alternates between <a href="https://www.tensorflow.org/api_docs/python/tf/io/FixedLenFeature"><code class="language-plaintext highlighter-rouge">tf.io.FixedLenFeature</code></a> and <a href="https://www.tensorflow.org/api_docs/python/tf/io/VarLenFeature"><code class="language-plaintext highlighter-rouge">tf.io.VarLenFeature</code></a> and produces a feature specification that is understood by TensorFlow and TensorFlow Extended.</p> <p>The <a href="https://github.com/chain-rule/example-weather-forecast">repository</a> provides a wrapper for executing the pipeline on Cloud Dataflow. The following figure shows the flow of the data with respect to the four operating modes:</p> <p><img src="/assets/images/2019-11-08-sequential-data/dataflow.svg" alt="" /></p> <p>The outcome is a hierarchy of files on Cloud Storage:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>. └── data/ └── training/ └── 2019-11-01-12-00-00/ ├── analysis/ │ └── transform/ │ ├── transform_fn/... │ └── transform_metadata/... ├── testing/ │ └── examples/ │ ├── part-000000-of-00004 │ ├── ... │ └── part-000003-of-00004 ├── training/ │ └── examples/ │ ├── part-000000-of-00006 │ ├── ... │ └── part-000005-of-00006 └── validation/ └── examples/ ├── part-000000-of-00004 ├── ... └── part-000003-of-00004 </code></pre></div></div> <p>Here, <code class="language-plaintext highlighter-rouge">data/training</code> contains all data needed for the training phase, which collectively refers to training entwined with validation and followed by testing. Moving forward, this hierarchy is meant to accommodate the application phase as well by populating a <code class="language-plaintext highlighter-rouge">data/application</code> entry next to the <code class="language-plaintext highlighter-rouge">data/training</code> one. It can also accommodate trained models and the results of applying these models by having a <code class="language-plaintext highlighter-rouge">model</code> entry with a structure similar to the one of the <code class="language-plaintext highlighter-rouge">data</code> entry.</p> <p>In the listing above, the files whose name starts with <code class="language-plaintext highlighter-rouge">part-</code> are the ones containing TFRecords. It can be seen that, for each mode, the corresponding examples have been split into multiple files, which is done for more efficient access during the usage stage discussed in the next section.</p> <h1 id="execution">Execution</h1> <p>At this point, the data have made it all the way to the execution phase, referring to training, validation, and testing; however, the data are yet to be injected into a TensorFlow graph, which is the topic of this section. As before, relevant parameters are kept in a <a href="https://github.com/chain-rule/example-weather-forecast/blob/master/configs/training/execution.json">separate configuration file</a>:</p> <div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w"> </span><span class="nl">"data"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"schema"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"latitude"</span><span class="p">,</span><span class="w"> </span><span class="nl">"kind"</span><span class="p">:</span><span class="w"> </span><span class="s2">"float32"</span><span class="w"> </span><span class="p">},</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"longitude"</span><span class="p">,</span><span class="w"> </span><span class="nl">"kind"</span><span class="p">:</span><span class="w"> </span><span class="s2">"float32"</span><span class="w"> </span><span class="p">},</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"duration"</span><span class="p">,</span><span class="w"> </span><span class="nl">"kind"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">"float32"</span><span class="p">]</span><span class="w"> </span><span class="p">},</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"temperature"</span><span class="p">,</span><span class="w"> </span><span class="nl">"kind"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">"float32"</span><span class="p">]</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="p">],</span><span class="w"> </span><span class="nl">"modes"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"training"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"spec"</span><span class="p">:</span><span class="w"> </span><span class="s2">"transformed"</span><span class="p">,</span><span class="w"> </span><span class="nl">"shuffle_macro"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"buffer_size"</span><span class="p">:</span><span class="w"> </span><span class="mi">100</span><span class="w"> </span><span class="p">},</span><span class="w"> </span><span class="nl">"interleave"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"cycle_length"</span><span class="p">:</span><span class="w"> </span><span class="mi">100</span><span class="p">,</span><span class="w"> </span><span class="nl">"num_parallel_calls"</span><span class="p">:</span><span class="w"> </span><span class="mi">-1</span><span class="w"> </span><span class="p">},</span><span class="w"> </span><span class="nl">"shuffle_micro"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"buffer_size"</span><span class="p">:</span><span class="w"> </span><span class="mi">512</span><span class="w"> </span><span class="p">},</span><span class="w"> </span><span class="nl">"map"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"num_parallel_calls"</span><span class="p">:</span><span class="w"> </span><span class="mi">-1</span><span class="w"> </span><span class="p">},</span><span class="w"> </span><span class="nl">"batch"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"batch_size"</span><span class="p">:</span><span class="w"> </span><span class="mi">128</span><span class="w"> </span><span class="p">},</span><span class="w"> </span><span class="nl">"prefetch"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"buffer_size"</span><span class="p">:</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="p">},</span><span class="w"> </span><span class="nl">"repeat"</span><span class="p">:</span><span class="w"> </span><span class="p">{}</span><span class="w"> </span><span class="p">},</span><span class="w"> </span><span class="nl">"validation"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"spec"</span><span class="p">:</span><span class="w"> </span><span class="s2">"transformed"</span><span class="p">,</span><span class="w"> </span><span class="nl">"shuffle_macro"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"buffer_size"</span><span class="p">:</span><span class="w"> </span><span class="mi">100</span><span class="w"> </span><span class="p">},</span><span class="w"> </span><span class="nl">"interleave"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"cycle_length"</span><span class="p">:</span><span class="w"> </span><span class="mi">100</span><span class="p">,</span><span class="w"> </span><span class="nl">"num_parallel_calls"</span><span class="p">:</span><span class="w"> </span><span class="mi">-1</span><span class="w"> </span><span class="p">},</span><span class="w"> </span><span class="nl">"map"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"num_parallel_calls"</span><span class="p">:</span><span class="w"> </span><span class="mi">-1</span><span class="w"> </span><span class="p">},</span><span class="w"> </span><span class="nl">"batch"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"batch_size"</span><span class="p">:</span><span class="w"> </span><span class="mi">128</span><span class="w"> </span><span class="p">},</span><span class="w"> </span><span class="nl">"prefetch"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"buffer_size"</span><span class="p">:</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="p">},</span><span class="w"> </span><span class="nl">"repeat"</span><span class="p">:</span><span class="w"> </span><span class="p">{}</span><span class="w"> </span><span class="p">},</span><span class="w"> </span><span class="nl">"testing"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"spec"</span><span class="p">:</span><span class="w"> </span><span class="s2">"original"</span><span class="p">,</span><span class="w"> </span><span class="nl">"interleave"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"cycle_length"</span><span class="p">:</span><span class="w"> </span><span class="mi">100</span><span class="p">,</span><span class="w"> </span><span class="nl">"num_parallel_calls"</span><span class="p">:</span><span class="w"> </span><span class="mi">-1</span><span class="w"> </span><span class="p">},</span><span class="w"> </span><span class="nl">"map"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"num_parallel_calls"</span><span class="p">:</span><span class="w"> </span><span class="mi">-1</span><span class="w"> </span><span class="p">},</span><span class="w"> </span><span class="nl">"batch"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"batch_size"</span><span class="p">:</span><span class="w"> </span><span class="mi">128</span><span class="w"> </span><span class="p">},</span><span class="w"> </span><span class="nl">"prefetch"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"buffer_size"</span><span class="p">:</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="p">}</span><span class="w"> </span></code></pre></div></div> <p>It can be seen that the file contains only one block: <code class="language-plaintext highlighter-rouge">data</code>. This is sufficient for the purposes of this article; however, it is also meant to cover the construction of the model in mind, including its hyperparameters, and the execution process, including the optimizer and evaluation metrics.</p> <p>The <code class="language-plaintext highlighter-rouge">data</code> block is similar to the one we saw before. In this case, <code class="language-plaintext highlighter-rouge">modes</code> describes various calls to the <a href="https://www.tensorflow.org/guide/data"><code class="language-plaintext highlighter-rouge">tf.data</code></a> API related to shuffling, batching, and so on. Those who are familiar with the API will probably immediately recognize them. It is now instructive to go straight to the Python code.</p> <p>Below is an excerpt from a <a href="https://github.com/chain-rule/example-weather-forecast/blob/master/forecast/data.py">Python class</a> responsible for building the pipeline on the TensorFlow side:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># config = ... </span> <span class="c1"># List all files matching a given pattern </span><span class="n">pattern</span> <span class="o">=</span> <span class="p">[</span><span class="bp">self</span><span class="p">.</span><span class="n">path</span><span class="p">,</span> <span class="n">name</span><span class="p">,</span> <span class="s">'examples'</span><span class="p">,</span> <span class="s">'part-*'</span><span class="p">]</span> <span class="n">dataset</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">data</span><span class="p">.</span><span class="n">Dataset</span><span class="p">.</span><span class="n">list_files</span><span class="p">(</span><span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="o">*</span><span class="n">pattern</span><span class="p">))</span> <span class="c1"># Shuffle the files if needed </span><span class="k">if</span> <span class="s">'shuffle_macro'</span> <span class="ow">in</span> <span class="n">config</span><span class="p">:</span> <span class="n">dataset</span> <span class="o">=</span> <span class="n">dataset</span><span class="p">.</span><span class="n">shuffle</span><span class="p">(</span><span class="o">**</span><span class="n">config</span><span class="p">[</span><span class="s">'shuffle_macro'</span><span class="p">])</span> <span class="c1"># Convert the files into datasets of examples stored as TFRecords and # amalgamate these datasets into one dataset of examples </span><span class="n">dataset</span> <span class="o">=</span> <span class="n">dataset</span> \ <span class="p">.</span><span class="n">interleave</span><span class="p">(</span><span class="n">tf</span><span class="p">.</span><span class="n">data</span><span class="p">.</span><span class="n">TFRecordDataset</span><span class="p">,</span> <span class="o">**</span><span class="n">config</span><span class="p">[</span><span class="s">'interleave'</span><span class="p">])</span> <span class="c1"># Shuffle the examples if needed </span><span class="k">if</span> <span class="s">'shuffle_micro'</span> <span class="ow">in</span> <span class="n">config</span><span class="p">:</span> <span class="n">dataset</span> <span class="o">=</span> <span class="n">dataset</span><span class="p">.</span><span class="n">shuffle</span><span class="p">(</span><span class="o">**</span><span class="n">config</span><span class="p">[</span><span class="s">'shuffle_micro'</span><span class="p">])</span> <span class="c1"># Preprocess the examples with respect to a given spec, pad the examples # and form batches of different sizes, and postprocess the batches </span><span class="n">dataset</span> <span class="o">=</span> <span class="n">dataset</span> \ <span class="p">.</span><span class="nb">map</span><span class="p">(</span><span class="n">_preprocess</span><span class="p">,</span> <span class="o">**</span><span class="n">config</span><span class="p">[</span><span class="s">'map'</span><span class="p">])</span> \ <span class="p">.</span><span class="n">padded_batch</span><span class="p">(</span><span class="n">padded_shapes</span><span class="o">=</span><span class="n">_shape</span><span class="p">(),</span> <span class="o">**</span><span class="n">config</span><span class="p">[</span><span class="s">'batch'</span><span class="p">])</span> \ <span class="p">.</span><span class="nb">map</span><span class="p">(</span><span class="n">_postprocess</span><span class="p">,</span> <span class="o">**</span><span class="n">config</span><span class="p">[</span><span class="s">'map'</span><span class="p">])</span> <span class="c1"># Prefetch the batches if needed </span><span class="k">if</span> <span class="s">'prefetch'</span> <span class="ow">in</span> <span class="n">config</span><span class="p">:</span> <span class="n">dataset</span> <span class="o">=</span> <span class="n">dataset</span><span class="p">.</span><span class="n">prefetch</span><span class="p">(</span><span class="o">**</span><span class="n">config</span><span class="p">[</span><span class="s">'prefetch'</span><span class="p">])</span> <span class="c1"># Repeat the data once the source is exhausted if needed </span><span class="k">if</span> <span class="s">'repeat'</span> <span class="ow">in</span> <span class="n">config</span><span class="p">:</span> <span class="n">dataset</span> <span class="o">=</span> <span class="n">dataset</span><span class="p">.</span><span class="n">repeat</span><span class="p">(</span><span class="o">**</span><span class="n">config</span><span class="p">[</span><span class="s">'repeat'</span><span class="p">])</span> </code></pre></div></div> <p>The pipeline is self-explanatory. It is simply a chain of operations stacked on top of each other. It is, however, worth taking a closer look at the preprocessing and postprocessing mappings, which can be seen before and after the padding step, respectively:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">_preprocess</span><span class="p">(</span><span class="n">proto</span><span class="p">):</span> <span class="n">spec</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">transforms</span><span class="p">[</span><span class="n">config</span><span class="p">[</span><span class="s">'transform'</span><span class="p">]]</span> \ <span class="p">.</span><span class="n">transformed_feature_spec</span><span class="p">()</span> <span class="n">example</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">io</span><span class="p">.</span><span class="n">parse_single_example</span><span class="p">(</span><span class="n">proto</span><span class="p">,</span> <span class="n">spec</span><span class="p">)</span> <span class="k">return</span> <span class="p">(</span> <span class="p">{</span><span class="n">name</span><span class="p">:</span> <span class="n">example</span><span class="p">[</span><span class="n">name</span><span class="p">]</span> <span class="k">for</span> <span class="n">name</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">contextual_names</span><span class="p">},</span> <span class="p">{</span> <span class="c1"># Convert the sequential columns from sparse to dense </span> <span class="n">name</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">schema</span><span class="p">[</span><span class="n">name</span><span class="p">].</span><span class="n">to_dense</span><span class="p">(</span><span class="n">example</span><span class="p">[</span><span class="n">name</span><span class="p">])</span> <span class="k">for</span> <span class="n">name</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">sequential_names</span> <span class="p">},</span> <span class="p">)</span> <span class="k">def</span> <span class="nf">_postprocess</span><span class="p">(</span><span class="n">contextual</span><span class="p">,</span> <span class="n">sequential</span><span class="p">):</span> <span class="n">sequential</span> <span class="o">=</span> <span class="p">{</span> <span class="c1"># Convert the sequential columns from dense to sparse </span> <span class="n">name</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">schema</span><span class="p">[</span><span class="n">name</span><span class="p">].</span><span class="n">to_sparse</span><span class="p">(</span><span class="n">sequential</span><span class="p">[</span><span class="n">name</span><span class="p">])</span> <span class="k">for</span> <span class="n">name</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">sequential_names</span> <span class="p">}</span> <span class="k">return</span> <span class="p">{</span><span class="o">**</span><span class="n">contextual</span><span class="p">,</span> <span class="o">**</span><span class="n">sequential</span><span class="p">}</span> </code></pre></div></div> <p>Currently, <code class="language-plaintext highlighter-rouge">tf.data</code> does not support padding sparse tensors, which is the representation used for sequential features in TensorFlow. In the running example about forecasting weather, such features are <code class="language-plaintext highlighter-rouge">duration</code> and <code class="language-plaintext highlighter-rouge">temperature</code>. This is the reason such features are converted to their dense counterparts in <code class="language-plaintext highlighter-rouge">_preprocess</code>. However, the final representation has to be sparse still. Therefore, the sequential features are converted back to the sparse format in <code class="language-plaintext highlighter-rouge">_postprocess</code>. Hopefully, this back-and-forth conversion will be rendered obsolete in future versions.</p> <p>Having executed the above steps, we have an instance of <a href="https://www.tensorflow.org/api_docs/python/tf/data/Dataset"><code class="language-plaintext highlighter-rouge">tf.data.Dataset</code></a>, which is the ultimate goal, as it is the standard way of ingesting data into a TensorFlow graph. At this point, one might create a Keras model leveraging <a href="https://www.tensorflow.org/api_docs/python/tf/keras/layers/DenseFeatures"><code class="language-plaintext highlighter-rouge">tf.keras.layers.DenseFeatures</code></a> and <a href="https://www.tensorflow.org/api_docs/python/tf/keras/experimental/SequenceFeatures"><code class="language-plaintext highlighter-rouge">tf.keras.experimental.SequenceFeatures</code></a> for constructing the input layer and then pass the data set to the <code class="language-plaintext highlighter-rouge">fit</code> function of the model. A <a href="https://github.com/chain-rule/example-weather-forecast/blob/master/forecast/model.py">skeleton</a> for this part can be found in the repository.</p> <h1 id="conclusion">Conclusion</h1> <p>In this article, we have discussed a scalable approach to the ingestion of sequential observations from BigQuery into a TensorFlow graph. The key tools that have been used to this end are TensorFlow Extended in combination with Cloud Dataflow and the <code class="language-plaintext highlighter-rouge">tf.data</code> API of TensorFlow.</p> <p>In addition, the provided code has been written to be general and easily customizable. It has been achieved by separating the configuration part from the implementation one.</p> <h1 id="references">References</h1> <ul> <li>Lak Lakshmanan, “<a href="https://www.oreilly.com/learning/repeatable-sampling-of-data-sets-in-bigquery-for-machine-learning">Repeatable sampling of data sets in BigQuery for machine learning</a>,” 2016.</li> </ul>Ivan UkhovHow hard can it be to ingest sequential data into a TensorFlow model? As always, the answer is, “It depends.” Where are the sequences in question stored? Can they fit in main memory? Are they of the same length? In what follows, we shall build a flexible and scalable workflow for feeding sequential observations into a TensorFlow graph starting from BigQuery as the data warehouse.Sample size determination using historical data and simulation2019-09-25T06:00:00+00:002019-09-25T06:00:00+00:00https://blog.ivanukhov.com/2019/09/25/bootstrap<p>In order to test a hypothesis, one has to design and execute an adequate experiment. Typically, it is neither feasible nor desirable to involve the whole population. Instead, a relatively small subset of the population is studied, and given the outcome for this small sample, relevant conclusions are drawn with respect to the population. An important question to answer is then, What is the minimal sample size needed for the experiment to succeed? In what follows, we answer this question using solely historical data and computer simulation, without invoking any classical statistical procedures.</p> <p>Although, as we shall see, the ideas are straightforward, direct calculations were impossible to perform before computers. To be able to answer this kind of questions back then, statisticians developed mathematical theories in order to approximate the calculations for specific situations. Since nothing else was possible, these approximations and the various terms and conditions under which they operate made up a large part of traditional textbooks and courses in statistics. However, the advent of today’s computing power has enabled one to estimate required sample sizes in a more direct and intuitive way, with the only prerequisites being an understanding of statistical inference, the availability of historical data describing the status quo, and the ability to write a few lines of code in a programming language.</p> <h1 id="problem">Problem</h1> <p>For concreteness, consider the following scenario. We run an online business and hypothesize that a specific change in promotion campaigns, such as making them personalized, will have a positive effect on a specific performance metric, such as the average deposit. In order to investigate if it is the case, we decide to perform a two-sample test. There are the following two competing hypotheses.</p> <ul> <li> <p>The null hypothesis postulates that the change has no effect on the metric.</p> </li> <li> <p>The alternative hypothesis postulates that the change has a positive effect on the metric.</p> </li> </ul> <p>There will be two groups: a control group and a treatment group. The former will be exposed to the current promotion policy, while the latter to the new one. There are also certain requirements imposed on the test. First, we have a level of statistical significance $$\alpha$$ and a level of practical significance $$\delta$$ in mind. The former puts a limit on the false-positive rate, and the latter indicates the smallest effect that we still care about; anything smaller is as good as zero for any practical purpose. In addition, we require the test to have a prescribed false-negative rate $$\beta$$, ensuring that the test has enough statistical power.</p> <p>For our purposes, the test is considered well designed if it is capable of detecting a difference as small as $$\delta$$ so that the false-positive and false-negative rates are controlled to levels $$\alpha$$ and $$\beta$$, respectively. Typically, parameters $$\alpha$$ and $$\delta$$ are held constant, and the desired false-positive rate $$\beta$$ is attained by varying the number of participants in each group, which we denote by $$n$$. Note that we do not want any of the parameters to be smaller than the prescribed values, as it would be wasteful.</p> <p>So what should the sample size be for the test to be well designed?</p> <h1 id="solution">Solution</h1> <p>Depending on the distribution of the data and on the chosen metric, one might or might not be able to find a suitable test among the standard ones, while ensuring that the test’s assumptions can safely be considered satisfied. More importantly, a textbook solution might not be the most intuitive one, which, in particular, might lead to misuse of the test. It is the understanding that matters.</p> <p>Here we take a more pragmatic and rather general approach that circumvents the above concerns. It requires only historical data and basic programming skills. Despite its simplicity, the method below goes straight to the core of what the famed statistical tests are doing behind all the math. The approach belongs to the class of so-called bootstrap techniques and is as follows.</p> <p>Suppose we have historical data on customers’ behavior under the current promotion policy, which is commonplace in practice. An important realization is that this data set represents what we expect to observe in the control group. It is also what is expected of the treatment group provided that the null hypothesis is true, that is, when the proposed change has no effect. This realization enables one to simulate what would happen if each group was limited to an arbitrary number of participants. Then, by varying this size parameter, it is possible to find the smallest value that makes the test well designed, that is, make the test satisfy the requirements on $$\alpha$$, $$\beta$$, and $$\delta$$, as discussed in the previous section.</p> <p>This is all. The rest is an elaboration of the above idea.</p> <p>The simulation entails the following. To begin with, note that what we are interested in testing is the difference between the performance metric applied to the treatment group and the same metric applied to the control group, which is referred to as the test statistic:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Test statistic = Metric(Treatment sample) - Metric(Control sample). </code></pre></div></div> <p><code class="language-plaintext highlighter-rouge">Treatment sample</code> and <code class="language-plaintext highlighter-rouge">Control sample</code> stand for sets of observations, and <code class="language-plaintext highlighter-rouge">Metric(Sample)</code> stands for computing the performance metric given such a sample. For instance, each observation could be the total deposit of a customer, and the metric could be the average value:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Metric(Sample) = Sum of observations / Number of observations. </code></pre></div></div> <p>Note, however, that it is an example; the metric can be arbitrary, and this is a huge advantage of this approach to sample size determination based on data and simulation.</p> <p>Large positive values of the test statistic speak in favor of the treatment (that is, the new promotion policy in our example), while those that are close to zero suggest that the treatment is futile.</p> <p>A sample of $$n$$ observations corresponding to the status quo (that is, the current policy in our example) can be easily obtained by drawing $$n$$ data points with replacement from the historical data:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Sample = Choose random with replacement(Data, N). </code></pre></div></div> <p>This expression is used for <code class="language-plaintext highlighter-rouge">Control sample</code> under both the null and alternative hypotheses. As alluded to earlier, this is also how <code class="language-plaintext highlighter-rouge">Treatment sample</code> is obtained under the null. Regarding the alternative hypothesis being true, one has to express the hypothesized outcome as a distribution for the case of the minimal detectable difference, $$\delta$$. The simplest and reasonable solution is to sample the data again, apply the metric, and then adjust the result to reflect the alternative hypothesis:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Metric(Choose random with replacement(Data, N)) + Delta. </code></pre></div></div> <p>Here, again, one is free to change the logic under the alternative according to the situation at hand. For instance, instead of an additive effect, one could simulate a multiplicative one.</p> <p>The above is a way to simulate a single instance of the experiment under either the null or alternative hypothesis; the result is a single value for the test statistic. The next step is to estimate how the test statistic would vary if the experiment was repeated many times in the two scenarios. This simply means that the procedure should be repeated multiple times:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Repeat many times { Sample 1 = Choose random with replacement(Data, N) Sample 2 = Choose random with replacement(Data, N) Metric 1 = Metric(Sample 1) Metric 2 = Metric(Sample 2) Test statistic under null = Metric 1 - Metric 2 Sample 3 = Choose random with replacement(Data, N) Sample 4 = Choose random with replacement(Data, N) Metric 3 = Metric(Sample 3) + Delta Metric 4 = Metric(Sample 4) Test statistic under alternative = Metric 3 - Metric 4 } </code></pre></div></div> <p>This yields a collection of values for the test statistic under the null hypothesis and a collection of values for the test statistic under the alternative hypothesis. Each one contains realizations from the so-called sampling distribution in the corresponding scenario. The following figure gives an illustration:</p> <p><img src="/assets/images/2019-09-25-bootstrap/sampling-distribution-1.svg" alt="" /></p> <p>The blue shape is the sampling distribution under the null hypothesis, and the red one is the sampling distribution under the alternative hypothesis. We shall come back to this figure shortly.</p> <p>These two distributions of the test statistic are what we are after, as they allow one to compute the false-positive rate and eventually choose a sample size. First, given $$\alpha$$, the sampling distribution under the null (the blue one) is used in order to find a value beyond which the probability mass is equal to $$\alpha$$:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Critical value = Quantile([Test statistic under null], 1 - alpha). </code></pre></div></div> <p><code class="language-plaintext highlighter-rouge">Quantile</code> computes the quantile specified by the second argument given a set of observations. This quantity is called the critical value of the test. In the figure above, it is denoted by a dashed line. When the test statistic falls to the right of the critical value, we reject the null hypothesis; otherwise, we fail to reject it. Second, the sampling distribution in the case of the alternative hypothesis being true (the red one) is used in order to compute the false-negative rate:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Attained beta = Mean([Test statistic under alternative &lt; Critical value]). </code></pre></div></div> <p>It corresponds to the probability mass of the sampling distribution under the alternative to the left of the critical value. In the figure, it is the red area to the left of the dashed line.</p> <p>The final step is to put the above procedure in an optimization loop that minimizes the distance between the target and attained $$\beta$$’s with respect to the sample size:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Optimize N until Attained beta is close to Target beta { Repeat many times { Test statistic under null = ... Test statistic under alternative = ... } Critical value = ... Attained beta = ... } </code></pre></div></div> <p>This concludes the calculation of the size that the control and treatment groups should have in order for the upcoming test in promotion campaigns to be well designed in terms of the level of statistical significance $$\alpha$$, the false-negative rate $$\beta$$, and the level of practical significance $$\delta$$.</p> <p>An example of how this technique could be implemented in practice can be found in the appendix.</p> <h1 id="conclusion">Conclusion</h1> <p>In this article, we have discussed an approach to sample size determination that is based on historical data and computer simulation rather than on mathematical formulae tailored for specific situations. It is general and straightforward to implement. More importantly, the technique is intuitive, since it directly follows the narrative of null hypothesis significance testing. It does require prior knowledge of the key concepts in statistical inference. However, this knowledge is arguably essential for those who are involved in scientific experimentation. It constitutes the core of statistical literacy.</p> <h1 id="acknowledgments">Acknowledgments</h1> <p>This article was inspired by a blog post authored by <a href="http://allendowney.blogspot.com/2011/05/there-is-only-one-test.html">Allen Downey</a> and a talk given by <a href="https://www.youtube.com/watch?v=5Dnw46eC-0o">John Rauser</a>. I also would like to thank <a href="http://users.stat.umn.edu/~rend0020/">Aaron Rendahl</a> for his feedback on the introduction to the method presented here and for his help with the implementation given in the appendix.</p> <h1 id="references">References</h1> <ul> <li>Allen Downey, “<a href="http://allendowney.blogspot.com/2011/05/there-is-only-one-test.html">There is only one test!</a>,” 2011.</li> <li>John Rauser, “<a href="https://www.youtube.com/watch?v=5Dnw46eC-0o">Statistics without the agonizing pain</a>,” 2014.</li> <li>Joseph Lee Rodgers, “<a href="https://doi.org/10.1207/S15327906MBR3404_2">The bootstrap, the jackknife, and the randomization test: A sampling taxonomy</a>,” Multivariate Behavioral Research, 2010.</li> </ul> <h1 id="appendix">Appendix</h1> <p>The following listing shows an implementation of the bootstrap approach in R:</p> <figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="n">tidyverse</span><span class="p">)</span><span class="w"> </span><span class="n">set.seed</span><span class="p">(</span><span class="m">42</span><span class="p">)</span><span class="w"> </span><span class="c1"># Artificial data for illustration</span><span class="w"> </span><span class="n">observation_count</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">20000</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">tibble</span><span class="p">(</span><span class="n">value</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rlnorm</span><span class="p">(</span><span class="n">observation_count</span><span class="p">))</span><span class="w"> </span><span class="c1"># Performance metric</span><span class="w"> </span><span class="n">metric</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">mean</span><span class="w"> </span><span class="c1"># Statistical significance</span><span class="w"> </span><span class="n">alpha</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">0.05</span><span class="w"> </span><span class="c1"># False-negative rate</span><span class="w"> </span><span class="n">beta</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">0.2</span><span class="w"> </span><span class="c1"># Practical significance</span><span class="w"> </span><span class="n">delta</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">0.1</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">metric</span><span class="p">(</span><span class="n">data</span><span class="o">$</span><span class="n">value</span><span class="p">)</span><span class="w"> </span><span class="n">simulate</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">sample_size</span><span class="p">,</span><span class="w"> </span><span class="n">replication_count</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="c1"># Function for drawing a single sample of size sample_size</span><span class="w"> </span><span class="n">run_one</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">()</span><span class="w"> </span><span class="n">sample</span><span class="p">(</span><span class="n">data</span><span class="o">$</span><span class="n">value</span><span class="p">,</span><span class="w"> </span><span class="n">sample_size</span><span class="p">,</span><span class="w"> </span><span class="n">replace</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w"> </span><span class="c1"># Function for drawing replication_count samples of size sample_size</span><span class="w"> </span><span class="n">run_many</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">()</span><span class="w"> </span><span class="n">replicate</span><span class="p">(</span><span class="n">replication_count</span><span class="p">,</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">metric</span><span class="p">(</span><span class="n">run_one</span><span class="p">())</span><span class="w"> </span><span class="p">})</span><span class="w"> </span><span class="c1"># Simulation under the null hypothesis</span><span class="w"> </span><span class="n">control_null</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">run_many</span><span class="p">()</span><span class="w"> </span><span class="n">treatment_null</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">run_many</span><span class="p">()</span><span class="w"> </span><span class="n">difference_null</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">treatment_null</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">control_null</span><span class="w"> </span><span class="c1"># Simulation under the alternative hypothesis</span><span class="w"> </span><span class="n">control_alternative</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">run_many</span><span class="p">()</span><span class="w"> </span><span class="n">treatment_alternative</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">run_many</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">delta</span><span class="w"> </span><span class="n">difference_alternative</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">treatment_alternative</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">control_alternative</span><span class="w"> </span><span class="c1"># Computation of the critical value</span><span class="w"> </span><span class="n">critical_value</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">quantile</span><span class="p">(</span><span class="n">difference_null</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">alpha</span><span class="p">)</span><span class="w"> </span><span class="c1"># Computation of the false-negative rate</span><span class="w"> </span><span class="n">beta</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">difference_alternative</span><span class="w"> </span><span class="o">&lt;</span><span class="w"> </span><span class="n">critical_value</span><span class="p">)</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">difference_null</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">difference_null</span><span class="p">,</span><span class="w"> </span><span class="n">difference_alternative</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">difference_alternative</span><span class="p">,</span><span class="w"> </span><span class="n">critical_value</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">critical_value</span><span class="p">,</span><span class="w"> </span><span class="n">beta</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">beta</span><span class="p">)</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="c1"># Number of replications</span><span class="w"> </span><span class="n">replication_count</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="m">1000</span><span class="w"> </span><span class="c1"># Interval of possible values for the sample size</span><span class="w"> </span><span class="n">search_interval</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">10000</span><span class="p">)</span><span class="w"> </span><span class="c1"># Root finding to attain the desired value by varying the sample size</span><span class="w"> </span><span class="n">target</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="n">beta</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">simulate</span><span class="p">(</span><span class="nf">as.integer</span><span class="p">(</span><span class="n">n</span><span class="p">),</span><span class="w"> </span><span class="n">replication_count</span><span class="p">)</span><span class="o">$</span><span class="n">beta</span><span class="w"> </span><span class="n">sample_size</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nf">as.integer</span><span class="p">(</span><span class="n">uniroot</span><span class="p">(</span><span class="n">target</span><span class="p">,</span><span class="w"> </span><span class="n">interval</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">search_interval</span><span class="p">)</span><span class="o">$</span><span class="n">root</span><span class="p">)</span></code></pre></figure> <p>The illustrative figure shown in the solution section displays the sampling distribution of the test statistic under the null and alternative for the sample size found by this code snippet.</p>Ivan UkhovIn order to test a hypothesis, one has to design and execute an adequate experiment. Typically, it is neither feasible nor desirable to involve the whole population. Instead, a relatively small subset of the population is studied, and given the outcome for this small sample, relevant conclusions are drawn with respect to the population. An important question to answer is then, What is the minimal sample size needed for the experiment to succeed? In what follows, we answer this question using solely historical data and computer simulation, without invoking any classical statistical procedures.A Bayesian approach to the inference of the net promoter score2019-08-19T06:00:00+00:002019-08-19T06:00:00+00:00https://blog.ivanukhov.com/2019/08/19/net-promoter<p>The net promoter score is a widely adopted metric for gauging customers’ satisfaction with a product. The popularity of the score is arguably attributed to the simplicity of measurement and the intuitiveness of interpretation. Moreover, it is claimed to be correlated with revenue growth, which, ignoring causality, makes it even more appealing. In this article, we leverage Bayesian statistics in order to infer the net promoter score for an arbitrary segmentation of a customer base. The outcome of the inference is a distribution over all possible values of the score weighted by probabilities, which provides exhaustive information for the subsequent decision-making.</p> <p>A bare-bones net promoter survey is composed of only one question: “How likely are you to recommend us to a friend?” The answer is an integer ranging from 0 to 10 inclusively. If the grade is between 0 and 6 inclusively, the person in question is said to be a detractor. If it is 7 or 8, the person is said to be a neutral. Lastly, if it is 9 or 10, the person is deemed a promoter. The net promoter score itself is then the percentage of promoters minus the percentage of detractors. The minimum and maximum attainable values of the score are −100 and 100, respectively. In this case, the greater, the better.</p> <p>As it is usually the case with surveys, a small but representative subset of customers is reached out to, and the collected responses are then used to draw conclusions about the target population of customers. Our objective is to facilitate this last step by estimating the net promoter score given a set of responses and necessarily quantify and put front and center the uncertainty in our estimates.</p> <p>Before we proceed, since a net promoter survey is an observational study, which is prone to such biases as participation and response biases, great care must be taken when analyzing the results. In this article, however, we focus on the inference of the net promoter score under the assumption that the given sample of responses is representative of the target population.</p> <h1 id="problem">Problem</h1> <p>In practice, one is interested to know the net promoter scope for different subpopulations of customers, such as countries of operation and age groups, which is the scenario that we shall target. To this end, suppose that there are $$m$$ segments of interest, and each customer belongs to strictly one of them. The results of a net promoter survey can then be summarized using the following $$m \times 3$$ matrix:</p> $y = \left( \begin{matrix} d_1 &amp; n_1 &amp; p_1 \\ \vdots &amp; \vdots &amp; \vdots \\ d_i &amp; n_i &amp; p_i \\ \vdots &amp; \vdots &amp; \vdots \\ d_m &amp; n_m &amp; p_m \end{matrix} \right)$ <p>where $$d_i$$, $$n_i$$, and $$p_i$$ denote the number of detractors, neutrals, and promoters in segment $$i$$, respectively. For segment $$i$$, the <em>observed</em> net promoter score can be computed as follows:</p> $\hat{s}_i = 100 \times \frac{p_i - d_i}{d_i + n_i + p_i}.$ <p>However, this observed score is a single scalar value calculated using $$d_i + n_i + p_i$$ data points, which is only a subset of the corresponding subpopulation. It may or may not correspond well to the actual net promoter score of that subpopulation. We have no reason to trust it, since the above estimate alone does not tell us anything about the uncertainty associated with it. Uncertainty quantification is essential for sound decision-making, which is what we are after.</p> <p>Ideally, for each segment, given the observed data, we would like to have a distribution of all possible values of the score with probabilities attached. Such a probability distribution would be exhaustive information, from which any other statistic could be easily derived. Here we tackle the problem by means of Bayesian inference, which we discuss next.</p> <h1 id="solution">Solution</h1> <p>In order to perform Bayesian inference of the net promoter score, we need to decide on an adequate Bayesian model for the problem at hand. Recall first that we are interested in inferring scores for several segments. Even though there might be segment-specific variations in the product, such as special offers in certain countries, or in customers’ perception of the product, such as age-related preferences, it is conceptually the same product that the customers were asked to evaluate. It is then sensible to expect the scores in different segments to have something in common. With this in mind, we construct a hierarchical model with parameters shared by the segments.</p> <p>First, let</p> $\theta_i = (\theta_{id}, \theta_{in}, \theta_{ip}) \in \langle 0, 1 \rangle^3$ <p>be a triplet of parameters corresponding to the proportion of detractors, neutrals, and promoters in segment $$i$$, respectively, with the constraint that they have to sum up to one. The constraint makes the triplet a simplex, which is what is emphasized by the angle brackets on the right-hand side. These are the main parameters we are interested in inferring. If the true value of $$\theta_i$$ was known, the net promoter score would be computed as follows:</p> $\hat{s}_i = 100 \times (\theta_{ip} - \theta_{id}).$ <p>Parameter $$\theta_i$$ can also be thought of as a vector of probabilities of observing one of the three types of customers in segment $$i$$, that is, detractors, neutrals, and promoters. Then the natural model for the observed data is a multinomial distribution with $$d_i + n_i + p_i$$ trials and probabilities $$\theta_i$$:</p> $y_i | \theta_i \sim \text{Multinomial}(d_i + n_i + p_i, \theta_i)$ <p>where $$y_i$$ refers to the $$i$$th row of matrix $$y$$ introduced earlier. The family of multinomial distributions is a generalization of the family of binomial distributions to more than two outcomes.</p> <p>The above gives a data distribution. In order to complete the modeling part, we need to decide on a prior probability distribution for $$\theta_i$$. Each $$\theta_i$$ is a simplex of probabilities. In such a case, a reasonable choice is a Dirichlet distribution:</p> $\theta_i | \phi \sim \text{Dirichlet}(\phi)$ <p>where $$\phi = (\phi_d, \phi_n, \phi_p)$$ is a vector of strictly positive parameters. This family of distributions is a generalization of the family of beta distributions to more than two categories. Note that $$\phi$$ is the same for all segments, which is what enables information sharing. In particular, it means that the less reliable estimates for segments with fewer observations will be shrunk toward the more reliable estimates for segments with more observations. In other words, with this architecture, segments with fewer observations are able to draw strength from those with more observations.</p> <p>How about $$\phi$$? This triplet is a characteristic of the product irrespective of the segment. Its individual components can be utilized in order to encode one’s prior knowledge about the net promoter score. Specifically, $$\phi_d$$, $$\phi_n$$, and $$\phi_p$$ could be set to imaginary observations of detractors, neutrals, and promoters, respectively, reflecting one’s beliefs prior to conducting the survey. The higher these imaginary counts are, the more certain one claims to be about the true score. One could certainly set these hyperparameters to fixed values; however, a more comprehensive solution is to infer them from the data as well, giving the model more flexibility by making it hierarchical. In addition, an inspection of $$\phi$$ afterward can provide insights into the overall satisfaction with the product.</p> <p>We now need to specify a prior, or rather a hyperprior, for $$\phi$$. We proceed under the assumption that we have little knowledge about the true score. Even if there were surveys in the past, it is still a valid choice, especially when the product evolves rapidly, rendering prior surveys marginally relevant.</p> <p>Now, it is more convenient to think in terms of expected values and variances instead of imaginary counts, which is what $$\phi$$ represents. Let us find an alternative parameterization of the Dirichlet distribution. The expected value of this distribution is as follows:</p> $\mu = (\mu_d, \mu_n, \mu_p) = \frac{\phi}{\phi_d + \phi_n + \phi_p} \in \langle 0, 1 \rangle^3.$ <p>It can be seen that it is a simplex of proportions of detractors, neutrals, and promoters of the whole population, which is similar to $$\theta_i$$ describing segment $$i$$. Regarding the variance,</p> $\sigma^2 = \frac{1}{\phi_d + \phi_n + \phi_p}$ <p>is considered to capture it sufficiently well. Solving the system of the last two equations for $$\phi$$ yields the following result:</p> $\phi = \frac{\mu}{\sigma^2}.$ <p>The prior for $$\theta_i$$ can then be rewritten as follows:</p> $\theta_i | \mu, \sigma \sim \text{Dirichlet}\left(\frac{\mu}{\sigma^2}\right).$ <p>This new parameterization requires two hyperpriors: one is for $$\mu$$, and one is for $$\sigma$$. For $$\mu$$, a reasonable choice is a uniform distribution (over a simplex), and for $$\sigma$$, a half-Cauchy distribution:</p> \begin{align} &amp; \mu \sim \text{Uniform}(\langle 0, 1 \rangle^3) \text{ and} \\ &amp; \sigma \sim \text{Half-Cauchy}(0, 1). \end{align} <p>The two distributions are relatively week, which is intended in order to let the data speak for themselves. At this point, all parameters have been defined. Of course, one could go further if the problem at hand had a deeper structure; however, in this case, it is arguably not justifiable.</p> <p>The final model is as follows:</p> \begin{align} y_i | \theta_i &amp; \sim \text{Multinomial}(d_i + n_i + p_i, \theta_i), \\ \theta_i | \mu, \sigma &amp; \sim \text{Dirichlet}(\mu / \sigma^2), \\ \mu &amp; \sim \text{Uniform}(\langle 0, 1 \rangle^3), \text{ and} \\ \sigma &amp; \sim \text{Half-Cauchy}(0, 1). \end{align} <p>The posterior distribution factorizes as follows:</p> $p(\theta_1, \dots, \theta_m, \mu, \sigma | y) \propto p(y | \theta_1, \dots, \theta_m) \, p(\theta_1 | \mu, \sigma) \cdots p(\theta_m | \mu, \sigma) \, p(\mu) \, p(\sigma),$ <p>which relies on the usual assumption of independence given the parameters. One could make a few simplifications by, for instance, leveraging the conjugacy of the Dirichlet distribution with respect to the multinomial distribution; however, it is not needed in practice, as we shall see shortly.</p> <p>The above posterior distribution is our ultimate goal. It is the one that gives us a complete picture of what the true net promoter score in each segment might be given the available evidence, that is, the responses from the survey. All that is left is to draw a large enough sample from this distribution and start to summarize and visualize the results.</p> <p>Unfortunately, as one might probably suspect, drawing samples from the posterior is not an easy task. It does not correspond to any standard distribution and hence does not have a readily available random number generator. Fortunately, the topic is sufficiently mature, and there have been developed techniques for sampling complex distributions, such as the family of Markov chain Monte Carlo methods. Unfortunately, the most effective and efficient of these techniques are notoriously complex themselves, and it might be extremely difficult and tedious to implement and apply them correctly in practice. Fortunately, the need for versatile tools for modeling and inference with the focus on the problem at hand and not on implementation details has been recognized and addressed. Nontrivial scenarios can be tackled with a surprisingly small amount of effort nowadays, which we illustrate next.</p> <h1 id="implementation">Implementation</h1> <p>In this section, we implement the model using the probabilistic programming language <a href="https://mc-stan.org/">Stan</a>. Stan is straightforward to integrate into one’s workflow, as it has interfaces for many general-purpose programming languages, including Python and R. Here we only highlight the main points of the implementation and leave it to the curious reader to discover Stan on their own.</p> <p>The following listing is a complete implementation of the model:</p> <div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">data</span> <span class="p">{</span> <span class="kt">int</span><span class="o">&lt;</span><span class="n">lower</span> <span class="o">=</span> <span class="mi">0</span><span class="o">&gt;</span> <span class="n">m</span><span class="p">;</span> <span class="c1">// The number of segments</span> <span class="kt">int</span><span class="o">&lt;</span><span class="n">lower</span> <span class="o">=</span> <span class="mi">0</span><span class="o">&gt;</span> <span class="n">n</span><span class="p">;</span> <span class="c1">// The number of categories, which is always three</span> <span class="kt">int</span> <span class="n">y</span><span class="p">[</span><span class="n">m</span><span class="p">,</span> <span class="n">n</span><span class="p">];</span> <span class="c1">// The observed counts of detractors, neutrals, and promoters</span> <span class="p">}</span> <span class="n">parameters</span> <span class="p">{</span> <span class="n">simplex</span><span class="p">[</span><span class="n">n</span><span class="p">]</span> <span class="n">mu</span><span class="p">;</span> <span class="n">real</span><span class="o">&lt;</span><span class="n">lower</span> <span class="o">=</span> <span class="mi">0</span><span class="o">&gt;</span> <span class="n">sigma</span><span class="p">;</span> <span class="n">simplex</span><span class="p">[</span><span class="n">n</span><span class="p">]</span> <span class="n">theta</span><span class="p">[</span><span class="n">m</span><span class="p">];</span> <span class="p">}</span> <span class="n">transformed</span> <span class="n">parameters</span> <span class="p">{</span> <span class="n">vector</span><span class="o">&lt;</span><span class="n">lower</span> <span class="o">=</span> <span class="mi">0</span><span class="o">&gt;</span><span class="p">[</span><span class="n">n</span><span class="p">]</span> <span class="n">phi</span><span class="p">;</span> <span class="n">phi</span> <span class="o">=</span> <span class="n">mu</span> <span class="o">/</span> <span class="n">sigma</span><span class="o">^</span><span class="mi">2</span><span class="p">;</span> <span class="p">}</span> <span class="n">model</span> <span class="p">{</span> <span class="n">mu</span> <span class="o">~</span> <span class="n">uniform</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span> <span class="n">sigma</span> <span class="o">~</span> <span class="n">cauchy</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span> <span class="k">for</span> <span class="p">(</span><span class="n">i</span> <span class="n">in</span> <span class="mi">1</span><span class="o">:</span><span class="n">m</span><span class="p">)</span> <span class="p">{</span> <span class="n">theta</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">~</span> <span class="n">dirichlet</span><span class="p">(</span><span class="n">phi</span><span class="p">);</span> <span class="n">y</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">~</span> <span class="n">multinomial</span><span class="p">(</span><span class="n">theta</span><span class="p">[</span><span class="n">i</span><span class="p">]);</span> <span class="p">}</span> <span class="p">}</span> </code></pre></div></div> <p>It can be seen that the code is very laconic and follows closely the development given in the previous section, including the notation. It is worth noting that, in the model block, we seemingly use unconstrained uniform and Cauchy distributions; however, the constraints are enforced by the definitions of the corresponding hyperparameters, <code class="language-plaintext highlighter-rouge">mu</code> and <code class="language-plaintext highlighter-rouge">sigma</code>.</p> <p>This is practically all that is needed; the rest will be taken care of by Stan, which is actually a lot of work, including an adequate initialization, an efficient execution, and necessary diagnostics and quality checks. Under the hood, the sampling of the posterior in Stan is based on the Hamiltonian Monte Carlo algorithm and the no-U-turn sampler, which are considered to be the state-of-the-art.</p> <p>The output of the sampling procedure is a set of draws from the posterior distribution, which, again, is exhaustive information about the net promoter score in the segments of interest. In particular, one can quantify the uncertainty in and the probability of any statement one makes about the score. For instance, if a concise summary is needed, one could compute the mean of the score and accompany it with a high-posterior-density credible interval, capturing the true value with the desired probability. However, if applicable, the full distribution should be integrated into the decision-making process.</p> <h1 id="conclusion">Conclusion</h1> <p>In this article, we have constructed a hierarchical Bayesian model for inferring the net promoter score for an arbitrary segmentation of a customer base. The model features shared parameters, which enable information exchange between the segments. This allows for a more robust estimation of the score, especially in the case of segments with few observations. The final output of the inference is a probability distribution over all possible values of the score in each segment, which lays a solid foundation for the subsequent decision-making. We have also seen how seamlessly the model can be implemented in practice using modern tools for statistical inference, such as Stan.</p> <p>Lastly, note that the presented model is only one alternative; there are many other. How would <em>you</em> model the net promoter score? What changes would you make? Make sure to leave a comment.</p> <h1 id="references">References</h1> <ul> <li>Andrew Gelman et al., <em><a href="http://www.stat.columbia.edu/~gelman/book/">Bayesian Data Analysis</a></em>, Chapman and Hall/CRC, 2014.</li> <li>Andrew Gelman, “<a href="https://statmodeling.stat.columbia.edu/2009/10/21/some_practical/">Some practical questions about prior distributions</a>,” 2009.</li> </ul>Ivan UkhovThe net promoter score is a widely adopted metric for gauging customers’ satisfaction with a product. The popularity of the score is arguably attributed to the simplicity of measurement and the intuitiveness of interpretation. Moreover, it is claimed to be correlated with revenue growth, which, ignoring causality, makes it even more appealing. In this article, we leverage Bayesian statistics in order to infer the net promoter score for an arbitrary segmentation of a customer base. The outcome of the inference is a distribution over all possible values of the score weighted by probabilities, which provides exhaustive information for the subsequent decision-making.Interactive notebooks in tightly sealed disposable containers2019-07-24T06:00:00+00:002019-07-24T06:00:00+00:00https://blog.ivanukhov.com/2019/07/24/notebook<p>It is truly amazing how interactive notebooks—where a narrative in a spoken language is entwined with executable chunks of code in a programming language—have revolutionized the way we work with data and document our thought processes and findings for others and, equally importantly, for our future selves. They are ubiquitous and taken for granted. It is hard to imagine where data enthusiasts would be without them. Most likely, we would be spending too much time staring at a terminal window, anxiously re-running scripts from start to finish, printing variables, and saving lots of files with tables and graphs on disk for further inspection. Interactive notebooks are an essential tool in the data scientist’s toolbox, and in this article, we are going to make them readily available for our use with our favorite packages installed and preferences set up, no matter where we find ourselves working and regardless of the mess we might have left behind during the previous session.</p> <p>Python and R (in alphabetic order) are arguably the primary languages used by data scientists nowadays. In the context of interactive computations, <a href="https://ipython.org/">IPython</a> and later on <a href="https://jupyter.org/">Project Jupyter</a> have been of paramount importance for the Python community (the latter is actually language agnostic). In the R community, this role has been played by <a href="https://www.rstudio.com/">RStudio</a>. Therefore, having at one’s disposal <a href="https://jupyter.org/">JupyterLab</a>, which is Project Jupyter’s flagship, and RStudio should make one well equipped for a wide range of data challenges. As alluded to earlier, the objective is to have an environment that has a fixed initial state defined by us and is accessible to us on any machine we might happen to work on. This problem definition is a perfect fit for containerization. Specifically, we shall build custom-tailored <a href="https://www.docker.com/">Docker</a> images for JupyterLab and RStudio and create a few convenient shortcuts for launching them.</p> <p>The code discussed below can be found in the following two repositories:</p> <ul> <li><a href="https://github.com/chain-rule/JupyterLab/tree/article">JupyterLab</a> and</li> <li><a href="https://github.com/chain-rule/RStudio/tree/article">RStudio</a>.</li> </ul> <h1 id="jupyterlab">JupyterLab</h1> <p>In order to build a Docker image for JupyterLab, we begin with a <a href="https://github.com/chain-rule/JupyterLab/blob/article/Dockerfile"><code class="language-plaintext highlighter-rouge">Dockerfile</code></a>:</p> <div class="language-docker highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Start with a minimal Python image</span> <span class="k">FROM</span><span class="s"> python:3.7-slim</span> <span class="c"># Install the desired Python packages</span> <span class="k">COPY</span><span class="s"> requirements.txt /tmp/requirements.txt</span> <span class="k">RUN </span>pip <span class="nb">install</span> <span class="nt">--upgrade</span> pip <span class="k">RUN </span>pip <span class="nb">install</span> <span class="nt">--upgrade</span> <span class="nt">--requirement</span> /tmp/requirements.txt <span class="c"># Configure JupyterLab to use a specific IP address and port</span> <span class="k">RUN </span><span class="nb">mkdir</span> <span class="nt">-p</span> ~/.jupyter <span class="k">RUN </span><span class="nb">echo</span> <span class="s2">"c.NotebookApp.ip = '0.0.0.0'"</span> <span class="o">&gt;&gt;</span> ~/.jupyter/jupyter_notebook_config.py <span class="k">RUN </span><span class="nb">echo</span> <span class="s2">"c.NotebookApp.port = 8888"</span> <span class="o">&gt;&gt;</span> ~/.jupyter/jupyter_notebook_config.py <span class="c"># Set the working directory</span> <span class="k">WORKDIR</span><span class="s"> /home/jupyterlab</span> <span class="c"># Stort JupyterLab once the container is launched</span> <span class="k">ENTRYPOINT</span><span class="s"> jupyter lab --allow-root --no-browser</span> </code></pre></div></div> <p>In words, we take a minimalistic image with the desired version of Python preinstalled—in this case, it is the <a href="https://hub.docker.com/_/python">official Python image</a> tagged <code class="language-plaintext highlighter-rouge">3.7-slim</code>, which refers to Python 3.7 with any available bug fixes promptly applied—and add packages that we consider to be important for our work. These packages are gathered in the usual <a href="https://github.com/chain-rule/JupyterLab/blob/article/requirements.txt"><code class="language-plaintext highlighter-rouge">requirements.txt</code></a>, which might look as follows:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>jupyterlab matplotlib numpy pandas pylint pytest scikit-learn scipy seaborn tensorflow yapf </code></pre></div></div> <p>The first one, <code class="language-plaintext highlighter-rouge">jupyterlab</code>, is essential; the rest is up to the data scientist’s taste. An important aspect to note is that, in this example, the versions of the listed packages are not fixed; hence, the latest available versions will be taken each time a new image is built. Alternatively, one can pin them to specific numbers by changing <code class="language-plaintext highlighter-rouge">requirements.txt</code>. For instance, one might write <code class="language-plaintext highlighter-rouge">tensorflow==1.14.0</code> instead of <code class="language-plaintext highlighter-rouge">tensorflow</code>.</p> <p>Having defined an image, we need a tool for orchestration. We would like to have a convenient command for actually building the image and, more importantly, a convenient command for launching a container with that image from an arbitrary directory. The versatile <code class="language-plaintext highlighter-rouge">make</code> to the rescue!</p> <div class="language-make highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># The name of the Docker image </span><span class="nv">name</span> <span class="o">:=</span> jupyterlab <span class="c"># The directory to be mounted to the container </span><span class="nv">root</span> <span class="o">?=</span> <span class="nv">${PWD}</span> <span class="c"># Build a new image </span><span class="nl">build</span><span class="o">:</span> docker rmi <span class="nv">${name}</span> <span class="o">||</span> <span class="nb">true</span> docker build <span class="nt">--tag</span> <span class="nv">${name}</span> . <span class="c"># Start a new container </span><span class="nl">start</span><span class="o">:</span> <span class="p">@</span>docker run <span class="nt">--interactive</span> <span class="nt">--tty</span> <span class="nt">--rm</span> <span class="se">\</span> <span class="nt">--name</span> <span class="nv">${name}</span> <span class="se">\</span> <span class="nt">--publish</span> 8888:8888 <span class="se">\</span> <span class="nt">--volume</span> <span class="s2">"</span><span class="nv">${root}</span><span class="s2">:/home/jupyterlab"</span> <span class="se">\</span> <span class="nv">${name}</span> </code></pre></div></div> <p>In the above <a href="https://github.com/chain-rule/JupyterLab/blob/article/Makefile"><code class="language-plaintext highlighter-rouge">Makefile</code></a>, we define two commands: <code class="language-plaintext highlighter-rouge">build</code> and <code class="language-plaintext highlighter-rouge">start</code>. The <code class="language-plaintext highlighter-rouge">build</code> command instructs Docker to build a new image according to the recipe in <code class="language-plaintext highlighter-rouge">Dockerfile</code>. The <code class="language-plaintext highlighter-rouge">start</code> command launches a new container and mounts the directory specified by the <code class="language-plaintext highlighter-rouge">root</code> variable to the file system inside the container using the <code class="language-plaintext highlighter-rouge">--volume</code> option. It also forwards port 8888 inside the container, which is the one specified in <code class="language-plaintext highlighter-rouge">Dockerfile</code>, to port 8888 on the host machine so that JupyterLab can be reached from the browser.</p> <p>Let us now go ahead and try the two commands:</p> <div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>make build make start </code></pre></div></div> <p>JupyterLab should come back with usage instructions similar to the following:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>... [I 18:40:15.078 LabApp] The Jupyter Notebook is running at: [I 18:40:15.078 LabApp] http://e4edba021595:8888/?token=&lt;token&gt; [I 18:40:15.078 LabApp] or http://127.0.0.1:8888/?token=&lt;token&gt; [I 18:40:15.078 LabApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). [C 18:40:15.082 LabApp] To access the notebook, open this file in a browser: file:///root/.local/share/jupyter/runtime/nbserver-6-open.html Or copy and paste one of these URLs: http://e4edba021595:8888/?token=&lt;token&gt; or http://127.0.0.1:8888/?token=&lt;token&gt; ... </code></pre></div></div> <p>By clicking on the last link, we end up in a fully fledged JupyterLab. Congratulations! However, there is one step left. JupyterLab is currently running in the folder with our <code class="language-plaintext highlighter-rouge">Dockerfile</code> and <code class="language-plaintext highlighter-rouge">Makefile</code>, which is not particularly useful, as each project we might want to work on probably lives in its own folder elsewhere in the file system. Fortunately, it is easy to fix with an alias:</p> <div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">alias </span><span class="nv">jupyterlab</span><span class="o">=</span><span class="s1">'make -C /path/to/the/folder/with/the/Makefile root="${PWD}"'</span> </code></pre></div></div> <p>This command should be placed in the start-up script of the shell being utilized. In the case of Bash, it can be done as follows:</p> <div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">echo</span> <span class="s2">"alias jupyterlab='make -C </span><span class="se">\"</span><span class="k">${</span><span class="nv">PWD</span><span class="k">}</span><span class="se">\"</span><span class="s2"> root=</span><span class="se">\"\$</span><span class="s2">{PWD}</span><span class="se">\"</span><span class="s2">'"</span> <span class="o">&gt;&gt;</span> ~/.bashrc </code></pre></div></div> <p>Now, in a new terminal, one should be able to run JupyterLab from any directory as follows:</p> <div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cd</span> /path/to/some/project jupyterlab </code></pre></div></div> <p>Note that the content of the current working directory (that is, <code class="language-plaintext highlighter-rouge">/path/to/some/project</code>) is readily available inside JupyterLab. All notebooks created and modified in the GUI there will be stored directly in this folder, and they will remain here when the container is shut down.</p> <h1 id="rstudio">RStudio</h1> <p>It is time to get to grips with an image for R notebooks. As before, we begin with a <a href="https://github.com/chain-rule/RStudio/blob/article/Dockerfile"><code class="language-plaintext highlighter-rouge">Dockerfile</code></a>:</p> <div class="language-docker highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Start with an RStudio image</span> <span class="k">FROM</span><span class="s"> rocker/rstudio:latest</span> <span class="c"># Install the software that R packages require</span> <span class="k">RUN </span>apt-get update <span class="k">RUN </span>apt-get <span class="nb">install</span> <span class="nt">-y</span> libxml2-dev texlive texlive-latex-extra zlib1g-dev <span class="c"># Set the working directory</span> <span class="k">WORKDIR</span><span class="s"> /home/rstudio</span> <span class="c"># Install the desired R packages</span> <span class="k">COPY</span><span class="s"> requirements.txt /tmp/requirements.txt</span> <span class="k">RUN </span><span class="nb">echo</span> <span class="s2">"install.packages(readLines('/tmp/requirements.txt'), </span><span class="se">\ </span><span class="s2"> repos = 'http://cran.us.r-project.org')"</span> | R </code></pre></div></div> <p>Installing RStudio from scratch is not an easy task. Fortunately, we can start with the <a href="https://hub.docker.com/r/rocker/rstudio/">official RStudio image</a>, which is what is specified at the top of the file. If desired, the <code class="language-plaintext highlighter-rouge">latest</code> tag can be changed to a specific version. The second block of Docker instructions is to provide programs and libraries that are needed by the R packages that one is planning to install. For instance, TeX Live is needed for rendering notebooks as PDF documents using LaTeX. The last block of instructions in <code class="language-plaintext highlighter-rouge">Dockerfile</code> is for installing the R packages themselves. As with Python, all necessary packages are gathered in a single file called <a href="https://github.com/chain-rule/RStudio/blob/article/requirements.txt"><code class="language-plaintext highlighter-rouge">requirements.txt</code></a>:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>devtools glmnet plotly rmarkdown rstan testthat tidytext tidyverse </code></pre></div></div> <p>The <code class="language-plaintext highlighter-rouge">rmarkdown</code> package is required for notebooks in Markdown. The rest is intended to be changed according to one’s preferences; although, <code class="language-plaintext highlighter-rouge">tidyverse</code> is arguably a must in modern R.</p> <p>All right, in order to build the image and launch containers, we create the following <a href="https://github.com/chain-rule/RStudio/blob/article/Makefile"><code class="language-plaintext highlighter-rouge">Makefile</code></a>:</p> <div class="language-make highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># The name of the Docker image </span><span class="nv">name</span> <span class="o">:=</span> rstudio <span class="c"># The directory to be mounted to the container </span><span class="nv">root</span> <span class="o">?=</span> <span class="nv">${PWD}</span> <span class="c"># Build a new image </span><span class="nl">build</span><span class="o">:</span> docker rmi <span class="nv">${name}</span> <span class="o">||</span> <span class="nb">true</span> docker build <span class="nt">--tag</span> <span class="nv">${name}</span> . <span class="c"># Start a new container </span><span class="nl">start</span><span class="o">:</span> <span class="p">@</span><span class="nb">echo</span> <span class="s2">"Address: http://localhost:8787/"</span> <span class="p">@</span><span class="nb">echo</span> <span class="s2">"User: rstudio"</span> <span class="p">@</span><span class="nb">echo</span> <span class="s2">"Password: rstud10"</span> <span class="p">@</span><span class="nb">echo</span> <span class="p">@</span><span class="nb">echo</span> <span class="s1">'Press Control-C to terminate...'</span> <span class="p">@</span>docker run <span class="nt">--interactive</span> <span class="nt">--tty</span> <span class="nt">--rm</span> <span class="se">\</span> <span class="nt">--name</span> <span class="nv">${name}</span> <span class="se">\</span> <span class="nt">--publish</span> 8787:8787 <span class="se">\</span> <span class="nt">--volume</span> <span class="s2">"</span><span class="nv">${root}</span><span class="s2">:/home/rstudio"</span> <span class="se">\</span> <span class="nt">--env</span> <span class="nv">PASSWORD</span><span class="o">=</span>rstud10 <span class="se">\</span> <span class="nv">${name}</span> <span class="o">&gt;</span> /dev/null </code></pre></div></div> <p>It is similar to the one for JupyterLab; however, since the default prompt of RStudio is not as informative as the one of JupyterLab, we print our own usage instructions upon <code class="language-plaintext highlighter-rouge">start</code>.</p> <p>The final piece is the shortcut for launching RStudio:</p> <div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">alias </span><span class="nv">rstudio</span><span class="o">=</span><span class="s1">'make -C /path/to/the/folder/with/the/Makefile root="${PWD}"'</span> </code></pre></div></div> <p>In the case of Bash, it can be installed as follows:</p> <div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">echo</span> <span class="s2">"alias rstudio='make -C </span><span class="se">\"</span><span class="k">${</span><span class="nv">PWD</span><span class="k">}</span><span class="se">\"</span><span class="s2"> root=</span><span class="se">\"\$</span><span class="s2">{PWD}</span><span class="se">\"</span><span class="s2">'"</span> <span class="o">&gt;&gt;</span> ~/.bashrc </code></pre></div></div> <p>Now it is time to build the image, go to an arbitrary directory, and test the alias:</p> <div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>make build <span class="nb">cd</span> /path/to/some/project rstudio </code></pre></div></div> <p>Unlike the JupyterLab image, this one is much slower to build due to R packages traditionally compiling a lot of C++ code upon installation.</p> <p>Lastly, it might be particularly convenient to have one’s GUI preferences (such as the font size in the editor) and alike be automatically set up upon each container launch. This can be achieved by realizing that RStudio stores user preferences in a local folder called <code class="language-plaintext highlighter-rouge">.rstudio</code>. Then the <code class="language-plaintext highlighter-rouge">start</code> command can be adjusted to silently plant a preconfigured <code class="language-plaintext highlighter-rouge">.rstudio</code> into the current working directory, which can be seen in the <a href="https://github.com/chain-rule/RStudio/tree/article">repository</a> accompanying this article.</p> <h1 id="conclusion">Conclusion</h1> <p>Having completed the above steps, we have two Docker images: one is for Python notebooks via JupyterLab, and one is for R notebooks via RStudio. At the moment, the images are stored locally; however, they can be pushed to a public or private image repository, such as <a href="https://hub.docker.com/">Docker Hub</a> and <a href="https://cloud.google.com/container-registry/">Google Container Registry</a>, and subsequently pulled on an arbitrary machine having Docker installed. Alternatively, they can be built on each machine separately. Regardless of the installation, the crucial point is that our working environment will unshakably remain in a specific pristine state defined by us.</p> <p>Lastly, it is worth noting that similar images can straightforwardly be built for more specific scenarios. For instance, the following repository provides a skeleton for building and using a custom <a href="https://cloud.google.com/datalab/">Datalab</a>, which is Google’s wrapper for Jupyter notebooks that run in the cloud: <a href="https://github.com/chain-rule/Datalab">Datalab</a>.</p>Ivan UkhovIt is truly amazing how interactive notebooks—where a narrative in a spoken language is entwined with executable chunks of code in a programming language—have revolutionized the way we work with data and document our thought processes and findings for others and, equally importantly, for our future selves. They are ubiquitous and taken for granted. It is hard to imagine where data enthusiasts would be without them. Most likely, we would be spending too much time staring at a terminal window, anxiously re-running scripts from start to finish, printing variables, and saving lots of files with tables and graphs on disk for further inspection. Interactive notebooks are an essential tool in the data scientist’s toolbox, and in this article, we are going to make them readily available for our use with our favorite packages installed and preferences set up, no matter where we find ourselves working and regardless of the mess we might have left behind during the previous session.On the expected utility in conversion rate optimization2019-07-08T06:00:00+00:002019-07-08T06:00:00+00:00https://blog.ivanukhov.com/2019/07/08/conversion<p>It can be not only extremely useful but also deeply satisfying to occasionally dust off one’s math skills. In this article, we approach the classical problem of conversion rate optimization—which is frequently faced by companies operating online—and derive the expected utility of switching from variant A to variant B under some modeling assumptions. This information can subsequently be utilized in order to support the corresponding decision-making process.</p> <p>An R implementation of the math below and more can be found in the following repository:</p> <ul> <li><a href="https://github.com/chain-rule/conversion-rate">conversion-rate</a>.</li> </ul> <p>However, it was written for personal exploratory purposes and has no documentation at the moment. If you decide to dive in, you will be on your own.</p> <h1 id="problem">Problem</h1> <p>Suppose, as a business, you send communications to your customers in order to increase their engagement with the product. Furthermore, suppose you suspect that a certain change to the usual way of working might increase the uplift. In order to test your hypothesis, you set up an A/B test. The only decision you care about is whether or not you should switch from variant A to variant B where variant A is the baseline (the usual way of working). The twist is that, from the perspective of the business, variant B comes with its own gain if it is the winner, and its own loss if it is the loser. The goal is to incorporate this information in the final decision, making necessary assumptions along the way.</p> <h1 id="solution">Solution</h1> <p>Let $$A$$ and $$B$$ be two random variables modeling the conversion rates of the two variants, variant A and variant B. Furthermore, let $$p$$ be the probability density function of the joint distribution of $$A$$ and $$B$$. In what follows, concrete values assumed by the variables are denoted by $$a$$ and $$b$$, respectively.</p> <p>Define the utility function as</p> $U(a, b) = G(a, b) I(a &lt; b) + L(a, b) I(a &gt; b)$ <p>where $$G$$ and $$L$$ are referred to as the gain and loss functions, respectively. The gain function takes effect when variant B has a higher conversion rate than the one of variant A, and the loss function takes effect when variant A is better than variant B, which is what is enforced by the two indicator functions (the equality is not essential). The expected utility is then as follows:</p> \begin{align} E(U(A, B)) &amp;= \int_0^1 \int_0^1 U(a, b) p(a, b) \, db \, da \\ &amp;= \int_0^1 \int_a^1 G(a, b) p(a, b) \, db \, da + \int_0^1 \int_0^a L(a, b) p(a, b) \, db \, da. \end{align} <p>We assume further the gain and loss are linear:</p> \begin{align} &amp; G(a, b) = w_g (b - a) \text{ and} \\ &amp; L(a, b) = w_l (b - a). \end{align} <p>In the above, $$w_g$$ and $$w_l$$ are two non-negative scaling factors, which can be used to encode business preferences. Then we have that</p> \begin{align} E(U(A, B)) = &amp; w_g \int_0^1 \int_a^1 b \, p(a, b) \, db \, da - w_g \int_0^1 \int_a^1 a \, p(a, b) \, db \, da + {} \\ &amp; w_l \int_0^1 \int_0^a b \, p(a, b) \, db \, da - w_l \int_0^1 \int_0^a a \, p(a, b) \, db \, da. \end{align} <p>For convenience, denote the four integrals by $$G_1$$, $$G_2$$, $$L_1$$, and $$L_2$$, respectively, in which case we have that</p> $E(U(A, B)) = w_g \, G_1 - w_g \, G_2 + w_l \, L_1 - w_l \, L_2.$ <p>Now, suppose the distributions of $$A$$ and $$B$$ are estimated using Bayesian inference. In this approach, the prior knowledge of the decision-maker about the conversion rates of the two variants is combined with the evidence in the form of data continuously streaming from the A/B test. It is natural to use a binomial distribution for the data and a beta distribution for the prior knowledge, which results in a posterior distribution that is also a beta distribution due to conjugacy.</p> <p><em>A posteriori</em>, we have the following marginal distributions:</p> \begin{align} &amp; A \sim \text{Beta}(\alpha_a, \beta_a) \text{ and} \\ &amp; B \sim \text{Beta}(\alpha_b, \beta_b) \end{align} <p>where $$\alpha_a$$ and $$\beta_a$$ the shape parameters of $$A$$, and $$\alpha_b$$ and $$\beta_b$$ of the shape parameters of $$B$$. Assuming that the two random variables are independent given the parameters,</p> $p(a, b) = p(a) \, p(b) = \frac{a^{\alpha_a - 1} (1 - a)^{\beta_a - 1}}{B(\alpha_a, \beta_a)} \frac{b^{\alpha_b - 1} (1 - b)^{\beta_b - 1}}{B(\alpha_b, \beta_b)}.$ <p>We can now compute the expected utility. The first integral is as follows:</p> \begin{align} G_1 &amp;= \int_0^1 \int_a^1 \frac{a^{\alpha_a - 1} (1 - a)^{\beta_a - 1}}{B(\alpha_a, \beta_a)} \frac{b^{\alpha_b} (1 - b)^{\beta_b - 1}}{B(\alpha_b, \beta_b)} \, db \, da \\ &amp;= \frac{B(\alpha_b + 1, \beta_b)}{B(\alpha_b, \beta_b)} \int_0^1 \int_a^1 \frac{a^{\alpha_a - 1} (1 - a)^{\beta_a - 1}}{B(\alpha_a, \beta_a)} \frac{b^{\alpha_b} (1 - b)^{\beta_b - 1}}{B(\alpha_b + 1, \beta_b)} \, db \, da \\ &amp;= \frac{B(\alpha_b + 1, \beta_b)}{B(\alpha_b, \beta_b)} h(\alpha_a, \beta_a, \alpha_b + 1, \beta_b) \end{align} <p>where, which a slight abuse of notation, $$B$$ is the beta function and</p> $h(\alpha_1, \beta_1, \alpha_2, \beta_2) = P(X_1 &lt; X_2)$ <p>for any</p> \begin{align} &amp; X_1 \sim \text{Beta}(\alpha_1, \beta_1) \text{ and} \\ &amp; X_2 \sim \text{Beta}(\alpha_2, \beta_2). \end{align} <p>The function $$h$$ can be computed analytically, as shown in the blog posts mentioned above. Specifically,</p> $h(\alpha_1, \beta_1, \alpha_2, \beta_2) = \sum_{i = 0}^{\alpha_2 - 1} \frac{B(\alpha_1 + i, \beta_1 + \beta_2)}{(\beta_2 + i) B(1 + i, \beta_2) B(\alpha_1, \beta_1)}.$ <p>Similarly,</p> $G_2 = \frac{B(\alpha_a + 1, \beta_a)}{B(\alpha_a, \beta_a)} h(\alpha_a + 1, \beta_a, \alpha_b, \beta_b).$ <p>Regarding the last two integrals in the expression of the utility function,</p> \begin{align} L_1 &amp;= \int_0^1 \int_0^a \frac{a^{\alpha_a - 1} (1 - a)^{\beta_a - 1}}{B(\alpha_a, \beta_a)} \frac{b^{\alpha_b} (1 - b)^{\beta_b - 1}}{B(\alpha_b, \beta_b)} \, db \, da \\ &amp;= \frac{B(\alpha_b + 1, \beta_b)}{B(\alpha_b, \beta_b)} \int_0^1 \int_0^a \frac{a^{\alpha_a - 1} (1 - a)^{\beta_a - 1}}{B(\alpha_a, \beta_a)} \frac{b^{\alpha_b} (1 - b)^{\beta_b - 1}}{B(\alpha_b + 1, \beta_b)} \, db \, da \\ &amp;= \frac{B(\alpha_b + 1, \beta_b)}{B(\alpha_b, \beta_b)} h(\alpha_b + 1, \beta_b, \alpha_a, \beta_a). \end{align} <p>Also,</p> $L_2 = \frac{B(\alpha_a + 1, \beta_a)}{B(\alpha_a, \beta_a)} h(\alpha_b, \beta_b, \alpha_a + 1, \beta_a).$ <p>Assembling the integrals together, we obtain</p> \begin{align} E(U(A, B)) = &amp; w_g \, \frac{B(\alpha_b + 1, \beta_b)}{B(\alpha_b, \beta_b)} h(\alpha_a, \beta_a, \alpha_b + 1, \beta_b) - {} \\ &amp; w_g \, \frac{B(\alpha_a + 1, \beta_a)}{B(\alpha_a, \beta_a)} h(\alpha_a + 1, \beta_a, \alpha_b, \beta_b) + {} \\ &amp; w_l \, \frac{B(\alpha_b + 1, \beta_b)}{B(\alpha_b, \beta_b)} h(\alpha_b + 1, \beta_b, \alpha_a, \beta_a) - {} \\ &amp; w_l \, \frac{B(\alpha_a + 1, \beta_a)}{B(\alpha_a, \beta_a)} h(\alpha_b, \beta_b, \alpha_a + 1, \beta_a). \end{align} <p>At this point, we could call it a day, but there is some room for simplification. Note that, in the case of the assumed linear model, we have the following relationship between $$G$$ and $$L$$:</p> \begin{align} G_1 - G_2 &amp;= \frac{B(\alpha_b + 1, \beta_b)}{B(\alpha_b, \beta_b)} h(\alpha_a, \beta_a, \alpha_b + 1, \beta_b) - \frac{B(\alpha_a + 1, \beta_a)}{B(\alpha_a, \beta_a)} h(\alpha_a + 1, \beta_a, \alpha_b, \beta_b) \\ &amp;= \frac{B(\alpha_b + 1, \beta_b)}{B(\alpha_b, \beta_b)} (1 - h(\alpha_b + 1, \beta_b, \alpha_a, \beta_a)) - \frac{B(\alpha_a + 1, \beta_a)}{B(\alpha_a, \beta_a)} (1 - h(\alpha_b, \beta_b, \alpha_a + 1, \beta_a)) \\ &amp;= \frac{B(\alpha_b + 1, \beta_b)}{B(\alpha_b, \beta_b)} - \frac{B(\alpha_a + 1, \beta_a)}{B(\alpha_a, \beta_a)} - (L_1 - L_2) \\ &amp;= \Delta - (L_1 - L_2) \end{align} <p>where $$\Delta$$ is the different between the above two ratios of beta functions. Therefore,</p> \begin{align} E(U(A, B)) &amp;= w_g (G_1 - G_2) + w_l (L_1 - L_2) \\ &amp;= w_g (G_1 - G_2) + w_l (\Delta - (G_1 - G_2)) \\ &amp;= (w_g - w_l) (G_1 - G_2) + w_l \, \Delta. \end{align} <h1 id="conclusion">Conclusion</h1> <p>The decision-maker is now better equipped to take action. Having obtained the posterior distributions of the conversion rates of the two variants, the derived formula allows one to assess whether variant B is worth switching to, considering its utility to the business at hand.</p> <p>The reason the expected utility $$E(U(A, B))$$ can be evaluated in closed form in this case is the linearity of the utility function $$U(a, b)$$. More nuanced preferences require a different approach. The most flexible candidate is simulation, which is straightforward and should arguably be the go-to tool regardless of the availability of a closed-form solution, as it is less error-prone.</p> <p>Please feel free to reach out if you have any thoughts or suggestions.</p> <h1 id="acknowledgments">Acknowledgments</h1> <p>This article is largely inspired by a series of excellent blog posts by <a href="http://www.evanmiller.org/bayesian-ab-testing.html">Evan Miller</a>, <a href="https://www.chrisstucchio.com/blog/2014/bayesian_ab_decision_rule.html">Chris Stucchio</a>, and <a href="http://varianceexplained.org/r/bayesian-ab-testing/">David Robinson</a>, which are strongly recommended.</p> <h1 id="references">References</h1> <ul> <li>Chris Stucchio, “<a href="https://www.chrisstucchio.com/blog/2014/bayesian_ab_decision_rule.html">Easy evaluation of decision rules in Bayesian A/B testing</a>,” 2014.</li> <li>David Robinson, “<a href="http://varianceexplained.org/r/bayesian-ab-testing/">Is Bayesian A/B testing immune to peeking? Not exactly</a>,” 2015.</li> <li>Evan Miller, “<a href="http://www.evanmiller.org/bayesian-ab-testing.html">Formulas for Bayesian A/B testing</a>,” 2014.</li> </ul>Ivan UkhovIt can be not only extremely useful but also deeply satisfying to occasionally dust off one’s math skills. In this article, we approach the classical problem of conversion rate optimization—which is frequently faced by companies operating online—and derive the expected utility of switching from variant A to variant B under some modeling assumptions. This information can subsequently be utilized in order to support the corresponding decision-making process.A poor man’s orchestration of predictive models, or do it yourself2019-07-01T06:00:00+00:002019-07-01T06:00:00+00:00https://blog.ivanukhov.com/2019/07/01/orchestration<p>As a data scientist focusing on developing data products, you naturally want your work to reach its target audience. Suppose, however, that your company does not have a dedicated engineering team for productizing data-science code. One solution is to seek help in other teams, which are surely busy with their own endeavors, and spend months waiting. Alternatively, you could take the initiative and do it yourself. In this article, we take the initiative and schedule the training and application phases of a predictive model using Apache <a href="https://airflow.apache.org/">Airflow</a>, Google <a href="https://cloud.google.com/compute/">Compute Engine</a>, and <a href="https://www.docker.com/">Docker</a>.</p> <p>Let us first set expectations for what is assumed to be given and what will be attained by the end of this article. It is assumed that a predictive model for supporting business decisions—such as a model for identifying potential churners or a model for estimating the lifetime value of customers—has already been developed. This means that a business problem has already been identified and translated into a concrete question, the data needed for answering the question have already been collected and transformed into a target variable and a set of explanatory variables, and a modeling technique has already been selected and calibrated in order to answer the question by predicting the target variable given the explanatory variables. For the sake of concreteness, the model is assumed to be written in Python. We also assume that the company at hand has chosen Google Cloud Platform as its primary platform, which makes a certain suite of tools readily available.</p> <p>Our goal is then to schedule the model to run in the cloud via Airflow, Compute Engine, and Docker so that it is periodically retrained (in order to take into account potential fluctuations in the data distribution) and periodically applied (in order to actually make predictions), delivering predictions to the data warehouse in the form of <a href="https://cloud.google.com/bigquery/">BigQuery</a> for further consumption by other parties.</p> <p>It is important to note that this article is not a tutorial on any of the aforementioned technologies. The reader is assumed to be familiar with Google Cloud Platform and to have an understanding of Airflow and Docker, as well as to be comfortable with finding out missing details on their own.</p> <p>Lastly, the following two repositories contain the code discussed below:</p> <ul> <li><a href="https://github.com/chain-rule/example-prediction">example-prediction</a> and</li> <li><a href="https://github.com/chain-rule/example-prediction-service">example-prediction-service</a>.</li> </ul> <h1 id="preparing-the-model">Preparing the model</h1> <p>The suggested structure of the repository hosting the model is as follows:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>. ├── configs/ │ ├── application.json │ └── training.json ├── prediction/ │ ├── __init__.py │ ├── main.py │ ├── model.py │ └── task.py ├── README.md └── requirements.txt </code></pre></div></div> <p>Here <a href="https://github.com/chain-rule/example-prediction/tree/master/prediction"><code class="language-plaintext highlighter-rouge">prediction/</code></a> is a Python package, and it is likely to contain many more files than the ones listed. The <a href="https://github.com/chain-rule/example-prediction/blob/master/prediction/main.py"><code class="language-plaintext highlighter-rouge">main</code></a> file is the entry point for command-line invocation, the <a href="https://github.com/chain-rule/example-prediction/blob/master/prediction/task.py"><code class="language-plaintext highlighter-rouge">task</code></a> module defines the actions that the package is capable of performing, and the <a href="https://github.com/chain-rule/example-prediction/blob/master/prediction/model.py"><code class="language-plaintext highlighter-rouge">model</code></a> module defines the model.</p> <p>As alluded to above, the primary job of the <code class="language-plaintext highlighter-rouge">main</code> file is to parse command-line arguments, read a configuration file, potentially set up logging and alike, and delegate the rest to the <code class="language-plaintext highlighter-rouge">task</code> module. At a later stage, an invocation of an action might look as follows:</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python <span class="nt">-m</span> prediction.main <span class="nt">--action</span> training <span class="nt">--config</span> configs/training.json </code></pre></div></div> <p>Here we are passing two arguments: <code class="language-plaintext highlighter-rouge">--action</code> and <code class="language-plaintext highlighter-rouge">--config</code>. The former is to specify the desired action, and the latter is to supply additional configuration parameters, such as the location of the training data and the values of the model’s hyperparameters. Keeping all parameters in a separate file, as opposed to hard-coding them, makes the code reusable, and passing them all at once as a single file scales much better than passing each parameter as a separate argument.</p> <p>The <code class="language-plaintext highlighter-rouge">task</code> module is conceptually as follows (see the repository for the exact implementation):</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Task</span><span class="p">:</span> <span class="k">def</span> <span class="nf">training</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> <span class="c1"># Read the training data </span> <span class="c1"># Train the model </span> <span class="c1"># Save the trained model </span> <span class="k">def</span> <span class="nf">application</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> <span class="c1"># Read the application data </span> <span class="c1"># Load the trained model </span> <span class="c1"># Make predictions </span> <span class="c1"># Save the predictions </span></code></pre></div></div> <p>In this example, there are two tasks: training and application. The training task is responsible for fetching the training data, training the model, and saving the result in a predefined location for future usage by the application task. The application task is responsible for fetching the application data (that is, the data the model is supposed to be applied to), loading the trained model produced by the training task, making predictions, and saving them in a predefined location to be picked up for the subsequent delivery to the data warehouse.</p> <p>Likewise, the <code class="language-plaintext highlighter-rouge">model</code> module can be simplified as follows:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Model</span><span class="p">:</span> <span class="k">def</span> <span class="nf">fit</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">data</span><span class="p">):</span> <span class="c1"># Estimate the model’s parameters </span> <span class="k">def</span> <span class="nf">predict</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">data</span><span class="p">):</span> <span class="c1"># Make predictions using the estimated parameters </span></code></pre></div></div> <p>It can be seen that the structure presented above makes very few assumptions about the model, which makes it generally applicable. It can also be easily extended to accommodate other actions. For instance, one could have a separate action for testing the model on unseen data.</p> <p>Having structured the model as shown above, it can now be productized, which we discuss next.</p> <h1 id="wrapping-the-model-into-a-service">Wrapping the model into a service</h1> <p>Now it is time to turn the model into a service. In the scope of this article, a service is a self-sufficient piece of code that can be executed in the cloud upon request. To this end, another repository is created, adhering to the separation-of-concerns design principle. Specifically, by doing so, we avoid mixing the modeling code with the code specific to a particular environment where the model happens to be deployed. The suggested structure of the repository is as follows:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>. ├── container/ │ ├── Dockerfile │ ├── run.sh │ └── wait.sh ├── service/ │ ├── configs/ │ │ ├── application.json │ │ └── training.json │ ├── source/ # the first repository as a submodule │ └── requirements.txt ├── scheduler/ │ ├── configs/ │ │ ├── application.json │ │ └── training.json │ ├── application.py # a symbolic link to graph.py │ ├── graph.py │ └── training.py # a symbolic link to graph.py ├── Makefile └── README.md </code></pre></div></div> <p>The <a href="https://github.com/chain-rule/example-prediction-service/tree/master/container"><code class="language-plaintext highlighter-rouge">container/</code></a> folder contains files for building a Docker image for the service. The <a href="https://github.com/chain-rule/example-prediction-service/tree/master/service"><code class="language-plaintext highlighter-rouge">service/</code></a> folder is the service itself, meaning that these files will be present in the container and eventually executed. Lastly, the <a href="https://github.com/chain-rule/example-prediction-service/tree/master/scheduler"><code class="language-plaintext highlighter-rouge">scheduler/</code></a> folder contains files for scheduling the service using Airflow. The last one will be covered in the next section; here we focus on the first two.</p> <p>Let us start with <code class="language-plaintext highlighter-rouge">service/</code>. The first repository (the one discussed in the previous section) is added to this second repository as a Git submodule living in <code class="language-plaintext highlighter-rouge">service/source/</code>. This means that the model will essentially be embedded in the service but will conveniently remain an independent entity. At all times, the service contains a reference to a particular state (a particular commit, potentially on a dedicated release branch) of the model, guaranteeing that the desired version of the model is in production. However, when invoking the model from the service, we might want to use a different set of configuration files than the ones present in the first repository. To this end, a service-specific version of the configuration files is created in <code class="language-plaintext highlighter-rouge">service/configs/</code>. We might also want to install additional Python dependencies; hence, there is a separate file with requirements.</p> <p>Now it is time to containerize the service code by building a Docker image. The relevant files are gathered in <code class="language-plaintext highlighter-rouge">container/</code>. The image is defined in <a href="https://github.com/chain-rule/example-prediction-service/tree/master/container/Dockerfile"><code class="language-plaintext highlighter-rouge">container/Dockerfile</code></a> and is as follows:</p> <div class="language-docker highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Use a minimal Python image</span> <span class="k">FROM</span><span class="s"> python:3.7-slim</span> <span class="c"># Install Google Cloud SDK as described in</span> <span class="c"># https://cloud.google.com/sdk/docs/quickstart-debian-ubuntu</span> <span class="c"># Copy the service directory to the image</span> <span class="k">COPY</span><span class="s"> service /service</span> <span class="c"># Copy the run script to the image</span> <span class="k">COPY</span><span class="s"> container/run.sh /run.sh</span> <span class="c"># Install Python dependencies specific to the predictive model</span> <span class="k">RUN </span>pip <span class="nb">install</span> <span class="nt">--upgrade</span> <span class="nt">--requirement</span> /service/source/requirements.txt <span class="c"># Install Python dependencies specific to the service</span> <span class="k">RUN </span>pip <span class="nb">install</span> <span class="nt">--upgrade</span> <span class="nt">--requirement</span> /service/requirements.txt <span class="c"># Set the working directory to be the service directory</span> <span class="k">WORKDIR</span><span class="s"> /service</span> <span class="c"># Set the entry point to be the run script</span> <span class="k">ENTRYPOINT</span><span class="s"> /run.sh</span> </code></pre></div></div> <p>As mentioned earlier, <code class="language-plaintext highlighter-rouge">service/</code> gets copied as is (including <code class="language-plaintext highlighter-rouge">service/source</code> with the model), and it will be the working directory inside the container. We also copy <a href="https://github.com/chain-rule/example-prediction-service/tree/master/container/run.sh"><code class="language-plaintext highlighter-rouge">container/run.sh</code></a>, which becomes the entry point of the container; this script is executed whenever a container is launched. Let us take a look at the content of the script (as before, some parts omitted for clarity):</p> <div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/bash</span> <span class="k">function </span>process_training<span class="o">()</span> <span class="o">{</span> <span class="c"># Invoke training</span> python <span class="nt">-m</span> prediction.main <span class="se">\</span> <span class="nt">--action</span> <span class="k">${</span><span class="nv">ACTION</span><span class="k">}</span> <span class="se">\</span> <span class="nt">--config</span> configs/<span class="k">${</span><span class="nv">ACTION</span><span class="k">}</span>.json <span class="c"># Set the output location in Cloud Storage</span> <span class="nb">local </span><span class="nv">output</span><span class="o">=</span>gs://<span class="k">${</span><span class="nv">NAME</span><span class="k">}</span>/<span class="k">${</span><span class="nv">VERSION</span><span class="k">}</span>/<span class="k">${</span><span class="nv">ACTION</span><span class="k">}</span>/<span class="k">${</span><span class="nv">timestamp</span><span class="k">}</span> <span class="c"># Copy the trained model from the output directory to Cloud Storage</span> save output <span class="k">${</span><span class="nv">output</span><span class="k">}</span> <span class="o">}</span> <span class="k">function </span>process_application<span class="o">()</span> <span class="o">{</span> <span class="c"># Find the latest trained model in Cloud Storage</span> <span class="c"># Copy the trained model from Cloud Storage to the output directory</span> load <span class="k">${</span><span class="nv">input</span><span class="k">}</span> output <span class="c"># Invoke application</span> python <span class="nt">-m</span> prediction.main <span class="se">\</span> <span class="nt">--action</span> <span class="k">${</span><span class="nv">ACTION</span><span class="k">}</span> <span class="se">\</span> <span class="nt">--config</span> configs/<span class="k">${</span><span class="nv">ACTION</span><span class="k">}</span>.json <span class="c"># Set the output location in Cloud Storage</span> <span class="nb">local </span><span class="nv">output</span><span class="o">=</span>gs://<span class="k">${</span><span class="nv">NAME</span><span class="k">}</span>/<span class="k">${</span><span class="nv">VERSION</span><span class="k">}</span>/<span class="k">${</span><span class="nv">ACTION</span><span class="k">}</span>/<span class="k">${</span><span class="nv">timestamp</span><span class="k">}</span> <span class="c"># Copy the predictions from the output directory to Cloud Storage</span> save output <span class="k">${</span><span class="nv">output</span><span class="k">}</span> <span class="c"># Set the input file in Cloud Storage</span> <span class="c"># Set the output data set and table in BigQuery</span> <span class="c"># Ingest the predictions from Cloud Storage into BigQuery</span> ingest <span class="k">${</span><span class="nv">input</span><span class="k">}</span> <span class="k">${</span><span class="nv">output</span><span class="k">}</span> player_id:STRING,label:BOOL <span class="o">}</span> <span class="k">function </span>delete<span class="o">()</span> <span class="o">{</span> <span class="c"># Delete a Compute Engine instance called "${NAME}-${VERSION}-${ACTION}"</span> <span class="o">}</span> <span class="k">function </span>ingest<span class="o">()</span> <span class="o">{</span> <span class="c"># Ingest a file from Cloud Storage into a table in BigQuery</span> <span class="o">}</span> <span class="k">function </span>load<span class="o">()</span> <span class="o">{</span> <span class="c"># Sync the content of a location in Cloud Storage with a local directory</span> <span class="o">}</span> <span class="k">function </span>save<span class="o">()</span> <span class="o">{</span> <span class="c"># Sync the content of a local directory with a location in Cloud Storage</span> <span class="o">}</span> <span class="k">function </span>send<span class="o">()</span> <span class="o">{</span> <span class="c"># Write into a Stackdriver log called "${NAME}-${VERSION}-${ACTION}"</span> <span class="o">}</span> <span class="c"># Invoke the delete function when the script exits regardless of the reason</span> <span class="nb">trap </span>delete EXIT <span class="c"># Report a successful start to Stackdriver</span> send <span class="s1">'Running the action...'</span> <span class="c"># Invoke the function specified by the ACTION environment variable</span> process_<span class="k">${</span><span class="nv">ACTION</span><span class="k">}</span> <span class="c"># Report a successful completion to Stackdriver</span> send <span class="s1">'Well done.'</span> </code></pre></div></div> <p>The script expects a number of environment variables to be set upon each container launch, which will be discussed shortly. The primary ones are <code class="language-plaintext highlighter-rouge">NAME</code>, <code class="language-plaintext highlighter-rouge">VERSION</code>, and <code class="language-plaintext highlighter-rouge">ACTION</code>, indicating the name of the service, version of the service, and action to be executed by the service, respectively.</p> <p>As we shall see below, the above script interacts with several different products on Google Cloud Platform. It might then be surprising that there is only a handful of variables passed to the script. The explanation is that the convention-over-configuration design paradigm is followed to a great extent here, meaning that other necessary variables can be derived (save sensible default values) from the ones given, since there are certain naming conventions used throughout the project.</p> <p>The <code class="language-plaintext highlighter-rouge">process_training</code> and <code class="language-plaintext highlighter-rouge">process_application</code> are responsible for training and application, respectively. It can be seen that they leverage the command-line interface by invoking the <code class="language-plaintext highlighter-rouge">main</code> file, which was discussed in the previous section. Since containers are stateless, all artifacts are stored in an external storage, which is a bucket in <a href="https://cloud.google.com/storage/">Cloud Storage</a> in our case, and this job is delegated to the <code class="language-plaintext highlighter-rouge">load</code> and <code class="language-plaintext highlighter-rouge">save</code> functions used in both <code class="language-plaintext highlighter-rouge">process_training</code> and <code class="language-plaintext highlighter-rouge">process_application</code>. In addition, the result of the application action (that is, the predictions) is ingested into a table in BigQuery using <a href="https://cloud.google.com/sdk/">Cloud SDK</a>, which can be seen in the <code class="language-plaintext highlighter-rouge">ingest</code> function in <a href="https://github.com/chain-rule/example-prediction-service/tree/master/container/run.sh"><code class="language-plaintext highlighter-rouge">container/run.sh</code></a>.</p> <p>The container communicates with the outside world using <a href="https://cloud.google.com/stackdriver/">Stackdriver</a> via the <code class="language-plaintext highlighter-rouge">send</code> function, which writes messages to a log dedicated to the current service run. The most important message is the one indicating a successful completion, which is sent at the very end; we use “Well done” for this purpose. This is the message that will be looked for in order to determine the overall outcome of a service run.</p> <p>Note also that, upon successful or unsuccessful completion, the container deletes its hosting virtual machine, which is achieved by setting a handler (<code class="language-plaintext highlighter-rouge">delete</code>) for the <code class="language-plaintext highlighter-rouge">EXIT</code> event.</p> <p>Lastly, let us discuss the commands used for building the image and launching the actions. This entails a few lengthy invocations of Cloud SDK, which can be neatly organized in a <a href="https://github.com/chain-rule/example-prediction-service/tree/master/Makefile"><code class="language-plaintext highlighter-rouge">Makefile</code></a>:</p> <div class="language-make highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># The name of the service </span><span class="nv">name</span> <span class="o">?=</span> example-prediction-service <span class="c"># The version of the service </span><span class="nv">version</span> <span class="o">?=</span> 2019-00-00 <span class="c"># The name of the project on Google Cloud Platform </span><span class="nv">project</span> <span class="o">?=</span> example-cloud-project <span class="c"># The zone for operations in Compute Engine </span><span class="nv">zone</span> <span class="o">?=</span> europe-west1-b <span class="c"># The address of Container Registry </span><span class="nv">registry</span> <span class="o">?=</span> eu.gcr.io <span class="c"># The name of the Docker image </span><span class="nv">image</span> <span class="o">:=</span> <span class="nv">${name}</span> <span class="c"># The name of the instance excluding the action </span><span class="nv">instance</span> <span class="o">:=</span> <span class="nv">${name}</span>-<span class="nv">${version}</span> <span class="nl">build</span><span class="o">:</span> docker rmi <span class="nv">${image}</span> 2&gt; /dev/null <span class="o">||</span> <span class="nb">true</span> docker build <span class="nt">--file</span> container/Dockerfile <span class="nt">--tag</span> <span class="nv">${image}</span> . docker tag <span class="nv">${image}</span> <span class="nv">${registry}</span>/<span class="nv">${project}</span>/<span class="nv">${image}</span>:<span class="nv">${version}</span> docker push <span class="nv">${registry}</span>/<span class="nv">${project}</span>/<span class="nv">${image}</span>:<span class="nv">${version}</span> <span class="nl">training-start</span><span class="o">:</span> gcloud compute instances create-with-container <span class="nv">${instance}</span><span class="nt">-training</span> <span class="se">\</span> <span class="nt">--container-image</span> <span class="nv">${registry}</span>/<span class="nv">${project}</span>/<span class="nv">${image}</span>:<span class="nv">${version}</span> <span class="se">\</span> <span class="nt">--container-env</span> <span class="nv">NAME</span><span class="o">=</span><span class="nv">${name}</span> <span class="se">\</span> <span class="nt">--container-env</span> <span class="nv">VERSION</span><span class="o">=</span><span class="nv">${version}</span> <span class="se">\</span> <span class="nt">--container-env</span> <span class="nv">ACTION</span><span class="o">=</span>training <span class="se">\</span> <span class="nt">--container-env</span> <span class="nv">ZONE</span><span class="o">=</span><span class="nv">${zone}</span> <span class="se">\</span> <span class="nt">--container-restart-policy</span> never <span class="se">\</span> <span class="nt">--no-restart-on-failure</span> <span class="se">\</span> <span class="nt">--machine-type</span> n1-standard-1 <span class="se">\</span> <span class="nt">--scopes</span> default,bigquery,compute-rw,storage-rw <span class="p">-</span><span class="nt">-zone</span> <span class="nv">${zone}</span> <span class="nl">training-wait</span><span class="o">:</span> container/wait.sh instance <span class="nv">${instance}</span><span class="nt">-training</span> <span class="nv">${zone}</span> <span class="nl">training-check</span><span class="o">:</span> container/wait.sh success <span class="nv">${instance}</span><span class="nt">-training</span> <span class="c"># Similar for application </span></code></pre></div></div> <p>Here we define one command for building images, namely <code class="language-plaintext highlighter-rouge">build</code>, and three commands per action, namely <code class="language-plaintext highlighter-rouge">start</code>, <code class="language-plaintext highlighter-rouge">wait</code>, and <code class="language-plaintext highlighter-rouge">check</code>. In this section, we discuss <code class="language-plaintext highlighter-rouge">build</code> and <code class="language-plaintext highlighter-rouge">start</code> and leave the last two for the next section, as they are needed specifically for scheduling.</p> <p>The <code class="language-plaintext highlighter-rouge">build</code> command is invoked as follows:</p> <div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>make build </code></pre></div></div> <p>It has to be used each time a new version of the service is to be deployed. The command creates a local Docker image according to the recipe in <code class="language-plaintext highlighter-rouge">container/Dockerfile</code> and uploads it to <a href="https://cloud.google.com/container-registry/">Container Registry</a>, which is Google’s storage for Docker images. For the last operation to succeed, your local Docker has to be configured appropriately, which boils down to the following lines:</p> <div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gcloud auth login <span class="c"># General authentication for Cloud SDK</span> gcloud auth configure-docker </code></pre></div></div> <p>Once <code class="language-plaintext highlighter-rouge">build</code> has finished successfully, one should be able to see the newly created image in <a href="https://console.cloud.google.com">Cloud Console</a> by navigating to Container Registry in the menu to the left. All future versions of the service will be neatly grouped in a separate folder in the registry.</p> <p>Given that the image is in the cloud, we can start to create virtual machines running containers with this particular image, which is what the <code class="language-plaintext highlighter-rouge">start</code> command is for:</p> <div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>make training-start <span class="c"># Similar for application</span> </code></pre></div></div> <p>Internally, it relies on <code class="language-plaintext highlighter-rouge">gcloud compute instances create-with-container</code>, which can be seen in <code class="language-plaintext highlighter-rouge">Makefile</code> listed above. There are a few aspects to note about this command. Apart from selecting the right image and version (<code class="language-plaintext highlighter-rouge">--container-image</code>), one has to make sure to set the environment variables mentioned earlier, as they control what the container will be doing once launched. This is achieved by passing a number of <code class="language-plaintext highlighter-rouge">--container-env</code> options to <code class="language-plaintext highlighter-rouge">create-with-container</code>. Here one can also easily scale up and down the host virtual machine via the <code class="language-plaintext highlighter-rouge">--machine-type</code> option. Lastly, it is important to set the <code class="language-plaintext highlighter-rouge">--scopes</code> option correctly in order to empower the container to work with BigQuery, Compute Engine, and Cloud Storage.</p> <p>At this point, we have a few handy commands for working with the service. It is time for scheduling.</p> <h1 id="scheduling-the-service">Scheduling the service</h1> <p>The goal now is to make both training and application be executed periodically, promptly delivering predictions to the data warehouse. Technically, one could just keep invoking <code class="language-plaintext highlighter-rouge">make training-start</code> and <code class="language-plaintext highlighter-rouge">make application-start</code> on their local machine, but of course, this is neither convenient nor reliable. Instead, we would like to have an autonomous scheduler running in the cloud that would, apart from its primary task of dispatching jobs, manage temporal dependencies between jobs, keep record of all past and upcoming jobs, and preferably provide a web-based dashboard for monitoring. One such tool is Airflow, and it is the one used in this article.</p> <p>In Airflow, the work to be performed is expressed as a directed acyclic graph defined using Python. Our job is to create two such graphs. One is for training, and one is for application, each with its own periodicity. At this point, it might seem that each graph should contain only one node calling the <code class="language-plaintext highlighter-rouge">start</code> command, which was introduced earlier. However, a more comprehensive solution is to not only start the service but also wait for its termination and check that it successfully executed the corresponding logic. It will give us great visibility on the life cycle of the service in terms of various statistics (for instance, the duration and outcome of all runs) directly in Airflow.</p> <p>The above is the reason we have defined two additional commands in <code class="language-plaintext highlighter-rouge">Makefile</code>: <code class="language-plaintext highlighter-rouge">wait</code> and <code class="language-plaintext highlighter-rouge">check</code>. The <code class="language-plaintext highlighter-rouge">wait</code> command ensures that the virtual machine reached a terminal state (regardless of the outcome), and the <code class="language-plaintext highlighter-rouge">check</code> command ensures that the terminal state was the one expected. This functionality can be implemented in different ways. The approach that we use can be seen in <a href="https://github.com/chain-rule/example-prediction-service/tree/master/container/wait.sh"><code class="language-plaintext highlighter-rouge">container/wait.sh</code></a>, which is invoked by both operations from <code class="language-plaintext highlighter-rouge">Makefile</code>:</p> <div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/bash</span> <span class="k">function </span>process_instance<span class="o">()</span> <span class="o">{</span> <span class="nb">echo</span> <span class="s1">'Waiting for the instance to finish...'</span> <span class="k">while </span><span class="nb">true</span><span class="p">;</span> <span class="k">do</span> <span class="c"># Try to read some information about the instance</span> <span class="c"># Exit successfully when there is no such instance</span> <span class="nb">wait </span><span class="k">done</span> <span class="o">}</span> <span class="k">function </span>process_success<span class="o">()</span> <span class="o">{</span> <span class="nb">echo</span> <span class="s1">'Waiting for the success to be reported...'</span> <span class="k">while </span><span class="nb">true</span><span class="p">;</span> <span class="k">do</span> <span class="c"># Check if the last entry in Stackdriver contains “Well done”</span> <span class="c"># Exit successfully if the phrase is present</span> <span class="nb">wait </span><span class="k">done</span> <span class="o">}</span> <span class="k">function </span><span class="nb">wait</span><span class="o">()</span> <span class="o">{</span> <span class="nb">echo</span> <span class="s1">'Waiting...'</span> <span class="nb">sleep </span>10 <span class="o">}</span> <span class="c"># Invoke the function specified by the first command-line argument and forward</span> <span class="c"># the rest of the arguments to this function</span> process_<span class="k">${</span><span class="nv">1</span><span class="k">}</span> <span class="k">${</span><span class="p">@</span>:2:10<span class="k">}</span> </code></pre></div></div> <p>The script has two main functions. The <code class="language-plaintext highlighter-rouge">process_instance</code> function waits for the virtual machine to finish, and it is currently based on trying to fetch some information about the machine in question using Cloud SDK. Whenever this fetching fails, it is an indication of the machine being shut down and destroyed, which is exactly what is needed in this case. The <code class="language-plaintext highlighter-rouge">process_success</code> function waits for the key phrase “Well done” to appear in Stackdriver. However, this message might never appear, and this is the reason <code class="language-plaintext highlighter-rouge">process_success</code> has a timeout, unlike <code class="language-plaintext highlighter-rouge">process_instance</code>.</p> <p>All right, there are now three commands to schedule in sequence: <code class="language-plaintext highlighter-rouge">start</code>, <code class="language-plaintext highlighter-rouge">wait</code>, and <code class="language-plaintext highlighter-rouge">check</code>. For instance, for training, the exact command sequence is the following:</p> <div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>make training-start make training-wait make training-check </code></pre></div></div> <p>We need to create two separate Python files defining two separate Airflow graphs; however, the graphs will be almost identical except for the triggering interval and the prefix of the <code class="language-plaintext highlighter-rouge">start</code>, <code class="language-plaintext highlighter-rouge">wait</code>, and <code class="language-plaintext highlighter-rouge">check</code> commands. It then makes sense to keep the varying parts in separate configuration files and use the exact same code for constructing the graphs, adhering to the do-not-repeat-yourself design principle. The <a href="https://github.com/chain-rule/example-prediction-service/tree/master/scheduler/configs"><code class="language-plaintext highlighter-rouge">scheduler/configs/</code></a> folder contains the configuration files suggested, and <a href="https://github.com/chain-rule/example-prediction-service/tree/master/scheduler/graph.py"><code class="language-plaintext highlighter-rouge">scheduler/graph.py</code></a> is the Python script creating a graph:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">airflow</span> <span class="kn">import</span> <span class="n">DAG</span> <span class="kn">from</span> <span class="nn">airflow.operators.bash_operator</span> <span class="kn">import</span> <span class="n">BashOperator</span> <span class="k">def</span> <span class="nf">configure</span><span class="p">():</span> <span class="c1"># Extract the directory containing the current file </span> <span class="n">path</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">dirname</span><span class="p">(</span><span class="n">__file__</span><span class="p">)</span> <span class="c1"># Extract the name of the current file without its extension </span> <span class="n">name</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">splitext</span><span class="p">(</span><span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">basename</span><span class="p">(</span><span class="n">__file__</span><span class="p">))[</span><span class="mi">0</span><span class="p">]</span> <span class="c1"># Load the configuration file corresponding to the extracted name </span> <span class="n">config</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="s">'configs'</span><span class="p">,</span> <span class="n">name</span> <span class="o">+</span> <span class="s">'.json'</span><span class="p">)</span> <span class="n">config</span> <span class="o">=</span> <span class="n">json</span><span class="p">.</span><span class="n">loads</span><span class="p">(</span><span class="nb">open</span><span class="p">(</span><span class="n">config</span><span class="p">).</span><span class="n">read</span><span class="p">())</span> <span class="k">return</span> <span class="n">config</span> <span class="k">def</span> <span class="nf">construct</span><span class="p">(</span><span class="n">config</span><span class="p">):</span> <span class="k">def</span> <span class="nf">_construct_graph</span><span class="p">(</span><span class="n">default_args</span><span class="p">,</span> <span class="n">start_date</span><span class="p">,</span> <span class="o">**</span><span class="n">options</span><span class="p">):</span> <span class="n">start_date</span> <span class="o">=</span> <span class="n">datetime</span><span class="p">.</span><span class="n">datetime</span><span class="p">.</span><span class="n">strptime</span><span class="p">(</span><span class="n">start_date</span><span class="p">,</span> <span class="s">'%Y-%m-%d'</span><span class="p">)</span> <span class="k">return</span> <span class="n">DAG</span><span class="p">(</span><span class="n">default_args</span><span class="o">=</span><span class="n">default_args</span><span class="p">,</span> <span class="n">start_date</span><span class="o">=</span><span class="n">start_date</span><span class="p">,</span> <span class="o">**</span><span class="n">options</span><span class="p">)</span> <span class="k">def</span> <span class="nf">_construct_task</span><span class="p">(</span><span class="n">graph</span><span class="p">,</span> <span class="n">name</span><span class="p">,</span> <span class="n">code</span><span class="p">):</span> <span class="k">return</span> <span class="n">BashOperator</span><span class="p">(</span><span class="n">task_id</span><span class="o">=</span><span class="n">name</span><span class="p">,</span> <span class="n">bash_command</span><span class="o">=</span><span class="n">code</span><span class="p">,</span> <span class="n">dag</span><span class="o">=</span><span class="n">graph</span><span class="p">)</span> <span class="c1"># Construct an empty graph </span> <span class="n">graph</span> <span class="o">=</span> <span class="n">_construct_graph</span><span class="p">(</span><span class="o">**</span><span class="n">config</span><span class="p">[</span><span class="s">'graph'</span><span class="p">])</span> <span class="c1"># Construct the specified tasks </span> <span class="n">tasks</span> <span class="o">=</span> <span class="p">[</span><span class="n">_construct_task</span><span class="p">(</span><span class="n">graph</span><span class="p">,</span> <span class="o">**</span><span class="n">task</span><span class="p">)</span> <span class="k">for</span> <span class="n">task</span> <span class="ow">in</span> <span class="n">config</span><span class="p">[</span><span class="s">'tasks'</span><span class="p">]]</span> <span class="n">tasks</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">([(</span><span class="n">task</span><span class="p">.</span><span class="n">task_id</span><span class="p">,</span> <span class="n">task</span><span class="p">)</span> <span class="k">for</span> <span class="n">task</span> <span class="ow">in</span> <span class="n">tasks</span><span class="p">])</span> <span class="c1"># Enforce the specified dependencies between the tasks </span> <span class="k">for</span> <span class="n">child</span><span class="p">,</span> <span class="n">parent</span> <span class="ow">in</span> <span class="n">config</span><span class="p">[</span><span class="s">'dependencies'</span><span class="p">]:</span> <span class="n">tasks</span><span class="p">[</span><span class="n">parent</span><span class="p">].</span><span class="n">set_downstream</span><span class="p">(</span><span class="n">tasks</span><span class="p">[</span><span class="n">child</span><span class="p">])</span> <span class="k">return</span> <span class="n">graph</span> <span class="k">try</span><span class="p">:</span> <span class="c1"># Load an appropriate configuration file and construct a graph accordingly </span> <span class="n">graph</span> <span class="o">=</span> <span class="n">construct</span><span class="p">(</span><span class="n">configure</span><span class="p">())</span> <span class="k">except</span> <span class="nb">FileNotFoundError</span><span class="p">:</span> <span class="c1"># Exit without errors in case the current file has no configuration file </span> <span class="k">pass</span> </code></pre></div></div> <p>The script receives no arguments and instead tries to find a suitable configuration file based on its own name, which can be seen in the <code class="language-plaintext highlighter-rouge">configure</code> function. Then <code class="language-plaintext highlighter-rouge">scheduler/training.py</code> and <code class="language-plaintext highlighter-rouge">scheduler/application.py</code> can simply be symbolic links to <code class="language-plaintext highlighter-rouge">scheduler/graph.py</code>, avoiding any code repetition. When they are read by Airflow, each one will have its own name, and it will load its own configuration if there is one in <code class="language-plaintext highlighter-rouge">scheduler/configs/</code>.</p> <p>For instance, for training, <a href="https://github.com/chain-rule/example-prediction-service/tree/master/scheduler/configs/training.json"><code class="language-plaintext highlighter-rouge">scheduler/configs/training.json</code></a> is as follows:</p> <div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w"> </span><span class="nl">"graph"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"dag_id"</span><span class="p">:</span><span class="w"> </span><span class="s2">"example-prediction-service-training"</span><span class="p">,</span><span class="w"> </span><span class="nl">"schedule_interval"</span><span class="p">:</span><span class="w"> </span><span class="s2">"0 0 1 * *"</span><span class="p">,</span><span class="w"> </span><span class="nl">"start_date"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2019-07-01"</span><span class="w"> </span><span class="p">},</span><span class="w"> </span><span class="nl">"tasks"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"start"</span><span class="p">,</span><span class="w"> </span><span class="nl">"code"</span><span class="p">:</span><span class="w"> </span><span class="s2">"make -C '${ROOT}/..' training-start"</span><span class="w"> </span><span class="p">},</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"wait"</span><span class="p">,</span><span class="w"> </span><span class="nl">"code"</span><span class="p">:</span><span class="w"> </span><span class="s2">"make -C '${ROOT}/..' training-wait"</span><span class="w"> </span><span class="p">},</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"check"</span><span class="p">,</span><span class="w"> </span><span class="nl">"code"</span><span class="p">:</span><span class="w"> </span><span class="s2">"make -C '${ROOT}/..' training-check"</span><span class="w"> </span><span class="p">}</span><span class="w"> </span><span class="p">],</span><span class="w"> </span><span class="nl">"dependencies"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w"> </span><span class="p">[</span><span class="s2">"wait"</span><span class="p">,</span><span class="w"> </span><span class="s2">"start"</span><span class="p">],</span><span class="w"> </span><span class="p">[</span><span class="s2">"check"</span><span class="p">,</span><span class="w"> </span><span class="s2">"wait"</span><span class="p">]</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="p">}</span><span class="w"> </span></code></pre></div></div> <p>Each configuration file contains three main sections: <code class="language-plaintext highlighter-rouge">graph</code>, <code class="language-plaintext highlighter-rouge">tasks</code>, and <code class="language-plaintext highlighter-rouge">dependencies</code>. The first section prescribes the desired start date, periodicity, and other parameters specific to the graph itself. In this example, the graph is triggered on the first day of every month at midnight (<code class="language-plaintext highlighter-rouge">0 0 1 * *</code>), which might be a reasonable frequency for retraining the model. The second section defines what commands should be executed. It can be seen that there is one task for each of the three actions. The <code class="language-plaintext highlighter-rouge">-C '${ROOT}/..'</code> part is needed in order to ensure that the right <code class="language-plaintext highlighter-rouge">Makefile</code> is used, which is taken care of in <code class="language-plaintext highlighter-rouge">scheduler/graph.py</code>. Lastly, the third section dictates the order of execution by enforcing dependencies. In this case, we are saying that <code class="language-plaintext highlighter-rouge">wait</code> depends on (should be executed after) <code class="language-plaintext highlighter-rouge">start</code>, and that <code class="language-plaintext highlighter-rouge">check</code> depends on <code class="language-plaintext highlighter-rouge">wait</code>, forming a chain of tasks.</p> <p>At this point, the graphs are considered to be complete. In order to make Airflow aware of them, the repository can be simply cloned into the <code class="language-plaintext highlighter-rouge">dags</code> directory of Airflow.</p> <p>Lastly, Airflow itself can live on a separate instance in Compute Engine. Alternatively, <a href="https://cloud.google.com/composer/">Cloud Composer</a> provided by Google Cloud Platform can be utilized for this purpose.</p> <h1 id="conclusion">Conclusion</h1> <p>Having reached this point, our predictive model is up and running in the cloud in an autonomous fashion, delivering predictions to the data warehouse to act upon. The data warehouse is certainly not the end of the journey, but we stop here and save the discussion for another time.</p> <p>Although the presented workflow gets the job done, it has its own limitations and weaknesses, which one has to be aware of. The most prominent one is the communication between a Docker container running inside a virtual machine and the scheduler, Airflow. Busy waiting for a virtual machine in Compute Engine to shut down and for Stackdriver to deliver a certain message is arguably not the most reliable solution. There is also a certain overhead associated with starting a virtual machine in Compute Engine, downloading an image from Container Registry, and launching a container. Furthermore, this approach is not suitable for online prediction, as the service does not expose any API for other services to integrate with—its job is making periodically batch predictions.</p> <p>If you have any suggestions regarding improving the workflow or simply would like to share your thoughts, please leave a comment below or send an e-mail. Feel also free to <a href="https://github.com/chain-rule/example-prediction-service/issues">create an issue</a> or <a href="https://github.com/chain-rule/example-prediction-service/pulls">open a pull request</a> on GitHub. Any feedback is very much appreciated!</p> <h1 id="follow-up">Follow-up</h1> <p>Since its publication, the workflow presented in this article has been significantly simplified. More specifically, on July 16, 2019, it became possible to execute arbitrary Docker images on Google <a href="https://cloud.google.com/ai-platform/">AI Platform</a>. The platform takes care of the whole life cycle of the container, obviating the need for any wait scripts and ad-hoc communication mechanisms via Stackdriver. Refer to “<a href="https://medium.com/google-cloud/how-to-run-serverless-batch-jobs-on-google-cloud-ca45a4e33cb1">How to run serverless batch jobs on Google Cloud</a>” by Lak Lakshmanan for further details.</p> <h1 id="references">References</h1> <ul> <li>Lak Lakshmanan, “<a href="https://medium.com/google-cloud/how-to-run-serverless-batch-jobs-on-google-cloud-ca45a4e33cb1">How to run serverless batch jobs on Google Cloud</a>,” 2019.</li> </ul>Ivan UkhovAs a data scientist focusing on developing data products, you naturally want your work to reach its target audience. Suppose, however, that your company does not have a dedicated engineering team for productizing data-science code. One solution is to seek help in other teams, which are surely busy with their own endeavors, and spend months waiting. Alternatively, you could take the initiative and do it yourself. In this article, we take the initiative and schedule the training and application phases of a predictive model using Apache Airflow, Google Compute Engine, and Docker.