“Good news, everyone!” is a collection of articles about solving problems that a software engineer might encounter in practice—or invent to sharpen their skills in leisure time.

## Articles

### Out of memory, or gradient accumulation for larger models

When the model grows large and does not fit on a single device, and there are no more devices to spare, the common mitigation strategy is to reduce the batch size, thereby allowing more space for the model at the expense of the data. However, smaller batches lead to noisier weight updates, which is undesirable. One solution is gradient accumulation where the weights are updated after evaluating the gradients for several batches at a time. In this article, we show how it can be implemented in practice.

### Relative positional embedding for any attention mechanism

In Shaw et al. (2018), the authors introduce relative positional embedding for self-attention in transformer models, and in Huang et al. (2018), the authors present a memory efficient approach to calculating this embedding in decoder blocks, in which the self-attention is causal. In this article, the approach is generalized to any attention mechanism, should it be self or cross or full or causal.

### Breaking sticks, or estimation of probability distributions using the Dirichlet process

Recall the last time you wanted to understand the distribution of given data. One alternative was to plot a histogram. However, it resulted in frustration due to the choice of the number of bins to use, which led to drastically different outcomes. Another alternative was kernel density estimation. Despite having a similar choice to make, it has the advantage of producing smooth estimates, which are more realistic for continuous quantities with regularities. However, kernel density estimation was unsatisfactory too: it did not aid in understanding the underlying structure of the data and, moreover, provided no means of quantifying the uncertainty associated with the results. In this article, we discuss a Bayesian approach to the estimation of data-generating distributions that addresses the aforementioned concerns.

### Heteroscedastic Gaussian process regression

Gaussian process regression is a nonparametric Bayesian technique for modeling relationships between variables of interest. The vast flexibility and rigor mathematical foundation of this approach make it the default choice in many problems involving small- to medium-sized data sets. In this article, we illustrate how Gaussian process regression can be utilized in practice. To make the case more compelling, we consider a setting where linear regression would be inadequate. The focus will be

*not*on getting the job done as fast as possible but on learning the technique and understanding the choices being made.### What is the easiest way to compare two data sets?

One has probably come across this problem numerous times. There are two versions of a tabular data set with a lot of columns of different types, and one wants to quickly identify any differences between the two. For example, the pipeline providing data to a predictive model might have been updated, and the goal is to understand if there have been any side effects of this update for the training data.

### Bayesian inference of the net promoter score via multilevel regression with poststratification

Customer surveys are naturally prone to biases. One prominent example is participation bias, which arises when individuals decide not to respond to the survey, and this pattern is not random. For instance, new customers might reply less eagerly than those who are senior. This renders the obtained responses unrepresentative of the target population. In this article, we tackle participation bias for the case of the net promoter survey by means of multilevel regression and poststratification.

### Ingestion of sequential data from BigQuery into TensorFlow

How hard can it be to ingest sequential data into a TensorFlow model? As always, the answer is, “It depends.” Where are the sequences in question stored? Can they fit in main memory? Are they of the same length? In what follows, we shall build a flexible and scalable workflow for feeding sequential observations into a TensorFlow graph starting from BigQuery as the data warehouse.

### Sample size determination using historical data and simulation

In order to test a hypothesis, one has to design and execute an adequate experiment. Typically, it is neither feasible nor desirable to involve the whole population. Instead, a relatively small subset of the population is studied, and given the outcome for this small sample, relevant conclusions are drawn with respect to the population. An important question to answer is then, What is the minimal sample size needed for the experiment to succeed? In what follows, we answer this question using solely historical data and computer simulation, without invoking any classical statistical procedures.

### A Bayesian approach to the inference of the net promoter score

The net promoter score is a widely adopted metric for gauging customers’ satisfaction with a product. The popularity of the score is arguably attributed to the simplicity of measurement and the intuitiveness of interpretation. Moreover, it is claimed to be correlated with revenue growth, which, ignoring causality, makes it even more appealing. In this article, we leverage Bayesian statistics in order to infer the net promoter score for an arbitrary segmentation of a customer base. The outcome of the inference is a distribution over all possible values of the score weighted by probabilities, which provides exhaustive information for the subsequent decision-making.

### Interactive notebooks in tightly sealed disposable containers

It is truly amazing how interactive notebooks—where a narrative in a spoken language is entwined with executable chunks of code in a programming language—have revolutionized the way we work with data and document our thought processes and findings for others and, equally importantly, for our future selves. They are ubiquitous and taken for granted. It is hard to imagine where data enthusiasts would be without them. Most likely, we would be spending too much time staring at a terminal window, anxiously re-running scripts from start to finish, printing variables, and saving lots of files with tables and graphs on disk for further inspection. Interactive notebooks are an essential tool in the data scientist’s toolbox, and in this article, we are going to make them readily available for our use with our favorite packages installed and preferences set up, no matter where we find ourselves working and regardless of the mess we might have left behind during the previous session.

### On the expected utility in conversion rate optimization

It can be not only extremely useful but also deeply satisfying to occasionally dust off one’s math skills. In this article, we approach the classical problem of conversion rate optimization—which is frequently faced by companies operating online—and derive the expected utility of switching from variant A to variant B under some modeling assumptions. This information can subsequently be utilized in order to support the corresponding decision-making process.

### A poor man’s orchestration of predictive models, or do it yourself

As a data scientist focusing on developing data products, you naturally want your work to reach its target audience. Suppose, however, that your company does not have a dedicated engineering team for productizing data-science code. One solution is to seek help in other teams, which are surely busy with their own endeavors, and spend months waiting. Alternatively, you could take the initiative and do it yourself. In this article, we take the initiative and schedule the training and application phases of a predictive model using Apache Airflow, Google Compute Engine, and Docker.