It is truly amazing how interactive notebooks—where a narrative in a spoken language is entwined with executable chunks of code in a programming language—have revolutionized the way we work with data and document our thought processes and findings for others and, equally importantly, for our future selves. They are ubiquitous and taken for granted. It is hard to imagine where data enthusiasts would be without them. Most likely, we would be spending too much time staring at a terminal window, anxiously re-running scripts from start to finish, printing variables, and saving lots of files with tables and graphs on disk for further inspection. Interactive notebooks are an essential tool in the data scientist’s toolbox, and in this article, we are going to make them readily available for our use with our favorite packages installed and preferences set up, no matter where we find ourselves working and regardless of the mess we might have left behind during the previous session.

Python and R (in alphabetic order) are arguably the primary languages used by data scientists nowadays. In the context of interactive computations, IPython and later on Project Jupyter have been of paramount importance for the Python community (the latter is actually language agnostic). In the R community, this role has been played by RStudio. Therefore, having at one’s disposal JupyterLab, which is Project Jupyter’s flagship, and RStudio should make one well equipped for a wide range of data challenges. As alluded to earlier, the objective is to have an environment that has a fixed initial state defined by us and is accessible to us on any machine we might happen to work on. This problem definition is a perfect fit for containerization. Specifically, we shall build custom-tailored Docker images for JupyterLab and RStudio and create a few convenient shortcuts for launching them.

The code discussed below can be found in the following two repositories:

JupyterLab and
RStudio.

JupyterLab

In order to build a Docker image for JupyterLab, we begin with a Dockerfile:

# Start with a minimal Python image
FROM python:3.7-slim

# Install the desired Python packages
COPY requirements.txt /tmp/requirements.txt
RUN pip install --upgrade pip
RUN pip install --upgrade --requirement /tmp/requirements.txt

# Configure JupyterLab to use a specific IP address and port
RUN mkdir -p ~/.jupyter
RUN echo "c.NotebookApp.ip = '0.0.0.0'" >> ~/.jupyter/jupyter_notebook_config.py
RUN echo "c.NotebookApp.port = 8888" >> ~/.jupyter/jupyter_notebook_config.py

# Set the working directory
WORKDIR /home/jupyterlab

# Stort JupyterLab once the container is launched
ENTRYPOINT jupyter lab --allow-root --no-browser

In words, we take a minimalistic image with the desired version of Python preinstalled—in this case, it is the official Python image tagged 3.7-slim, which refers to Python 3.7 with any available bug fixes promptly applied—and add packages that we consider to be important for our work. These packages are gathered in the usual requirements.txt, which might look as follows:

jupyterlab
matplotlib
numpy
pandas
pylint
pytest
scikit-learn
scipy
seaborn
tensorflow
yapf

The first one, jupyterlab, is essential; the rest is up to the data scientist’s taste. An important aspect to note is that, in this example, the versions of the listed packages are not fixed; hence, the latest available versions will be taken each time a new image is built. Alternatively, one can pin them to specific numbers by changing requirements.txt. For instance, one might write tensorflow==1.14.0 instead of tensorflow.

Having defined an image, we need a tool for orchestration. We would like to have a convenient command for actually building the image and, more importantly, a convenient command for launching a container with that image from an arbitrary directory. The versatile make to the rescue!

# The name of the Docker image
name := jupyterlab
# The directory to be mounted to the container
root ?= ${PWD}

# Build a new image
build:
	docker rmi ${name} || true
	docker build --tag ${name} .

# Start a new container
start:
	@docker run --interactive --tty --rm \
		--name ${name} \
		--publish 8888:8888 \
		--volume "${root}:/home/jupyterlab" \
		${name}

In the above Makefile, we define two commands: build and start. The build command instructs Docker to build a new image according to the recipe in Dockerfile. The start command launches a new container and mounts the directory specified by the root variable to the file system inside the container using the --volume option. It also forwards port 8888 inside the container, which is the one specified in Dockerfile, to port 8888 on the host machine so that JupyterLab can be reached from the browser.

Let us now go ahead and try the two commands:

make build
make start

JupyterLab should come back with usage instructions similar to the following:

...
[I 18:40:15.078 LabApp] The Jupyter Notebook is running at:
[I 18:40:15.078 LabApp] http://e4edba021595:8888/?token=<token>
[I 18:40:15.078 LabApp]  or http://127.0.0.1:8888/?token=<token>
[I 18:40:15.078 LabApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 18:40:15.082 LabApp]

    To access the notebook, open this file in a browser:
        file:///root/.local/share/jupyter/runtime/nbserver-6-open.html
    Or copy and paste one of these URLs:
        http://e4edba021595:8888/?token=<token>
     or http://127.0.0.1:8888/?token=<token>
...

By clicking on the last link, we end up in a fully fledged JupyterLab. Congratulations! However, there is one step left. JupyterLab is currently running in the folder with our Dockerfile and Makefile, which is not particularly useful, as each project we might want to work on probably lives in its own folder elsewhere in the file system. Fortunately, it is easy to fix with an alias:

alias jupyterlab='make -C /path/to/the/folder/with/the/Makefile root="${PWD}"'

This command should be placed in the start-up script of the shell being utilized. In the case of Bash, it can be done as follows:

echo "alias jupyterlab='make -C \"${PWD}\" root=\"\${PWD}\"'" >> ~/.bashrc

Now, in a new terminal, one should be able to run JupyterLab from any directory as follows:

cd /path/to/some/project
jupyterlab

Note that the content of the current working directory (that is, /path/to/some/project) is readily available inside JupyterLab. All notebooks created and modified in the GUI there will be stored directly in this folder, and they will remain here when the container is shut down.

RStudio

It is time to get to grips with an image for R notebooks. As before, we begin with a Dockerfile:

# Start with an RStudio image
FROM rocker/rstudio:latest

# Install the software that R packages require
RUN apt-get update
RUN apt-get install -y libxml2-dev texlive texlive-latex-extra zlib1g-dev

# Set the working directory
WORKDIR /home/rstudio

# Install the desired R packages
COPY requirements.txt /tmp/requirements.txt
RUN echo "install.packages(readLines('/tmp/requirements.txt'), \
                           repos = 'http://cran.us.r-project.org')" | R

Installing RStudio from scratch is not an easy task. Fortunately, we can start with the official RStudio image, which is what is specified at the top of the file. If desired, the latest tag can be changed to a specific version. The second block of Docker instructions is to provide programs and libraries that are needed by the R packages that one is planning to install. For instance, TeX Live is needed for rendering notebooks as PDF documents using LaTeX. The last block of instructions in Dockerfile is for installing the R packages themselves. As with Python, all necessary packages are gathered in a single file called requirements.txt:

devtools
glmnet
plotly
rmarkdown
rstan
testthat
tidytext
tidyverse

The rmarkdown package is required for notebooks in Markdown. The rest is intended to be changed according to one’s preferences; although, tidyverse is arguably a must in modern R.

All right, in order to build the image and launch containers, we create the following Makefile:

# The name of the Docker image
name := rstudio
# The directory to be mounted to the container
root ?= ${PWD}

# Build a new image
build:
	docker rmi ${name} || true
	docker build --tag ${name} .

# Start a new container
start:
	@echo "Address:  http://localhost:8787/"
	@echo "User:     rstudio"
	@echo "Password: rstud10"
	@echo
	@echo 'Press Control-C to terminate...'
	@docker run --interactive --tty --rm \
		--name ${name} \
		--publish 8787:8787 \
		--volume "${root}:/home/rstudio" \
		--env PASSWORD=rstud10 \
		${name} > /dev/null

It is similar to the one for JupyterLab; however, since the default prompt of RStudio is not as informative as the one of JupyterLab, we print our own usage instructions upon start.

The final piece is the shortcut for launching RStudio:

alias rstudio='make -C /path/to/the/folder/with/the/Makefile root="${PWD}"'

In the case of Bash, it can be installed as follows:

echo "alias rstudio='make -C \"${PWD}\" root=\"\${PWD}\"'" >> ~/.bashrc

Now it is time to build the image, go to an arbitrary directory, and test the alias:

make build
cd /path/to/some/project
rstudio

Unlike the JupyterLab image, this one is much slower to build due to R packages traditionally compiling a lot of C++ code upon installation.

Lastly, it might be particularly convenient to have one’s GUI preferences (such as the font size in the editor) and alike be automatically set up upon each container launch. This can be achieved by realizing that RStudio stores user preferences in a local folder called .rstudio. Then the start command can be adjusted to silently plant a preconfigured .rstudio into the current working directory, which can be seen in the repository accompanying this article.

Conclusion

Having completed the above steps, we have two Docker images: one is for Python notebooks via JupyterLab, and one is for R notebooks via RStudio. At the moment, the images are stored locally; however, they can be pushed to a public or private image repository, such as Docker Hub and Google Container Registry, and subsequently pulled on an arbitrary machine having Docker installed. Alternatively, they can be built on each machine separately. Regardless of the installation, the crucial point is that our working environment will unshakably remain in a specific pristine state defined by us.

Lastly, it is worth noting that similar images can straightforwardly be built for more specific scenarios. For instance, the following repository provides a skeleton for building and using a custom Datalab, which is Google’s wrapper for Jupyter notebooks that run in the cloud: Datalab.