Interactive notebooks in tightly sealed disposable containers
It is truly amazing how interactive notebooks—where a narrative in a spoken language is entwined with executable chunks of code in a programming language—have revolutionized the way we work with data and document our thought processes and findings for others and, equally importantly, for our future selves. They are ubiquitous and taken for granted. It is hard to imagine where data enthusiasts would be without them. Most likely, we would be spending too much time staring at a terminal window, anxiously re-running scripts from start to finish, printing variables, and saving lots of files with tables and graphs on disk for further inspection. Interactive notebooks are an essential tool in the data scientist’s toolbox, and in this article, we are going to make them readily available for our use with our favorite packages installed and preferences set up, no matter where we find ourselves working and regardless of the mess we might have left behind during the previous session.
Python and R (in alphabetic order) are arguably the primary languages used by data scientists nowadays. In the context of interactive computations, IPython and later on Project Jupyter have been of paramount importance for the Python community (the latter is actually language agnostic). In the R community, this role has been played by RStudio. Therefore, having at one’s disposal JupyterLab, which is Project Jupyter’s flagship, and RStudio should make one well equipped for a wide range of data challenges. As alluded to earlier, the objective is to have an environment that has a fixed initial state defined by us and is accessible to us on any machine we might happen to work on. This problem definition is a perfect fit for containerization. Specifically, we shall build custom-tailored Docker images for JupyterLab and RStudio and create a few convenient shortcuts for launching them.
The code discussed below can be found in the following two repositories:
- JupyterLab and
- RStudio.
JupyterLab
In order to build a Docker image for JupyterLab, we begin with a
Dockerfile
:
# Start with a minimal Python image
FROM python:3.7-slim
# Install the desired Python packages
COPY requirements.txt /tmp/requirements.txt
RUN pip install --upgrade pip
RUN pip install --upgrade --requirement /tmp/requirements.txt
# Configure JupyterLab to use a specific IP address and port
RUN mkdir -p ~/.jupyter
RUN echo "c.NotebookApp.ip = '0.0.0.0'" >> ~/.jupyter/jupyter_notebook_config.py
RUN echo "c.NotebookApp.port = 8888" >> ~/.jupyter/jupyter_notebook_config.py
# Set the working directory
WORKDIR /home/jupyterlab
# Stort JupyterLab once the container is launched
ENTRYPOINT jupyter lab --allow-root --no-browser
In words, we take a minimalistic image with the desired version of Python
preinstalled—in this case, it is the official Python image
tagged 3.7-slim
, which refers to Python 3.7 with any available bug fixes
promptly applied—and add packages that we consider to be important for our work.
These packages are gathered in the usual
requirements.txt
, which might look as follows:
jupyterlab
matplotlib
numpy
pandas
pylint
pytest
scikit-learn
scipy
seaborn
tensorflow
yapf
The first one, jupyterlab
, is essential; the rest is up to the data
scientist’s taste. An important aspect to note is that, in this example, the
versions of the listed packages are not fixed; hence, the latest available
versions will be taken each time a new image is built. Alternatively, one can
pin them to specific numbers by changing requirements.txt
. For instance, one
might write tensorflow==1.14.0
instead of tensorflow
.
Having defined an image, we need a tool for orchestration. We would like to have
a convenient command for actually building the image and, more importantly, a
convenient command for launching a container with that image from an arbitrary
directory. The versatile make
to the rescue!
# The name of the Docker image
name := jupyterlab
# The directory to be mounted to the container
root ?= ${PWD}
# Build a new image
build:
docker rmi ${name} || true
docker build --tag ${name} .
# Start a new container
start:
@docker run --interactive --tty --rm \
--name ${name} \
--publish 8888:8888 \
--volume "${root}:/home/jupyterlab" \
${name}
In the above Makefile
, we define two commands: build
and start
. The build
command instructs Docker to build a new image according
to the recipe in Dockerfile
. The start
command launches a new container and
mounts the directory specified by the root
variable to the file system inside
the container using the --volume
option. It also forwards port 8888 inside the
container, which is the one specified in Dockerfile
, to port 8888 on the host
machine so that JupyterLab can be reached from the browser.
Let us now go ahead and try the two commands:
make build
make start
JupyterLab should come back with usage instructions similar to the following:
...
[I 18:40:15.078 LabApp] The Jupyter Notebook is running at:
[I 18:40:15.078 LabApp] http://e4edba021595:8888/?token=<token>
[I 18:40:15.078 LabApp] or http://127.0.0.1:8888/?token=<token>
[I 18:40:15.078 LabApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 18:40:15.082 LabApp]
To access the notebook, open this file in a browser:
file:///root/.local/share/jupyter/runtime/nbserver-6-open.html
Or copy and paste one of these URLs:
http://e4edba021595:8888/?token=<token>
or http://127.0.0.1:8888/?token=<token>
...
By clicking on the last link, we end up in a fully fledged JupyterLab.
Congratulations! However, there is one step left. JupyterLab is currently
running in the folder with our Dockerfile
and Makefile
, which is not
particularly useful, as each project we might want to work on probably lives in
its own folder elsewhere in the file system. Fortunately, it is easy to fix with
an alias:
alias jupyterlab='make -C /path/to/the/folder/with/the/Makefile root="${PWD}"'
This command should be placed in the start-up script of the shell being utilized. In the case of Bash, it can be done as follows:
echo "alias jupyterlab='make -C \"${PWD}\" root=\"\${PWD}\"'" >> ~/.bashrc
Now, in a new terminal, one should be able to run JupyterLab from any directory as follows:
cd /path/to/some/project
jupyterlab
Note that the content of the current working directory (that is,
/path/to/some/project
) is readily available inside JupyterLab. All notebooks
created and modified in the GUI there will be stored directly in this folder,
and they will remain here when the container is shut down.
RStudio
It is time to get to grips with an image for R notebooks. As before, we begin
with a Dockerfile
:
# Start with an RStudio image
FROM rocker/rstudio:latest
# Install the software that R packages require
RUN apt-get update
RUN apt-get install -y libxml2-dev texlive texlive-latex-extra zlib1g-dev
# Set the working directory
WORKDIR /home/rstudio
# Install the desired R packages
COPY requirements.txt /tmp/requirements.txt
RUN echo "install.packages(readLines('/tmp/requirements.txt'), \
repos = 'http://cran.us.r-project.org')" | R
Installing RStudio from scratch is not an easy task. Fortunately, we can start
with the official RStudio image, which is what is specified at
the top of the file. If desired, the latest
tag can be changed to a specific
version. The second block of Docker instructions is to provide programs and
libraries that are needed by the R packages that one is planning to install. For
instance, TeX Live is needed for rendering notebooks as PDF documents using
LaTeX. The last block of instructions in Dockerfile
is for installing the R
packages themselves. As with Python, all necessary packages are gathered in a
single file called requirements.txt
:
devtools
glmnet
plotly
rmarkdown
rstan
testthat
tidytext
tidyverse
The rmarkdown
package is required for notebooks in Markdown. The rest is
intended to be changed according to one’s preferences; although, tidyverse
is
arguably a must in modern R.
All right, in order to build the image and launch containers, we create the
following Makefile
:
# The name of the Docker image
name := rstudio
# The directory to be mounted to the container
root ?= ${PWD}
# Build a new image
build:
docker rmi ${name} || true
docker build --tag ${name} .
# Start a new container
start:
@echo "Address: http://localhost:8787/"
@echo "User: rstudio"
@echo "Password: rstud10"
@echo
@echo 'Press Control-C to terminate...'
@docker run --interactive --tty --rm \
--name ${name} \
--publish 8787:8787 \
--volume "${root}:/home/rstudio" \
--env PASSWORD=rstud10 \
${name} > /dev/null
It is similar to the one for JupyterLab; however, since the default prompt of
RStudio is not as informative as the one of JupyterLab, we print our own usage
instructions upon start
.
The final piece is the shortcut for launching RStudio:
alias rstudio='make -C /path/to/the/folder/with/the/Makefile root="${PWD}"'
In the case of Bash, it can be installed as follows:
echo "alias rstudio='make -C \"${PWD}\" root=\"\${PWD}\"'" >> ~/.bashrc
Now it is time to build the image, go to an arbitrary directory, and test the alias:
make build
cd /path/to/some/project
rstudio
Unlike the JupyterLab image, this one is much slower to build due to R packages traditionally compiling a lot of C++ code upon installation.
Lastly, it might be particularly convenient to have one’s GUI preferences (such
as the font size in the editor) and alike be automatically set up upon each
container launch. This can be achieved by realizing that RStudio stores user
preferences in a local folder called .rstudio
. Then the start
command can be
adjusted to silently plant a preconfigured .rstudio
into the current working
directory, which can be seen in the repository accompanying this
article.
Conclusion
Having completed the above steps, we have two Docker images: one is for Python notebooks via JupyterLab, and one is for R notebooks via RStudio. At the moment, the images are stored locally; however, they can be pushed to a public or private image repository, such as Docker Hub and Google Container Registry, and subsequently pulled on an arbitrary machine having Docker installed. Alternatively, they can be built on each machine separately. Regardless of the installation, the crucial point is that our working environment will unshakably remain in a specific pristine state defined by us.
Lastly, it is worth noting that similar images can straightforwardly be built for more specific scenarios. For instance, the following repository provides a skeleton for building and using a custom Datalab, which is Google’s wrapper for Jupyter notebooks that run in the cloud: Datalab.