Hands-on: Setup your data environment with Docker


Whenever you start a new data project or have a great idea dealing with data, an initial proof of concept may be necessary to kick things off. Of course you don’t want, and probably don’t have the time to spend hours on setting up a completely new data environment without even having a look at the data itself. In the following article you will learn how Docker may help you setting up a replicable data environment without wasting your time over and over again.

What is Docker and why should you give it a try?

Docker is one of the simplest, most flexible ways to create, deploy and run your desired applications in a specified environment, so called containers. Of course you ask yourself what are containers?

Non-technical explanation: Just like in the image above imagine that in our case your local machine is an island where you already produce things. To improve that you need additional tools and these come (just as the Docker Logo) in small containers. As soon as you set them up and run them, they are good to go.

Technical explanation: A container is a standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another. A Docker container image is a lightweight, standalone, executable package of software that includes everything needed to run an application: code, runtime, system tools, system libraries and settings. Other important terms:

  • Images: Just a snapshot of your container.
  • Dockerfile: That’s a yaml file that is used to build up your image. At the end you of this session you will have a yaml-file template and use it for with your own container specifications.
  • Docker hub: Here you can push and pull Docker images and use it for your own needs. Basically GitHub just for Docker.

Why use Docker?

Let me outline to you the main reasons why I love using Docker:

  • For you as a data scientist or data analyst, docker means that you can focus on exploring, transforming and modelling data without thinking about the system that your data environment is running on in the first place. By using one of the thousands applications ready to run in your Docker containers, you don’t have to worry about installing and connecting them all individually. Docker allows you to deploy your chosen working environment in seconds — whenever you need it.
  • Let’s assume you are not the only one working in the project, but your team members need to get their hands on the code as well. Now the one option is that each teammate runs the code on their own environment, with different architectures, different libraries and different versions of applications. The docker option is that each member has access to the same container image, starts the image with docker and is ready to go. Docker provides repeatable data environments to everybody on your team so you can start collaborating right away.

There are definitely several other benefits that come with Docker, especially if you are working with the Enterprise version. It is definitely worth exploring and does not only benefit you as a data scientist.

Installing & Running Docker

You can install Docker desktop, which is what you need to get started in no time: Visit Docker Hub here , select the Docker version for your Mac or Windows and install it. As soon as starting Docker on your local machina you can see this lovely little whale on your top navigation bar — well done.

By clicking on the Docker logo you can see if Docker is running. Another alternative is opening the command line and entering “docker info” so you’ll see what is running. Here are some basic Docker commands:

docker login #Log in to Docker registry
docker run #create a new container and start it
docker start #start an existing container
docker stop #stop a running container
docker ps [-a include stopped containers] #show all containers
docker rm #remove container by name or id
docker rmi$(docker images -q) # remove all images

You can start with a simple example giving it a try with Jupyter notebook. All you have to do is to look for an image in Docker Hub, open your Terminal and run the docker. In the example below you can then find Jupyter running on localhost:8888 — Easy!

docker run -p 8888:8888 jupyter/scipy-notebook:2c80cf3537ca

While we can now play around with our application in our container, it is not exactly a complete data environment an advanced data scientist is looking for. You probably want to use more advanced tools like Nifi for data ingestion and processing, Kafka for data streaming and a SQL or NonSQL database to store some tables in between. Can we still use Docker for all of that? The answer: Yes of course — Docker compose up is here to manage all that for you.

Docker Compose: Pulling it all together

To setup your desired data environment, you probably want several containers running on our local machine. That’s why we use Docker Compose. Compose is a tool for defining and running multi-container Docker applications. While connecting each container individually can be time consuming, docker compose allows a collection of multiple containers to interact via their own network in a very straight forward way. With compose, you use a yaml file to configure your application’s services first, then with a single command (docker compose up) you create and start all the services from your previously defined.*

In the following you can find the main steps to get started:

  1. Define your app’s environment with a Dockerfile for easy reproduction
  2. Specify all your services that make up your data environment in docker-compose.yml
  3. Open your terminal in the folder where you saved the yaml file and run docker-compose up

A docker-compose.yml may look something similar as you can see below. And although you can definitely use the following as a template, you should definitely configure this once for yourself:

version: ‘3’
hostname: zookeeper
container_name: zookeeper_dataenv
image: ‘bitnami/zookeeper:latest’
image: mkobit/nifi
container_name: nifi_dataenv
– 8080:8080
– 8081:8081
– NIFI_ZK_CONNECT_STRING=zookeeper:2181
image: jupyter/minimal-notebook:latest
– 8888:8888
image: mongo:latest
container_name: mongodb_dataenv
– MONGO_DATA_DIR=/data/db
– MONGO_LOG_DIR=/dev/null
– 27017:27017
image: bitnami/grafana:latest
container_name: grafana_dataenv
– 3000:3000
image: ‘postgres:9.6.3-alpine’
container_name: psql_dataenv
– 5432:5432
POSTGRES_DB: psql_data_environment
POSTGRES_USER: psql_user
PGDATA: /opt/psql_data
restart: “no”

That’s it! You just learned the basics on how to deploy your very own data environment anywhere and in seconds which means wasting less time on setting up things, and more time for being productive.


Articles you may enjoy reading as well:

Eliminating Churn is Growth Hacking 2.0

Misleading with Data & Statistics

Learn To Create Your Own Sample Dataset


Please note that there are plenty of other container software options. I simply liked to working with Docker and wanted to share my experience with you

Resource for Docker Compose:


Compose is a tool for defining and running multi-container Docker applications. With Compose, you use a Compose file to…


Leave a Comment

Your email address will not be published. Required fields are marked *