Demystify Kubernetes with Hands-on Exercise!
As a data scientist in practice, I feel the trend of pushing the traditional routine of data processing, model training, and inference into an integrated pipeline with (CI/CD)Continuous Integration/Continuous Deployment, whose concepts are borrowed from DevOps. There are two primary reasons, at least from my perspective. One is that modeling is stepping out of the prototyping phase to the massive adoption of the model to the production, either as an add-on or application. The other reason is the rising need of bringing the whole modeling experience to the cloud and the enhanced orchestration of the model development.
Urging by the need, I started my upskilling journeys with tools that fit in the transformation of modeling, like Kubernetes. It is a hit right now and it has acquired its reputation of easy scaling, portability, and extensibility. This blog here specifically laid out the components that it offers. Yet when I started my research and get my hands on it, I feel overwhelmed by all these concepts and “official guides” for not being a trained software developer. I chewed up all these materials bit by bit and started to gain the fuller picture even though I don’t have an experienced background in DevOps. My notes and understanding are all in this tutorial and I hope this can be an easy yet accomplishing start for those who are also in need of upskilling themselves in Kubernetes.
Kubernetes in a nutshell
Kubernetes define itself as a production-grade, open-source platform that orchestrates executions of application containers within and across computer clusters. Simply put, Kubernetes is a manager of a couple of computers assembled for you to perform application on. It is the command center to execute the tasks you assign to it: scheduling applications, regular maintenance, scaling up capacity, and rolling out updates.
Cluster Diagram from Kubernetes
The most basic components of a Kubernetes cluster are master and node. The master is the manager, the center of the cluster. A node is a virtual machine or a physical computer. Each node has a Kubelet that manages it and communicate with the master. Since the node is a VM, we should also have Docker/Container as a tool to execute our applications.
Now we know the basics of Kubernetes, we may start to wonder what can it do from a data scientist’s perspective. In my view and experience, I do find the concepts of Kubernetes very appealing, especially when you think the modeling should eventually be productized as an application/service. The issues where it hinders many products designer like scalability and extensibility will also be a part of the consideration when we want to make the models accessible to the end-user.
Google offers a list of applications that they have for on-premise solutions. But what we want to achieve here is to use and customize Kubernetes for our needs and purposes. Thus this tutorial will grind from the bottom up that we start from the basics to setup GCP all the way to deploy an application onto the Kubernetes cluster.
In this tutorial, we will use Google Kubernetes Engine to set up a Kubernetes cluster. But before you start, make sure the following prerequisites are met:
- Acquire an account at GCP and login to the console.
- In the console, enable Kubernetes Engine API. You can find a library APIs in the API library, just search name.
API library (Image by Author)
3. Install gcloud. You can use the web-based terminal or your own computer terminal. To find you web-based terminal in the GCP console:
Click the icon in the red box (Image by Author)
It is recommended to follow the official guide to install gcloud, but you can also use the following command for your convenience:
4. Install kubectl. This step is simple: after setting up the gcloud, enter the command:
gcloud components install kubectl
Once we have all preceding steps fulfilled, the project can now use Google Cloud SDK to create a managed Kubernetes cluster.
- To create container clusters with gcloud:
gcloud container clusters create
- machine-type: The type of machine to use for nodes.
- num-nodes: The number of nodes to be created in each of the cluster’s zones. You can set this to 1 to create a single node cluster, but it needs at least 3 nodes to build a Kubernetes cluster. This can be resized later using gcloud command.
- zone: Compute zone (e.g. us-central1-a) for the cluster.
If the cluster is properly created, it will display info as the following:
2. To test the newly set up cluster, we use can use kubectl:
kubectl get node
3. Next we will grant permission to the user to perform admin actions.
kubectl create clusterrolebinding cluster-admin-binding
4. Install helm. Helm is a package manager for Kubernetes applications. You can use the Helm Chart to search for the applications/packages to deploy to Kubernetes. Helm Chart are preset abstractions to describe how packages will be installed to Kubernetes
curl https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 | bash
With Helm installed, we can check what is installed in the cluster by calling:
Now your Kubernetes is ready to go! Congrats 🤓
With JupyerHub Example
Before starting out with the installation process, it is always helpful to inspect from a higher level and understand what it looks like. In the previous steps, we have finish setup the Kubernetes Cluster with Helm installed. Now we want to deploy an application to it and start using it. For the example here I will use JupyterHub that can serve multiple users. The following is the structure of this exercise for easier reference.
Installation Structure (Image by Author Using LucidChart)
- Security Setup
- Generate a proxy security token. For security purposes, it is highly not recommended to run the JupyerHub without SSL encryption. Simply generate with openssl rand -hex 32 . Save your generated token in your notepad for later use.
- Add token to your proxy. It can be set up either in the configuration file or using or store as an environment variable. However, since we have Helm installed, we can use helm to template yamlfile to add the SSL token. The following is the format you provide toyaml with the token generated in the previous step. We will add that later to the config file.
2. Adding Configuration
JupyterHub provides a very comprehensive guide on how to set up the configuration. I am adding some useful customizations for data science purposes. You can also refer to the guide to setup your customization or use the reference here to navigate your needs.
- image: where we can add a data-science notebook image from a comprehensive list provided here. Also, remember to use the tag instead of latest. Your image tag will be accessible in the DockerHub and the tag info for the data-science notebook can be found here.
- memory: where we specify the storage limit for your user. This will send requests to Kubernetes API.
If you want to only follow the example, just copy&paste the above and replace the necessary credential. Use nano config.yaml or vim config.yaml to create the yaml file and replace the content.
3. Install with Helm. The following scripts will install JupyterHub with helm upgrade . For details with the command, check here for reference.
For future upgrade with helm , we don’t need to create the namespace.
This part is very thoroughly explained by the guide from JupyterHub and personally, I find it super helpful to read. The RELEASE specify how you want to call from helm command and NAMESPACE is what you want to identify with kubectl .
Rest while helm kicks off the installation and once it is properly setup, you will see the finishing messages by JupyterHub. If you need to troubleshoot any issues with this step, refer to the guide or use kubectl get events to retrieve the log and find out what is going on by yourself.
4. Check status and test your JupyterHub. After setting up correctly, we should see the hub is ready by calling:
kubectl —-namespace=jhub get pod
To access the Hub with your cluster, call the following to acquire the IP address:
kubectl –namespace=jhub get svc proxy-public
Copy&Paste your external-IP to a new browser, and it should navigate you to the JupyerHub sign-in page. You can start using your own JupyterHub with Kubernetes cluster.
JupyterHub Interface (Image by Author)
Notice here we haven’t setup any authentication for users so we can just enter random username and password to get started. However, if you are looking for more security to protect your cluster, check this link to setup your authentication with JupyerHub.
Clean-up! & Final Thoughts
Cloud is not cheap if you keep it idle without doing anything. So if you are no longer need your cluster, it is recommended to delete it.
# Retrieve the cluster gcloud container clusters list# Delete the clustergcloud container clusters delete
I learnt a ton from the implementation of this example and I hope you feel the same. Kubernetes are not all about mystery and as a data scientist, even without prior training, can understand the fundamentals and get hands-on on practical applications. I will continue writing about my research, thoughts, and application with Kubernetes and demystify with plain and simple language. Peace! 🤗