Mastering docker not only requires the knowledge of various docker commands but also its low level architecture and internal details. The target of this section is to give a very short overview of docker containers from a Linux system perspective.
Docker doesn’t really have any internals. It’s simply a golang binary wrapped around a bunch of tooling that already exists in the kernel, such as:
- cgroups to limit an applications available resources.
- namespaces to provide isolation from other containers.
- Union Filesystems to provide fast, light access to storage.
It is recommended reading about how those three technologies work before trying to understand what else docker does, as docker simply provides a more accessible API/command line tooling for these technologies.
The whole architecture can be divided into two parts :
- Kernel Stuff
- Docker Engine
Yes ! thats it, these two comprises the whole architecture of docker. But its not that simple, lets dive deeply inside both.
So, a kernel is the part of the operating system that mediates access to system resources. It’s responsible for enabling multiple applications to effectively share the hardware by controlling access to CPU, memory, disk I/O, and networking.
Whereas, an operating system is the kernel plus applications that enable users to get something done (i.e compiler, text editor, window manager, etc).
The Linux kernel provides the cgroups functionality that allows limitation and prioritization of resources (CPU, memory, block I/O, network, etc.) without the need for starting any virtual machines, and also namespace isolation functionality that allows complete isolation of an applications’ view of the operating environment, including process trees, networking, user IDs and mounted file systems.
That being said, LXC (Linux Containers) is an operating-system-level virtualization method for running multiple isolated Linux systems (containers) on a control host using a single Linux kernel.
LXC combines the kernel’s cgroups and support for isolated namespaces to provide an isolated environment for applications. Docker originally used LinuX Containers (LXC), but later switched to runC (formerly known as libcontainer), which runs in the same operating system as its host.
Oops ! pardon, now the question arises :
What even is a container ?
At their most basic level, Linux containers are aptly named for the metal shipping containers to which they’re so often equated. Whether it’s on a freight ship, a cargo train, or on the back of a big rig truck, the container itself is the same uniform vessel of transporting goods.
The word “container” doesn’t mean anything super precise. Basically there are a few new Linux kernel features (“namespaces” and “cgroups”) that let you isolate processes from each other. When you use those features, you call it “containers”.
Basically these features let you pretend you have something like a virtual machine, except it’s not a virtual machine at all, it’s just processes running in the same Linux kernel.
Containers take the operating system and slice it into two pieces. On one hand, you get the work unit for the application, which contains application code and dependencies in a way that can be optimized by the DevOps teams, and gives them autonomy and control to make decisions when they want to. They no longer have to wait for other teams.
The other piece is the operating system kernel. The OS kernel and container payload provide support for the resources and primitives you want available like storage, networking, and security. Because containers are an OS technology, you can run them anywhere, be it virtual hosts or a public cloud. That hybrid quality lets you manage any application in any environment using the same technology while still empowering DevOps teams.
Lars Herrmann, General Manager of the Integrated Solutions Business Unit at Red Hat
A kernel namespace wraps a global system resource in an abstraction that makes it appear to the processes within the namespace that they have their own isolated instance of the global resource.
Changes to the global resource are visible to other processes that are members of the namespace, but are invisible to other processes.
Linux provides the following namespaces:
- PID namespace provides isolation for the allocation of process identifiers (PIDs), lists of processes and their details. While the new namespace is isolated from other siblings, processes in its “parent” namespace still see all processes in child namespaces—albeit with different PID numbers.
- Network namespace isolates the network interface controllers (physical or virtual), iptables firewall rules, routing tables etc. Network namespaces can be connected with each other using the “veth” virtual Ethernet device.
- “UTS” namespace allows changing the hostname.
- Mount namespace allows creating a different file system layout, or making certain mount points read-only.
- IPC namespace isolates the System V inter-process communication between namespaces.
- User namespace isolates the user IDs between namespaces.
Namespaces are created with the “unshare” command or syscall, or as new flags in a “clone” syscall.
It’s a method to put processes into groups by allowing the following:
- Resource limitation : Groups can be set to not exceed a configured memory limit, which also includes the file system cache.
- Prioritization : Some groups may get a larger share of CPU utilization or disk I/O throughput.
- Accounting: Measures how much resources certain systems use, which may be used, for example, for billing purposes.
- Control : Freezing the groups of processes, their checkpointing and restarting.
Each cgroup is represented by a directory in the cgroup file system containing the following files describing that cgroup:
- Tasks: list of tasks (by PID) attached to that cgroup.
- Cgroup.procs : list of thread group IDs in the cgroup.
- Notify_on_release flag: run the release agent on exit?
- Release_agent: the path to use for release notifications (this file exists in the top cgroup only).
Union File Systems
Docker uses Union File Systems to build up an image. You can think of a Union File System as a stackable file system, meaning files and directories of separate file systems (known as branches) can be transparently overlaid to form a single file system.
The contents of directories which have the same path within the overlaid branches are seen as a single merged directory, which avoids the need to create separate copies of each layer. Instead, they can all be given pointers to the same resource; when certain layers need to be modified, it’ll create a copy and modify a local copy, leaving the original unchanged. That’s how file systems can *appear* writable without actually allowing writes. (In other words, a “copy-on-write” system.)
Source – Docker Deep Dive by Nigel Poultan
Copy On Write
When a container is started it appears to have its own Linux files system that you’re using. The fact is you don’t; when a container starts it links to files in the base kernel that all containers share, and the way it handle the changes is via the copy on write model, where each change to the file system is copied to an in memory version of the base file with the changes.
This results two main effects :
- The container will result a small footprint i.e. lightweighted.
- Changes in memory are faster than writing to storage.
This works in conjunction with Union File system. Where, each image layer is read-only; this image never changes. When a container is created, Docker builds from the stack of images and then adds the read-write layer on top. That layer, combined with the knowledge of the image layers below it and some configuration data, form the container.
Now that we have seen what kernel actually is, we will take a look on how docker makes these low level stuff as easy as docker run command for us.
Docker engine is the layer on which Docker runs. It’s a lightweight runtime and tooling that manages containers, images, builds, and more. It runs natively on Linux systems and is made up of:
- A Docker Daemon that runs in the host computer.
- A Docker Client that then communicates with the Docker Daemon to execute commands.
- A REST API for interacting with the Docker Daemon remotely.
The Docker Client is what you, as the end-user of Docker, communicate with. Think of it as the UI for Docker. For example, when you do…
$ docker run hello-world
you are communicating to the Docker Client, which then communicates your instructions to the Docker Daemon.
The Docker daemon is what actually executes commands sent to the Docker Client — like building, running, and distributing your containers. The Docker Daemon runs on the host machine, but as a user, you never communicate directly with the Daemon. The Docker Client can run on the host machine as well, but it’s not required to. It can run on a different machine and communicate with the Docker Daemon that’s running on the host machine.
Images are read-only templates that you build from a set of instructions written in your Dockerfile. Images define both what you want your packaged application and its dependencies to look like and what processes to run when it’s launched.
The Docker image is built using a Dockerfile. Each instruction in the Dockerfile adds a new “layer” to the image, with layers representing a portion of the images file system that either adds to or replaces the layer below it. Layers are key to Docker’s lightweight yet powerful structure. Docker uses a Union File System to achieve this.
A Dockerfile is where you write the instructions to build a Docker image. A Sample Dockerfile looks like :
# our base image FROM alpine:3.5 # Install python and pip RUN apk add –update py2-pip # install Python modules needed by the Python app COPY requirements.txt /usr/src/app/ RUN pip install –no-cache-dir -r /usr/src/app/requirements.txt # copy files required for the app to run COPY app.py /usr/src/app/ COPY templates/index.html /usr/src/app/templates/ # tell the port number the container should expose EXPOSE 5000 # run the application CMD [“python”, “/usr/src/app/app.py”]
Not going deep into the keywords being used in the Dockerfile, this is just to get an idea what image is, how it is built and how it uses Union File System explained above to achieve this.
A Docker registry stores Docker images. Docker Hub is a public registry that anyone can use, and Docker is configured to look for images on Docker Hub by default. You can even run your own private registry. If you use Docker Datacenter (DDC), it includes Docker Trusted Registry (DTR).
When you use the docker pull or docker run commands, the required images are pulled from your configured registry. When you use the docker push command, your image is pushed to your configured registry.
Docker Engine Events
Now that we have seen what Kernel and Docker Engine actually is, it starts making sense for what really Docker is.
They built a thing called “Docker Engine” that uses “Kernel” features :).
That’s all Docker is! Of course Docker has a lot of features these days, but a lot of it is built on these basic Linux kernel primitives.
Comparing Containers and Virtual Machines
Containers and virtual machines have similar resource isolation and allocation benefits, but function differently because containers virtualize the operating system instead of hardware. Containers are more portable and efficient.
Containers are an abstraction at the app layer that packages code and dependencies together. Multiple containers can run on the same machine and share the OS kernel with other containers, each running as isolated processes in user space. Containers take up less space than VMs (container images are typically tens of MBs in size), can handle more applications and require fewer VMs and Operating systems.
Virtual machines (VMs) are an abstraction of physical hardware turning one server into many servers. The hypervisor allows multiple VMs to run on a single machine. Each VM includes a full copy of an operating system, the application, necessary binaries and libraries – taking up tens of GBs. VMs can also be slow to boot.
Thus, Virtualization emulates a virtual hardware environment to run various software stacks; it provides what’s called an abstraction layer to give that cloud-computing environment flexibility over how applications and data are structured and deployed. So, upon a single virtualized OS kernel, you can then run multiple servers or instances. Containers are the instances.
Virtualization provides abstraction and emulation and, with containers, you get a similar kind of abstraction but without the emulation.
Scenario based comparison
So, let’s say you have a 1 GB container image; if you wanted to use a full VM, you would need to have 1 GB times n number of VMs you want. With Docker and AuFS (Union File System) you can share the bulk of the 1 GB between all the containers and if you have 1000 containers you still might only have a little over 1 GB of space for the containers OS (assuming they are all running the same OS image).
A full virtualized system gets its own set of resources allocated to it, and does minimal sharing. You get more isolation, but it is much heavier (requires more resources). With Docker you get less isolation, but the containers are lightweight (require fewer resources). So you could easily run thousands of containers on a host, and it won’t even blink.
A full virtualized system usually takes minutes to start, whereas Docker/LXC/runC containers take seconds, and often even less than a second.
There are pros and cons for each type of virtualized system. If you want full isolation with guaranteed resources, a full VM is the way to go. If you just want to isolate processes from each other and want to run a ton of them on a reasonably sized host, then Docker/LXC/runC seems to be the way to go.