Monday, January 11, 2016

Docker Filesystems: Understanding the btrfs Backend

The basis of the filesystem use in Docker is the storage backend abstraction.[14] A storage backend allows you to store a set of layers each addressed by a unique name.

Various storage backends are supported in Docker filesystems:[1]
  • vfs backend
  • devicemapper backend[21]
  • btrfs backend
  • aufs backend
In this article, we will discuss docker filesystems in general and btrfs backend in specific.

Images and Containers


A core part of the Docker model is the efficient use of layered images and containers:
  • Images
    • Each Docker image on the system is stored as a layer, with the parent being the layer of the parent image. 
      • To create such an image a new layer is created (based on the right parent) and then the changes in that image are applied to the newly mounted filesystem.
    • Docker images have intermediate layers that increase reusability, decrease disk usage, and speed up docker build by allowing each step to be cached. These intermediate layers are not shown by default in the "docker images" command.
      • Each layer is a filesystem tree that can be mounted[2] when needed and modified. New layers can be started from scratch, but they can also be created with a specified parent.
  • Containers
    • Docker containers are isolated mini Linux environments built from Docker images, base images with zero or more filesystem layers on top of them.
As shown below, there are 1 container and 118 images in this docker installation.  In its storage backend, btrfs is the configured storage driver,[13] which will be the focus of this article:

# docker info
Containers: 1
Images: 118
Storage Driver: btrfs

To retrieve low-level information on a container or image, you can use "docker inspect" command which takes a required ID argument (either container's or image's).  You can use "docker ps" to find the ID of a specific container or use "docker images" to list the IDs of all images.

Base Image


Base images are typically minimal operating system images and the layers on top of them are added by developers to create convenience images (such as an image which already has Java SE installed and configured) for direct use or for use as building blocks.

Each container is related to a top image which is built up from layers of images starting from a base image.

To find the top image associated with a container, type:

# docker inspect --format "{{ .Image }}" ce483e532466
eca6affff525415c7e2199f1e8b2222ffce31d4bcf4a0cd05a48807d2c1f7647


To find the layers of images that a container is built up from, type:

# docker history eca6affff525415c7e2199f1e8b2222ffce31d4bcf4a0cd05a48807d2c1f7647
IMAGE               CREATED             CREATED BY                                      SIZE
eca6affff525        4 days ago          /bin/sh -c #(nop) WORKDIR /u01/app              0 B
ccf8bd04df89        4 days ago          /bin/sh -c #(nop) ENV APP_HOME=/u01/app/        0 B
c91b83e8c828        4 days ago          /bin/sh -c #(nop) USER [apaas]                  0 B
ab06ea65ece3        4 days ago          /bin/sh -c chown -R apaas:apaas /u01/           9.146 MB
2354b0ad9541        4 days ago          /bin/sh -c #(nop) ADD dir:ff4334d8629caee02b1   9.144 MB
246fb66aa39e        4 days ago          /bin/sh -c chmod -R +x /u01/scripts/            1.383 kB
21c5ddd9b74c        4 days ago          /bin/sh -c #(nop) COPY dir:17f42381efa361f6c6   1.383 kB
c347b96af5be        4 days ago          /bin/sh -c mkdir -p /u01/scripts /u01/logs      0 B
00c1fc450430        4 days ago          /bin/sh -c #(nop) USER [root]                   0 B
f52b843cf97e        7 weeks ago         /bin/sh -c mv java java.orig && chmod +x ./ja   7.718 kB
7c6d6279239c        7 weeks ago         /bin/sh -c #(nop) USER [apaas]                  0 B
5c6ad3a0ad33        7 weeks ago         /bin/sh -c mkdir -p /u01/logs && chown -R apa   306.5 MB
0100a4922bfb        7 weeks ago         /bin/sh -c #(nop) WORKDIR /u01/jdk/jdk1.7.0_9   0 B
4983e8502db6        7 weeks ago         /bin/sh -c #(nop) ADD file:3511bd6019a189ef28   226 B
266e209d77d3        7 weeks ago         /bin/sh -c #(nop) ENV PATH=/u01/jdk/jdk1.7.0_   0 B
c00eef371809        7 weeks ago         /bin/sh -c #(nop) ENV JAVA_HOME=/u01/jdk/jdk1   0 B
db5d61324db8        7 weeks ago         /bin/sh -c #(nop) ADD file:babe1a2cf183ba22e4   306.5 MB
10287b34527b        5 months ago        /bin/sh -c groupadd apaas && useradd -g apaas   296.1 kB
035a8c863461        5 months ago        /bin/sh -c mkdir -p /u01/jdk/ && mkdir -p /u0   0 B
a555d44630e2        10 months ago       /bin/sh -c #(nop) CMD [/bin/bash]               0 B
23a9eb33093d        10 months ago       /bin/sh -c #(nop) ADD file:33b9447cdbd58ef81b   195.1 MB
7258693d533e        10 months ago       /bin/sh -c #(no/p) MAINTAINER Oracle Linux Pro   0 B


Note that "eca6affff525" is the top image which is built on top of "ccf8bd04df89" and so on.  The base image is "7258693d533e", which doesn't have a parent.  For example, if you display the parent of the base image, it displays nothing (i.e., no patent):

# docker inspect --format "{{ .Parent }}" 7258693d533e
<blank>

The btrfs Backend


The brtfs backend requires /var/lib/docker to be on a btrfs filesystem and uses the filesystem level snapshotting to implement layers.

# df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvdb4       11G  7.1G  2.5G  75% /var/lib/docker
# mount -l
/dev/xvdb4 on /var/lib/docker type btrfs (rw)
<snipped>

You can find the layers of the images in the folder /var/lib/docker/btrfs/subvolumes.  Each layer is stored as a btrfs subvolume inside the folder  and start out as a snapshot of the parent subvolume (if any).

This backend is pretty fast. Mounting /var/lib/docker on a different filesystem than the rest of your system is recommended in order to limit the impact of filesystem corruption.

Image Cleanup


One of the purposes of learning Docker filesystems and storage backends is to assure you know what you are doing before cleaning up unwanted images.[15,16]

For example, before removing an image, no containers can be using it (running or stopped). After you've assured that, these commands can cleanup untagged images (see also filtering) or all images.

# batch cleanup untagged images 
docker rmi $(docker images -q -f "dangling=true")

# remove all images by id 
docker rmi $(docker images -aq)

References

  1. Supported Filesystems (Docker)
  2. Concept of Mounting
    • The concept of mounting allows programs to be agnostic about where your data is structured
    • From an application (or user) point of view, the file system is one tree. Under the hood, the file system structure can be on a single partition, but also on a dozen partitions, network storage, removable media and more.
  3. Displaying Physical Volumes (Redhat)
  4. Docker - How to analyze a container's disk usage? (good)
  5. Finding all storage devices attached to a Linux machine
  6. /dev/dm-1 (block device)
    • dev/dm-1 is for "device mapper n.1". Basically, it is a logical unit carved out using the kernel embedded device mapper layer. From a userspace application point of view, it is a RAW block device.
  7. Linux file system
  8. Docker images command
  9. Docker cp command
    • You can copy to or from either a running or stopped container.
    • Behavior is similar to the common Unix utility cp -a in that 
      • directories are copied recursively with permissions preserved if possible. 
      • Ownership is set to the user and primary group on the receiving end of the transfer.  For example, 
        • Files copied to a container will be created with UID:GID of the root user. 
        • Files copied to the local machine will be created with the UID:GID of the user which invoked the docker cp command.
      • It is not possible to copy certain system files such as resources under /proc,/sys, /dev, and mounts created by the user in the container.
  10. Understanding Volumes in Docker (good) 
  11. Docker Volume Manager
  12. Docker Quicksheet 
  13. Storage Driver (Docker)
    • A storage driver is how docker implements a particular union file system. 
    • Keeping with are “batteries included, but replaceable” philosophy, Docker supports a number of different union file systems. 
      • For instance, Ubuntu’s default storage driver is AUFS, where for Red Hat and Centos it’s Device Mapper.
  14. Docker Images
    • Docker images are stored as series of read-only layers. 
    • When we start a container, Docker takes the read-only image and adds a read-write layer on top. 
    • If the running container modifies an existing file, the file is copied out of the underlying read-only layer and into the top-most read-write layer where the changes are applied. 
      • The version in the read-write layer hides the underlying file, but does not destroy it — it still exists in the underlying image. 
    • When a Docker container is deleted, relaunching the image will start a fresh container without any of the changes made in the previously running container — those changes are lost. 
    • Docker calls this combination of read-only layers with a read-write layer on top a Union File System.
  15. Why is docker image eating up my disk space that is not used by docker
  16. Docker error : no space left on device
  17. docker ps -s
    • -s, --size=false Display total file sizes
  18. Advanced Docker Volumes
  19. Resizing Docker containers with the Device Mapper plugin
  20. Question on Resource Limits? (Docker)
  21. devicemapper - a storage backend based on Device Mapper

Wednesday, December 30, 2015

Jumbo Frames—Design Considerations for Efficient Network

Each network has some maximum packet size, or maximum transmission unit (MTU). Ultimately there is some limit imposed by the technology, but often the limit is an engineering choice or even an administrative choice.[1]

Many Gigabit Ethernet switches and Gigabit Ethernet network interface cards can support jumbo frames.[2] There are performance benefits to enable Jumbo Frames (MTU: 9000). However, existing transmission links may still impose smaller MTU (e.g., 1500). This could exhibit issues along transit paths, which is referred to here as MTU Mismatch.

In this article, we will examine issues manifested by MTU mismatch and their design considerations.

How to Accommodate MTU Differences


When a host on the Internet wants to send some data, it must know how to divide the data into packets. And in particular, it needs to know the maximum size of packet.

Jumbo frames are Ethernet frames with more than 1500 bytes of payload.[3] Conventionally, jumbo frames can carry up to 9000 bytes of payload, but variations exist and some care must be taken using the term. In this article, we will use MTU: 9000 and MTU: 1500 as our examples to discuss MTU-mismatch issues.

Issues

MTU is a maximum—you tell a network device NOT to drop frames unless they are larger than the maximum. A device with an MTU of 1500 can still communicate with a device with an MTU of 9000. However, when large-size packets are sent from MTU 9000 device to MTU-1500 device, the following happens:
  • If DF (Don't Fragment) flag is set
    • Packets will be dropped and a router is required to return an ICMP Destination Unreachable message to the source of the datagram, with the Code indicating "fragmentation needed and DF set"
  • If DF flag is not set
    • Packets will be fragmented to accommodate MTU differences, which will beget a cost[4]

How to Test Potential MTU Mismatch


Either ping, tracepath, or traceroute (with --mtu option) command can be used to test potential MTU-mismatches.

For example, you can verify that the path between two end nodes has at least the expected MTU using the ping command:
ping -M do -c 4 -s 8972
The -M do option causes the DF flag to be set.
The -c option sets the number of pings.
The -s option specifies the number of bytes of padding that should be added to the echo request. In addition to this number there will be 20 bytes for the internet protocol header, and 8 bytes for the ICMP header and timestamp. The amount of padding should therefore be 28 bytes less than the network-layer MTU that you are trying to test (9000 − 28 = 8972).

If the test is unsuccessful, then you should see an error in response to each echo request:
$ ping -M do -c 4 -s 8972 10.252.136.96
PING 10.252.136.96 (10.252.136.96) 8972(9000) bytes of data.
From 10.249.184.27 icmp_seq=1 Frag needed and DF set (mtu = 1500)
From 10.249.184.27 icmp_seq=1 Frag needed and DF set (mtu = 1500)
From 10.249.184.27 icmp_seq=1 Frag needed and DF set (mtu = 1500)
From 10.249.184.27 icmp_seq=1 Frag needed and DF set (mtu = 1500)


--- 10.252.136.96 ping statistics ---
0 packets transmitted, 0 received, +4 errors
Similarly, you can use tracepath command to test:
$ tracepath -n -l 9000
The -n option specifies not looking up host names (i.e, only print IP addresses numerically).
The -l option sets the initial packet length to pktlen instead of 65536 for tracepath or 128000 for tracepath6.
In the tracepath output, the last line summarizes information about all the path to the destination:
The last line shows detected Path MTU, amount of hops to the destination and our guess about amount of hops from the destination to us, which can be different when the path is asymmetric.
/* a packet of length 9000 cannot reach its destination */
$ tracepath -n -l 9000 10.249.184.27
1: 10.241.71.129 0.630ms
2: 10.241.152.60 0.577ms
3: 10.241.152.0 0.848ms
4: 10.246.1.49 1.007ms
5: 10.246.1.106 0.783ms
6: no reply
...
31: no reply
Too many hops: pmtu 9000
Resume: pmtu 9000
/* a packet of length 1500 reached its destination */
$ tracepath -n -l 1500 10.249.184.27
1: 10.241.71.129 0.502ms
2: 10.241.152.62 0.419ms
3: 10.241.152.4 0.543ms
4: 10.246.1.49 0.886ms
5: 10.246.1.106 0.439ms
6: 10.249.184.27 0.292ms reached
Resume: pmtu 1500 hops 6 back 59

When to Enable Jumbo Frames?


Enabling jumbo frame mode (for example, on Gigabit Ethernet network interface cards) can offer the following benefits:
  • Less consumption of bandwidth by non-data protocol overhead
    • Hence increase network throughput
  • Reduction of the packet rate
    • Hence reduce server overhead
      • The use of large MTU sizes allows the operating system to send fewer packets of a larger size to reach the same network throughput.
      • For example, you will see the decrease in CPU usage when transferring larger file
The above factors are especially important in speeding up NFS or iSCSI traffic, which normally has larger payload size.

Design Considerations


When jumbo frame mode is enabled, the trade-offs include:
  • Bigger I/O buffer
    • Required for both end nodes and intermediate transit nodes
  • MTU mismatch
    • May beget IP fragmentation or even loss of data
Therefore, some design considerations are required. For example, you can:
  • Avoid situations where you have jumbo frame enabled host NIC's talking to non-jumbo frame enabled host NIC's.
      • One design trick is to let your NFS or ISCSI traffic be sent via a dedicated NIC and your normal host traffic be sent via a non-jumbo-MTU enabled interface
        • If your workload only include small messages, then the larger MTU size will not help
      • Be sure to use commands with the Don't fragment bit set to ensure that your hosts which are configured for jumbo frames are able to successfully communicate with each other via jumbo frames.
  • Enable Path MTU Discovery (PMTUD)[18]
    • When possible, use the largest MTU size that the adapter and network support, but constrained by Path MTU
    • Make sure the packet filter on your firewall process ICMP packets correctly
      • RFC 4821, Packetization Layer Path MTU Discovery, describes a Path MTU Discovery technique which responds more robustly to ICMP filtering.
  • Be aware of extra non-data protocol overhead if you configure encapsulation such as GRE tunneling or IPsec encryption.

References

  1. The TCP Maximum Segment Size and Related Topics
  2. Jumbo/Giant Frame Support on Catalyst Switches Configuration Example
  3. Ethernet Jumbo Frames\
  4. IP Fragmentation: How to Avoid It? (Xml and More)
  5. The Great Jumbo Frames Debate
  6. Resolve IP Fragmentation, MTU, MSS, and PMTUD Issues with GRE and IPSEC
  7. Sites with Broken/Working PMTUD
  8. Path MTU Discovery
  9. TCP headers
  10. bad TCP checksums
  11. MSS performance consideration
  12. Understanding Routing Table
  13. route (Linux man page)
  14. Docker should set host-side veth MTU #4378
  15. Add MTU to lxc conf to make host and container MTU match
  16. Xen Networking
  17. TCP parameter settings (/proc/sys/net/ipv4)
  18. Change the MTU of a network interface
    • To enable PMTUD on Linux, type:
      • echo 1 > /proc/sys/net/ipv4/tcp_mtu_probing
      • echo 1024 > /proc/sys/net/ipv4/tcp_base_mss
  19. MTU manipulation
  20. Jumbo Frames, the gotcha's you need to know! (good)
  21. Understand container communication (Docker)
  22. calicoctl should allow configuration of veth MTU #488 - GitHub
  23. Linux MTU Change Size
  24. Changing the MTU size in Windows Vista, 7 or 8
  25. Linux Configure Jumbo Frames to Boost Network Performance
  26. Path MTU discovery in practice