Wednesday, December 30, 2015

Jumbo Frames—Design Considerations for Efficient Network

Each network has some maximum packet size, or maximum transmission unit (MTU). Ultimately there is some limit imposed by the technology, but often the limit is an engineering choice or even an administrative choice.[1]

Many Gigabit Ethernet switches and Gigabit Ethernet network interface cards can support jumbo frames.[2] There are performance benefits to enable Jumbo Frames (MTU: 9000). However, existing transmission links may still impose smaller MTU (e.g., 1500). This could exhibit issues along transit paths, which is referred to here as MTU Mismatch.

In this article, we will examine issues manifested by MTU mismatch and their design considerations.

How to Accommodate MTU Differences

When a host on the Internet wants to send some data, it must know how to divide the data into packets. And in particular, it needs to know the maximum size of packet.

Jumbo frames are Ethernet frames with more than 1500 bytes of payload.[3] Conventionally, jumbo frames can carry up to 9000 bytes of payload, but variations exist and some care must be taken using the term. In this article, we will use MTU: 9000 and MTU: 1500 as our examples to discuss MTU-mismatch issues.


MTU is a maximum—you tell a network device NOT to drop frames unless they are larger than the maximum. A device with an MTU of 1500 can still communicate with a device with an MTU of 9000. However, when large-size packets are sent from MTU 9000 device to MTU-1500 device, the following happens:
  • If DF (Don't Fragment) flag is set
    • Packets will be dropped and a router is required to return an ICMP Destination Unreachable message to the source of the datagram, with the Code indicating "fragmentation needed and DF set"
  • If DF flag is not set
    • Packets will be fragmented to accommodate MTU differences, which will beget a cost[4]

How to Test Potential MTU Mismatch

Either ping, tracepath, or traceroute (with --mtu option) command can be used to test potential MTU-mismatches.

For example, you can verify that the path between two end nodes has at least the expected MTU using the ping command:
ping -M do -c 4 -s 8972
The -M do option causes the DF flag to be set.
The -c option sets the number of pings.
The -s option specifies the number of bytes of padding that should be added to the echo request. In addition to this number there will be 20 bytes for the internet protocol header, and 8 bytes for the ICMP header and timestamp. The amount of padding should therefore be 28 bytes less than the network-layer MTU that you are trying to test (9000 − 28 = 8972).

If the test is unsuccessful, then you should see an error in response to each echo request:
$ ping -M do -c 4 -s 8972
PING ( 8972(9000) bytes of data.
From icmp_seq=1 Frag needed and DF set (mtu = 1500)
From icmp_seq=1 Frag needed and DF set (mtu = 1500)
From icmp_seq=1 Frag needed and DF set (mtu = 1500)
From icmp_seq=1 Frag needed and DF set (mtu = 1500)

--- ping statistics ---
0 packets transmitted, 0 received, +4 errors
Similarly, you can use tracepath command to test:
$ tracepath -n -l 9000
The -n option specifies not looking up host names (i.e, only print IP addresses numerically).
The -l option sets the initial packet length to pktlen instead of 65536 for tracepath or 128000 for tracepath6.
In the tracepath output, the last line summarizes information about all the path to the destination:
The last line shows detected Path MTU, amount of hops to the destination and our guess about amount of hops from the destination to us, which can be different when the path is asymmetric.
/* a packet of length 9000 cannot reach its destination */
$ tracepath -n -l 9000
1: 0.630ms
2: 0.577ms
3: 0.848ms
4: 1.007ms
5: 0.783ms
6: no reply
31: no reply
Too many hops: pmtu 9000
Resume: pmtu 9000
/* a packet of length 1500 reached its destination */
$ tracepath -n -l 1500
1: 0.502ms
2: 0.419ms
3: 0.543ms
4: 0.886ms
5: 0.439ms
6: 0.292ms reached
Resume: pmtu 1500 hops 6 back 59

When to Enable Jumbo Frames?

Enabling jumbo frame mode (for example, on Gigabit Ethernet network interface cards) can offer the following benefits:
  • Less consumption of bandwidth by non-data protocol overhead
    • Hence increase network throughput
  • Reduction of the packet rate
    • Hence reduce server overhead
      • The use of large MTU sizes allows the operating system to send fewer packets of a larger size to reach the same network throughput.
      • For example, you will see the decrease in CPU usage when transferring larger file
The above factors are especially important in speeding up NFS or iSCSI traffic, which normally has larger payload size.

Design Considerations

When jumbo frame mode is enabled, the trade-offs include:
  • Bigger I/O buffer
    • Required for both end nodes and intermediate transit nodes
  • MTU mismatch
    • May beget IP fragmentation or even loss of data
Therefore, some design considerations are required. For example, you can:
  • Avoid situations where you have jumbo frame enabled host NIC's talking to non-jumbo frame enabled host NIC's.
      • One design trick is to let your NFS or ISCSI traffic be sent via a dedicated NIC and your normal host traffic be sent via a non-jumbo-MTU enabled interface
        • If your workload only include small messages, then the larger MTU size will not help
      • Be sure to use commands with the Don't fragment bit set to ensure that your hosts which are configured for jumbo frames are able to successfully communicate with each other via jumbo frames.
  • Enable Path MTU Discovery (PMTUD)[18]
    • When possible, use the largest MTU size that the adapter and network support, but constrained by Path MTU
    • Make sure the packet filter on your firewall process ICMP packets correctly
      • RFC 4821, Packetization Layer Path MTU Discovery, describes a Path MTU Discovery technique which responds more robustly to ICMP filtering.
  • Be aware of extra non-data protocol overhead if you configure encapsulation such as GRE tunneling or IPsec encryption.


  1. The TCP Maximum Segment Size and Related Topics
  2. Jumbo/Giant Frame Support on Catalyst Switches Configuration Example
  3. Ethernet Jumbo Frames\
  4. IP Fragmentation: How to Avoid It? (Xml and More)
  5. The Great Jumbo Frames Debate
  6. Resolve IP Fragmentation, MTU, MSS, and PMTUD Issues with GRE and IPSEC
  7. Sites with Broken/Working PMTUD
  8. Path MTU Discovery
  9. TCP headers
  10. bad TCP checksums
  11. MSS performance consideration
  12. Understanding Routing Table
  13. route (Linux man page)
  14. Docker should set host-side veth MTU #4378
  15. Add MTU to lxc conf to make host and container MTU match
  16. Xen Networking
  17. TCP parameter settings (/proc/sys/net/ipv4)
  18. Change the MTU of a network interface
    • To enable PMTUD on Linux, type:
      • echo 1 > /proc/sys/net/ipv4/tcp_mtu_probing
      • echo 1024 > /proc/sys/net/ipv4/tcp_base_mss
  19. MTU manipulation
  20. Jumbo Frames, the gotcha's you need to know! (good)
  21. Understand container communication (Docker)
  22. calicoctl should allow configuration of veth MTU #488 - GitHub
  23. Linux MTU Change Size
  24. Changing the MTU size in Windows Vista, 7 or 8
  25. Linux Configure Jumbo Frames to Boost Network Performance
  26. Path MTU discovery in practice

Sunday, December 27, 2015

IP Fragmentation: How to Avoid It?

Most Ethernet LANs use an MTU of 1500 bytes (modern LANs can use Jumbo frames, allowing for an MTU up to 9000 bytes); however, border protocols like PPPoE will reduce this.

Jumbo Frames have performance benefits if it is enabled on 10 Gigabit + Networks, especially when using Ethernet based storage access, with vMotion, or with other high throughput applications such as Oracle RAC Interconnects.[1]

Enabling Jumbo Frames may also aggravate IP fragmentation due to more frequent MTU mismatches along the transmission path. In this article, we will review IP fragmentation overhead and the way to avoid it.


To begin with, you can think those terms measuring two different sizes:
  • MSS is related to receiving buffer(s)
    • Can be announced via TCP Header(s)[22,24]
      • When a TCP client initiates a connection to a server, it includes its MSS as an option in the first (SYN) packet
      • When a TCP server receives a connection from a client, it includes its MSS as an option in the a SYN/ACK packet
  • MTU is related to transmission link(s)
    • Can be configured on network interface(s)[15,19,20]
The IP protocol was designed for use on a wide variety of transmission links. Although the maximum length of an IP datagram is 64K, most transmission links enforce a smaller maximum packet length limit, called a MTU (or Maximum Transmission Unit).

Originally, MSS meant how big a buffer was allocated on a receiving station to be able to store the TCP data. In other words, TCP Maximum Segment Size (MSS) defines the maximum amount of data that a host is willing to accept in a single TCP/IP datagram.  MSS and MTU are related in this way—MSS is just the TCP data size, which does not include the IP header and the TCP header:[22]
  • MSS = MTU - sizeof(TCPHDR) - sizeof(IPHDR)
    • For example, assuming both TCPHDR and IPHDR have the default size of 20 bytes
      • MSS = 8960 if MTU= 9000
      • MSS = 1460 if MTU= 1500

IP Fragmentation Overhead

The design of IP accommodates MTU differences by allowing routers to fragment IP datagrams as necessary. The receiving station is responsible for reassembling the fragments back into the original full size IP datagram. Read [2] for a good illustration of IP Fragmentation and Reassembly.
There are several issues that make IP fragmentation undesirable:
  • Overhead for Sender (or Router)
    • There is a small increase in CPU and memory overhead to fragment an IP datagram
  • Overhead for Receiver
    • The receiver must allocate memory for the arriving fragments and coalesce them back into one datagram after all of the fragments are received.
  • Overhead for handling dropped frames
    • Fragments could be dropped because of a congested link
      • If this happens, the complete original datagram will have to be retransmitted
  • Causing firewall having trouble enforcing its policies
    • If the IP fragments are out of order, a firewall may block the non-initial fragments because they do not carry the information that would match the packet filter.
      • Read [2] for more explanation

How to Avoid Fragmentation

Contrary to popular belief, the MSS value is not negotiated between hosts.[22,24] However, an additional step could be taken by the sender to avoid fragmentation on the local and remote wires. In scenario 2, it illustrates this additional step:
The way MSS now works is that each host will first compare its outgoing interface MTU with its own MSS buffer and choose the lowest value as the MSS to send. The hosts will then compare the MSS size received against their own interface MTU and again choose the lower of the two values.
Path MTU Discovery (PMTUD) is another technique that can help you avoid IP fragmentation. It was originally intended for routers in IPv4. However, all modern operating systems use it on endpoints. PMTUD is especially useful in network situations where intermediate links have smaller MTUs than the MTU of the end links. To avoid IP fragmentation, it dynamically determines the lowest MTU along the path from a packet's source to its destination. However, as discussed here, there are three things that can break PMTUD.

Unfortunately, IP fragmentation issues could be be more widespread than you thought due to:
  • Dynamic Routing
    • The MTUs on the intermediate links may vary on different routes
  • IP tunnels
    • The reason that tunnels cause more fragmentation is because the tunnel encapsulation adds "overhead" to the size a packet.
      • For example, adding Generic Router Encapsulation (GRE) adds 24 bytes to a packet, and after this increase the packet may need to be fragmented because it is larger then the outbound MTU.

In [3], it lists which sites using PMTUD or alternatives to resolve IP fragmentation. It also lists sites simply allowing their packets to be fragmented rather than using PMTUD. This demonstrates the complexity of fragmentation avoidance.


  1. The Great Jumbo Frames Debate
  2. Resolve IP Fragmentation, MTU, MSS, and PMTUD Issues with GRE and IPSEC
  3. Sites with Broken/Working PMTUD
  4. Path MTU Discovery
  5. TCP headers
  6. bad TCP checksums
  7. MSS performance consideration
  8. Understanding Routing Table
  9. route (Linux man page)
  10. Docker should set host-side veth MTU #4378
  11. Add MTU to lxc conf to make host and container MTU match
  12. Xen Networking
  13. TCP parameter settings (/proc/sys/net/ipv4)
  14. Change the MTU of a network interface
    • tcp_base_mss, tcp_mtu_probing, etc
  15. MTU manipulation
    • Ethernet MTU vs IP MTU
      • The default Ethernet MTU is 1500 bytes and can be configured and can be raised on Cisco IOS with the system mtu command under global configuration.
      • As with Ethernet frames, the MTU can be adjusted for IP packets. However, the IP MTU is configured per interface rather than system-wide, with the ip mtu command.
  16. Jumbo Frames, the gotcha's you need to know! (good)
  17. Understand container communication (Docker)
  18. calicoctl should allow configuration of veth MTU #488 - GitHub
  19. Linux MTU Change Size
  20. Changing the MTU size in Windows Vista, 7 or 8
  21. Linux Configure Jumbo Frames to Boost Network Performance
  22. The TCP Maximum Segment Size and Related Topics (RFC 879)
    • The MSS can be used completely independently in each direction of data flow.
  23. Changing TCP MSS under LINUX - Networking
  24. TCP negotiations

Sunday, December 13, 2015

Routing Table: All Things Considered

Routing is the decision over which interface a packet is to be sent. This decision has to be made for locally created packets, too. Routing tables contain network addresses and the associated interface or next hop.

In this article, we will review all aspects of routing table (i.e.,  ip route and ip rule) on a Linux server.

Routing Table

A routing table is a set of rules, often viewed in table format, that is used to determine where data packets traveling over an Internet Protocol (IP) network will be directed. All IP-enabled devices, including routers and switches, use routing tables.

Routing tables can be maintained in two ways:[1,11]
  • Manually 
    • Tables for static network devices do not change unless a network administrator manually changes them
      • Useful when few or just one route exist
      • Can be administrative burden
      • Frequently used for default route[12]
    • Static routes can be added via
      • "route add" command. 
        • To persist it, you can also add the route command to rc.local
  • Dynamically
    • Devices build and maintain their routing tables automatically by using routing protocols[8] to exchange information about the surrounding network topology. 
    • Dynamic routing tables allow devices to "listen" to the network and respond to occurrences like:
      • Device failures 
      • Network congestion.

Kernel IP Routing Table

Beyond the two commonly used routing tables (the local and main routing tables), the Linux kernel supports up to 252 additional routing tables.

The ip route and ip rule commands have built in support for the special tables main and local. Any other routing tables can be referred to by number or an administratively maintained mapping file, /etc/iproute2/rt_tables.

Typical content of /etc/iproute2/rt_tables[13,14]

# reserved values
255     local      1
254     main       2
253     default    3
0       unspec     4
# local
1      inr.ruhep   5
1The local table is a special routing table maintained by the kernel. Users can remove entries from the local routing table at their own risk. Users cannot add entries to the local routing table. The file /etc/iproute2/rt_tables need not exist, as the iproute2 tools have a hard-coded entry for the local table.
2The main routing table is the table operated upon by route and, when not otherwise specified, by ip route. The file /etc/iproute2/rt_tables need not exist, as the iproute2 tools have a hard-coded entry for the main table.
3The default routing table is another special routing table
4Operating on the unspec routing table appears to operate on all routing tables simultaneously. 
5This is an example indicating that table 1 is known by the name inr.ruhep. Any references to table inr.ruhep in an ip rule or ip route will substitute the value 1 for the word inr.ruhep.

Format of Routing Table

Rules in the routing table usually consists of the following entities:
  • Network Destination 
    • The one which is outside your subnet
      • Basically it has different subnet mask compared to the local's
  • NetMask[9]
    • Makes it easier for the Router (i.e., layer 3 device, which isolates 2 subnets). 
      • This is used to identify which subnet the packet must go to
    • aka GenMask
      •  shows the “generality” of the route, i.e., the network mask for this route
  • Gateway
    • There could be more than one gateway within a network, so to reach the destination we configure which could be the best possible gateway
    • A gateway is an essential feature of most routers, although other devices (such as any PC or server) can function as a gateway.
  • Interface
    • You could have multiple interfaces (ethernet interfaces, eth0, eth1, eth2...) on your device, which each interface would be assigned an IPAddress
    • This provides instruction on how to reach the gateway and through which interface it needs to push the packet.
  • Metrics (Cost)[10]
    • Provides the path cost, basically for static routing the value would be 1 (default, but we can change it) and for dynamic routing (RIP, IGRP, OSPF) it varies.
  • MSS
    • Maximum Segment Size for TCP connections over this route
      • Usually has the value of 0, meaning “no changes”
    • MTU vs MSS
      • MTU = MSS + Header (40 bytes or more)
      • For example,
        • MTU = 576 -> MSS = 536
        • MTU = 1500 -> MSS = 1460 
    • Fragmentation of data segment 
      • If the data segment size is too large for any of the routers through which the data passes, the oversize segment(s) are fragmented. 
      • This slows down the connection speed as seen by the computer user. In some cases the slowdown is dramatic.
      • The likelihood of such fragmentation can be minimized by keeping the MSS as small as reasonably possible. 
  • Window
    • Default window size, which indicates how many TCP packets can be sent before at least one of them has to be ACKnowledged. 
    • Like the MSS, this field is usually 0, meaning “no changes”
  • irtt (Initial Round Trip Time)
    • May be used by the kernel to guess about the best TCP parameters without waiting for slow replies. 
    • In practice, it’s not used much, so you’ll probably never see anything else than 0 here.

How do I get there from here?

Device uses routing table to decide either a packet to stay in the current sub-net or be pushed to outside the sub-net.  Here are the rules to be used given a destination address in the packet:[2]
  • Position of route in the table has no significance. 
  • When more than one route matches a destination address
    • The route with the longest subnet mask (most consecutive 1-bits) is used
  • When multiple routes match the destination and have subnet masks of the same length
    • The route with the lowest metric is used

Scenario 1

Assume that we have a packet with destination IP address as w.x.y.z arrives to a router which checks its routing table.  If the router identifies that w.x.y.0/24 is present in the table, it will try to reach the concerned gateway by pushing through the respective interface.

Assuming that there are two entries of the same with metrics different, it will choose the one which has a lower value. Assuming the router does not find any entries in the routing table, it will go to default gateway. 

Scenario 2

Now assume another scenario: a packet with the destination IP address k.l.m.n arrives to the router, and k.l.m.0/24 is default mask of the router, then it implies that the packet is destined to same network and it will not push packet to the peer sub-net.

Linux Commands

The following Linux commands can be used to print the routing table on the server:
In below discussions, we will use a server which has the following entry in its /etc/sysconfig/network configuration file:
  • Default Gateway:

and the interface entry in the /sbin/ifconfig configuration file has:

eth0      Link encap:Ethernet  HWaddr 00:ll:mm:nn:aa:bb
          inet addr:  Bcast:  Mask:

route Command


  • Ref
    • Number of references to this route (not used in the Linux kernel)
  • Use
    • Count of lookups for the route. Depending on the use of -F and -C this will be either route cache misses (-F) or hits (-C).
  • Flags
    • U (route is up)
    • H (target is a host)
    • G (use gateway)
    • R (reinstate route for dynamic routing)
    • D (dynamically installed by daemon or redirect)
    • M (modified from routing daemon or redirect)
    • A (installed by addrconf)
    • C (cache entry)
    • ! (reject route)


  • -F
    • Operate on the kernel's FIB (Forwarding Information Base) routing table. This is the default.
  • -C
    • Operate on the kernel's routing cache.
  • -n
    • Show numerical addresses instead of trying to determine symbolic host names. This is useful if you are trying to determine why the route to your name server has vanished.

$ /sbin/route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface         UG    0      0        0 eth0   U     0      0        0 eth0     U     1002   0        0 eth0     U     1010   0        0 usb0   U     0      0        0 usb0

netstat Command

netstat options:
  • -r 
    • Display the kernel routing tables
    • -n 
      • Show numerical addresses instead of trying to determine symbolic host, port or user names

    $ netstat -rn
    Kernel IP routing table
    Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface         UG        0 0          0 eth0   U         0 0          0 eth0     U         0 0          0 eth0     U         0 0          0 usb0   U         0 0          0 usb0

    In the context of servers, means "all IPv4 addresses on the local machine". If a host has two IP addresses, and, and a server running on the host listens on, it will be reachable at both  IPs.

    ip route Command


    • protocol
      • redirect - the route was installed due to an ICMP  redirect
      • kernel  -  the  route was installed by the kernel during autoconfiguration
      • boot  -  the  route  was  installed  during  the  bootup sequence.  If a routing daemon starts, it will purge all of them.
      • static - the route was installed by the administrator to override  dynamic  routing.  Routing daemon will respect them and, probably, even advertise them to its peers.
      • ra - the route was installed by Router Discovery  protocol
    • scope
      • global - the address is globally valid
      • site - (IPv6 only) the address is site local, i.e. it is valid inside this site
      • link - the address is link local, i.e. it is valid  only on this device
      • host - the address is valid only inside this host
    • src
      • The  source  address  to prefer when sending to the destinations  covered by the route prefix
      • Most commonly used  on multi-homed hosts, although almost every machine out there uses this hint for connections to localhost

    • ip route show - list routes
      • the command displays the contents of the routing tables or the route(s) selected by some criteria
      • table (sub command)
        • show  the  routes from provided table(s).  The default setting is to show table main.  TABLEID may either be the ID of a real table or one of the special values:
          • all - list all of the tables.
          • cache - dump the routing cache.

    $ /sbin/ip route
    default via dev eth0    dev eth0  proto kernel  scope link  src   dev eth0  scope link  metric 1002   dev usb0  scope link  metric 1010 dev usb0  proto kernel  scope link  src

    # Viewing the local routing table with ip route show table local

    $ /sbin/ip route show table local
    broadcast dev eth0  proto kernel  scope link  src
    local dev eth0  proto kernel  scope host  src
    broadcast dev eth0  proto kernel  scope link  src
    broadcast dev lo  proto kernel  scope link  src
    local dev lo  proto kernel  scope host  src
    local dev lo  proto kernel  scope host  src
    broadcast dev lo  proto kernel  scope link  src
    broadcast dev usb0  proto kernel  scope link  src
    local dev usb0  proto kernel  scope host  src

    broadcast dev usb0  proto kernel  scope link  src


    1. Routing Table Definition
    2. How do I read/interpret a (netstat) routing table ?
    3. Static Routes and the Default Gateway (Redhat)
    4. How Routing Algorithms Work
      • Hierarchical routing -- When the network size grows, the number of routers in the network increases. Consequently, the size of routing tables increases, as well, and routers can't handle network traffic as efficiently. We use hierarchical routing to overcome this problem.
      • Internet -> Clusters -> Regions -> Nodes
    5. Routing Table
    6. route
    7. ip route
    8. Introduction to routing protocols (good)
    9. IP Address Mask Formats—the Router will display different Mask formats at different times:
      • bitcount —
      • hexadecimal — 0xFFFFFF00
      • decimal — 
    10. Metrics (Cost).  Different protocols use different metrics:
      • RIP/RIPv2 is hop count and ticks (IPX)
        • Ticks are used to determine server timeout
      • OSPF/ISIS is interface cost (bandwidth) 
      • (E)IGRP is compound 
      • BGP can be complicated
    11. 3 ways of building forwarding table in router:
      • Directly connected 
        • Routes that the router is attached to
      • Static 
        • Routes are manually defined 
      • Dynamic 
        • Routes protocol are learned from a Protocol
    12. Default route
      • Route used if no match is found in routing table
      • Special network number: (IP) 
    13. Multiple Routing Table
    14. IPROUTE2 Utility Suite Howto
    15. Difference between routing and forwarding table
    16. Loopback address
      •  A special IP number ( that is designated for the software loopback interface of a machine. 
        • The loopback interface has no hardware associated with it, and it is not physically connected to a network.
    17. Netstat: network analysis and troubleshooting, explained

    Tuesday, December 1, 2015

    Cloud: Find Out More about Your VM

    If you can access a VM in a cloud and would like to find out more about it, here are what you can do on a Linux system.


    lscpu is a useful Linux command to uncover CPU architecture information.  It can print out the following VM information:
    • Hypervisor vendor[1]
    • Virtualization type 
    • Cpu virtualization extension[2]
    Using two different servers (see below) as examples, we have found that:
    • Server #1
      • This is a Xen guest fully virtualized (HVM).
    • Server #2
      • This is a physical server.  However, it does have the virtualization extensions in hardware.
        • Another way to verify it is to:[3]
          • cat /proc/cpuinfo | grep vmx
    Server #1

    $ lscpu
    Architecture:          x86_64
    CPU op-mode(s):        32-bit, 64-bit
    Byte Order:            Little Endian
    CPU(s):                4
    On-line CPU(s) list:   0-3
    Thread(s) per core:    4
    Core(s) per socket:    1
    Socket(s):             1
    NUMA node(s):          1
    Vendor ID:             GenuineIntel
    CPU family:            6
    Model:                 63
    Stepping:              2
    CPU MHz:               2294.924
    BogoMIPS:              4589.84
    Hypervisor vendor:     Xen
    Virtualization type:   full

    L1d cache:             32K
    L1i cache:             32K
    L2 cache:              256K
    L3 cache:              46080K
    NUMA node0 CPU(s):     0-3

    Server #2

    $ lscpu
    Architecture:          x86_64
    CPU op-mode(s):        32-bit, 64-bit
    Byte Order:            Little Endian
    CPU(s):                16
    On-line CPU(s) list:   0-15
    Thread(s) per core:    2
    Core(s) per socket:    4
    CPU socket(s):         2
    NUMA node(s):          2
    Vendor ID:             GenuineIntel
    CPU family:            6
    Model:                 26
    Stepping:              5
    CPU MHz:               1600.000
    BogoMIPS:              4521.27
    Virtualization:        VT-x
    L1d cache:             32K
    L1i cache:             32K
    L2 cache:              256K
    L3 cache:              8192K
    NUMA node0 CPU(s):     0-3,8-11
    NUMA node1 CPU(s):     4-7,12-15


    virt-what is another useful Linux command to detect if we are running in a virtual machine.  For example, Server #1 is a Xen guest fully virtualized as shown below:[4]

    $ sudo virt-what

    If virt-what is not installed in the system, you can install it using yum (an interactive, rpm based, package manager).

    To find out which package "virt-what" is in, type:

    $ yum whatprovides "*/virt-what"
    Loaded plugins: aliases, changelog, downloadonly, kabi, presto, refresh-
                  : packagekit, security, tmprepo, verify, versionlock
    Loading support for kernel ABI
    virt-what-1.11-1.1.el6.x86_64 : Detect if we are running in a virtual machine
    Repo        : installed
    Matched from:
    Filename    : /usr/sbin/virt-what

    To install the matched package, type (note that "-1.11-1.1.el6" in the middle of the full name has been removed):

    # yum install virt-what.x86_64
    Loaded plugins: aliases, changelog, downloadonly, kabi, presto, refresh-
                  : packagekit, security, tmprepo, verify, versionlock
    Loading support for kernel ABI
    Setting up Install Process
    Nothing to do

    Finally, If nothing is printed and the "virt-what" exits with code 0 (or no error) as in Server #2, then it can mean either that the program is running on bare-metal or the program is running inside a type of virtual machine which we don't know about or cannot detect.


    1. How to check which hypervisor is used from my VM?
    2. Linux: Find Out If CPU Support Intel VT and AMD-V Virtualization Support
        • Hardware virtualization support:
          • vmx — Intel VT-x, virtualization support enabled in BIOS.
          • svm — AMD SVM,virtualization enabled in BIOS.
      1. Enabling Intel VT and AMD-V virtualization hardware extensions in BIOS
      2. Hardware-assisted virtualizion (HVM)
        • HVM support requires special CPU extensions - VT-x for Intel processors and AMD-V for AMD based machines. 
      3. Oracle Process Cloud Service 16.2.1 Release
      4. All Cloud-related articles on Xml and More

      Saturday, November 7, 2015

      Security and Isolation Implementation in Docker Containers

      Multitenancy is regarded an important feature of cloud computing. If we consider applications running on a container a tenant, the goal of good security-and-isolation design is to ensure tenants running on a host only use resources visible to them.

      As container technology evolves, its implementation of security, isolation and resource control has been continually improved.  In this article, we will review how Docker container achieves its security and isolation utilizing native container features of Linux such as namespaces, cgroups, capabilities, etc.

      Virtualization and Isolation

      Operating system-level virtualization, containers, zones, or even "chroot with steroids" are names that define the same concept of user-space isolation. Product such as Docker makes use of user-space isolation on top of OS-level vitualization facilities to provide extra security.

      Since version 0.9, Docker includes the libcontainer library as its own way to directly use virtualization facilities provided by the Linux kernel, in addition to using abstracted virtualization interfaces via LXC, [1] systemd-nspawn[2], and libvert,[3]

      These virtualization libraries all utilize native container features of Linux (see Diagram above):
      • namespaces
      • cgroups
      • capabilities
      and more. Docker combines these components into a wrapper which it calls a container format.


      The default container format is called libcontainer. Docker also supports traditional Linux containers using LXC. In the future, Docker may support other container formats, for example, by integrating with BSD Jails or Solaris Zones.

      Execution driver is the implementation of a specific container format and used for running docker containers. In the latest release, libcontainer
      • Is the default execution driver for running docker containers
      • Is shipped alongside the LXC driver
      • Is a pure Go library which is developed to access the kernel’s container APIs directly, without any other dependencies
        • Docker out of the box can now manipulate namespaces, control groups, capabilities, apparmor profiles, network interfaces and firewalling rules – all in a consistent and predictable way, and without depending on LXC or any other userland package.[6]
        • You provide a root filesystem and a configuration on how libcontainer is supposed to execute a container and it does the rest.
        • It allows spawning new containers or attaching to an existing container.
        • In fact, libcontainer delivered much needed stability that the team had decided to make it the default.
          • As of Docker 0.9, LXC is now optional
            • Note that LXC driver will continue to be supported going forward.
          • To switch back to the LXC driver, simply restart the Docker daemon with
            • docker -d -e lxc


      Docker isn't virtualization, as such – instead, it's an abstraction on top of the kernel's support for namespaces, which provides the isolated workspace (or containter). When you run a container, Docker creates a set of namespaces for that container.

      Some of the namespaces that Docker uses on Linux are:
      • pid namespace
        • Used for process isolation (PID: Process ID).
        • Processes running inside the container appear to be running on a normal Linux system although they are sharing the underlying kernel with processes located in other namespaces.
      • net namespace
        • Used for managing network interfaces (NET: Networking).
        • DNAT allows you to configure your guest's networking independently of your host's and have a convenient interface for forwarding only the ports you want between them.
          • However, you can replace this with a bridge to a physical interface.
      • ipc namespace
        • Used for managing access to IPC resources (IPC: InterProcess Communication).
      • mnt namespace
        • Used for managing mount-points (MNT: Mount).
      • uts namespace
        • Used for isolating kernel and version identifiers. (UTS: Unix Timesharing System).
      These isolation benefits naturally come with costs. Based on your network access patterns, your memory constraints, you may choose how to configure namespaces for your containers with Docker.

      cgroups (or Control Groups)

      Docker on Linux makes use of another technology called cgroups. Because each VM is a process, all normal Linux resource management facilities such as scheduling and cgroups apply to VMs. Furthermore, there is only one level of resource allocation and scheduling because a containerized Linux system only has one kernel and the kernel has full visibility into the containers.

      In summary, cgroups allow Docker to
      • Group processes and manage their aggregate resource consumption
      • Share available hardware resources to containers
      • Limit the memory and CPU consumption of containers
        • A container can be resized by simply changing the limits of its corresponding cgroup.
        • You can gather information about resource usage in the container by examining Linux control groups in /sys/fs/cgroup.
      • Provide a reliable way of terminating all processes inside a container.


      "POSIX capabilities" is what Linux uses.[9] These capabilities are a partitioning of the all powerful root privilege into a set of distinct privileges. You can see a full list of available capabilities in Linux manpages. Docker drops all capabilities except those needed, a whitelist instead of a blacklist approach.

      Your average server (bare metal or virtual machine) needs to run a bunch of processes as root. Those typically include:
      • SSH 
      • cron 
      • syslogd 
      • Hardware management tools (e.g., load modules) 
      • Network configuration tools (e.g., to handle DHCP, WPA, or VPNs),
      and much more.

      A container is very different, because almost all of those tasks are handled by the infrastructure around the container. By default, Docker starts containers with a restricted set of capabilities. In most cases, containers will not need “real” root privileges at all. For example, processes (like web servers) that just need to bind on a port below 1024 do not have to run as root: they can just be granted the CAP_NET_BIND_SERVICE instead. And therefore, containers can run with a reduced capability set; meaning that “root” within a container has much less privileges than the real “root”.

      Capabilities are just one of the many security features provided by modern Linux kernels. To harden a Docker host, you can also leverage other existing, well-known systems like
      If your distribution comes with security model templates for Docker containers, you can use them out of the box. For instance, Docker ships a template that works with AppArmor and Red Hat comes with SELinux policies for Docker.

      Photo Credit


      1. LXC—Linux containers.
      2. Control Centre: The systemd Linux init system
      3. The virtualization API: libvirt
      4. Solomon Hykes and others. What is Docker?
      5. How is Docker different from a normal virtual machine? (Stackoverflow)
      6. Docker 0.9: introducing execution drivers and libcontainer
        • Uses layered filesystems AuFS.
      7. Is there a formula for calculating the overhead of a Docker container?
      8. An Updated Performance Comparison of Virtual Machinesand Linux Containers
      9. capabilities(7) - Linux man page
      10. Netlink (Wikipedia)
      11. The lost packages of docker
      12. ebtables/iptables interaction on a Linux-based bridge
      13. Comparsion Between AppArmor and Selinux
      14. The docker-proxy (netfilter)
      15. Hardware isolation
      16. Understand the architecture (docker)
      17. Linux kernel capabilities FAQ
      18. Docker: Differences between Container and Full VM (Xml and More)
      19. Docker vs VMs
        • There is one key metric where Docker Containers are weaker than Virtual Machines, and that’s “Isolation”. Intel’s VT-d and VT- x technologies have provided Virtual Machines with ring-1 hardware isolation of which, it takes full advantage. It helps Virtual Machines from breaking down and interfering with each other.
      20. Docker Security
      21. Introduction to Control Groups (Cgroups)
      22. Docker Runtime Metrics
        • Control groups are exposed through a pseudo-filesystem. In recent distros, you should find this filesystem under /sys/fs/cgroup
        • On older systems, the control groups might be mounted on /cgroup, without distinct hierarchies.
          • To figure out where your control groups are mounted, you can run:
            • $ grep cgroup /proc/mounts
      23. Oracle WebLogic Server on Docker Containers (white paper)
      24. WebLogic on Docker (GitHub)
        • Sample Docker configurations to facilitate installation, configuration, and environment setup for DevOps users. This project includes quick start dockerfiles and samples for both WebLogic 12.1.3 and 12.2.1 based on Oracle Linux and Oracle JDK 8 (Server).

      Sunday, November 1, 2015

      Docker: Differences between Container and Full VM

      A virtual machine (VM) is an emulation of a particular computer system. Virtual machines operate based on the computer architecture and functions of a real or hypothetical computer, and their implementations may involve specialized hardware, software, or a combination of both.

      In this article, we will examine the differences between a Docker Container and a Full VM  (see Note 1).

      Docker Container

      Docker is a facility for creating encapsulated computer environments, each encapsulated computer environment is called a container.[2,7]

      Starting up a Docker container is lightning fast because:
      Each container shares the host computer's copy of the kernel.
      • However, each with its own running copy of Linux
      • This means there's no hypervisor, and no extended bootup

      In contrast, Virtual Machines implantation in KVM, VirtualBox or VMware is different.


      • Host OS vs Guest OS
        • Host OS
          • is the original OS installed on a computer
        • Guest OS
          • is installed in a virtual machine or disk partition in addition to the host or main OS
            • In a virtualization, a guest OS can be different from the host OS
            • In disk partitioning, a guest OS must be the same as the host OS
      • Hypervisor (or virtual machine monitor)
        • is a piece of computer software, firmware or hardware that creates and runs virtual machines.
        • A computer on which a hypervisor is running one or more virtual machines is defined as a host machine
        • Each virtual machine is called a guest machine.
      • Docker Container
        • A encapsulated computer environment created by Docker
        • Docker on Linux platforms
          • Building on top of facilities provided by the Linux kernel (primarily cgroups and namespaces)
          • Unlike a virtual machine, does not require or include a separate operating system
        • Docker on non-Linux platforms 
      • Docker daemon
        • is the persistent process that manages containers. 
          • Docker uses the same binary for both the daemon and client.
        • uses Linux-specific kernel features

      Container vs Full VM

      A full virtualized system gets its own set of resources allocated to it, and does minimal sharing. You get more isolation, but it is much heavier (requires more resources).  With Docker container you get less isolation, but they are more lightweight and require less resources. So you could easily run 1000's on a host, and it doesn't even blink.[1]

      Basically, a Docker container (see Note 1) and a full VM have different fundamental goals
      • VM is to fully emulate a foreign environment
        • Hypervisor in a full VM implementation is required to translate commands between Guest OS and Host OS
        • Each VM requires a full copy of the OS, the application being run and any supporting libraries
        • If you need to simultaneously run different operating systems (like Windows, OS/X or BSD), or run programs compiled for other operating systems: You need to do a full Virtual Machines implantation.
          • In contrast, the container OS (or, more accurately, the kernel) must be the same as the host OS and is shared between container and host (see Note 1).
      • Container is to make applications portable and self-contained
        • Each container shares the host computer's copy of the kernel. 
          • This means there's no hypervisor and no extended bootup.
        • The container engine is responsible for starting and stopping containers in a similar way to the hypervisor on a VM. 
          • However, processes running inside containers are equivalent to native processes on the host and do not incur the overheads associated with hypervisor execution.


      1. In this article, we only focus on Docker implementations on Linux platforms. In other words, our discussions here exclude non-Linux platforms (i.e, Windows, Mac OS X, etc.).[2]
        • Because the Docker daemon uses Linux-specific kernel features, you can’t run Docker natively in either Windows or Mac OS X.
        • Docker on non-Linux platforms uses a Linux virtual machine to run the containers.

      Photo Credit


      1. How is Docker different from a normal virtual machine? (Stackoverflow)
      2. Newbie's Overview of Docker
      3. Supported Installation (Docker)
      4. EXTERIOR: Using Dual-VM Based External Shell for Guest-OS Introspection, Configuration, and Recovery
      5. Comparing Virtual Machines and Linux Containers Performance
      6. An Updated Performance Comparison of Virtual Machines and Linux Containers
      7. Security and Isolation Implementation in Docker Containers

      Saturday, October 24, 2015

      Does SPECjbb 2015 Have Less Variability than SPECjbb2013?

      What's the advantage of using SPECjbb 2015 over SPECjbb 2013? The answer may be "less variability". In this article, I'm going to examine the credibility of this claim.

      Note that the experiments run here are not intended to evaluate the ultimate performance of JVMs (OpenJDK vs. HotSpot) with different configurations. Different JDK 7 versions were used in OpenJDK and HotSpot here. But, the author believes that the differences between builds of OpenJDK and HotSpot should be minor.

      To recap, the main purpose of this article is comparing the variability of SPECjbb 2013 and SPECjbb 2015.

      OpenJDK vs. HotSpot

      To begin with , we want to emphasize that OpenJDK and HotSpot actually share most of the code bases. Henrik Stahl, VP of Product Management in the Java Platform Group at Oracle, has shed some light on it:
      Q: What is the difference between the source code found in the OpenJDK repository, and the code you use to build the Oracle JDK?
      A: It is very close - our build process for Oracle JDK releases builds on OpenJDK 7 by adding just a couple of pieces, like the deployment code, which includes Oracle's implementation of the Java Plugin and Java WebStart, as well as some closed source third party components like a graphics rasterizer, some open source third party components, like Rhino, and a few bits and pieces here and there, like additional documentation or third party fonts. Moving forward, our intent is to open source all pieces of the Oracle JDK except those that we consider commercial features such as JRockit Mission Control (not yet available in Oracle JDK), and replace encumbered third party components with open source alternatives to achieve closer parity between the code bases.

      Raw Data

      Common Settings of the Tests
      Different Settings of Tests
      • Test 1
        • OpenJDK 
        • Huge Pages:[7] 0
      • Test 2
        • OpenJDK 
        • Huge Pages: 2200
      • Test 3
        • HotSpot
        • Huge Pages: 0
      • Test 4
        • HotSpot
        • Huge Pages: 2200

      Test 1
      Test 2
      Test 3
      Test 4

      MeanStd DevCVMeanStd DevCVMeanStd DevCVMean   Std DevCV

      Test 1
      Test 2
      Test 3
      Test 4

      MeanStd DevCVMeanStd DevCVMeanStd DevCVMeanStd DevCV
        critical-jOPS51420.213.93%581.8045.267.78% 522.40 18.903.62%571.636.676.42%


      Coefficient of variation (CV) is used for the comparison of experiment's variability.  Based on 5 sample points of each test, we have come to the following preliminary conclusions:
      • SPECjbb 2013 vs SPECjbb 2015
        • On critical-jOPS, SPECjbb 2015 exhibits lower extent of variability in all tests.
        • On max-jOPS, SPECjbb 2015 exhibits lower extent of variability in all tests except Test 3&4.
      • Enabling Huge Pages or not
        • In both SPECjbb 2013 & 2015, enabling huge pages exhibits higher extent of variability (exception: SPECjbb 2013/OpenJDK/critical-jOPS)
      • OpenJDK vs HotSpot
        • Both JVMs exhibit similar performance in average (i.e., max- or critical- jOPS) .
      By no means our results are conclusive.  As more tests are run in the future, we could gain a better picture.  From the preliminary results, it seems to be wise of choosing SPECjbb 2015 over SPECjbb 2013 in future Java SE performance tests.