Xml and More: 2015

Wednesday, December 30, 2015

Jumbo Frames—Design Considerations for Efficient Network

Each network has some maximum packet size, or maximum transmission unit (MTU). Ultimately there is some limit imposed by the technology, but often the limit is an engineering choice or even an administrative choice.^[1]

Many Gigabit Ethernet switches and Gigabit Ethernet network interface cards can support jumbo frames.^[2] There are performance benefits to enable Jumbo Frames (MTU: 9000). However, existing transmission links may still impose smaller MTU (e.g., 1500). This could exhibit issues along transit paths, which is referred to here as MTU Mismatch.

In this article, we will examine issues manifested by MTU mismatch and their design considerations.

How to Accommodate MTU Differences

When a host on the Internet wants to send some data, it must know how to divide the data into packets. And in particular, it needs to know the maximum size of packet.

Jumbo frames are Ethernet frames with more than 1500 bytes of payload.^[3] Conventionally, jumbo frames can carry up to 9000 bytes of payload, but variations exist and some care must be taken using the term. In this article, we will use MTU: 9000 and MTU: 1500 as our examples to discuss MTU-mismatch issues.

Issues

MTU is a maximum—you tell a network device NOT to drop frames unless they are larger than the maximum. A device with an MTU of 1500 can still communicate with a device with an MTU of 9000. However, when large-size packets are sent from MTU 9000 device to MTU-1500 device, the following happens:

If DF (Don't Fragment) flag is set

Packets will be dropped and a router is required to return an ICMP Destination Unreachable message to the source of the datagram, with the Code indicating "fragmentation needed and DF set"

If DF flag is not set

Packets will be fragmented to accommodate MTU differences, which will beget a cost^[4]

How to Test Potential MTU Mismatch

Either ping, tracepath, or traceroute (with --mtu option) command can be used to test potential MTU-mismatches.

For example, you can verify that the path between two end nodes has at least the expected MTU using the ping command:

ping -M do -c 4 -s 8972

The -M do option causes the DF flag to be set.

The -c option sets the number of pings.

The -s option specifies the number of bytes of padding that should be added to the echo request. In addition to this number there will be 20 bytes for the internet protocol header, and 8 bytes for the ICMP header and timestamp. The amount of padding should therefore be 28 bytes less than the network-layer MTU that you are trying to test (9000 − 28 = 8972).

If the test is unsuccessful, then you should see an error in response to each echo request:

$ ping -M do -c 4 -s 8972 10.252.136.96

PING 10.252.136.96 (10.252.136.96) 8972(9000) bytes of data.
From 10.249.184.27 icmp_seq=1 Frag needed and DF set (mtu = 1500)
From 10.249.184.27 icmp_seq=1 Frag needed and DF set (mtu = 1500)
From 10.249.184.27 icmp_seq=1 Frag needed and DF set (mtu = 1500)
From 10.249.184.27 icmp_seq=1 Frag needed and DF set (mtu = 1500)

--- 10.252.136.96 ping statistics ---
0 packets transmitted, 0 received, +4 errors

Similarly, you can use tracepath command to test:

$ tracepath -n -l 9000

The -n option specifies not looking up host names (i.e, only print IP addresses numerically).

The -l option sets the initial packet length to pktlen instead of 65536 for tracepath or 128000 for tracepath6.

In the tracepath output, the last line summarizes information about all the path to the destination:

The last line shows detected Path MTU, amount of hops to the destination and our guess about amount of hops from the destination to us, which can be different when the path is asymmetric.

/* a packet of length 9000 cannot reach its destination */
$ tracepath -n -l 9000 10.249.184.27
1: 10.241.71.129 0.630ms
2: 10.241.152.60 0.577ms
3: 10.241.152.0 0.848ms
4: 10.246.1.49 1.007ms
5: 10.246.1.106 0.783ms
6: no reply
...
31: no reply
Too many hops: pmtu 9000
Resume: pmtu 9000

/* a packet of length 1500 reached its destination */
$ tracepath -n -l 1500 10.249.184.27
1: 10.241.71.129 0.502ms
2: 10.241.152.62 0.419ms
3: 10.241.152.4 0.543ms
4: 10.246.1.49 0.886ms
5: 10.246.1.106 0.439ms
6: 10.249.184.27 0.292ms reached
Resume: pmtu 1500 hops 6 back 59

When to Enable Jumbo Frames?

Enabling jumbo frame mode (for example, on Gigabit Ethernet network interface cards) can offer the following benefits:

Less consumption of bandwidth by non-data protocol overhead

Hence increase network throughput

Reduction of the packet rate

Hence reduce server overhead

The use of large MTU sizes allows the operating system to send fewer packets of a larger size to reach the same network throughput.
For example, you will see the decrease in CPU usage when transferring larger file

The above factors are especially important in speeding up NFS or iSCSI traffic, which normally has larger payload size.

Design Considerations

When jumbo frame mode is enabled, the trade-offs include:

Bigger I/O buffer

Required for both end nodes and intermediate transit nodes

MTU mismatch

May beget IP fragmentation or even loss of data

Therefore, some design considerations are required. For example, you can:

Avoid situations where you have jumbo frame enabled host NIC's talking to non-jumbo frame enabled host NIC's.

One design trick is to let your NFS or ISCSI traffic be sent via a dedicated NIC and your normal host traffic be sent via a non-jumbo-MTU enabled interface

If your workload only include small messages, then the larger MTU size will not help

Be sure to use commands with the Don't fragment bit set to ensure that your hosts which are configured for jumbo frames are able to successfully communicate with each other via jumbo frames.

Enable Path MTU Discovery (PMTUD)^[18]

When possible, use the largest MTU size that the adapter and network support, but constrained by Path MTU
Make sure the packet filter on your firewall process ICMP packets correctly

RFC 4821, Packetization Layer Path MTU Discovery, describes a Path MTU Discovery technique which responds more robustly to ICMP filtering.

Be aware of extra non-data protocol overhead if you configure encapsulation such as GRE tunneling or IPsec encryption.

References

The TCP Maximum Segment Size and Related Topics
Jumbo/Giant Frame Support on Catalyst Switches Configuration Example
Ethernet Jumbo Frames\
IP Fragmentation: How to Avoid It? (Xml and More)
The Great Jumbo Frames Debate
Resolve IP Fragmentation, MTU, MSS, and PMTUD Issues with GRE and IPSEC
Sites with Broken/Working PMTUD
Path MTU Discovery
TCP headers
bad TCP checksums
MSS performance consideration
Understanding Routing Table
route (Linux man page)
Docker should set host-side veth MTU #4378
Add MTU to lxc conf to make host and container MTU match
Xen Networking
TCP parameter settings (/proc/sys/net/ipv4)
Change the MTU of a network interface

To enable PMTUD on Linux, type:

echo 1 > /proc/sys/net/ipv4/tcp_mtu_probing
echo 1024 > /proc/sys/net/ipv4/tcp_base_mss

MTU manipulation
Jumbo Frames, the gotcha's you need to know! (good)
Understand container communication (Docker)
calicoctl should allow configuration of veth MTU #488 - GitHub
Linux MTU Change Size
Changing the MTU size in Windows Vista, 7 or 8
Linux Configure Jumbo Frames to Boost Network Performance
Path MTU discovery in practice
Odd tracepath and ping behavior when using a 9000 byte MTU
How to Read a Traceroute

Sunday, December 27, 2015

IP Fragmentation: How to Avoid It?

Most Ethernet LANs use an MTU of 1500 bytes (modern LANs can use Jumbo frames, allowing for an MTU up to 9000 bytes); however, border protocols like PPPoE will reduce this.

Jumbo Frames have performance benefits if it is enabled on 10 Gigabit + Networks, especially when using Ethernet based storage access, with vMotion, or with other high throughput applications such as Oracle RAC Interconnects.^[1]

Enabling Jumbo Frames may also aggravate IP fragmentation due to more frequent MTU mismatches along the transmission path. In this article, we will review IP fragmentation overhead and the way to avoid it.

MSS vs MTU

To begin with, you can think those terms measuring two different sizes:

MSS is related to receiving buffer(s)

Can be announced via TCP Header(s)^[22,24]

When a TCP client initiates a connection to a server, it includes its MSS as an option in the first (SYN) packet
When a TCP server receives a connection from a client, it includes its MSS as an option in the a SYN/ACK packet

MTU is related to transmission link(s)

Can be configured on network interface(s)^[15,19,20]

The IP protocol was designed for use on a wide variety of transmission links. Although the maximum length of an IP datagram is 64K, most transmission links enforce a smaller maximum packet length limit, called a MTU (or Maximum Transmission Unit).

Originally, MSS meant how big a buffer was allocated on a receiving station to be able to store the TCP data. In other words, TCP Maximum Segment Size (MSS) defines the maximum amount of data that a host is willing to accept in a single TCP/IP datagram. MSS and MTU are related in this way—MSS is just the TCP data size, which does not include the IP header and the TCP header:^[22]

MSS = MTU - sizeof(TCPHDR) - sizeof(IPHDR)

For example, assuming both TCPHDR and IPHDR have the default size of 20 bytes

MSS = 8960 if MTU= 9000
MSS = 1460 if MTU= 1500

IP Fragmentation Overhead

The design of IP accommodates MTU differences by allowing routers to fragment IP datagrams as necessary. The receiving station is responsible for reassembling the fragments back into the original full size IP datagram. Read [2] for a good illustration of IP Fragmentation and Reassembly.

There are several issues that make IP fragmentation undesirable:

Overhead for Sender (or Router)

There is a small increase in CPU and memory overhead to fragment an IP datagram

Overhead for Receiver

The receiver must allocate memory for the arriving fragments and coalesce them back into one datagram after all of the fragments are received.

Overhead for handling dropped frames

Fragments could be dropped because of a congested link

If this happens, the complete original datagram will have to be retransmitted

Causing firewall having trouble enforcing its policies

If the IP fragments are out of order, a firewall may block the non-initial fragments because they do not carry the information that would match the packet filter.

Read [2] for more explanation

How to Avoid Fragmentation

Contrary to popular belief, the MSS value is not negotiated between hosts.^[22,24] However, an additional step could be taken by the sender to avoid fragmentation on the local and remote wires. In scenario 2, it illustrates this additional step:

The way MSS now works is that each host will first compare its outgoing interface MTU with its own MSS buffer and choose the lowest value as the MSS to send. The hosts will then compare the MSS size received against their own interface MTU and again choose the lower of the two values.

Path MTU Discovery (PMTUD) is another technique that can help you avoid IP fragmentation. It was originally intended for routers in IPv4. However, all modern operating systems use it on endpoints. PMTUD is especially useful in network situations where intermediate links have smaller MTUs than the MTU of the end links. To avoid IP fragmentation, it dynamically determines the lowest MTU along the path from a packet's source to its destination. However, as discussed here, there are three things that can break PMTUD.

Unfortunately, IP fragmentation issues could be be more widespread than you thought due to:

Dynamic Routing

The MTUs on the intermediate links may vary on different routes

IP tunnels

The reason that tunnels cause more fragmentation is because the tunnel encapsulation adds "overhead" to the size a packet.

For example, adding Generic Router Encapsulation (GRE) adds 24 bytes to a packet, and after this increase the packet may need to be fragmented because it is larger then the outbound MTU.

In [3], it lists which sites using PMTUD or alternatives to resolve IP fragmentation. It also lists sites simply allowing their packets to be fragmented rather than using PMTUD. This demonstrates the complexity of fragmentation avoidance.

References

The Great Jumbo Frames Debate
Resolve IP Fragmentation, MTU, MSS, and PMTUD Issues with GRE and IPSEC
Sites with Broken/Working PMTUD
Path MTU Discovery
TCP headers
bad TCP checksums
MSS performance consideration
Understanding Routing Table
route (Linux man page)
Docker should set host-side veth MTU #4378
Add MTU to lxc conf to make host and container MTU match
Xen Networking
TCP parameter settings (/proc/sys/net/ipv4)
Change the MTU of a network interface

tcp_base_mss, tcp_mtu_probing, etc

MTU manipulation

Ethernet MTU vs IP MTU

The default Ethernet MTU is 1500 bytes and can be configured and can be raised on Cisco IOS with the system mtu command under global configuration.
As with Ethernet frames, the MTU can be adjusted for IP packets. However, the IP MTU is configured per interface rather than system-wide, with the ip mtu command.

Jumbo Frames, the gotcha's you need to know! (good)
Understand container communication (Docker)
calicoctl should allow configuration of veth MTU #488 - GitHub
Linux MTU Change Size
Changing the MTU size in Windows Vista, 7 or 8
Linux Configure Jumbo Frames to Boost Network Performance
The TCP Maximum Segment Size and Related Topics (RFC 879)

The MSS can be used completely independently in each direction of data flow.

Changing TCP MSS under LINUX - Networking
TCP negotiations

Sunday, December 13, 2015

Routing Table: All Things Considered

Routing is the decision over which interface a packet is to be sent. This decision has to be made for locally created packets, too. Routing tables contain network addresses and the associated interface or next hop.

In this article, we will review all aspects of routing table (i.e., ip route and ip rule) on a Linux server.

Routing Table

A routing table is a set of rules, often viewed in table format, that is used to determine where data packets traveling over an Internet Protocol (IP) network will be directed. All IP-enabled devices, including routers and switches, use routing tables.

Routing tables can be maintained in two ways:^[1,11]

Manually

Tables for static network devices do not change unless a network administrator manually changes them

Useful when few or just one route exist
Can be administrative burden
Frequently used for default route^[12]

Static routes can be added via

"route add" command.

To persist it, you can also add the route command to rc.local

Dynamically

Devices build and maintain their routing tables automatically by using routing protocols^[8] to exchange information about the surrounding network topology.
Dynamic routing tables allow devices to "listen" to the network and respond to occurrences like:

Device failures
Network congestion.

Kernel IP Routing Table

Beyond the two commonly used routing tables (the local and main routing tables), the Linux kernel supports up to 252 additional routing tables.

The ip route and ip rule commands have built in support for the special tables main and local. Any other routing tables can be referred to by number or an administratively maintained mapping file, /etc/iproute2/rt_tables.

Typical content of /etc/iproute2/rt_tables^[13,14]

#
# reserved values
#
255     local      
254     main       
253     default    
0       unspec     
#
# local
#
1      inr.ruhep

	The `local` table is a special routing table maintained by the kernel. Users can remove entries from the local routing table at their own risk. Users cannot add entries to the local routing table. The file `/etc/iproute2/rt_tables` need not exist, as the iproute2 tools have a hard-coded entry for the `local` table.
	The main routing table is the table operated upon by route and, when not otherwise specified, by ip route. The file `/etc/iproute2/rt_tables` need not exist, as the iproute2 tools have a hard-coded entry for the `main` table.
	The `default` routing table is another special routing table
	Operating on the `unspec` routing table appears to operate on all routing tables simultaneously.
	This is an example indicating that table 1 is known by the name inr.ruhep. Any references to `table inr.ruhep` in an ip rule or ip route will substitute the value 1 for the word inr.ruhep.

Format of Routing Table

Rules in the routing table usually consists of the following entities:

Network Destination

The one which is outside your subnet

Basically it has different subnet mask compared to the local's

NetMask^[9]

Makes it easier for the Router (i.e., layer 3 device, which isolates 2 subnets).

This is used to identify which subnet the packet must go to

aka GenMask

shows the “generality” of the route, i.e., the network mask for this route

Gateway

There could be more than one gateway within a network, so to reach the destination we configure which could be the best possible gateway
A gateway is an essential feature of most routers, although other devices (such as any PC or server) can function as a gateway.

Interface

You could have multiple interfaces (ethernet interfaces, eth0, eth1, eth2...) on your device, which each interface would be assigned an IPAddress
This provides instruction on how to reach the gateway and through which interface it needs to push the packet.

Metrics (Cost)^[10]

Provides the path cost, basically for static routing the value would be 1 (default, but we can change it) and for dynamic routing (RIP, IGRP, OSPF) it varies.

MSS

Maximum Segment Size for TCP connections over this route

Usually has the value of 0, meaning “no changes”

MTU vs MSS

MTU = MSS + Header (40 bytes or more)
For example,

MTU = 576 -> MSS = 536
MTU = 1500 -> MSS = 1460

Fragmentation of data segment

If the data segment size is too large for any of the routers through which the data passes, the oversize segment(s) are fragmented.
This slows down the connection speed as seen by the computer user. In some cases the slowdown is dramatic.
The likelihood of such fragmentation can be minimized by keeping the MSS as small as reasonably possible.

Window

Default window size, which indicates how many TCP packets can be sent before at least one of them has to be ACKnowledged.
Like the MSS, this field is usually 0, meaning “no changes”

irtt (Initial Round Trip Time)

May be used by the kernel to guess about the best TCP parameters without waiting for slow replies.
In practice, it’s not used much, so you’ll probably never see anything else than 0 here.

How do I get there from here?

Device uses routing table to decide either a packet to stay in the current sub-net or be pushed to outside the sub-net. Here are the rules to be used given a destination address in the packet:^[2]

Position of route in the table has no significance.
When more than one route matches a destination address

The route with the longest subnet mask (most consecutive 1-bits) is used

When multiple routes match the destination and have subnet masks of the same length

The route with the lowest metric is used

Scenario 1

Assume that we have a packet with destination IP address as w.x.y.z arrives to a router which checks its routing table. If the router identifies that w.x.y.0/24 is present in the table, it will try to reach the concerned gateway by pushing through the respective interface.

Assuming that there are two entries of the same with metrics different, it will choose the one which has a lower value. Assuming the router does not find any entries in the routing table, it will go to default gateway.

Scenario 2

Now assume another scenario: a packet with the destination IP address k.l.m.n arrives to the router, and k.l.m.0/24 is default mask of the router, then it implies that the packet is destined to same network and it will not push packet to the peer sub-net.

Linux Commands

The following Linux commands can be used to print the routing table on the server:

In below discussions, we will use a server which has the following entry in its /etc/sysconfig/network configuration file:

Default Gateway: 10.244.0.1

and the interface entry in the /sbin/ifconfig configuration file has:

eth0 Link encap:Ethernet HWaddr 00:ll:mm:nn:aa:bb

inet addr:10.244.3.87 Bcast:10.244.7.255 Mask:255.255.248.0

route Command

Output

Ref

Number of references to this route (not used in the Linux kernel)

Use

Count of lookups for the route. Depending on the use of -F and -C this will be either route cache misses (-F) or hits (-C).

Flags

U (route is up)
H (target is a host)
G (use gateway)
R (reinstate route for dynamic routing)
D (dynamically installed by daemon or redirect)
M (modified from routing daemon or redirect)
A (installed by addrconf)
C (cache entry)
! (reject route)

Option

-F

Operate on the kernel's FIB (Forwarding Information Base) routing table. This is the default.

-C

Operate on the kernel's routing cache.

-n

Show numerical addresses instead of trying to determine symbolic host names. This is useful if you are trying to determine why the route to your name server has vanished.

$ /sbin/route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         10.244.0.1      0.0.0.0         UG    0      0        0 eth0
10.244.0.0      0.0.0.0         255.255.248.0   U     0      0        0 eth0
169.254.0.0     0.0.0.0         255.255.0.0     U     1002   0        0 eth0
169.254.0.0     0.0.0.0         255.255.0.0     U     1010   0        0 usb0
169.254.182.0   0.0.0.0         255.255.255.0   U     0      0        0 usb0

netstat Command

netstat options:
-r 
Display the kernel routing tables
-n 
Show numerical addresses instead of trying to determine symbolic host, port or user names

$ netstat -rn

Kernel IP routing table

Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface

0.0.0.0         10.244.0.1      0.0.0.0         UG        0 0          0 eth0

10.244.0.0      0.0.0.0         255.255.248.0   U         0 0          0 eth0

169.254.0.0     0.0.0.0         255.255.0.0     U         0 0          0 eth0

169.254.0.0     0.0.0.0         255.255.0.0     U         0 0          0 usb0

169.254.182.0   0.0.0.0         255.255.255.0   U         0 0          0 usb0

In the context of servers, 0.0.0.0 means "all IPv4 addresses on the local machine". If a host has two IP addresses, 192.168.1.1 and 10.1.2.1, and a server running on the host listens on 0.0.0.0, it will be reachable at both  IPs.

ip route Command

Output

protocol

redirect - the route was installed due to an ICMP redirect
kernel - the route was installed by the kernel during autoconfiguration
boot - the route was installed during the bootup sequence. If a routing daemon starts, it will purge all of them.
static - the route was installed by the administrator to override dynamic routing. Routing daemon will respect them and, probably, even advertise them to its peers.
ra - the route was installed by Router Discovery protocol

scope

global - the address is globally valid
site - (IPv6 only) the address is site local, i.e. it is valid inside this site
link - the address is link local, i.e. it is valid only on this device
host - the address is valid only inside this host

src

The source address to prefer when sending to the destinations covered by the route prefix
Most commonly used on multi-homed hosts, although almost every machine out there uses this hint for connections to localhost

Option

ip route show - list routes

the command displays the contents of the routing tables or the route(s) selected by some criteria
table (sub command)

show the routes from provided table(s). The default setting is to show table main. TABLEID may either be the ID of a real table or one of the special values:

all - list all of the tables.
cache - dump the routing cache.

$ /sbin/ip route

default via 10.244.0.1 dev eth0

10.244.0.0/21 dev eth0 proto kernel scope link src 10.244.3.87

169.254.0.0/16 dev eth0 scope link metric 1002

169.254.0.0/16 dev usb0 scope link metric 1010

169.254.182.0/24 dev usb0 proto kernel scope link src 169.254.182.77

# Viewing the local routing table with ip route show table local

$ /sbin/ip route show table local
broadcast 10.244.0.0 dev eth0 proto kernel scope link src 10.244.3.87
local 10.244.3.87 dev eth0 proto kernel scope host src 10.244.3.87
broadcast 10.244.7.255 dev eth0 proto kernel scope link src 10.244.3.87
broadcast 127.0.0.0 dev lo proto kernel scope link src 127.0.0.1
local 127.0.0.0/8 dev lo proto kernel scope host src 127.0.0.1
local 127.0.0.1 dev lo proto kernel scope host src 127.0.0.1
broadcast 127.255.255.255 dev lo proto kernel scope link src 127.0.0.1
broadcast 169.254.182.0 dev usb0 proto kernel scope link src 169.254.182.77
local 169.254.182.77 dev usb0 proto kernel scope host src 169.254.182.77

broadcast 169.254.182.255 dev usb0 proto kernel scope link src 169.254.182.7

References

Routing Table Definition
How do I read/interpret a (netstat) routing table ?
Static Routes and the Default Gateway (Redhat)
How Routing Algorithms Work

Hierarchical routing -- When the network size grows, the number of routers in the network increases. Consequently, the size of routing tables increases, as well, and routers can't handle network traffic as efficiently. We use hierarchical routing to overcome this problem.
Internet -> Clusters -> Regions -> Nodes

Routing Table
route
ip route
Introduction to routing protocols (good)
IP Address Mask Formats—the Router will display different Mask formats at different times:

bitcount —172.16.31.6/24
hexadecimal —172.16.31.6 0xFFFFFF00
decimal — 172.16.31.6 255.255.255.0

Metrics (Cost). Different protocols use different metrics:

RIP/RIPv2 is hop count and ticks (IPX)

Ticks are used to determine server timeout

OSPF/ISIS is interface cost (bandwidth)
(E)IGRP is compound
BGP can be complicated

3 ways of building forwarding table in router:

Directly connected

Routes that the router is attached to

Static

Routes are manually defined

Dynamic

Routes protocol are learned from a Protocol

Default route

Route used if no match is found in routing table
Special network number: 0.0.0.0 (IP)

Multiple Routing Table
IPROUTE2 Utility Suite Howto
Difference between routing and forwarding table
Loopback address

A special IP number (127.0.0.1) that is designated for the software loopback interface of a machine.

The loopback interface has no hardware associated with it, and it is not physically connected to a network.

Netstat: network analysis and troubleshooting, explained

Tuesday, December 1, 2015

Cloud: Find Out More about Your VM

If you can access a VM in a cloud and would like to find out more about it, here are what you can do on a Linux system.

lscpu

lscpu is a useful Linux command to uncover CPU architecture information. It can print out the following VM information:

Hypervisor vendor^[1]
Virtualization type
Cpu virtualization extension^[2]

Using two different servers (see below) as examples, we have found that:

Server #1

This is a Xen guest fully virtualized (HVM).

Server #2

This is a physical server. However, it does have the virtualization extensions in hardware.

Another way to verify it is to:^[3]

cat /proc/cpuinfo | grep vmx

Server #1

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    4
Core(s) per socket:    1
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Stepping:              2
CPU MHz:               2294.924
BogoMIPS:              4589.84
Hypervisor vendor:     Xen
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              46080K
NUMA node0 CPU(s):     0-3

Server #2

$ lscpu

Architecture: x86_64

CPU op-mode(s): 32-bit, 64-bit

Byte Order: Little Endian

CPU(s): 16

On-line CPU(s) list: 0-15

Thread(s) per core: 2

Core(s) per socket: 4

CPU socket(s): 2

NUMA node(s): 2

Vendor ID: GenuineIntel

CPU family: 6

Model: 26

Stepping: 5

CPU MHz: 1600.000

BogoMIPS: 4521.27

Virtualization: VT-x

L1d cache: 32K

L1i cache: 32K

L2 cache: 256K

L3 cache: 8192K

NUMA node0 CPU(s): 0-3,8-11

NUMA node1 CPU(s): 4-7,12-15

virt-what

virt-what is another useful Linux command to detect if we are running in a virtual machine. For example, Server #1 is a Xen guest fully virtualized as shown below:^[4]

$ sudo virt-what
xen
xen-hvm

If virt-what is not installed in the system, you can install it using yum (an interactive, rpm based, package manager).

To find out which package "virt-what" is in, type:

$ yum whatprovides "*/virt-what"

Loaded plugins: aliases, changelog, downloadonly, kabi, presto, refresh-

: packagekit, security, tmprepo, verify, versionlock

Loading support for kernel ABI

virt-what-1.11-1.1.el6.x86_64 : Detect if we are running in a virtual machine

Repo : installed

Matched from:

Filename : /usr/sbin/virt-what

To install the matched package, type (note that "-1.11-1.1.el6" in the middle of the full name has been removed):

# yum install virt-what.x86_64

Loaded plugins: aliases, changelog, downloadonly, kabi, presto, refresh-

              : packagekit, security, tmprepo, verify, versionlock

Loading support for kernel ABI

Setting up Install Process

Nothing to do

Finally, If nothing is printed and the "virt-what" exits with code 0 (or no error) as in Server #2, then it can mean either that the program is running on bare-metal or the program is running inside a type of virtual machine which we don't know about or cannot detect.

References

How to check which hypervisor is used from my VM?
Linux: Find Out If CPU Support Intel VT and AMD-V Virtualization Support

Hardware virtualization support:

vmx — Intel VT-x, virtualization support enabled in BIOS.
svm — AMD SVM,virtualization enabled in BIOS.

Enabling Intel VT and AMD-V virtualization hardware extensions in BIOS
Hardware-assisted virtualizion (HVM)

HVM support requires special CPU extensions - VT-x for Intel processors and AMD-V for AMD based machines.

Oracle Process Cloud Service 16.2.1 Release
All Cloud-related articles on Xml and More

Saturday, November 7, 2015

Security and Isolation Implementation in Docker Containers

Multitenancy is regarded an important feature of cloud computing. If we consider applications running on a container a tenant, the goal of good security-and-isolation design is to ensure tenants running on a host only use resources visible to them.

As container technology evolves, its implementation of security, isolation and resource control has been continually improved. In this article, we will review how Docker container achieves its security and isolation utilizing native container features of Linux such as namespaces, cgroups, capabilities, etc.

Virtualization and Isolation

Operating system-level virtualization, containers, zones, or even "chroot with steroids" are names that define the same concept of user-space isolation. Product such as Docker makes use of user-space isolation on top of OS-level vitualization facilities to provide extra security.

Since version 0.9, Docker includes the libcontainer library as its own way to directly use virtualization facilities provided by the Linux kernel, in addition to using abstracted virtualization interfaces via LXC, ^[1] systemd-nspawn^[2], and libvert,^[3]

These virtualization libraries all utilize native container features of Linux (see Diagram above):

namespaces
cgroups
capabilities

and more. Docker combines these components into a wrapper which it calls a container format.

libcontainer

The default container format is called libcontainer. Docker also supports traditional Linux containers using LXC. In the future, Docker may support other container formats, for example, by integrating with BSD Jails or Solaris Zones.

Execution driver is the implementation of a specific container format and used for running docker containers. In the latest release, libcontainer

Is the default execution driver for running docker containers
Is shipped alongside the LXC driver
Is a pure Go library which is developed to access the kernel’s container APIs directly, without any other dependencies

Docker out of the box can now manipulate namespaces, control groups, capabilities, apparmor profiles, network interfaces and firewalling rules – all in a consistent and predictable way, and without depending on LXC or any other userland package.^[6]
You provide a root filesystem and a configuration on how libcontainer is supposed to execute a container and it does the rest.
It allows spawning new containers or attaching to an existing container.
In fact, libcontainer delivered much needed stability that the team had decided to make it the default.

As of Docker 0.9, LXC is now optional

Note that LXC driver will continue to be supported going forward.

To switch back to the LXC driver, simply restart the Docker daemon with

docker -d -e lxc

namespaces

Docker isn't virtualization, as such – instead, it's an abstraction on top of the kernel's support for namespaces, which provides the isolated workspace (or containter). When you run a container, Docker creates a set of namespaces for that container.

Some of the namespaces that Docker uses on Linux are:

pid namespace

Used for process isolation (PID: Process ID).
Processes running inside the container appear to be running on a normal Linux system although they are sharing the underlying kernel with processes located in other namespaces.

net namespace

Used for managing network interfaces (NET: Networking).
DNAT allows you to configure your guest's networking independently of your host's and have a convenient interface for forwarding only the ports you want between them.

However, you can replace this with a bridge to a physical interface.

ipc namespace

Used for managing access to IPC resources (IPC: InterProcess Communication).

mnt namespace

Used for managing mount-points (MNT: Mount).

uts namespace

Used for isolating kernel and version identifiers. (UTS: Unix Timesharing System).

These isolation benefits naturally come with costs. Based on your network access patterns, your memory constraints, you may choose how to configure namespaces for your containers with Docker.

cgroups (or Control Groups)

Docker on Linux makes use of another technology called cgroups. Because each VM is a process, all normal Linux resource management facilities such as scheduling and cgroups apply to VMs. Furthermore, there is only one level of resource allocation and scheduling because a containerized Linux system only has one kernel and the kernel has full visibility into the containers.

In summary, cgroups allow Docker to

Group processes and manage their aggregate resource consumption
Share available hardware resources to containers
Limit the memory and CPU consumption of containers

A container can be resized by simply changing the limits of its corresponding cgroup.
You can gather information about resource usage in the container by examining Linux control groups in /sys/fs/cgroup.

Provide a reliable way of terminating all processes inside a container.

Capabilities^[20]

"POSIX capabilities" is what Linux uses.^[9] These capabilities are a partitioning of the all powerful root privilege into a set of distinct privileges. You can see a full list of available capabilities in Linux manpages. Docker drops all capabilities except those needed, a whitelist instead of a blacklist approach.

Your average server (bare metal or virtual machine) needs to run a bunch of processes as root. Those typically include:

SSH
cron
syslogd
Hardware management tools (e.g., load modules)
Network configuration tools (e.g., to handle DHCP, WPA, or VPNs),

and much more.

A container is very different, because almost all of those tasks are handled by the infrastructure around the container. By default, Docker starts containers with a restricted set of capabilities. In most cases, containers will not need “real” root privileges at all. For example, processes (like web servers) that just need to bind on a port below 1024 do not have to run as root: they can just be granted the CAP_NET_BIND_SERVICE instead. And therefore, containers can run with a reduced capability set; meaning that “root” within a container has much less privileges than the real “root”.

Capabilities are just one of the many security features provided by modern Linux kernels. To harden a Docker host, you can also leverage other existing, well-known systems like

TOMOYO
AppArmor
SELinux
GRSEC, etc.

If your distribution comes with security model templates for Docker containers, you can use them out of the box. For instance, Docker ships a template that works with AppArmor and Red Hat comes with SELinux policies for Docker.

Photo Credit

Docker Blog

References

LXC—Linux containers.
Control Centre: The systemd Linux init system
The virtualization API: libvirt
Solomon Hykes and others. What is Docker?
How is Docker different from a normal virtual machine? (Stackoverflow)
Docker 0.9: introducing execution drivers and libcontainer

Uses layered filesystems AuFS.

Is there a formula for calculating the overhead of a Docker contain er ?
An Updated Performance Comparison of Virtual Machinesand Linux Containers
capabilities(7) - Linux man page
Netlink (Wikipedia)
The lost packages of docker
ebtables/iptables interaction on a Linux-based bridge
Comparsion Between AppArmor and Selinux
The docker-proxy (netfilter)
Hardware isolation
Understand the architecture (docker)
Linux kernel capabilities FAQ
Docker: Differences between Container and Full VM (Xml and More)
Docker vs VMs

There is one key metric where Docker Containers are weaker than Virtual Machines, and that’s “Isolation”. Intel’s VT-d and VT- x technologies have provided Virtual Machines with ring-1 hardware isolation of which, it takes full advantage. It helps Virtual Machines from breaking down and interfering with each other.

Docker Security
Introduction to Control Groups (Cgroups)
Docker Runtime Metrics

Control groups are exposed through a pseudo-filesystem. In recent distros, you should find this filesystem under /sys/fs/cgroup.
On older systems, the control groups might be mounted on /cgroup, without distinct hierarchies.

To figure out where your control groups are mounted, you can run:

$ grep cgroup /proc/mounts

Oracle WebLogic Server on Docker Containers (white paper)
WebLogic on Docker (GitHub)

Sample Docker configurations to facilitate installation, configuration, and environment setup for DevOps users. This project includes quick start dockerfiles and samples for both WebLogic 12.1.3 and 12.2.1 based on Oracle Linux and Oracle JDK 8 (Server).

Sunday, November 1, 2015

Docker: Differences between Container and Full VM

A virtual machine (VM) is an emulation of a particular computer system. Virtual machines operate based on the computer architecture and functions of a real or hypothetical computer, and their implementations may involve specialized hardware, software, or a combination of both.

In this article, we will examine the differences between a Docker Container and a Full VM (see Note 1).

Docker Container

Docker is a facility for creating encapsulated computer environments, each encapsulated computer environment is called a container.^[2,7]

Starting up a Docker container is lightning fast because:

Each container shares the host computer's copy of the kernel.

However, each with its own running copy of Linux

This means there's no hypervisor, and no extended bootup.

In contrast, Virtual Machines implantation in KVM, VirtualBox or VMware is different.

Terminology

Host OS vs Guest OS

Host OS

is the original OS installed on a computer

Guest OS

is installed in a virtual machine or disk partition in addition to the host or main OS

In a virtualization, a guest OS can be different from the host OS
In disk partitioning, a guest OS must be the same as the host OS

Hypervisor (or virtual machine monitor)

is a piece of computer software, firmware or hardware that creates and runs virtual machines.
A computer on which a hypervisor is running one or more virtual machines is defined as a host machine.
Each virtual machine is called a guest machine.

Docker Container

A encapsulated computer environment created by Docker
Docker on Linux platforms

Building on top of facilities provided by the Linux kernel (primarily cgroups and namespaces)
Unlike a virtual machine, does not require or include a separate operating system

Docker on non-Linux platforms

Uses a Linux virtual machine to run the containers.

Docker daemon

is the persistent process that manages containers.

Docker uses the same binary for both the daemon and client.

uses Linux-specific kernel features

Container vs Full VM

A full virtualized system gets its own set of resources allocated to it, and does minimal sharing. You get more isolation, but it is much heavier (requires more resources). With Docker container you get less isolation, but they are more lightweight and require less resources. So you could easily run 1000's on a host, and it doesn't even blink.^[1]

Basically, a Docker container (see Note 1) and a full VM have different fundamental goals

VM is to fully emulate a foreign environment

Hypervisor in a full VM implementation is required to translate commands between Guest OS and Host OS.
Each VM requires a full copy of the OS, the application being run and any supporting libraries
If you need to simultaneously run different operating systems (like Windows, OS/X or BSD), or run programs compiled for other operating systems: You need to do a full Virtual Machines implantation.

In contrast, the container OS (or, more accurately, the kernel) must be the same as the host OS and is shared between container and host (see Note 1).

Container is to make applications portable and self-contained

Each container shares the host computer's copy of the kernel.

This means there's no hypervisor and no extended bootup.

The container engine is responsible for starting and stopping containers in a similar way to the hypervisor on a VM.

However, processes running inside containers are equivalent to native processes on the host and do not incur the overheads associated with hypervisor execution.

Notes

In this article, we only focus on Docker implementations on Linux platforms. In other words, our discussions here exclude non-Linux platforms (i.e, Windows, Mac OS X, etc.).^[2]

Because the Docker daemon uses Linux-specific kernel features, you can’t run Docker natively in either Windows or Mac OS X.
Docker on non-Linux platforms uses a Linux virtual machine to run the containers.

Photo Credit

Docker Blog

References

Saturday, October 24, 2015

Does SPECjbb 2015 Have Less Variability than SPECjbb2013?

What's the advantage of using SPECjbb 2015 over SPECjbb 2013? The answer may be "less variability". In this article, I'm going to examine the credibility of this claim.

Note that the experiments run here are not intended to evaluate the ultimate performance of JVMs (OpenJDK vs. HotSpot) with different configurations. Different JDK 7 versions were used in OpenJDK and HotSpot here. But, the author believes that the differences between builds of OpenJDK and HotSpot should be minor.

To recap, the main purpose of this article is comparing the variability of SPECjbb 2013 and SPECjbb 2015.

OpenJDK vs. HotSpot

To begin with , we want to emphasize that OpenJDK and HotSpot actually share most of the code bases. Henrik Stahl, VP of Product Management in the Java Platform Group at Oracle, has shed some light on it:

Q: What is the difference between the source code found in the OpenJDK repository, and the code you use to build the Oracle JDK?

A: It is very close - our build process for Oracle JDK releases builds on OpenJDK 7 by adding just a couple of pieces, like the deployment code, which includes Oracle's implementation of the Java Plugin and Java WebStart, as well as some closed source third party components like a graphics rasterizer, some open source third party components, like Rhino, and a few bits and pieces here and there, like additional documentation or third party fonts. Moving forward, our intent is to open source all pieces of the Oracle JDK except those that we consider commercial features such as JRockit Mission Control (~~not yet available in Oracle JDK~~), and replace encumbered third party components with open source alternatives to achieve closer parity between the code bases.

Raw Data

Common Settings of the Tests

SPECjbb

OOTB settings
-m composite

JDK 7

4GB heap
Parallel GC (default)

OEL 6.6
Data sampling

Total Runs: 5 for each test
Metrics

Sample Standard Deviation (Std Dev)
Coefficient of variation (CV)

Different Settings of Tests

Test 1

OpenJDK
Huge Pages:^[7] 0

Test 2

OpenJDK
Huge Pages: 2200

Test 3

HotSpot
Huge Pages: 0

Test 4

HotSpot
Huge Pages: 2200

SPECjbb 2013	Test 1			Test 2			Test 3			Test 4
	Mean	Std Dev	CV	Mean	Std Dev	CV	Mean	Std Dev	CV	Mean	Std Dev	CV
max-jOPS	2973.40	76.01	2.56%	3490.60	92.54	2.65%	3124.20	43.22	1.38%	3538	71.98	2.03%
critical-jOPS	748.40	70.02	9.36%	852.20	76.37	8.96%	741.60	36.28	4.89%	828.8	76.54	9.24%

SPECjbb 2015	Test 1			Test 2			Test 3			Test 4
	Mean	Std Dev	CV	Mean	Std Dev	CV	Mean	Std Dev	CV	Mean	Std Dev	CV
max-jOPS	2853.80	55.22	1.93%	3233.40	82.32	2.55%	2893.20	53.89	1.86%	3238.60	92.89	2.87%
critical-jOPS	514	20.21	3.93%	581.80	45.26	7.78%	522.40	18.90	3.62%	571.6	36.67	6.42%

Analysis

Coefficient of variation (CV) is used for the comparison of experiment's variability. Based on 5 sample points of each test, we have come to the following preliminary conclusions:

SPECjbb 2013 vs SPECjbb 2015

On critical-jOPS, SPECjbb 2015 exhibits lower extent of variability in all tests.
On max-jOPS, SPECjbb 2015 exhibits lower extent of variability in all tests except Test 3&4.

Enabling Huge Pages or not

In both SPECjbb 2013 & 2015, enabling huge pages exhibits higher extent of variability (exception: SPECjbb 2013/OpenJDK/critical-jOPS)

OpenJDK vs HotSpot

Both JVMs exhibit similar performance in average (i.e., max- or critical- jOPS) .

By no means our results are conclusive. As more tests are run in the future, we could gain a better picture. From the preliminary results, it seems to be wise of choosing SPECjbb 2015 over SPECjbb 2013 in future Java SE performance tests.

References

Coefficient of variation
When to use the sample or population standard deviation
SPECjbb2015_release_notes.txt
SPECjbb2013 Result File Fields
Java 7 Questions & Answers
java - Difference between JVM and HotSpot?
Huge Pages

Java Mission Control

Java Mission Control (JMC) is for Hotspot as JRockit Mission Control (JRMC) is for JRockit.

Standard Deviation Calculator

Wednesday, December 30, 2015

How to Accommodate MTU Differences

How to Test Potential MTU Mismatch

When to Enable Jumbo Frames?

Design Considerations

References

Sunday, December 27, 2015

MSS vs MTU

IP Fragmentation Overhead

How to Avoid Fragmentation

References

Sunday, December 13, 2015

Routing Table

Kernel IP Routing Table

Format of Routing Table

How do I get there from here?

Linux Commands

route Command

netstat Command

ip route Command

References

Tuesday, December 1, 2015

lscpu

virt-what

References

Saturday, November 7, 2015

Virtualization and Isolation

libcontainer

namespaces

cgroups (or Control Groups)

Capabilities[20]

Photo Credit

References

Sunday, November 1, 2015

Docker Container

Terminology

Container vs Full VM

Photo Credit

References

Saturday, October 24, 2015

OpenJDK vs. HotSpot

Raw Data

Analysis

References

Capabilities^[20]