Archive for the ‘sysadmin’ Category.

Why I hate Java HTTPClient MaxConnectionsPerHost

Background

The JAVA httpclient package is used by many software devs in SOA architecture shops to make back end connections from service to service. The apache developers who wrote the httpclient library clearly were considering the use of the httpclient library to make web browsers. As such they included a parameter, MaxConnectionsPerHost which would limit the number of simultaneous requests a browser could make to a website to 2, in order to avoid overloading the site. This made more sense back in the 90s when RFC 2068 was written with this recommendation, and firefox now has upped the default limit to 15, and I believe that IE has raised the default limit to 8.

My contention is that in SOA shops where servers are calling services that this connection limit is useless and should be disabled or raised to a very, very large value (10,000+).

The Problem

If you have a bank of servers calling a bank of other servers, in an SOA environment, you can wind up hitting an artificial limit caused by this httpclient limit which is nearly identical in behavior to exhaustion of database connection pools. It should be noted that database connection pooling, however, is necessary in order to reduce the cost of expensive database connection establishment and to reduce the memory impact on the database server of large amounts of database connections (particularly with Oracle, less so with a thinner database like MySQL).

What this appears like is that you have a bank of java (tomcat or whatever) servers, which are idle but periodically spiking to insane latency and timeouts. There’s no maxing out of CPU, I/O, network bandwidth or any other server resources. Similarly, the back end service that these servers are trying to contact is also scaled out adequately for the load and there’s no obvious performance issues that it is hitting, but response times in getting back a reply from the service is very, very slow as measured by the java making the httpclient call. This problem can masquerade as networking or load balancer issues and can drive network engineers nuts trying to track down why “the network is slow”.

What is observed, however, on the java app in thread dumps is potentially hundreds of threads stuck in doGetConnection:

"XXX THREAD NAME CHANGED TO PROTECT THE GUILTY XXX" daemon prio=10 tid=0x0000002ddb745800 nid=0x5edd in Object.wait() [0x000000005587f000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.doGetConnection(MultiThreadedHttpConnectionManager.java:509)
        - locked <0x0000002aa37a9208> (a org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ConnectionPool)
        at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.getConnectionWithTimeout(MultiThreadedHttpConnectionManager.java:394)
        at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:152)
        at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:396)
        at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:324)
[...etc...]

So, we have idle service on one end, idle service on the other end, everything is working but the service is effectively crashed because of this artificial limit. This happens because in an SOA environment, having a limit of 2 simultaneous connections from server to service VIP is way, way too low. In a bank of 8 servers that means you can only have 16 simultaneous connections and if you’re doing in aggregate 100 tps it only takes a few slow, expensive calls to the service to “clog up the pipes”.

Why raising the limit to 10 or 20 or 30 is bad

The obviously prudent thing to do is to raise this limit up to some limit which is reasonable but still “protects” the back end service. We wouldn’t want to disable the limit entirely because… well, something bad would clearly happen and someone put that limit in there for a reason.

I’m going to try to argue that the reason that limit was put in there had nothing to do with your SOA environment, and that I’ve worked in truly massive non-java SOA environments that didn’t have this kind of limit and never saw an issue, and that there’s much better ways to deal with scaling limits and brown-outs and that this limit is always the behavior you don’t want.

I just diagnosed this problem, again, for the hundredth time in a situation where the MaxConnectionsPerHost limit had been raised to 10 on a bank of 8 servers. This had been running fine for a long time, but 5 of the servers crashed at once due to a memory exhaustion issue. That was bad, but the set of servers is so overscaled that there was still 60% idle cpu cycles available on the clients. The problem was that the farm went from having a limit of 80 simultaneous connections down to having a limit of 30 simultaneous connections. That was the only thing that caused the entire farm to fail (due to timeouts).

Granted, having 5 of 8 servers out of rotation is a bad thing, but the farm actually could have taken the load and this would have been an “oops, we had 5 servers out, damn we’re overscaled, good no customers were impacted” problem, but the “prudent” limit of 10 resulted in an outage. I’d rather jack the limit up to something very, very large and make this problem simply go away and stop encountering it. It didn’t do any good, and just caused us another outage.

Effect of Removing the Limit Completely

I worked in a very large Seattle-based Internet Retailor for 5 years as one of the “Tier 3 or 4″ Senior SEs who would see any kind of crazy infrastructure problem like this bubble up to us. We were not java-based at the time and were instead using process-based clients that simply had no concept of this kind of connection pooling to back end SOA services. Any server could open up as many connections to a back end service as it liked, each process could open up as many back end connection as it liked, and the processes did not share any state to know in aggregate how many connections were open to any back end server. With 30,000 servers and thousands of different deployed applications (literally) we never encountered any issues that the maxrequestsperhost limit would have solved. In my opinion, in an SOA shop this is a solution which is looking for a problem. I lived for 5 years in a massive environment and never once saw some kind of issue which made me think to utilize something like this limit.

And I would argue that this is because HTTP connections are *massively* cheap compared to Oracle connections. Sure you need to use a little bit of TCP/IP to get it going, but modern processors can do many more of those connection opens per second than your servers are ever going to want to be submitting (100 tps coming from a typical java tomcat is going to be impressive programming, but the TCP/IP stack won’t break a sweat). Trying to do some kind of HTTP/1.1-based connection pooling with a finite limit on it (which in my experience is *not* what is typically going on when I see the httpclient bug — most of the time these connections are not being reused at all) is a premature optimization in the Knuth-sense.

Poor Behavior on Surge Traffic

A common thought is that this “protects” your back-end services from surges. But the infinite-queuing behavior of the httpclient is precisely what you don’t want. As soon as the client overall starts to require more simultaneous connections than it can submit to the back end service the queue will attempt to grow infinitely long, creating infinite latency. What effectively happens is that every single request will take as long as the timeout period of whatever meta-client is calling the java service that uses httpclient.

In brown-out loading what you want is to start aggressively dropping connections to take load off, but you want to do that based on real brown-out of your back end service. SDEs are terrible at estimating what level of simultaneous connections would actually result in a real brown-out of the back end service. Nobody measures this in Q/A or load, or looks at it in production, and it would probably take a team of people in a large site to keep measuring and tweaking all the clients in production. The only way to reliable tell you really are in a brown-out situation is by wrapping your back-end calls with timeouts — and not queueing.

Retries are also poor behavior as well, unless you have exponential backoff like TCP/IP uses — otherwise you ensure that a momentary brown-out produces an permanent overload of the back-end service.

You could still use a simultaneous connection limit if you must, but you must not queue, you must drop. If you queue, in an overload you will build latency without limit once the pipes are filled up, causing every request to timeout which results in a 100% outage anyway. If you immediately drop requests over the limit then there is the possibility that you could hit a situation where dropping 10% of the requests allows 90% of the other requests to succeed in a timely manner. However, again, this requires being able to accurately measure exactly what the simultaneous connection threshold should be. Set it too low and you start to deny requests before you have overloaded your backend service. Set it too high and you overload your back-end service anyway — you will never manage to set it correctly and budget adequate time to maintain it as the software changes, so its effectively useless go down that road. Anyway, the httpclient blocks requests in doGetConnection when all the connections are being used and does not drop them, so the httpclient does not implement this kind of behavior.

MaxConnectionsPerHost recommendation

Find some way to disable it, or set it to something “insane” like 10,000. All it does, in an SOA environment, is cause problems without usefully solving any problem.

You will then never again see the problem where an idle java app is having problems talking to an idle back end service when everything in the network checks out fine — an outage due simply to configuration.

You can then focus on scalability and brownouts using timeouts, exponential backoff, or simply elastic or “cloudy” scaling of services in response to demand.

And in general, you should not try to “protect” back end services with any kind of artificial limits. Invariably this result in exercising the Law of Unintended Consequences when those limits are accidentally hit when all the services are fine. It will not ‘save’ you but will cause you problems. Of course, when you encounter a problem like Oracle’s expensive connections you need to manage that resource and establish limits, but still those limits should be pushed to the point where you are using the limits to protect Oracle’s memory and not in an attempt to prevent apps from delivery ‘too many’ requests to Oracle. If your database melts down under load you need to address the problem well upstream with whatever is misbehaving — or just buy a bigger database machine, or use some more horizontally scalable NoSQL solution.

Summary

The MaxConnectionsPerHost parameter is a throwback to RFC 2068, written in 1997, before anyone dreamed up the term “Service Oriented Architecture”. In an SOA shop, this behavior is worse than useless and wastes time and causes needless outages without solving any problems.

The apache httpclient authors probably will not by default remove this limit, since they have to consider that their code could also be used to write web browsers, where some kind of finite limit is certainly prudent. But in any SOA shop this limit should be disabled or raised to something like 10,000 in all service-calling code — effectively making it unlimited and removing it.

Share

Beginning to configure Chef

Setup

First, you can signup for the free hosting service for 5 servers, which eliminates the need to screw around setting up the server infrastructure:

https://cookbooks.opscode.com/users/new

Second, go through the five step tutorial here:

http://help.opscode.com/faqs/start/how-to-get-started

Configuring ~/.chef

The opscode instructions have you create ~/chef-repo/.chef and put your knife.rb and .pem file in there.

I prefer to move those files into ~/.chef so that I don’t have to cd ~/chef-repo in order to use knife. You can also push out thse config files in ~/.chef just like ~/.bashrc files so that you can use knife on different nodes. Of course a hacker who gains root on one of the servers where your knife credentials are can impersonate you to the chef server, so it may be wise to restrict your knife config to a set of bastion servers. Obviously, chef can help you manage this config.

My ~/.chef/knife.rb looks like:

log_level                :info
log_location             STDOUT
node_name                "USERNAME"
client_key               "#{ENV['HOME']}/.chef/USERNAME.pem"
validation_client_name   "ORGANIZATION-validator"
validation_key           "#{ENV['HOME']}/.chef/ORGANIZATION-validator.pem"
chef_server_url          "https://api.opscode.com/organizations/ORGANIZATION"
cache_type               'BasicFile'
cache_options( :path => "#{ENV['HOME']}/.chef/checksums" )
cookbook_path            ["#{ENV['HOME']}/chef-repo/site-cookbooks"]

Here, USERNAME and ORGANIZATION should be replaced by your chef username and your organization’s name. You will also need to put your credentials in ~/.chef/USERNAME.pem (replace USERNAME by your username).

You have now configured knife to know who you are, where your credentials are, and configured some paths in your environment.

You should now be able to run comands like knife node list in any directory on the host that you have configured.

I’ve also setup my cookbook_path to ~/chef-repo/site-cookbooks so that from any directory, knife can find my cookbooks in order to upload them — you should modify that to wherever you have your site’s cookbooks.

Adding New Nodes

Now, when you’re done with that, to add a new node, you can deploy these two files to the new node:

/etc/chef/client.rb
- this points chef at the server
/etc/chef/validation.pem
- your organization key to authenticate first time to the server

Then become root and run your chef-client to add the node to the server and create the client.pem

Now you can run knife node list to see the newly added node.

You should protect /etc/chef by making it mode 0700 or something similar. Once the /etc/chef/client.pem file has been bootstrapped down from the server, you (or a chef recipe) can delete the /etc/chef/validation.pem file.

Summary

So, now you should have setup an account at opscode, successfully setup your organization, configured knife to be easy to use on your host, and bootstrapped a few hosts to use chef. Now you can move on to getting chef to do useful things.

Share

KVM vs. XenServer vs. VMware Memory Overcommitment

KVM claims to support 3 different kinds of memory overcommitment (I wouldn’t count live migration as memory management). My comparative analysis of these features to its competitors, based on what has been written about them (I don’t have any direct experience with KVM, and have not played with XenServer 5.6′s balloon driver so far):

Swapping

Since KVM-based VMs run as processes a large amount of swap configured on the hard server will allow pages in the VMs to get swapped out. This seems inefficient since pages in the VM which are already file-backed may get paged to swap and be doubly-committed to disk, but should be usable in pre-production environments with many idle and completely unused VMs.

As far as I know VMware does not support this operation. This mechanism should be highly stable in KVM since it leverages the VM-as-process model and the underlying code has been debugged for literally decades.

It looks like the community Xen 4.0 hypervisor supports swap-to-disk for HVM based guests and not PV guests, but XenServer 5.6 does not yet support this.

Page Sharing or Memory De-Duplication

VMware patented this process. The base Linux kernel in 2.6.32 has added a similar feature in KSM, and the ksmd daemon, which runs in user-space and can de-duplicate memory across different processes. As KVM VMs run as processes, KVM immediately benefits from this.

Again it looks like the community Xen 4.0 hypervisor supports KSM for HVM guests, but not PVs and this feature is not present in XenServer 5.6.

Since KSM is relatively new code in the Linux kernel, it is going to be less tested than VMware’s implementation. The KSM code also uses a slightly different algorithm than VMware and avoids the use of hashes to do comparison of pages, which may impact performance (conceivably it could impact performance positively, since it avoids the computationally heavier hash in place of just doing memcmp).

Ballooning

VMware has supported dynamic ballooning for a long time. XenServer 5.6 has recently added dynamic ballooning, although the balloon driver in the Xen hypervisor has been present since at least before 2005. The only new addition to XenServer 5.6 has been the addition of the xenballoond guest daemon to dynamically tweak memory ballooning.

KVM also has memory ballooning and memory ballooning has been back-ported to the 2.6.18-194.el5 RHEL/CentOS 5.5 kernel. So far I can’t find any information on support for automatic dynamic balancing of memory ballooning the way that VMware or XenServer 5.6 does.

Scorecard

VMware is clearly mature and just works and the lack of swapping VMs out to swap pages on the host is probably not a necessary feature for VMware given that memory de-duplication and ballooning work reasonably well under VMware.

XenServer 5.6 relies entirely on a newly-introduced balloon driver and I expect that under heavier workloads that there could be some performance instability discovered in xenballoond under edge conditions. XenServer seems to be lagging behind. While the Xen community hypervisor looks to be able to easily leverage KSM swapping and page de-duplication on HVM guests, the architecture of Xen PVs means that any swapping or page de-duplication needs to be patched directly into the underlying Xen hypervisor.

KVM benefits from its design in being able to leverage kernel swapping of VM pages, and leveraging KSM page de-duplication. Swapping VM pages to disk should be remarkably stable, and should be useful to massively overcommit memory in pre-production environments and compete with VMware’s balloon driver. The KSM page de-duplication is not yet mature code, but inclusion in the vanilla Linux kernel should rapidly increase the maturity of the codebase and allow it to compete with VMware’s page de-duplication. Once KSM and the KVM balloon driver matures, it should put KVM on equal footing with VMware.

Share

Xenserver 5.6 extending disk on a Linux VM

Three things need to happen to increase the disk available to a Linux VM:

  1. Increase the disk size available to a virt in XenCenter
  2. Increase the size of the VM partition in the VM’s partition table
  3. Grow the size of the ext3 filesystem in the VM

First of all the VM needs to be shut down in order to extend the size of the drive in XenCenter (I don’t know how to do this on-the-fly).

On the storage tab of the VM, click on properties and increase the size of the storage in XenCenter

Increase the size of the linux / partition using fdisk. The utility parted should allow you to resize partitions as well and be much smarter about it and let you move filesystem data around as well, but parted does not allow you to edit running partitions. In this case since I want to edit /dev/xvda3 which is my / partition and it is at the end of the virtual disk I only need to extend the size of the partition and not move it around. On the running VM use fdisk to increase the partition:

# fdisk /dev/xvda

The number of cylinders for this disk is set to 5221.
There is nothing wrong with that, but this is larger than 1024,
and could in certain setups cause problems with:
1) software that runs at boot time (e.g., old versions of LILO)
2) booting and partitioning software from other OSs
   (e.g., DOS FDISK, OS/2 FDISK)

Command (m for help): p

Disk /dev/xvda: 42.9 GB, 42949672960 bytes
255 heads, 63 sectors/track, 5221 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

    Device Boot      Start         End      Blocks   Id  System
/dev/xvda1   *           1          38      305203+  83  Linux
/dev/xvda2              39         103      522112+  82  Linux swap / Solaris
/dev/xvda3             104        4177    32724405   83  Linux

Command (m for help): d
Partition number (1-4): 3

Command (m for help): n
Command action
   e   extended
   p   primary partition (1-4)
p
Partition number (1-4): 3
First cylinder (104-5221, default 104): 104
Last cylinder or +size or +sizeM or +sizeK (104-5221, default 5221): 5221

Command (m for help): p

Disk /dev/xvda: 42.9 GB, 42949672960 bytes
255 heads, 63 sectors/track, 5221 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

    Device Boot      Start         End      Blocks   Id  System
/dev/xvda1   *           1          38      305203+  83  Linux
/dev/xvda2              39         103      522112+  82  Linux swap / Solaris
/dev/xvda3             104        5221    41110335   83  Linux

Command (m for help): w
The partition table has been altered!

Calling ioctl() to re-read partition table.

WARNING: Re-reading the partition table failed with error 16: Device or resource busy.
The kernel still uses the old table.
The new table will be used at the next reboot.
Syncing disks.
#

Now the last warning is accurate, you need to reboot the VM again in order for the kernel to pickup the changes to the partition geometry. Once the VM has rebooted, you can run resize2fs in the running VM which will automatically notice the difference between the partition geometry and the size of the filesystem and will extend the filesystem to its maximum size:

# resize2fs /dev/xvda3
resize2fs 1.39 (29-May-2006)
Filesystem at /dev/xvda3 is mounted on /; on-line resizing required
Performing an on-line resize of /dev/xvda3 to 10277583 (4k) blocks.
The filesystem on /dev/xvda3 is now 10277583 blocks long.

The need to reboot the VM twice here is a little bit annoying, there should be a better way to do this on-the-fly. It should also be straightforwards to replace the use of the GUI with xe commands, and if you’re using ext3 volumes in XenServer and have access to the VHD files, it should be possible to do the partitioning and filesystem resizing from the dom0 as well (although doing this on-the-fly would require some kind of co-operation with the VM kernel, and probably requires the VM to be shut down anyway).

Share

XenServer 5.6 thin provisioning with ext3

XenServer 5.6 allows admins a choice between 3 different kinds of volume management: LVM, LVHD or ext3. With the default in XenServer 5.6 of LVHD you gain quick snapshots and have thin provisioning of snapshots and suspended virtual machines, but running virtual machines have 100% of their disk allocation counted against the disk usage. In order to get thin provisioning of running VMs you need to build/rebuild your SRs as ext3 volumes. You lose rapid snapshots in the process. I also am not sure that this meets everyone’s definition of “thin provisioning” since this is just lazy allocation of blocks on ext3. If you fill up the disk on the VM and then delete a large amount of space, I don’t believe you will see the disk usage affected on your virtual machine. Still, with most server images in the Enterprise being nearly un-utilized, this should still be effective — particularly if you are good about log rotation and don’t let your partitions fill up.

In order to convert the default local storage volume on a XenServer 5.6 host you need to use the console xe utilities to destroy and recreate the SR. This is destructive to VMs on the host, so these instructions assume a newly build XenServer 5.6 — the adaption to adding a new drive to a host and creating a new SR with ext3 using ‘xe sr-create’ with these arguments is also straight forwards. If you’ve already got VMs on the SR you’ll need to migrate them off and migrate them back one way or another. Don’t try this for the first time on a VM host that you care about, particularly if you aren’t skilled with the command line.

First there’s a default template in XenServer 5.6 which needs to be removed from the storage:

# xe vbd-list
uuid ( RO)             : f5c9f545-2019-7299-be87-fc7ef00be1e2
          vm-uuid ( RO): e2ad0921-dea8-5a1a-77e8-d3257fdcf48d
    vm-name-label ( RO): XenServer Transfer VM 5.6.0-31124p
         vdi-uuid ( RO): c3a8d327-2036-4ce2-9946-f0522f7572f4
            empty ( RO): false
           device ( RO):
# xe template-uninstall template-uuid=e2ad0921-dea8-5a1a-77e8-d3257fdcf48d
The following items are about to be destroyed
VM : e2ad0921-dea8-5a1a-77e8-d3257fdcf48d (XenServer Transfer VM 5.6.0-31124p)
VDI: c3a8d327-2036-4ce2-9946-f0522f7572f4 (XenServer Transfer VM system disk)
Type 'yes' to continue
yes
All objects destroyed

If you really needed that template, you don’t have it anymore. You’ll have to figure out how to get it back. I’m not sure what the purpose of that is for. It is by default installed on all new XenServer 5.6 images, so you should be able to export it from a fresh install and re-import it to fix, but I’m not going to offer instructions on how to do that, and haven’t tested it.

Next, find the uuid of the Local Storage SR:

# xe sr-list name-label="Local storage"
uuid ( RO)                : dacfea90-263e-0811-ab88-22f01b89b1b4
          name-label ( RW): Local storage
    name-description ( RW):
                host ( RO): vmhost.example.com
                type ( RO): lvm
        content-type ( RO): user

Then find the PBD that is attached to that:

]# xe pbd-list sr-uuid=dacfea90-263e-0811-ab88-22f01b89b1b4
uuid ( RO)                  : daabdf71-641c-900b-3451-bd5c70675fab
             host-uuid ( RO): 23d8a9a0-a317-47a5-a1e6-858ab120b57b
               sr-uuid ( RO): dacfea90-263e-0811-ab88-22f01b89b1b4
         device-config (MRO): device: /dev/disk/by-id/scsi-36001c230bd1017000e4f2ee6554b21c8-part3
    currently-attached ( RO): true

Then unplug the PBD:

# xe pbd-unplug uuid=daabdf71-641c-900b-3451-bd5c70675fab

Now destroy the SR:

# xe sr-destroy uuid=dacfea90-263e-0811-ab88-22f01b89b1b4

Now you can create the SR. I’ve been using servers that have /dev/sda, so the storage partition is /dev/sda3. If you’re doing this on a SATA system (ick) you might have to use /dev/hda3 here, or on an HP probably /dev/cciss/c0d0p3. If you have FibreChannel or iSCSI-attached disk on a SAN you’re on your own to figure out what your block device is.

# xe sr-create content-type=user type=ext device-config:device=/dev/sda3 shared=false name-label="Local storage"
76ec3072-ae85-cd38-e363-34cf6b63d520

This command will take some time to return as it creates the SR.

You now probably want to tune the reserved space down on the ext3 partition to make more of it available. The filesystem reserves 5% of the storage to make block allocation and defragmentation more efficient, but you probably want to manage that yourself (set monitoring alarms at 95% and migrate VMs off if the storage gets above 95%).

The block device to tune is not /dev/sda3, but you can find it from df -k:

# df -k
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sda1              4128448   3214896    703840  83% /
none                    384512         0    384512   0% /dev/shm
/opt/xensource/packages/iso/XenCenter.iso
                         44410     44410         0 100% /var/xen/xc-install
/dev/mapper/XSLocalEXT--76ec3072--ae85--cd38--e363--34cf6b63d520-76ec3072--ae85--cd38--e363--34cf6b63d520
                     279556112    191652 265163836   1% /var/run/sr-mount/76ec3072-ae85-cd38-e363-34cf6b63d520

Use tune2fs against that really ugly block device name to set the reserve to 0%:

# tune2fs -m 0 /dev/mapper/XSLocalEXT--76ec3072--ae85--cd38--e363--34cf6b63d520-76ec3072--ae85--cd38--e363--34cf6b63d520
tune2fs 1.39 (29-May-2006)
Setting reserved blocks percentage to 0% (0 blocks)

You should now be able to see the new “Local storage” device in XenCenter and can set it as the default storage location for new VMs. You will also see VHDs associated with your VMs showing up in the /var/run/sr-mount/[...etc...] directory.

Share