Amazon EC2 micro instances really, really suck
Amazon claims that their EC2 micro instance provides a “small amount of consistent CPU resources”:
Instances of this family provide a small amount of consistent CPU resources and allow you to burst CPU capacity when additional cycles are available.
Well, one of my micro instances has looked like this with 98-100% steal cycles for hours:
08:59:32 CPU %user %nice %system %iowait %steal %idle
09:00:01 all 5.51 0.00 0.00 0.00 94.49 0.00
09:01:09 all 1.04 0.00 0.01 0.00 98.85 0.09
09:02:02 all 3.41 0.00 0.00 0.00 96.59 0.00
09:03:01 all 1.09 0.00 0.02 0.00 98.84 0.05
09:04:02 all 1.74 0.00 0.00 0.00 98.24 0.02
09:05:02 all 10.15 0.00 1.46 1.74 69.08 17.56
09:06:03 all 4.66 0.00 0.03 0.31 94.75 0.25
09:07:05 all 1.46 0.00 0.00 0.00 98.54 0.00
09:08:02 all 2.98 0.00 0.00 0.00 97.00 0.02
09:09:04 all 3.40 0.00 0.02 0.00 96.28 0.30
09:10:04 all 1.59 0.00 0.00 0.02 98.40 0.00
09:11:15 all 10.16 0.00 0.79 0.83 80.32 7.91
09:12:02 all 2.06 0.00 0.02 0.00 97.88 0.04
09:13:03 all 4.32 0.00 0.00 0.00 95.61 0.07
09:14:01 all 0.00 0.00 0.00 0.00 100.00 0.00
09:15:01 all 3.99 0.00 0.00 0.00 95.99 0.02
09:16:01 all 2.35 0.00 0.00 0.48 94.37 2.80
09:17:01 all 16.01 0.00 0.81 0.61 76.10 6.46
09:18:01 all 0.00 0.00 0.00 0.00 100.00 0.00
09:19:02 all 0.79 0.00 0.00 0.00 99.19 0.02
09:20:02 all 4.56 0.00 0.02 0.00 95.43 0.00
09:21:03 all 0.00 0.00 0.00 0.00 100.00 0.00
09:22:15 all 10.84 0.00 0.76 0.61 83.74 4.04
09:23:01 all 3.70 0.00 0.00 0.00 96.27 0.02
09:24:05 all 5.87 0.00 0.00 0.00 94.08 0.05
09:25:02 all 0.00 0.00 0.00 0.00 100.00 0.00
09:26:01 all 2.02 0.00 0.00 0.03 97.95 0.00
09:27:01 all 0.00 0.00 0.00 0.00 100.00 0.00
09:28:02 all 11.65 0.00 1.09 1.06 77.05 9.16
09:29:09 all 3.24 0.00 0.00 0.00 94.92 1.84
09:30:13 all 2.13 0.00 0.00 0.00 97.87 0.00
09:31:01 all 1.99 0.00 0.00 0.00 97.98 0.02
09:32:02 all 3.43 0.00 0.02 0.00 96.56 0.00
09:33:01 all 0.22 0.00 0.05 0.25 96.12 3.36
09:34:02 all 14.96 0.00 1.16 1.29 70.74 11.84
09:35:01 all 0.95 0.00 0.00 0.00 99.05 0.00
09:36:17 all 5.62 0.00 0.03 0.00 94.36 0.00
09:37:02 all 0.00 0.00 0.00 0.00 100.00 0.00
09:38:02 all 1.84 0.00 0.00 0.00 98.13 0.03
09:39:01 all 1.92 0.00 0.27 0.71 87.51 9.59
09:40:14 all 8.11 0.00 0.43 0.35 87.92 3.19
09:41:01 all 2.46 0.00 0.02 0.00 97.50 0.02
09:42:02 all 2.22 0.00 0.00 0.00 97.78 0.00
09:43:02 all 2.00 0.00 0.00 0.00 98.00 0.00
Note that sar is taking up to 14 seconds at times (e.g. 09:40:14) in order to gather statistics, and it is quite light weight and doesn’t do much other than read a bit out of /proc.
This is so bad that the instance is having 600+ms ping times:
64 bytes from 50.18.x.y: icmp_seq=0 ttl=53 time=609.608 ms
64 bytes from 50.18.x.y: icmp_seq=1 ttl=53 time=1107.454 ms
64 bytes from 50.18.x.y: icmp_seq=2 ttl=53 time=107.230 ms
64 bytes from 50.18.x.y: icmp_seq=3 ttl=53 time=605.416 ms
64 bytes from 50.18.x.y: icmp_seq=4 ttl=53 time=1104.728 ms
64 bytes from 50.18.x.y: icmp_seq=5 ttl=53 time=104.510 ms
64 bytes from 50.18.x.y: icmp_seq=6 ttl=53 time=633.574 ms
64 bytes from 50.18.x.y: icmp_seq=7 ttl=53 time=1133.371 ms
64 bytes from 50.18.x.y: icmp_seq=8 ttl=53 time=133.149 ms
64 bytes from 50.18.x.y: icmp_seq=9 ttl=53 time=631.604 ms
That means that interrupt context and enough horsepower to respond to an ICMP ping is not able to run at all for 600+ ms.
And the problem is definitely on the EC2 side, I’m getting clean pings to the internet:
1 a.b.c.d 102.019 ms 15.208 ms 13.735 ms
2 69.17.83.233 11.109 ms 10.859 ms 10.363 ms
3 209.247.91.169 10.861 ms 12.741 ms 12.855 ms
4 4.68.105.30 11.484 ms 16.482 ms 17.860 ms
5 4.69.132.49 27.975 ms 28.601 ms 30.103 ms
6 4.69.153.18 30.226 ms 28.726 ms 28.478 ms
7 4.69.152.16 28.352 ms 61.582 ms 31.225 ms
8 4.53.208.22 31.100 ms 30.775 ms 30.788 ms
9 72.21.222.208 32.222 ms 32.219 ms 30.851 ms
10 72.21.222.255 33.224 ms 31.350 ms 31.226 ms
11 * * *
12 * * *
13 * * *
14 50.18.x.y 393.064 ms 1500.394 ms 989.786 ms
This isn’t “small amounts of consistent CPU”, this is useless. They’re clearly prioritizing micro instances down to the point where if the server is otherwise being utilized 100% the micro instances get completely queue starved.