<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Spherical Chicken &#187; sysadmin</title>
	<atom:link href="http://www.scriptkiddie.org/blog/category/sysadmin/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.scriptkiddie.org/blog</link>
	<description>Climate, Technical Diving, Economics, System Engineering, IT Security</description>
	<lastBuildDate>Wed, 25 Jan 2012 20:36:47 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Amazon EC2 micro instances really, really suck</title>
		<link>http://www.scriptkiddie.org/blog/2011/03/24/amazon-ec2-micro-instances-really-really-suck/</link>
		<comments>http://www.scriptkiddie.org/blog/2011/03/24/amazon-ec2-micro-instances-really-really-suck/#comments</comments>
		<pubDate>Thu, 24 Mar 2011 16:51:14 +0000</pubDate>
		<dc:creator>Lamont Granquist</dc:creator>
				<category><![CDATA[sysadmin]]></category>

		<guid isPermaLink="false">http://www.scriptkiddie.org/blog/?p=616</guid>
		<description><![CDATA[Amazon claims that their EC2 micro instance provides a &#8220;small amount of consistent CPU resources&#8221;: Instances of this family provide a small amount of consistent CPU resources and allow you to burst CPU capacity when additional cycles are available. Well, one of my micro instances has looked like this with 98-100% steal cycles for hours: [...]]]></description>
			<content:encoded><![CDATA[<p>Amazon claims that their EC2 micro instance provides a &#8220;small amount of consistent CPU resources&#8221;:</p>
<blockquote><p>Instances of this family provide a small amount of consistent CPU resources and allow you to burst CPU capacity when additional cycles are available.</p></blockquote>
<p>Well, one of my micro instances has looked like this with 98-100% steal cycles for hours:</p>
<pre>
<code>
08:59:32          CPU     %user     %nice   %system   %iowait    %steal     %idle
09:00:01          all      5.51      0.00      0.00      0.00     94.49      0.00
09:01:09          all      1.04      0.00      0.01      0.00     98.85      0.09
09:02:02          all      3.41      0.00      0.00      0.00     96.59      0.00
09:03:01          all      1.09      0.00      0.02      0.00     98.84      0.05
09:04:02          all      1.74      0.00      0.00      0.00     98.24      0.02
09:05:02          all     10.15      0.00      1.46      1.74     69.08     17.56
09:06:03          all      4.66      0.00      0.03      0.31     94.75      0.25
09:07:05          all      1.46      0.00      0.00      0.00     98.54      0.00
09:08:02          all      2.98      0.00      0.00      0.00     97.00      0.02
09:09:04          all      3.40      0.00      0.02      0.00     96.28      0.30
09:10:04          all      1.59      0.00      0.00      0.02     98.40      0.00
09:11:15          all     10.16      0.00      0.79      0.83     80.32      7.91
09:12:02          all      2.06      0.00      0.02      0.00     97.88      0.04
09:13:03          all      4.32      0.00      0.00      0.00     95.61      0.07
09:14:01          all      0.00      0.00      0.00      0.00    100.00      0.00
09:15:01          all      3.99      0.00      0.00      0.00     95.99      0.02
09:16:01          all      2.35      0.00      0.00      0.48     94.37      2.80
09:17:01          all     16.01      0.00      0.81      0.61     76.10      6.46
09:18:01          all      0.00      0.00      0.00      0.00    100.00      0.00
09:19:02          all      0.79      0.00      0.00      0.00     99.19      0.02
09:20:02          all      4.56      0.00      0.02      0.00     95.43      0.00
09:21:03          all      0.00      0.00      0.00      0.00    100.00      0.00
09:22:15          all     10.84      0.00      0.76      0.61     83.74      4.04
09:23:01          all      3.70      0.00      0.00      0.00     96.27      0.02
09:24:05          all      5.87      0.00      0.00      0.00     94.08      0.05
09:25:02          all      0.00      0.00      0.00      0.00    100.00      0.00
09:26:01          all      2.02      0.00      0.00      0.03     97.95      0.00
09:27:01          all      0.00      0.00      0.00      0.00    100.00      0.00
09:28:02          all     11.65      0.00      1.09      1.06     77.05      9.16
09:29:09          all      3.24      0.00      0.00      0.00     94.92      1.84
09:30:13          all      2.13      0.00      0.00      0.00     97.87      0.00
09:31:01          all      1.99      0.00      0.00      0.00     97.98      0.02
09:32:02          all      3.43      0.00      0.02      0.00     96.56      0.00
09:33:01          all      0.22      0.00      0.05      0.25     96.12      3.36
09:34:02          all     14.96      0.00      1.16      1.29     70.74     11.84
09:35:01          all      0.95      0.00      0.00      0.00     99.05      0.00
09:36:17          all      5.62      0.00      0.03      0.00     94.36      0.00
09:37:02          all      0.00      0.00      0.00      0.00    100.00      0.00
09:38:02          all      1.84      0.00      0.00      0.00     98.13      0.03
09:39:01          all      1.92      0.00      0.27      0.71     87.51      9.59
09:40:14          all      8.11      0.00      0.43      0.35     87.92      3.19
09:41:01          all      2.46      0.00      0.02      0.00     97.50      0.02
09:42:02          all      2.22      0.00      0.00      0.00     97.78      0.00
09:43:02          all      2.00      0.00      0.00      0.00     98.00      0.00
</code>
</pre>
<p>Note that sar is taking up to 14 seconds at times (e.g. 09:40:14) in order to gather statistics, and it is quite light weight and doesn&#8217;t do much other than read a bit out of /proc.</p>
<p>This is so bad that the instance is having 600+ms ping times:</p>
<pre>
<code>
64 bytes from 50.18.x.y: icmp_seq=0 ttl=53 time=609.608 ms
64 bytes from 50.18.x.y: icmp_seq=1 ttl=53 time=1107.454 ms
64 bytes from 50.18.x.y: icmp_seq=2 ttl=53 time=107.230 ms
64 bytes from 50.18.x.y: icmp_seq=3 ttl=53 time=605.416 ms
64 bytes from 50.18.x.y: icmp_seq=4 ttl=53 time=1104.728 ms
64 bytes from 50.18.x.y: icmp_seq=5 ttl=53 time=104.510 ms
64 bytes from 50.18.x.y: icmp_seq=6 ttl=53 time=633.574 ms
64 bytes from 50.18.x.y: icmp_seq=7 ttl=53 time=1133.371 ms
64 bytes from 50.18.x.y: icmp_seq=8 ttl=53 time=133.149 ms
64 bytes from 50.18.x.y: icmp_seq=9 ttl=53 time=631.604 ms
</code>
</pre>
<p>That means that interrupt context and enough horsepower to respond to an ICMP ping is not able to run at all for 600+ ms.</p>
<p>And the problem is definitely on the EC2 side, I&#8217;m getting clean pings to the internet:</p>
<pre>
<code>
 1  a.b.c.d  102.019 ms  15.208 ms  13.735 ms
 2  69.17.83.233  11.109 ms  10.859 ms  10.363 ms
 3  209.247.91.169  10.861 ms  12.741 ms  12.855 ms
 4  4.68.105.30  11.484 ms  16.482 ms  17.860 ms
 5  4.69.132.49  27.975 ms  28.601 ms  30.103 ms
 6  4.69.153.18  30.226 ms  28.726 ms  28.478 ms
 7  4.69.152.16  28.352 ms  61.582 ms  31.225 ms
 8  4.53.208.22  31.100 ms  30.775 ms  30.788 ms
 9  72.21.222.208  32.222 ms  32.219 ms  30.851 ms
10  72.21.222.255  33.224 ms  31.350 ms  31.226 ms
11  * * *
12  * * *
13  * * *
14  50.18.x.y  393.064 ms  1500.394 ms  989.786 ms
</code>
</pre>
<p>This isn&#8217;t &#8220;small amounts of consistent CPU&#8221;, this is useless.  They&#8217;re clearly prioritizing micro instances down to the point where if the server is otherwise being utilized 100% the micro instances get completely queue starved.</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fwww.scriptkiddie.org%2Fblog%2F2011%2F03%2F24%2Famazon-ec2-micro-instances-really-really-suck%2F&amp;title=Amazon%20EC2%20micro%20instances%20really%2C%20really%20suck" id="wpa2a_2"><img src="http://www.scriptkiddie.org/blog/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://www.scriptkiddie.org/blog/2011/03/24/amazon-ec2-micro-instances-really-really-suck/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>The Heretical Sysadmin: Outbound ACLs need to die</title>
		<link>http://www.scriptkiddie.org/blog/2011/02/25/the-heretical-sysadmin-outbound-acls-need-to-die/</link>
		<comments>http://www.scriptkiddie.org/blog/2011/02/25/the-heretical-sysadmin-outbound-acls-need-to-die/#comments</comments>
		<pubDate>Fri, 25 Feb 2011 17:45:25 +0000</pubDate>
		<dc:creator>Lamont Granquist</dc:creator>
				<category><![CDATA[networking]]></category>
		<category><![CDATA[sysadmin]]></category>

		<guid isPermaLink="false">http://www.scriptkiddie.org/blog/?p=594</guid>
		<description><![CDATA[Its time. The idea that all your servers need to be by default ACL&#8217;d so that they can&#8217;t talk to the internet is an idea that is dying and needs to be put out of its misery. First of all, outbound ACLs create constant rollout failures of software. Unless you&#8217;ve got really excellent process around [...]]]></description>
			<content:encoded><![CDATA[<p>Its time.  The idea that all your servers need to be by default ACL&#8217;d so that they can&#8217;t talk to the internet is an idea that is dying and needs to be put out of its misery.</p>
<p><span id="more-594"></span></p>
<p>First of all, outbound ACLs create constant rollout failures of software.  Unless you&#8217;ve got really excellent process around replicating the ACLs in pre-production and tracking the ACLs as part of rollout procedures you will inevitably wind up with rollout/deployment failures in production.  I&#8217;ve been on way too many post mortems of software deployments where the failure to track production outbound ACLs has been the cause.  I&#8217;ll say what doesn&#8217;t get said at these meetings, which is that the outbound ACLs themselves are the cause of the problem.</p>
<p>Second of all, with all the cloud services now the outbound ACLs are only going to cause more and more problems.  All my servers now talk to chef servers &#8220;in the cloud&#8221; run externally by opscode.  All my Dell servers pull OMSA and firmware directly from Dell.  The software dev teams are using github more and more and pulling directly from those servers &#8220;in the cloud&#8221;.  As more and more external platform dependencies are built it becomes impossible to deal with them all, particularly as the platform vendors change their own IPs around on their own schedule.  Opscode migrated from Amazon EC2 to rackspace and I never knew about that &#8212; if we had outbound ACLs everything would have broken, or it would have been a huge firedrill.  In the past 9 months I&#8217;ve gone from having basically no external dependencies on &#8220;the cloud&#8221; and burning tons of time carefully mirroring all my CentOS and RHEL RPMs, to having 3 major external dependencies to &#8220;the cloud&#8221;, and I&#8217;ve gotten more done and slept fine at night.</p>
<p>Sure, you can say that we should minimize all of this.  Replicate the dell stuff with in-house mirrors, stop using github and run our internal git servers, setup chef servers internally and host it ourselves.   But you just cut three big checks on our time, and we&#8217;re simply not staffed to handle that.  Even if we were staffed to handle it, if you stack rank what I care about none of that comes close to the top of the list of stuff that I would do if we had 2 more headcount.  I&#8217;ve given up thinking that its an appropriate job for a system admin to in-house everything that can just be outsourced to &#8220;the cloud&#8221; in order to remove external dependencies of the enterprise.</p>
<p>Sure this increases risk, but the job of the system admin is not to decrease every single risk that they find.  Everything needs to be stack ranked and needs to be assessed in terms of its importance.  I still have huge issues with account management and our deployment process is crap and we need to patch our servers and I don&#8217;t have enough time to do any of that.  How much time do I have to worry that my servers point directly at Dell&#8217;s repo for OMSA?  I don&#8217;t have time to care about that.  There&#8217;s much riskier things in my infrastructure that if I ignore them I will get hacked or will definitely cause me downtime.</p>
<p>And sure, outbound ACLs help after you do get hacked.  I&#8217;d argue, however, that the cost of the outbound ACLs is starting to greatly exceed the level that they help.  I&#8217;d also argue that you need to focus on preventing the attacks from being successful in the first place.  The use of outbound ACLs is lazy security engineering and damages the rest of the business.  Become pro-active and scan your network for open proxies and close them, don&#8217;t rely on outbound ACLs so that you can have open proxies available to the internet, but its okay (because nobody can use them to bounce into the internet and suck up your bandwidth &#8212; they can only scan your whole internal network &#8212; its okay!).  Tighten up your border security and if you must install &#8220;detect&#8221; controls instead of &#8220;prevent&#8221; controls.  Log when you see anomalous behavior.</p>
<p>There also will still be PCI-DSS to deal with, and every security best practices document out there probably start approaching the network from the perspective of locking everything down and then poking tiny little holes in it &#8212; writing checks on your time to manage all of that.  The people who wrote those docs don&#8217;t have to manage your network and it only took them a few minutes to draft that policy, while you&#8217;re stuck with the fallout from the policies for the rest of your career.  You don&#8217;t have to follow them if they&#8217;re no longer working.  Except for VISA, you have to follow that, but VISA would prefer that your servers all crash rather than having you violate their standard.  So, wall off the PCI-DSS environment and get used to managing the outbound firewall ACLs there and suck it up as a cost to the business.  But do not make the mistake of thinking that what you do for PCI-DSS is necessarily a best practice and that it should be applied to the rest of your Enterprise.</p>
<p>Also, certain outbound ACLs are fine.  None of your production servers should probably be making IRC connections.  By all means block that port and have any outbound connection to IRC servers set off security alarms that wake up your security engineers and put them to work.  Similarly, you can probably engineer your SMTP sending infrastructure so that most leaf nodes forward to dedicated SMTP internal relays and only those servers are allowed to send outgoing SMTP.  On a case-by-case basis that is fine.  But port 80 and port 443 to the internet are going to need to be wide open in the future.</p>
<p>Get used to it.  At some point, you will not be able to keep up and &#8220;the cloud&#8221; will solve too many problems to be able to continue to burn time replicating internet resources in-house, and the ACL management will become a nightmare and more of a risk to uptime to manage badly than to simply open it up.</p>
<p>Its time to stop considering outbound ACLs a &#8220;best practice&#8221; and start considering it something that bad SEs and NEs do who are just control freaks that haven&#8217;t evolved out of the late 90s best practices.  We have passed an inflection point where this practice is no longer net positive to the Enterprise and it is net negative in terms of management and availability.  Stop thinking that its part of doing a good job to block everything from being able to talk to the Internet.  You are in the way of the business.  You are an impediment to getting things done.  Servers, both inside your company and outside, desperately want to talk to each other to get work done, stop getting in the way.</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fwww.scriptkiddie.org%2Fblog%2F2011%2F02%2F25%2Fthe-heretical-sysadmin-outbound-acls-need-to-die%2F&amp;title=The%20Heretical%20Sysadmin%3A%20Outbound%20ACLs%20need%20to%20die" id="wpa2a_4"><img src="http://www.scriptkiddie.org/blog/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://www.scriptkiddie.org/blog/2011/02/25/the-heretical-sysadmin-outbound-acls-need-to-die/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Amazons CentOS-based AMIs are useless</title>
		<link>http://www.scriptkiddie.org/blog/2011/01/22/amazons-centos-based-amis-are-useless/</link>
		<comments>http://www.scriptkiddie.org/blog/2011/01/22/amazons-centos-based-amis-are-useless/#comments</comments>
		<pubDate>Sat, 22 Jan 2011 18:36:50 +0000</pubDate>
		<dc:creator>Lamont Granquist</dc:creator>
				<category><![CDATA[sysadmin]]></category>

		<guid isPermaLink="false">http://www.scriptkiddie.org/blog/?p=514</guid>
		<description><![CDATA[I posted some instructions here on using knife to bootstrap EC2 instances. I used Amazon&#8217;s AMIs thinking that they&#8217;d be more appropriate for EC2 because I&#8217;m an EC2 newbie. FAIL. The stated goal of Amazon&#8217;s AMI is to provide essentially a RHEL6 kernel on a RHEL5 ABI. However, if you really try to use RHEL5 [...]]]></description>
			<content:encoded><![CDATA[<p>I posted some instructions <a href="http://www.scriptkiddie.org/blog/2010/12/18/bootstrapping-amazons-centos-based-ami-in-ec2-using-knife/">here</a> on using knife to bootstrap EC2 instances.  I used Amazon&#8217;s AMIs thinking that they&#8217;d be more appropriate for EC2 because I&#8217;m an EC2 newbie.</p>
<p>FAIL.</p>
<p>The stated goal of Amazon&#8217;s AMI is to provide essentially a RHEL6 kernel on a RHEL5 ABI.  However, if you really try to use RHEL5 RPMs you&#8217;ll run into things like curl being upgraded to:</p>
<p>curl-7.19.7-15.16.amzn1.x86_64</p>
<p>Which is basically the RHEL6 version:</p>
<p>curl-7.19.7-6.el6.x86_64.rpm</p>
<p>Not the RHEL5 version:</p>
<p>curl-7.15.5-9.el5.x86_64.rpm</p>
<p>That breaks all kinds of RHEL5 RPMs, while its true that the rest of the O/S isn&#8217;t RHEL6 enough to install RHEL6 RPMs successfully.  The result is a painful inability to consistently install either RHEL5 or RHEL6 RPMs.</p>
<p>Its possible that the Amazon AMI might collect itself a following of people who build up public repos with all the RPMs that you can get from DAG, EPEL, ELFF, Jpackage, etc for the Amazon AMI.  However, right now its basically its own beast &#8212; not RHEL5 or RHEL6 compatible &#8212; with no publicly-available prebuilt package repos for it.  Since I have better things to do than recompile packages for it, its kinda useless to me, so I&#8217;d recommend the Rightscale CentOS 5.4 EBS-based images (to use on the t1.micro instance) instead. </p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fwww.scriptkiddie.org%2Fblog%2F2011%2F01%2F22%2Famazons-centos-based-amis-are-useless%2F&amp;title=Amazons%20CentOS-based%20AMIs%20are%20useless" id="wpa2a_6"><img src="http://www.scriptkiddie.org/blog/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://www.scriptkiddie.org/blog/2011/01/22/amazons-centos-based-amis-are-useless/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Bootstrapping Amazon&#8217;s CentOS-based AMI in EC2 using knife</title>
		<link>http://www.scriptkiddie.org/blog/2010/12/18/bootstrapping-amazons-centos-based-ami-in-ec2-using-knife/</link>
		<comments>http://www.scriptkiddie.org/blog/2010/12/18/bootstrapping-amazons-centos-based-ami-in-ec2-using-knife/#comments</comments>
		<pubDate>Sat, 18 Dec 2010 17:57:35 +0000</pubDate>
		<dc:creator>Lamont Granquist</dc:creator>
				<category><![CDATA[sysadmin]]></category>

		<guid isPermaLink="false">http://www.scriptkiddie.org/blog/?p=437</guid>
		<description><![CDATA[Quick note on how to use chef 0.9.12&#8242;s knife command to provision EC2 instances from the command line using the Amazon CentOS-based AMI. I&#8217;m assuming you already have your ~/.chef/knife.rb setup to talk to either the chef platform or some chef servers that you&#8217;ve built and that you can run commands like &#8216;knife node list&#8217; [...]]]></description>
			<content:encoded><![CDATA[<p>Quick note on how to use chef 0.9.12&#8242;s knife command to provision EC2 instances from the command line using the Amazon CentOS-based AMI.  I&#8217;m assuming you already have your ~/.chef/knife.rb setup to talk to either the chef platform or some chef servers that you&#8217;ve built and that you can run commands like &#8216;knife node list&#8217; successfully and have your chef client_key setup.  Setting up knife to talk to your chef server is outside the scope of this post.</p>
<blockquote><p>
NOTE: It turns out that the Amazon AMI looks like its based on RHEL6, with some upgraded RPMs.  The centos5-based chef install seems to work fine though.
</p></blockquote>
<p><span id="more-437"></span><br />
First, of all the server you run knife on needs some extra gems in order to talk to EC2:</p>
<pre>
<blockquote>
sudo gem install net-ssh net-ssh-multi fog highline
</pre>
</blockquote>
<p>Second, you need to setup your Amazon AWS keys in your ~/.chef/knife.rb file.  These are *not* your ssh keys associated with your EC2 account.  These are the key strings that you use to authenticate to AWS, which you should be able to find <a href="https://aws-portal.amazon.com/gp/aws/developer/account/index.html?ie=UTF8&#038;action=access-key">here</a>.  Add these two lines to your knife.rb:</p>
<pre>
<blockquote>
knife[:aws_access_key_id] = "&lt;your access key goes here&gt;"
knife[:aws_secret_access_key] =  "&lt;your secret key goes here&gt;"
</pre>
</blockquote>
<p>Then you need to download your EC2 ssh private key.  If you use multiple regions (us-west and us-east for example) you will need multiple ssh key pairs.  I put mine in ~/.ssh/ec2-west.pem and ~/.ssh/ec2-east.pem and I&#8217;ve named the keys in the UI &#8220;ec2-west&#8221; and &#8220;ec2-east&#8221;.</p>
<p>One you&#8217;ve done that, bootstrapping is just one command line, although there is considerable amounts of magic:</p>
<pre>
<blockquote>
knife ec2 server create --region us-west-1 -Z us-west-1b -i ami-d40e5e91 -f t1.micro -G server -I ~/.ssh/ec2-west.pem -S ec2-west --ssh-user ec2-user -d centos5-gems
</pre>
</blockquote>
<p>The breakdown of what the arguments do:</p>
<ul>
<li><b>&#8211;region us-west-1</b>:  This is the EC2 region east-coast/west-coast/EU/etc.
<li><b>-Z us-west-1b</b>: This is the availability zone, the default is an us-east-1 zone, so you need to specify this to make knife work in other regions.
<li><b>-i ami-d40e5e91</b>: This is the 32-bit Amazon CentOS-based AMI for the us-west region.  If you use an image sized larger than t1.micro you can use the 64-bit version, and if you change the region then you will need to chose the appropriate AMI in that region.
<li><b>-f t1.micro</b>: This is the micro/cheap EC2 instance that only supports 32-bit.
<li><b>-G server</b>:  This is your security group.  I&#8217;ve created a security group named &#8220;server&#8221; which opens up DNS and HTTP in addition to SSH.  If you have not created a security group try &#8220;-G default&#8221; here.
<li><b>-I ~/.ssh/ec2-west.pem</b>:  This is the path to the ssh secret key that I downloaded off my EC2 dashboard &#8212; this is also region specific.
<li><b>-S ec2-west</b>:  This is the name of the ssh secret key pair that I setup in the EC2 dashboard, this is also region specific.  The -l argument tells chef what to use to ssh in, the -S argument tells EC2 what to setup on the server.  They must obviously agree.
<li><b>&#8211;ssh-user ec2-user</b>:  Use the ec2-user to ssh into the Amazon AMI.
<li><b>-d centos5-gems</b>:  This overrides the default ubuntu-centric knife install of chef on the target host with a CentOS-5 RPM-based install.
</ul>
<p>You should then see your instance successfully start up and register with the chef platform.  After a bunch of output it should look something like:</p>
<pre>
<blockquote>
ec2-50-18-12-5.us-west-1.compute.amazonaws.com [Sat, 18 Dec 2010 17:29:33 +0000] INFO: Client key /etc/chef/client.pem is not present - registering
ec2-50-18-12-5.us-west-1.compute.amazonaws.com [Sat, 18 Dec 2010 17:29:37 +0000] WARN: HTTP Request Returned 404 Not Found: Cannot load node i-32ae3376
ec2-50-18-12-5.us-west-1.compute.amazonaws.com [Sat, 18 Dec 2010 17:29:39 +0000] INFO: Setting the run_list to [] from JSON
ec2-50-18-12-5.us-west-1.compute.amazonaws.com [Sat, 18 Dec 2010 17:29:40 +0000] INFO: Starting Chef Run (Version 0.9.12)
ec2-50-18-12-5.us-west-1.compute.amazonaws.com [Sat, 18 Dec 2010 17:29:40 +0000] WARN: Node i-32ae3376 has an empty run list.
ec2-50-18-12-5.us-west-1.compute.amazonaws.com [Sat, 18 Dec 2010 17:29:41 +0000] INFO: Chef Run complete in 1.13335 seconds
ec2-50-18-12-5.us-west-1.compute.amazonaws.com [Sat, 18 Dec 2010 17:29:41 +0000] INFO: cleaning the checksum cache
ec2-50-18-12-5.us-west-1.compute.amazonaws.com [Sat, 18 Dec 2010 17:29:41 +0000] INFO: Running report handlers
ec2-50-18-12-5.us-west-1.compute.amazonaws.com [Sat, 18 Dec 2010 17:29:41 +0000] INFO: Report handlers complete

Instance ID: i-32ae3376
Flavor: t1.micro
Image: ami-d40e5e91
Availability Zone: us-west-1b
Security Groups: server
SSH Key: ec2-west
Public DNS Name: ec2-50-18-12-5.us-west-1.compute.amazonaws.com
Public IP Address: 50.18.12.5
Private DNS Name: ip-10-162-215-71.us-west-1.compute.internal
Private IP Address: 10.162.215.71
Run List:
</pre>
</blockquote>
<p>You should be able to use &#8216;knife node list&#8217; to see the new instance, and you should be able to ssh into the instance using your ssh key.</p>
<pre>
<blockquote>
> knife node list
[
  "i-32ae3376"
]
> ssh -i ~/.ssh/ec2-west.pem ec2-50-18-12-5.us-west-1.compute.amazonaws.com -l ec2-user
Last login: Fri Dec 17 06:32:10 2010 from <W.X.Y.Z>

       __|  __|_  )  Amazon Linux AMI
       _|  (     /     Beta
      ___|\___|___|

See /etc/image-release-notes for latest release notes. :-)
[ec2-user@ip-10-162-215-71 ~]$
</pre>
</blockquote>
</pre>
</blockquote>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fwww.scriptkiddie.org%2Fblog%2F2010%2F12%2F18%2Fbootstrapping-amazons-centos-based-ami-in-ec2-using-knife%2F&amp;title=Bootstrapping%20Amazon%26%238217%3Bs%20CentOS-based%20AMI%20in%20EC2%20using%20knife" id="wpa2a_8"><img src="http://www.scriptkiddie.org/blog/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://www.scriptkiddie.org/blog/2010/12/18/bootstrapping-amazons-centos-based-ami-in-ec2-using-knife/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>XenServer, yum and &#8220;Error processing your License File&#8221;</title>
		<link>http://www.scriptkiddie.org/blog/2010/11/13/xenserver-yum-and-error-processing-your-license-file/</link>
		<comments>http://www.scriptkiddie.org/blog/2010/11/13/xenserver-yum-and-error-processing-your-license-file/#comments</comments>
		<pubDate>Sat, 13 Nov 2010 17:40:05 +0000</pubDate>
		<dc:creator>Lamont Granquist</dc:creator>
				<category><![CDATA[sysadmin]]></category>

		<guid isPermaLink="false">http://www.scriptkiddie.org/blog/?p=422</guid>
		<description><![CDATA[Turns out its fairly bad practice to take a XenServer instance and point your yum repos at a CentOS 5 yum repository. It makes it easy to install some useful utility stuff like lsof and strace. However, if you mistakenly run &#8216;yum -y upgrade&#8217; on a XenServer which is pointed at an enabled CentOS 5 [...]]]></description>
			<content:encoded><![CDATA[<p>Turns out its fairly bad practice to take a XenServer instance and point your yum repos at a CentOS 5 yum repository.</p>
<p>It makes it easy to install some useful utility stuff like lsof and strace.  However, if you mistakenly run &#8216;yum -y upgrade&#8217; on a XenServer which is pointed at an enabled CentOS 5 repo, it will upgrade the gnupg RPM which nukes some special sauce that Citrix has put in there.  This also updates your /etc/redhat-release file to suggest that your XenServer is now Citrix which can break cfengine, chef, etc which all look in that file to determine what flavor of redhat-ish server you are running and will now treat your box just like any other CentOS box.</p>
<p>One solution to this is simply to do something like on a sane XenServer:</p>
<p>tar -cvzf /tmp/gnupg.tar.gz `rpm -q -l gnupg`</p>
<p>Then copy that to a broken XenServer and something like:</p>
<p>cd / &#038;&#038; tar -xvzpf /path/to/gnupg.tar.gz</p>
<p>That&#8217;ll screw up your RPM database a little bit, but it&#8217;ll repair the problem.</p>
<p><span id="more-422"></span><br />
If you can get the gnupg RPM out of the base distro via yum and install that, it would probably be better to repair the system that way.</p>
<p>I also have a suspicion that its possible to do an &#8216;upgrade&#8217; of the XenServer dom0 from the busted 5.6 to a vanilla 5.6 which should do a wipe-and-reinstall of the dom0 while preserving all the domUs on the server (I discovered doing upgrades from 5.5 to 5.6 that this is what actually happens with a XenServer &#8216;upgrade&#8217; &#8212; slightly surprising, but possibly useful in this kind of a situation).  This would probably be the best solution since it would back out any other problems caused by the errant &#8216;yum upgrade&#8217;.  I haven&#8217;t tried this yet, but its on my list of things to explore.</p>
<p>I think that best practice with XenServer for &#8220;third-party&#8221; RPMs should probably be to setup a blank repo and copy only the RPMs that you want from CentOS5 into that repo.  Either that or to have the CentOS5 repo be disabled in your yum config and only &#8211;enablerepo it when you want to install a few utilities like strace and lsof.  That will keep admins from accidentally causing quite a bit of chaos to the distro.</p>
<p>Of course I think this only points out how KVM is going to ultimately be much more preferable to XenServer.  Since KVM just runs on a vanilla CentOS image, a &#8216;yum upgrade&#8217; is something that you really want to be doing on that image.  You want to manage it more-or-less like all the rest of your CentOS distributions.  XenServer will undoubtedly wind up going towards a more closed VMWare-ESXi-like model where the hypervisor is a thinner shim that no longer is a linux O/S image that you can&#8217;t log into and cause this kind of chaos.  After I hosed my XenServers this way and knew it was my own damn fault, I had the good sense not to bother Citrix customer support with this stupid problem, but I&#8217;m sure that others have engaged Citrix on this, and those people (legitimately) don&#8217;t want to see this kind of issue &#8212; this will be yet another point that drives them towards a more closed platform &#8212; and will drive me towards KVM.</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fwww.scriptkiddie.org%2Fblog%2F2010%2F11%2F13%2Fxenserver-yum-and-error-processing-your-license-file%2F&amp;title=XenServer%2C%20yum%20and%20%26%238220%3BError%20processing%20your%20License%20File%26%238221%3B" id="wpa2a_10"><img src="http://www.scriptkiddie.org/blog/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://www.scriptkiddie.org/blog/2010/11/13/xenserver-yum-and-error-processing-your-license-file/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Chef&#8217;s /etc/chef/client.rb file and code as configuration</title>
		<link>http://www.scriptkiddie.org/blog/2010/10/17/chefs-etcchefclient-rb-file-and-code-as-configuration/</link>
		<comments>http://www.scriptkiddie.org/blog/2010/10/17/chefs-etcchefclient-rb-file-and-code-as-configuration/#comments</comments>
		<pubDate>Sun, 17 Oct 2010 18:30:55 +0000</pubDate>
		<dc:creator>Lamont Granquist</dc:creator>
				<category><![CDATA[sysadmin]]></category>

		<guid isPermaLink="false">http://www.scriptkiddie.org/blog/?p=293</guid>
		<description><![CDATA[One interesting thing to wrap my head around in the cfengine-vs-chef world is code being used all over the place as configuration. I ran into an issue where I needed to set the PATH environment that chef executes in order to fix ohai to be able to realiably find dmidecode in /usr/sbin (among other things). [...]]]></description>
			<content:encoded><![CDATA[<p>One interesting thing to wrap my head around in the cfengine-vs-chef world is code being used all over the place as configuration.  I ran into an issue where I needed to set the PATH environment that chef executes in order to fix ohai to be able to realiably find dmidecode in /usr/sbin (among other things).  The way that I would normally expect to do this is to look into an configuration file in /etc and find some documentation that someone supported a &#8220;PATH&#8221; keyword or some kind of keyword in order to pass environment variables into the code in order to do what I want.  </p>
<p>Well, I surfed around through the options available in the /etc/chef/client.rb file and wound up frustrated because nobody had spec&#8217;d out the obvious use case of needing to set environment variables.  I considered submitting a bug for this.  I wrote out the bug in my head, and it was going to suggest either a specific &#8220;path=&#8221; directive to set the path, or some more complicated directive, ideally, to support setting arbitrary environment variables.</p>
<p>Then I pondered what Adam had told me a couple days earlier to just drop code in there.  In this case what we were discussing was setting the node_name off of `/bin/hostname`.  I momentarily meditated on the fact that it really was a client.*RB* file and not client.*CONF*.  Could it be this easy:</p>
<pre>
# cat /etc/chef/client.rb
log_level        :info
log_location     STDOUT
chef_server_url  'https://api.opscode.com/organizations/CENSORED'
validation_client_name 'CENSORED-validator'

Ohai::Config[:plugin_path] << '/etc/ohai/plugins'

ENV['PATH'] = "/usr/local/ruby-1.8.7/bin:/usr/local/perl-5.10.1/bin:/usr/local/sbin:/usr/local/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/sbin:/usr/bin:/sbin:/bin"
</pre>
<p>And, yes, apparently it just is that easy.  Nobody at opscode had to think about supporting setting ENV variables in the conf file, because they just executed the conf file so I can just embed ruby in there and change the behavior in ways that the authors possibly never considered.</p>
<p>Code-as-configuration FTW.</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fwww.scriptkiddie.org%2Fblog%2F2010%2F10%2F17%2Fchefs-etcchefclient-rb-file-and-code-as-configuration%2F&amp;title=Chef%26%238217%3Bs%20%2Fetc%2Fchef%2Fclient.rb%20file%20and%20code%20as%20configuration" id="wpa2a_12"><img src="http://www.scriptkiddie.org/blog/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://www.scriptkiddie.org/blog/2010/10/17/chefs-etcchefclient-rb-file-and-code-as-configuration/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Issues with DevOps</title>
		<link>http://www.scriptkiddie.org/blog/2010/09/26/issues-with-devops/</link>
		<comments>http://www.scriptkiddie.org/blog/2010/09/26/issues-with-devops/#comments</comments>
		<pubDate>Sun, 26 Sep 2010 19:42:02 +0000</pubDate>
		<dc:creator>Lamont Granquist</dc:creator>
				<category><![CDATA[sysadmin]]></category>

		<guid isPermaLink="false">http://www.scriptkiddie.org/blog/?p=267</guid>
		<description><![CDATA[DevOps background DevOps is a developing meme which seems to be attacking the wall between &#8216;development&#8217; and &#8216;operations&#8217;, incorporating agile methodologies in &#8216;operations&#8217; and expressing &#8216;infrastructure as code&#8217;. This is a very good thing, and its about damn time. But I&#8217;ve been thinking about these things for a long time, and I&#8217;m turning into a [...]]]></description>
			<content:encoded><![CDATA[<p><strong>DevOps background</strong></p>
<p>DevOps is a developing meme which seems to be attacking the wall between &#8216;development&#8217; and &#8216;operations&#8217;,  incorporating agile methodologies in &#8216;operations&#8217; and expressing &#8216;infrastructure as code&#8217;.  This is a very good thing, and its about damn time.  But I&#8217;ve been thinking about these things for a long time, and I&#8217;m turning into a cantankerous old fart at this point, so here&#8217;s my critique of the memes that I&#8217;m seeing so far&#8230;</p>
<p><span id="more-267"></span><br />
<strong>My Background</strong></p>
<p>I&#8217;ve been a Unix power user since 1988 on the Eskimo North BBS running Xenix and teaching myself C.  I&#8217;ve been a Unix SA since around 1996 (UW HITLab, UW MBT, various dotcoms).  From 2001-2006 I worked at [A large Internet Retailer] on the CFengine infrastructure there.  I CFengine-ized the 400 servers we mainted in 2001 on the front end of the website.  By 2006 I was pushing out CFengine code to 30,000 servers, and was by far the most prolific configuration management engineer at [Internet Retailer] (80 commits to what was the &#8216;base role&#8217; in that system vs. the next runner up having 5).  Since then I&#8217;ve worked at Real Networks (ask me about that after some drinks) and now the Rhapsody spin-off managing 2,000+ servers.</p>
<p><strong>Issue #1: the name &#8220;Operations&#8221;</strong></p>
<p>I find myself having radically more in common with software developers than with operations people.  I actually have a substantially worse time dealing with bureaucratic process than most software developers do.  The job that I do requires pushing out to every server in the organization.  The common notification and approval processes in Enterprise organizations tend to focus on operations fixing individual issues on individual servers, or code being deployed to a given set of servers with a very finite list of users that need to be notified of the changes.  The case of having to make changes on every server, potentially affecting every user is not a &#8216;use case&#8217; considered by the average policy.</p>
<p>When it does get handled, often it gets handled as a months-long operational fire-drill to get something simple done.  A case in point would be the 2007 Daylight Savings Time issue which at the organization I was at required all-hands-on-deck for 2 months, nearly killed one SA, and burned up massive amounts of Program Management time doing the notification and approvals to reboot every service in the company.  That kind of process is unsustainable when your basic job is to repeatedly do changes potentially affecting every service in the company.  If every change like that is going to take 2 months to do, that means that I would only be able to do 6 things a year.  </p>
<p>Obviously there is a clear need for agile methodologies here.  But I find myself more and more being part of &#8220;Operational&#8221; departments and yet being substantially more at odd with the &#8220;Operational&#8221; mindset that most any software development team.  I have quit jobs because it was clear to me that I would not get a change control policy that were sufficient agile that I could do my job and have adequate air cover to deal with inevitable mistakes.  Through working on massive environments and doing repeated global pushes of config changes, I&#8217;m very good at mitigating and assessing risk, but I&#8217;m not perfect.  I could do the math and see that inevitably I&#8217;d make a mistake trying to do my job, and simply get fired, while no manager was interested in trying to solve my unique change management issues.  Every time I bring up the problem, some massively bureaucratic and heavy process designed to cover my manager&#8217;s butt was the response, which would not allow me to do my job.</p>
<p>I believe so strongly that what I primarily do is not &#8220;operations&#8221; that I got into an argument with my present manager that our group should be named &#8220;infrastructure&#8221; but was overridden and wound up in yet another &#8220;operations&#8221; group.</p>
<p>I&#8217;d like to see more discussion of &#8220;Agile Infrastructure&#8221;, &#8220;Infrastructure Engineering&#8221;, &#8220;Infrastructure Architecture&#8221;, etc and less of an assumption that just because I know how to setup a DNS server that I&#8217;m &#8220;Operational&#8221;.  For awhile on Orkut (blast from the past, anyone?) there was an &#8220;Infrastructure Architects&#8221; group, which didn&#8217;t get a whole lot of traffic, but that&#8217;s what I&#8217;ve thought of as my job, even as I&#8217;ve suffered under being &#8220;Operational&#8221;.</p>
<p><strong>Issue #2:  The Whole Organization Needs to be Operational</strong></p>
<p>In web-centric service-oriented-architecture agile shops (buzzword bingo!) the domain experts of the software are the software developers.  I&#8217;ve watched &#8220;Operational&#8221; managers attempt and fail to execute on the strict playbook of having software developers write code and operations run the code.  This has extended to the ludicrous extreme of having a requirement for massively scalable real-time logfile publishing so that software developers could still debug production issues while having their accounts removed from the app servers that their code ran on.  I&#8217;ve also seen policy decisions made that &#8220;operations runs Java&#8221; which leads to fairly ludicrous behavior where SAs that couldn&#8217;t explain the first thing about how Java garbage collection runs are in change of all the GC options to Java (because they are defined as the experts, without actually being the experts).</p>
<p>Software developers in this world are the domain experts.  They don&#8217;t write shrink-wrapped software and they need to develop expertise in tuning their apps, debugging their apps and they need access to their software and arguably even need access to bounce servers to debug simple production issues.  The circus that ensues when software development orgs are not trusted to &#8216;touch&#8217; production and they have to use the &#8216;remote hands&#8217; of operations, when the SAs involved aren&#8217;t qualified to understand what the software team is telling them to do is amusing at first and then quickly gets tiresome.</p>
<p>SAs are not genetically born understanding &#8220;operations&#8221; and here&#8217;s nothing preventing software teams from developing good operational practices and being held accountable by their management chain.  Having software teams only responsible for writing software, while &#8220;operations&#8221; is entirely the responsibility of the operations teams doesn&#8217;t work well when you need to get operational work out of the software teams.  I would argue that placing operational responsibility for the software back onto the software teams will produce better uptime and operations and better code.</p>
<p>The point of all this, tough, is to attempt to break down the dichotomy between operations people vs. everything else.  We need Operations in Infrastructure teams, in Security teams, and in Development teams.  It may be that you have dedicated people to Operations in those teams, but Operations-as-a-discipline needs to be smeared out across the Enterprise and needs to not be just the sole responsibility of one department.</p>
<p>This probably aligns strongly with DevOps, but I haven&#8217;t seen anyone get hardcore enough about this problem.  To paraphrase Slim Shady: &#8220;Guess there&#8217;s a little Operations in all of us.  Fuck it, lets all stand up.&#8221;</p>
<p><strong>Issue #3:  Operations is not Already Agile</strong></p>
<p>This isn&#8217;t so much a criticism as it is a rallying cry.  There&#8217;s an idea that operations is already inherently more Agile than development teams already because they&#8217;ve got ticketing systems and they work in a bit of an ADHD manner on simple problems and knock them out all the time.  I&#8217;d like to pop that bubble and point out really how truly horrible operations is at doing agile methodologies.</p>
<p>First of all, I repeatedly see scrum used by PMs in operations as an excuse to have a daily meeting to go over action items every day and report progress back to the PMs.  If you go through any Agile documentation on how not do do scrum, that&#8217;s the only way that I&#8217;ve ever seen it done in Operations.  Somewhere there&#8217;s a huge disconnect between software development managers and PMs and &#8216;Operations&#8217; managers and PMs.</p>
<p>I also repeatedly see project work in &#8216;Operations&#8217; being done waterfall and that being presented as a Best Practice.  I tend to believe that there&#8217;s a psychological reason behind this in that because operations tends to be highly ADHD and interrupt driven all the time, and that tends to produce poorly designed and thought out work, that operational managers are natural driven towards wanting the complete other end of the pendulum &#8212; a massive process where everything is designed and documented ahead of time, with an absolutely unassailable list of requirements and the perfect implementation that will never go out of date, and we&#8217;ll never wind up looking back at it 5 years from now wanting to scrap the whole thing.  Focusing on just a few different major business objectives, getting the big issues right from the start, and then scrumming in tight repetitive cycles is something that feels like its just more of the poor practices that operations does all the time &#8212; the virtues of dealing with uncertainty and complexity in an Agile methodology are completely missed.  I would kill to work for a manager that just understood YAGNI as a means of reducing complexity.</p>
<p>Operational people are also plagued by the <a href="http://www.unixguide.net/freebsd/faq/16.19.shtml">&#8216;bikeshed&#8217;</a> problem.  The IT infrastructure needed to build a company is very complicated.  I&#8217;ve found that in large Enterprise institutions that it is very, very difficult to meet even the simplest of requirements, so I aggressively whittle down what want to accomplish to what I can accomplish.  This is almost universally met in meetings by people who want to be intelligent and develop lots of smart requirements about what the perfect solution would be in an ideal world with an infinite amount of time and resources to throw at the problem, or what could be accomplished in a smaller environment.  I&#8217;m not dumb, and you&#8217;re not proving how smart you are to me by making the problem as complicated as possible and solving all the problems of the world.  If I&#8217;m trying to accomplish something very stupid and simple it is because getting that simple problem solved across thousands of servers and hundreds of different types of apps is actually very hard.</p>
<p>A case in point would be centralized logfile collection.  After probably a decade of having no universally centralized logfile processing in one particular company (but lots of one-offs scattered throughout the Enterprise), I simply wanted to setup a syslog-ng server and point all system syslog traffic from all servers at it to get centralized logs for security.  Throw away all the requirements about application logs, centralized logs, GUIs for novice users, AAA to protect certain logs, etc.  I just wanted security-related logs to be centralized and to be able to have clueful SAs use grep on the logs.  Eminently achievable and basically out of the playbook of System Admin 101, but insufficiently clever and lazy and not taking the requirements of the rest of the business into consideration.</p>
<p>Anyway, the problem is bad.  Very bad.  This is not going to be an easy fight.  Far from finding a fertile field of people who are used to doing quick iterative processes, Operations is dominantly made up of people who really believe they&#8217;d be doing a better job if they were in charge of really massive projects, with huge requirements and piles of resources and planning that would make any nuclear plant manager proud.</p>
<p><strong>Issue #4:  Is ITIL actually the Enemy?</strong></p>
<p>Honestly, my knowledge of ITIL is largely limited to interacting with people who are ITIL-certified (certifiable?) and that leaves me not being very interested in learning much more about it.  My understanding is that ITIL codifies the wall between development and operations and strictly defines them as two entirely different things.</p>
<p>I know that I consistently see this diagram of ITIL-compliant escalation which involves issues going through tier 1, then often a tier 2, then kicked up to systems/network/database administrators at tier 3, then software developers are tier 4.  That just annoys me because I think the responsibility should be up to 24/7 staffed tier 1+2 to properly determine how to route the problem and should route *most* problems to software operations in tier 3, bypassing systems and networking folks entirely.  The bulk of issues in a well-designed systems/network infrastructure lie either with application code or performance issues with SQL databases, and I don&#8217;t need to get woken up at 3AM in the morning to rule out systems and networks in order to kick it over to software.  I&#8217;m also not interested in being woken up at 3AM in the morning to kick over services repeatedly because of a software bug which isn&#8217;t a high priority of the software development team, because features are more important.  I, personally, am a bit of an outright primadonna because I&#8217;m usually the tier 5 person who can deal with having multiple app departments, along with systems, networking and databases all doing a circular firing squad blaming each other for an outage and come into that kind of a problem and correctly determine the root cause and get the appropriate department working on fixing whatever is broken.  Still, I don&#8217;t see why systems engineers wind up being the goto-people at 3AM in the morning to figure out why its all broken &#8212; we do have lives, many of us have kids, and it gets really old really fast.  Operational SAs really need an actual Union and we need to go out on strike over how we&#8217;re expected to do oncall &#8212; the way that the 24/7 workplace has crept into all of our lives and just become a natural part of it that we have to do or else we&#8217;re not doing it right, is utterly appalling from a labor-relations perspective.</p>
<p>And if the other teams don&#8217;t have access to figure out issues they need it.  Network engineers need to be able to login to servers and tcpdump and traceroute.  Software devs need access to each others servers and logs, not just their own servers and logs.  Separation of debugging responsibility just leads to the circular firing squad.  One reason why I can be the Tier 5 goto is because I have root on all the servers, and often I&#8217;ve managed to get myself enable on the switches, routers and load balancers (and can we get a job title for this kind of person?  I&#8217;m tired of having my networking access yanked because I&#8217;m not a network engineer &#8212; I&#8217;ve never, ever done something nasty with enable access, but sometimes it is highly useful and prevents me from wasting NEs time on tickets that I cut just to ask stupid questions that I could login and determine myself &#8212; I&#8217;ll get CCNA/CCDP certified if i have to).</p>
<p>Anyway, my understanding of ITIL is that its rigid and inflexible in its definition of roles and responsibilities around operations vs. development and systems vs. network vs. software vs. databases, etc.  It seems to me to be the opposite of Agile.  If someone who really understands both ITIL and Agile can explain to me how ITIL can be Agile, by all means I&#8217;ll listen&#8230;</p>
<p><strong>On the other hand&#8230;</strong></p>
<p>Way to go.  I&#8217;ve obviously been thinking about the problems for a long time and just getting old and grouchy about it.  Hopefully DevOps or some intellectual descendant can catch on and start getting me excited&#8230;  Its about time someone got a meme started on what I&#8217;m really trying to do with my job&#8230;</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fwww.scriptkiddie.org%2Fblog%2F2010%2F09%2F26%2Fissues-with-devops%2F&amp;title=Issues%20with%20DevOps" id="wpa2a_14"><img src="http://www.scriptkiddie.org/blog/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://www.scriptkiddie.org/blog/2010/09/26/issues-with-devops/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Reducing Server I/O on virtualized hosts</title>
		<link>http://www.scriptkiddie.org/blog/2010/09/19/reducing-server-io-on-virtualized-hosts/</link>
		<comments>http://www.scriptkiddie.org/blog/2010/09/19/reducing-server-io-on-virtualized-hosts/#comments</comments>
		<pubDate>Sun, 19 Sep 2010 17:26:02 +0000</pubDate>
		<dc:creator>Lamont Granquist</dc:creator>
				<category><![CDATA[sysadmin]]></category>

		<guid isPermaLink="false">http://www.scriptkiddie.org/blog/?p=222</guid>
		<description><![CDATA[The Problem I/O is a pretty large deal with virtualized hosts. If you have 10:1 or 35:1 VM compression (guest-to-physical ratio) then any useless I/O which your servers are doing is going to cost you big in the overall I/O load on your physical servers. Many O/S images, however, have a lot of useless I/O [...]]]></description>
			<content:encoded><![CDATA[<p><strong>The Problem</strong></p>
<p>I/O is a pretty large deal with virtualized hosts.  If you have 10:1 or 35:1 VM compression (guest-to-physical ratio) then any useless I/O which your servers are doing is going to cost you big in the overall I/O load on your physical servers.</p>
<p>Many O/S images, however, have a lot of useless I/O which goes on, which can be a factor in scaling VMs and pushing towards expensive SAN/NAS centralized storage solutions to deal just with all the useless I/O.  I&#8217;ve been successful in identifying some fairly severe I/O issues from default RedHat installs which allow me to run a lot of services off of internal storage and don&#8217;t require centralized storage.</p>
<p><strong>Useless I/O Sucking Cronjobs</strong></p>
<p>On RHEL5 there is a pile of cronjobs that run periodically and like to suck up all the I/O on a server.  These cronjobs do try to do something useful, but the I/O load is counterproductive.  A classical example is the slocate cronjob that runs every night.  This cronjob indexes your entire disk so walks the entire inode tree every night.  If you have 35 virtual images on the same physical server, this will crush your I/O every single night.  While having the ability to easily locate files from the commandline sounds useful, the nightly cost of generating this index is too expensive and this cronjob must be turned off for servers &#8212; more so than ever in the virtualized-era of computing.</p>
<p>There are also cronjobs that generate the indexes for man pages that run every night.  While this produces less I/O than walking the entire inode tree, it&#8217;ll still produce synchronized I/O every night as all your VMs spin up at the same time to perform this task.  What you lose is the ability to do &#8216;man -k&#8217; or &#8216;apropos&#8217; on your servers (since every image should be the same you can have a bastion host with this cronjob enabled so you can look things up when you need to).</p>
<p>Here is chef recipe for RH7.3 through RHEL5 for reducing cronjob I/O load on servers:</p>
<pre>
#
# remove default cronjobs that suck up I/O
#

file "/etc/cron.daily/mlocate.cron" do
        action :delete
end

file "/etc/cron.daily/slocate.cron" do
        action :delete
end

file "/etc/cron.daily/makewhatis.cron" do
        action :delete
end

file "/etc/cron.weekly/makewhatis.cron" do
        action :delete
end

file "/etc/cron.daily/00-makewhatis.cron" do
        action :delete
end

file "/etc/cron.weekly/00-makewhatis.cron" do
        action :delete
end

file "/etc/cron.daily/00-logwatch" do
        action :delete
end

file "/etc/cron.daily/0logwatch" do
        action :delete
end
</pre>
<p><strong>Turning off fsync on syslog</strong></p>
<p>The fsync() or fdatasync() system calls force writes to be sent to the disk, which is something that databases need to do for ACID-compliant reasons and which reduces throughput.  For some reason the default install of /etc/syslog.conf in RedHat has syslogd treat the syslog files as databases and calls fsync() after every syslog line is written to the disk.  If your servers do any syslog at all, or you enable debug logging for some syslog functionality this can crush your I/O if multiple virts are syslog&#8217;ing at the same time (particularly out of cron).</p>
<p>Another admin at work tipped me off by asking me about this backtrace showing syslog stuck in fsync() [also note that the hung_task messages from newer kernels can be quite useful]:</p>
<pre>
echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
syslogd       D 0007f2bb68a22287     0 28885      1         28888 28872 (NOTLB)
 ffff8801e3e7fd88  0000000000000282  0000000000000000  0000000000000001
 000000000000000a  ffff8801d8f4a080  ffff880000d32080  000000000000d03d
 ffff8801d8f4a268  0000000000000000

Call Trace:
 [<ffffffff88036d5a>] :jbd:log_wait_commit+0xa3/0xf5
 [<ffffffff8029c3fb>] autoremove_wake_function+0x0/0x2e
 [<ffffffff8803178a>] :jbd:journal_stop+0x1cf/0x1ff
 [<ffffffff80230f7f>] __writeback_single_inode+0x1e9/0x328
 [<ffffffff802d321b>] do_readv_writev+0x26e/0x291
 [<ffffffff802e5a88>] sync_inode+0x24/0x33
 [<ffffffff8804c36d>] :ext3:ext3_sync_file+0xc9/0xdc
 [<ffffffff80251c10>] do_fsync+0x52/0xa4
 [<ffffffff802d3a1f>] __do_fsync+0x23/0x36
 [<ffffffff802602f9>] tracesys+0xab/0xb6
</pre>
<p>Syslog, in my opinion, is no where near important enough to be calling fsync() on everything it does, so I hit the man pages to determine how to disable this:</p>
<blockquote><p>       You  may  prefix each entry with the minus &#8221;-&#8221; sign to omit syncing the file after every logging.  Note that you might<br />
       lose information if the system crashes right behind a write attempt.  Nevertheless this might give you back some perfor-<br />
       mance, especially if you run programs that use logging in a very verbose manner.
</p></blockquote>
<p>And all I did was to add the &#8216;-&#8217; sign to all the logfiles that syslog.conf is configured to write to:</p>
<pre>
# Log all kernel messages to the console.
# Logging much else clutters up the screen.
#kern.*                                                 /dev/console

# Log anything (except mail) of level info or higher.
# Don't log private authentication messages!
*.info;mail.none;authpriv.none;cron.none                -/var/log/messages

# The authpriv file has restricted access.
authpriv.*                                              -/var/log/secure

# Log all the mail messages in one place.
mail.*                                                  -/var/log/maillog

# Log cron stuff
cron.*                                                  -/var/log/cron

# Everybody gets emergency messages
*.emerg                                                 *

# Save news errors of level crit and higher in a special file.
uucp,news.crit                                          -/var/log/spooler

# Save boot messages also to boot.log
local7.*                                                -/var/log/boot.log

*.*                                                     @SYSLOG-SERVER.example.com
</pre>
<p>Note that every file is preceeded by a &#8216;-&#8217; which is what does the magic.</p>
<p><strong>Summary</strong></p>
<p>If you are using Ubuntu, I expect that there&#8217;s a pile of useless cronjobs that you can also track down, but I haven&#8217;t spent the time to do so (use RHEL rather than Ubuntu at work).  I&#8217;d also be a little surprised if Ubuntu did not have the same behavior with syslog, and I expect the syntax is similar (although its always possible that Ubuntu got this behavior correct right out of the gate&#8230;)</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fwww.scriptkiddie.org%2Fblog%2F2010%2F09%2F19%2Freducing-server-io-on-virtualized-hosts%2F&amp;title=Reducing%20Server%20I%2FO%20on%20virtualized%20hosts" id="wpa2a_16"><img src="http://www.scriptkiddie.org/blog/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://www.scriptkiddie.org/blog/2010/09/19/reducing-server-io-on-virtualized-hosts/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Why I hate Java HTTPClient MaxConnectionsPerHost</title>
		<link>http://www.scriptkiddie.org/blog/2010/09/06/why-i-hate-java-httpclient-maxconnectionsperhost/</link>
		<comments>http://www.scriptkiddie.org/blog/2010/09/06/why-i-hate-java-httpclient-maxconnectionsperhost/#comments</comments>
		<pubDate>Mon, 06 Sep 2010 18:11:44 +0000</pubDate>
		<dc:creator>Lamont Granquist</dc:creator>
				<category><![CDATA[sysadmin]]></category>

		<guid isPermaLink="false">http://www.scriptkiddie.org/blog/?p=207</guid>
		<description><![CDATA[Background The JAVA httpclient package is used by many software devs in SOA architecture shops to make back end connections from service to service. The apache developers who wrote the httpclient library clearly were considering the use of the httpclient library to make web browsers. As such they included a parameter, MaxConnectionsPerHost which would limit [...]]]></description>
			<content:encoded><![CDATA[<p><strong>Background</strong></p>
<p>The JAVA httpclient package is used by many software devs in SOA architecture shops to make back end connections from service to service.  The apache developers who wrote the httpclient library clearly were considering the use of the httpclient library to make web browsers.  As such they included a parameter, MaxConnectionsPerHost which would limit the number of simultaneous requests a browser could make to a website to 2, in order to avoid overloading the site.  This made more sense back in the 90s when RFC 2068 was written with this recommendation, and firefox now has upped the default limit to 15, and I believe that IE has raised the default limit to 8.</p>
<p>My contention is that in SOA shops where servers are calling services that this connection limit is useless and should be disabled or raised to a very, very large value (10,000+).</p>
<p><strong>The Problem</strong></p>
<p>If you have a bank of servers calling a bank of other servers, in an SOA environment, you can wind up hitting an artificial limit caused by this httpclient limit which is nearly identical in behavior to exhaustion of database connection pools.  It should be noted that database connection pooling, however, is necessary in order to reduce the cost of expensive database connection establishment and to reduce the memory impact on the database server of large amounts of database connections (particularly with Oracle, less so with a thinner database like MySQL). </p>
<p>What this appears like is that you have a bank of java (tomcat or whatever) servers, which are idle but periodically spiking to insane latency and timeouts.  There&#8217;s no maxing out of CPU, I/O, network bandwidth or any other server resources.  Similarly, the back end service that these servers are trying to contact is also scaled out adequately for the load and there&#8217;s no obvious performance issues that it is hitting, but response times in getting back a reply from the service is very, very slow as measured by the java making the httpclient call.  This problem can masquerade as networking or load balancer issues and can drive network engineers nuts trying to track down why &#8220;the network is slow&#8221;.</p>
<p>What is observed, however, on the java app in thread dumps is potentially hundreds of threads stuck in doGetConnection:</p>
<pre>
"XXX THREAD NAME CHANGED TO PROTECT THE GUILTY XXX" daemon prio=10 tid=0x0000002ddb745800 nid=0x5edd in Object.wait() [0x000000005587f000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.doGetConnection(MultiThreadedHttpConnectionManager.java:509)
        - locked <0x0000002aa37a9208> (a org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ConnectionPool)
        at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.getConnectionWithTimeout(MultiThreadedHttpConnectionManager.java:394)
        at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:152)
        at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:396)
        at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:324)
[...etc...]
</pre>
<p>So, we have idle service on one end, idle service on the other end, everything is working but the service is effectively crashed because of this artificial limit.  This happens because in an SOA environment, having a limit of 2 simultaneous connections from server to service VIP is way, way too low.  In a bank of 8 servers that means you can only have 16 simultaneous connections and if you&#8217;re doing in aggregate 100 tps it only takes a few slow, expensive calls to the service to &#8220;clog up the pipes&#8221;.</p>
<p><strong>Why raising the limit to 10 or 20 or 30 is bad</strong></p>
<p>The obviously prudent thing to do is to raise this limit up to some limit which is reasonable but still &#8220;protects&#8221; the back end service.  We wouldn&#8217;t want to disable the limit entirely because&#8230;  well, something bad would clearly happen and someone put that limit in there for a reason.   </p>
<p>I&#8217;m going to try to argue that the reason that limit was put in there had nothing to do with your SOA environment, and that I&#8217;ve worked in truly massive non-java SOA environments that didn&#8217;t have this kind of limit and never saw an issue, and that there&#8217;s much better ways to deal with scaling limits and brown-outs and that this limit is always the behavior you don&#8217;t want.</p>
<p>I just diagnosed this problem, again, for the hundredth time in a situation where the MaxConnectionsPerHost limit had been raised to 10 on a bank of 8 servers.  This had been running fine for a long time, but 5 of the servers crashed at once due to a memory exhaustion issue.  That was bad, but the set of servers is so overscaled that there was still 60% idle cpu cycles available on the clients.  The problem was that the farm went from having a limit of 80 simultaneous connections down to having a limit of 30 simultaneous connections.  That was the only thing that caused the entire farm to fail (due to timeouts).</p>
<p>Granted, having 5 of 8 servers out of rotation is a bad thing, but the farm actually could have taken the load and this would have been an &#8220;oops, we had 5 servers out, damn we&#8217;re overscaled, good no customers were impacted&#8221; problem, but the &#8220;prudent&#8221; limit of 10 resulted in an outage.  I&#8217;d rather jack the limit up to something very, very large and make this problem simply go away and stop encountering it.  It didn&#8217;t do any good, and just caused us another outage.</p>
<p><strong>Effect of Removing the Limit Completely</strong></p>
<p>I worked in a very large Seattle-based Internet Retailor for 5 years as one of the &#8220;Tier 3 or 4&#8243; Senior SEs who would see any kind of crazy infrastructure problem like this bubble up to us.  We were not java-based at the time and were instead using process-based clients that simply had no concept of this kind of connection pooling to back end SOA services.  Any server could open up as many connections to a back end service as it liked, each process could open up as many back end connection as it liked, and the processes did not share any state to know in aggregate how many connections were open to any back end server.  With 30,000 servers and thousands of different deployed applications (literally) we never encountered any issues that the maxrequestsperhost limit would have solved.  In my opinion, in an SOA shop this is a solution which is looking for a problem.  I lived for 5 years in a massive environment and never once saw some kind of issue which made me think to utilize something like this limit.  </p>
<p>And I would argue that this is because HTTP connections are *massively* cheap compared to Oracle connections.  Sure you need to use a little bit of TCP/IP to get it going, but modern processors can do many more of those connection opens per second than your servers are ever going to want to be submitting (100 tps coming from a typical java tomcat is going to be impressive programming, but the TCP/IP stack won&#8217;t break a sweat).  Trying to do some kind of HTTP/1.1-based connection pooling with a finite limit on it (which in my experience is *not* what is typically going on when I see the httpclient bug &#8212; most of the time these connections are not being reused at all) is a premature optimization in the Knuth-sense.</p>
<p><strong>Poor Behavior on Surge Traffic</strong></p>
<p>A common thought is that this &#8220;protects&#8221; your back-end services from surges.  But the infinite-queuing behavior of the httpclient is precisely what you don&#8217;t want.  As soon as the client overall starts to require more simultaneous connections than it can submit to the back end service the queue will attempt to grow infinitely long, creating infinite latency.  What effectively happens is that every single request will take as long as the timeout period of whatever meta-client is calling the java service that uses httpclient.</p>
<p>In brown-out loading what you want is to start aggressively dropping connections to take load off, but you want to do that based on real brown-out of your back end service.  SDEs are terrible at estimating what level of simultaneous connections would actually result in a real brown-out of the back end service.  Nobody measures this in Q/A or load, or looks at it in production, and it would probably take a team of people in a large site to keep measuring and tweaking all the clients in production.  The only way to reliable tell you really are in a brown-out situation is by wrapping your back-end calls with timeouts &#8212; and not queueing.  </p>
<p>Retries are also poor behavior as well, unless you have exponential backoff like TCP/IP uses &#8212; otherwise you ensure that a momentary brown-out produces an permanent overload of the back-end service.</p>
<p>You could still use a simultaneous connection limit if you must, but you must not queue, you must drop.  If you queue, in an overload you will build latency without limit once the pipes are filled up, causing every request to timeout which results in a 100% outage anyway.  If you immediately drop requests over the limit then there is the possibility that you could hit a situation where dropping 10% of the requests allows 90% of the other requests to succeed in a timely manner.  However, again, this requires being able to accurately measure exactly what the simultaneous connection threshold should be.  Set it too low and you start to deny requests before you have overloaded your backend service.  Set it too high and you overload your back-end service anyway &#8212; you will never manage to set it correctly and budget adequate time to maintain it as the software changes, so its effectively useless go down that road.  Anyway, the httpclient blocks requests in doGetConnection when all the connections are being used and does not drop them, so the httpclient does not implement this kind of behavior.</p>
<p><strong>MaxConnectionsPerHost recommendation</strong></p>
<p>Find some way to disable it, or set it to something &#8220;insane&#8221; like 10,000.  All it does, in an SOA environment, is cause problems without usefully solving any problem.</p>
<p>You will then never again see the problem where an idle java app is having problems talking to an idle back end service when everything in the network checks out fine &#8212; an outage due simply to configuration.</p>
<p>You can then focus on scalability and brownouts using timeouts, exponential backoff, or simply elastic or &#8220;cloudy&#8221; scaling of services in response to demand.</p>
<p>And in general, you should not try to &#8220;protect&#8221; back end services with any kind of artificial limits.  Invariably this result in exercising the Law of Unintended Consequences when those limits are accidentally hit when all the services are fine.  It will not &#8216;save&#8217; you but will cause you problems.  Of course, when you encounter a problem like Oracle&#8217;s expensive connections you need to manage that resource and establish limits, but still those limits should be pushed to the point where you are using the limits to protect Oracle&#8217;s memory and not in an attempt to prevent apps from delivery &#8216;too many&#8217; requests to Oracle.  If your database melts down under load you need to address the problem well upstream with whatever is misbehaving &#8212; or just buy a bigger database machine, or use some more horizontally scalable NoSQL solution.</p>
<p><strong>Summary</strong></p>
<p>The MaxConnectionsPerHost parameter is a throwback to RFC 2068, written in 1997, before anyone dreamed up the term &#8220;Service Oriented Architecture&#8221;.  In an SOA shop, this behavior is worse than useless and wastes time and causes needless outages without solving any problems.</p>
<p>The apache httpclient authors probably will not by default remove this limit, since they have to consider that their code could also be used to write web browsers, where some kind of finite limit is certainly prudent.  But in any SOA shop this limit should be disabled or raised to something like 10,000 in all service-calling code &#8212; effectively making it unlimited and removing it.</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fwww.scriptkiddie.org%2Fblog%2F2010%2F09%2F06%2Fwhy-i-hate-java-httpclient-maxconnectionsperhost%2F&amp;title=Why%20I%20hate%20Java%20HTTPClient%20MaxConnectionsPerHost" id="wpa2a_18"><img src="http://www.scriptkiddie.org/blog/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://www.scriptkiddie.org/blog/2010/09/06/why-i-hate-java-httpclient-maxconnectionsperhost/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Beginning to configure Chef</title>
		<link>http://www.scriptkiddie.org/blog/2010/09/04/beginning-to-configure-chef/</link>
		<comments>http://www.scriptkiddie.org/blog/2010/09/04/beginning-to-configure-chef/#comments</comments>
		<pubDate>Sun, 05 Sep 2010 00:32:11 +0000</pubDate>
		<dc:creator>Lamont Granquist</dc:creator>
				<category><![CDATA[sysadmin]]></category>

		<guid isPermaLink="false">http://www.scriptkiddie.org/blog/?p=169</guid>
		<description><![CDATA[Setup First, you can signup for the free hosting service for 5 servers, which eliminates the need to screw around setting up the server infrastructure: https://cookbooks.opscode.com/users/new Second, go through the five step tutorial here: http://help.opscode.com/faqs/start/how-to-get-started Configuring ~/.chef The opscode instructions have you create ~/chef-repo/.chef and put your knife.rb and .pem file in there. I prefer [...]]]></description>
			<content:encoded><![CDATA[<p><strong>Setup</strong></p>
<p>First, you can signup for the free hosting service for 5 servers, which eliminates the need to screw around setting up the server infrastructure:</p>
<p><a href="https://cookbooks.opscode.com/users/new">https://cookbooks.opscode.com/users/new</a></p>
<p>Second, go through the five step tutorial here:</p>
<p><a href="http://help.opscode.com/faqs/start/how-to-get-started">http://help.opscode.com/faqs/start/how-to-get-started</a></p>
<p><strong>Configuring ~/.chef</strong></p>
<p>The opscode instructions have you create ~/chef-repo/.chef and put your knife.rb and <username>.pem file in there.</p>
<p>I prefer to move those files into ~/.chef so that I don&#8217;t have to <tt>cd ~/chef-repo</tt> in order to use knife.  You can also push out thse config files in ~/.chef just like ~/.bashrc files so that you can use knife on different nodes.  Of course a hacker who gains root on one of the servers where your knife credentials are can impersonate you to the chef server, so it may be wise to restrict your knife config to a set of bastion servers.  Obviously, chef can help you manage this config.</p>
<p>My ~/.chef/knife.rb looks like:</p>
<pre>
log_level                :info
log_location             STDOUT
node_name                "USERNAME"
client_key               "#{ENV['HOME']}/.chef/USERNAME.pem"
validation_client_name   "ORGANIZATION-validator"
validation_key           "#{ENV['HOME']}/.chef/ORGANIZATION-validator.pem"
chef_server_url          "https://api.opscode.com/organizations/ORGANIZATION"
cache_type               'BasicFile'
cache_options( :path => "#{ENV['HOME']}/.chef/checksums" )
cookbook_path            ["#{ENV['HOME']}/chef-repo/site-cookbooks"]
</pre>
<p>Here, USERNAME and ORGANIZATION should be replaced by your chef username and your organization&#8217;s name.  You will also need to put your credentials in ~/.chef/USERNAME.pem (replace USERNAME by your username).</p>
<p>You have now configured knife to know who you are, where your credentials are, and configured some paths in your environment.</p>
<p>You should now be able to run comands like <tt>knife node list</tt> in any directory on the host that you have configured.</p>
<p>I&#8217;ve also setup my cookbook_path to ~/chef-repo/site-cookbooks so that from any directory, knife can find my cookbooks in order to upload them &#8212; you should modify that to wherever you have your site&#8217;s cookbooks.</p>
<p><strong>Adding New Nodes</strong></p>
<p>Now, when you&#8217;re done with that, to add a new node, you can deploy these two files to the new node:</p>
<p><tt>/etc/chef/client.rb</tt><br />
- this points chef at the server<br />
<tt>/etc/chef/validation.pem</tt><br />
- your organization key to authenticate first time to the server</p>
<p>Then become root and run your <tt>chef-client</tt> to add the node to the server and create the client.pem</p>
<p>Now you can run <tt>knife node list</tt> to see the newly added node.</p>
<p>You should protect /etc/chef by making it mode 0700 or something similar.  Once the /etc/chef/client.pem file has been bootstrapped down from the server, you (or a chef recipe) can delete the /etc/chef/validation.pem file.</p>
<p><strong>Summary</strong></p>
<p>So, now you should have setup an account at opscode, successfully setup your organization, configured knife to be easy to use on your host, and bootstrapped a few hosts to use chef.  Now you can move on to getting chef to do useful things.</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fwww.scriptkiddie.org%2Fblog%2F2010%2F09%2F04%2Fbeginning-to-configure-chef%2F&amp;title=Beginning%20to%20configure%20Chef" id="wpa2a_20"><img src="http://www.scriptkiddie.org/blog/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://www.scriptkiddie.org/blog/2010/09/04/beginning-to-configure-chef/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

