Slacker Elves in the Amazon Cloud
One of my current projects has been building a framework for automatically provisioning and deploying servers into the Amazon EC2 cloud. So, I have scripts that ask Amazon to start an EC2 instance, wait for Amazon to tell me that they are 'running', and then try to SSH into the instances to do things to them.
One vexing issue I ran into is that often the ssh connection would fail while running the script, yet I would have no problem ssh'ing manually afterwards. I spent some time googling the problem and found some people having similar mysterious failures, but no concrete solutions.
The failure mode for me during script execution looked exactly like what you see when you've screwed up your keys somehow: running with -v, you can see it tries public-key, has no luck, and moves on to ask for a password. That definitely is not going to work because a) it's not interactive and b) there is no root password on a fresh EC2 instance.
So, what is the problem? Well, I have not really figured it out, but I have come to realize that there is a period of time after the Amazon says the instance is 'running' during which it will still reject all SSL connections. The period of time seems to vary between 30 and 60 seconds, and seems to increase during what I would assume are peak usage hours.
I speculate that somewhere in the cloud is a little elf that runs around distributing Amazon's half of the keypairs to all newly-started EC2 instances. The lag in SSL connectivity is due to the elf getting tired and/or busy, and the monitoring machinery is not designed to wait for him to finish before it declares that an instance is 'running'. I've not been able to confirm that this is the case, but I'm sure Occam would favor explanations involving lazy elves.
A crude way around the problem is to simply sleep for a minute or so before trying to establish a connection. A slightly less crude way is to repeatedly attempt to SSH in after the instance starts.