Here at Melon Media we’ve tried many different hosting options for scalable infrastructure. From dedicated servers, to Amazon EC2, to Rackspace Cloud Sites, they all have their benefits and weaknesses. What can be really interesting is how they handle when they get hit with a large traffic spike. When the pressure is on and time is critical, some systems shine and others… cause nervous breakdowns.
With ManageTwitter we decided to use a Rackspace Cloud Server instance for the first time in a production. It has some interesting benefits over other cloud options including the ability to be resized to a larger instance on the fly. Since this was a small app, that we didn’t really want to expend huge resources on, starting with a small Cloud Server and resizing it if needed seems like a good solution for scaling a small project.
ManageTwitter does not involve much writing to the hard drive. Most of the data used in the application is stored temporarily in memory. We started off with a 512 MB / 20 GB server which handled our initial load easily. The application performance is mostly limited by CPU performance.
Our first traffic spike came when TechCrunch posted an article about the service. I’d just briefly stopped by the computer to check my email after a healthy dinner of leftover crackers, yogurt and chocolate when I saw a retweet of TechCrunch’s twitter stream about ManageTwitter pop-up on Tweetdeck. The article had only gone up minutes earlier but after firing up up http://www.manageflitter.com in my browser I saw the server was already not responding. A quick check confirmed SSH access wasn’t going to respond under the load either. While Cloud Servers do provide burst capability, we’d clearly expended all resources.
I triggered a resize on the server within the Rackspace Cloud control panel and waited with anticipation of the server coming back up. The control panel show how the resize is progressing.
Flicking through TechCrunch & twitter in other windows I saw numerous complaints coming in about ManageTwitter being inaccessible. “Another Techcrunch review effect ?” posted jacopogio in the TechCrunch comments.
The site had been inaccessible for about 15 minutes at this point. Beads of sweat were starting to form.
19%… 20%… 10%…
What? I quickly fired up a chat window with Rackspace Cloud support.
“Hi, I’ve got a cloud server that’s undergoing a high traffic spike. I’m trying to resize it, but the progress meter is falling back down. Will it complete, or is something wrong.”
RS: “It will complete.”
“Is it normal for it to go slowly like this? I’ve resized similar servers that went signficantly faster.”
RS: “Yes, it’s normal under high traffic.”
“Is there anything I can do to make it go faster?”
RS: “Stop the traffic.”
Not the most useful thing to hear. I decided that I wasn’t going to get much out of that conversation, so went and made some changes to the code, moving static files to a CDN, that would help reduce load if the server did get back up.
After about another 20 minutes things looked promising. The resize operation had got to 99%.
Suddenly it dropped to 0%. Frantically I jumped back into the Rackspace support chat and explained what had happened. I gave them my IP. Several minutes later a pop-up asked me to verify that the server had resized successfully. I decided to confirm without checking the integrity – we didn’t have much valuable data stored at that point and at worst I could rebuild the code base from our development repository.
Fortunately most of the data was copied across successfully. It looked like a few hours of data was lost from the database for some reason, but otherwise it was a clean migration. I copied across the CDN integrated code and fired the site up my browser. The IP change seemed to happen almost instantly as the site came up straight away. We’ve been running smoothly on the 8192 MB / 320 GB instance ever since.
Ultimately the resize operation took 40mins to go from a 512 MB / 20 GB instance to 8192 MB / 320 GB under TechCrunch review traffic load. While the process was a little stressful and support could have been a little more helpful (a rare complaint from my experience with Rackspace!) it all worked pretty much as expected. For very little resources & cost the outcome was pretty much as good as we could have expected.
One interesting idea that we might recommend Rackspace implementing is using whatever technology rackspace use for IP switching to temporarily give a server a new IP so unexpected high load repairs/resizes can be performed faster.
What have your experiences been? Do you have any recommendations for how we could have handled this situation better?