Friday, July 13, 2012

How to Scale in the Real World

Every so often i'll see a blog post about "scaling lessons I learned when launching my startup." It's painful to read. Lessons like "use monitoring" and "you can use metrics for your application!" and "sharding is good".

Then they go into the hacks. "Add artificial load to your server so when it breaks, you can remove the extra load, and you have more capacity!" "Take a server offline every once in a while." "Automatically kill anything which has too much memory or CPU or takes too long." "Memcache memcache memcache." "Giving developers root makes their lives easier, and mine too!"

Yes, we were all beginners once. These little nuggets are a window into years ago when I too thought it was a good idea to make untested changes and restart services in the middle of the day, or up the limit on max connections until the server fell over, then scrambling to add another server. We learn from our mistakes.

The thing is, no book or blog post you read about scaling will work for everyone. Everyone experiences it differently because both the technology and what you're using it for is different. So you need to be flexible while relying on a little bit of old folks' wisdom. And yes, I just called myself 'old folks' at 28. Jesus i'm arrogant.

Anyway. Here's how I think of scaling in the real world. Keep in mind i'm only talking about "scaling" and not "keeping a fully-redundant high-performance site operating at peak optimization", because that's five different things and way more complex than a single blog post.


A good mindset of fear and paranoia will help you plan and execute everything you do to scale your site. You should be aware of everything you do and what it's consequences could be. Fear of the site going down, fear of what happens when I push this commit out, fear of bottlenecking i/o, fear of accidental ddos'ing, fear of getting hacked.

Fear is a great motivator. You should also keep in mind it's just a job and calm the hell down, but in general being wary of things breaking or degrading should be high in your mind when you do anything. It will help you plan and execute your plan in ways that will minimize risk and maximize the value of changes.


So your startup is going to revolutionize the way people take a bath by making a social network for rubber ducky owners. Great idea, but that's not the goal i'm talking about. Your goals towards scaling should have specific things you want to accomplish, such as a number of users on your site at the same time, or the average speed of anyone browsing any part of the site. You will execute your goals by building out your site to meet exactly these criteria.

Now you might be saying, but I want to scale infinitely! Can't you just tell me how to configure Redis so i'll never run out of capacity? The answer is of course, No. All scaling has upper limits. The point is to figure out how far you can go ahead of time, so that when you're getting nearer to the limit, you know to make a new goal and plan for that.

Imagine eBay. At some point they probably had a generic way to scale for a while, so they could keep adding servers and bandwidth and keep up with demand. But at some point, you outgrow datacenters. You outgrow coasts and continents. Will your little auction site keep churning away when it's stretched out across the globe, still using a static map file in Apache that needs to be reloaded every time you add an application server? The goals have to be re-imagined at some point. Figuring yours out will make it easier to focus on the 'now' while keeping an eye on the future.

Step 3. PLAN

A scaling plan is basically your architecture manifesto. Keep in mind, it's based on your goals, which should change as you grow, so don't be stuck on one kind of technology or way of doing something. Whatever it is you're doing, there's a different, probably better way to do it, so don't get too caught up with the details. To begin, take your goal and look at every single layer from the client to your app's guts and back.

Let's take a goal which says "I want to maintain 30,000 hits per second of traffic." Starting with the browser client, where is your traffic going? Probably to a web server. If it's going straight to your web servers, you're going to need to sustain over 30K connections, which is a problem for just one web server. If you were going to a CDN that would be much easier to deal with, and you can probably get by with one frontend caching proxy server like Varnish (though that's not redundant at all, your goal didn't include redundancy...). It will have to be a really beefy box to keep a good and fast cache, though. You'll probably also want to enforce cache headers to the CDN to make sure it's not pulling your whole site from the origin every 2 seconds.

So you have 30K HPS to static content. Wonderful! Oh what's that? You wanted to display a social graph of your rubber ducky empire to every user? Shit. I guess we need more stuff. MySQL for a database (because it's easy and universal), Starman for an application server (because fuck you Perl is more than good enough), Memcached for your "fast" application cache, and one of those Map/Reduce thingies for making your social graph (i'm not a real developer, I don't know how that shit works). But how do you configure them? How many do you need? What happens if you outgrow something? Calm down. And keep in mind it doesn't really matter what you pick, you'll figure out how to scale it soon enough.

First write your application for the stack you picked. It doesn't matter what your application is or how shitty it runs as that has nothing to do with scaling. Scaling happens once the piece of crap code is done. This is how scrappy start-ups can afford to write terrible on-the-fly hacks and still survive launch week. So now that your app is running, you need to gather benchmarks.

To gather benchmarks we need metrics. To get metrics you either write something yourself or grab something that's actually good, like collectd. Configure it to gather everything under the sun and send it somewhere not on the box it's collecting on. Then populate your system with fake data and start hammering all the parts of the site. This is useful later as you can keep testing functionality and capacity as your site grows.

As you test your site, see how much of the resources are used up by the meager benchmark you've made. Now compare that to your goal and add about 20% to that number, and you know how much resources you'll need to hit your goal. Now just allocate enough capacity to get there. Keep in mind disk i/o, bandwidth, cpu, database queries, connection pool numbers, cache hit percentage, etc etc.

These numbers are not just basic information you need for capacity planning, it's critical in monitoring your live site to see when you unexpectedly hit a bottleneck. All of these criteria should have monitoring alerts trigger if they get anywhere near 80%, or double in a less-than-manageable amount of time. (Can you double your database capacity in an hour? No? Then you should probably get alerts if any of your database metrics go up by 50% in a half-hour.)

Now that you know the basic resources you'll need to achieve your goals, tune your stack. This is where "premature optimization" is actually a great thing. For example, your resource numbers for MySQL probably look ridiculous - 50 servers just to handle 30k HPS? Apparently people forget that MySQL (like most tools) needs to be tuned to reach its peak performance. Once you tune your stack you can go back to your benchmarking tools and fine-tune the performance to get the numbers more efficient.

But let's be honest: the goal is not to get the fastest performing stack, it's to get a stack that can perform. You might start to rethink your application when you find out it's just not performing very well. In general it's a mistake to redesign your app just because it looks like scaling is taking a lot more resources than it should. As a famous customer support representative once said, "The future is gonna cost more money," and your application will get slower over time. Focus on scaling and let someone else optimize the application.

With realistic numbers about how your site can perform, you can start allocating resources.Your goal was 30K HPS, but you only get 100 hits per second right now. If you have no historical data to plot the growth of traffic, just shoot for 10 times the traffic you're doing now and allocate resources for that. Before you have a launch day or big advertising push or something, check your historical data and do another 10x increase beforehand. If you're not using the cloud, make sure your provider can allocate resources at the drop of a hat for you, or that you have spares to use. If you're using the cloud, make sure you have all the steps down-pat for adding your resources in real time, so if you suddenly get a million users signing up to your site you know how to throw more resources in place.

The "we just got 10,000,000 signups!" scenario is extremely rare. But for cases of unexpected, goal-smashing growth, you need to have an emergency plan as well. You can find examples of them around the web. Typically it's a combination of handicaps to your site to keep some core functionality running. The last thing you want is for everything to go down. It's better to cap the number of incoming connections and allow a slow stream of users to use the site while you rush to obtain more resources to grow the site in time. Anything can become a bottleneck - network traffic, disk i/o, memory ceiling, database connections/queries, etc. Be aware of the maximum level for each criteria by comparing the resource use from your metrics with the configuration of each software component.

The last thing you want, which you'll add probably as you realize you don't have the money or capacity to just keep adding resources, is caching. In short: Cache Everything. Cache on your frontends. Cache on your backends. Cache to disk. Cache in memory. Cache the highest-used pages. Use a bigger journal to cache in the filesystem. If you desperately need iops, using tmpfs and writing changes occasionally with rsync is a form of caching. You can send users to the same servers to maximize cache hits at the cost of high-resource hot spots, or send them random places for better spread-out load at the loss of cache hits and increase in global resource use. Figure out what works best for your application.


So you have your goal, you have your plan, now you need to put it into practice. Scaling is one of those things where you don't need it until you need it. So being prepared to execute your plan at a moment's notice is pretty important. Usually it involves fire drills where your site goes down or you lose capacity and you need to add more quickly. But the management of your site is important as well.

Are your changes automatic? Do you have good revision control and deployment, and can you revert your changes immediately? Is your application's use of your infrastructure abstract enough that you can change backend pieces without ever touching your code? Can you roll out new services at the push of a button? Have you been testing your changes?

It seems obvious, but many times the problem with rapid scaling is simply a lack of best practices. All those little things you ignore because you're a startup and you don't have time to implement configuration management because of your 'just ship it' mentality? Once you've shipped it, and you suddenly need to scale, you get bit in the ass by the eventuality of your apathy to best practices.

Scaling is a never-ending process of analyzing data, testing limits, and growing your infrastructure. There's no easy way to do it, but at the same time, pretty much anyone can do it. The reason scrappy kids right out of school that jump on the startup bandwagon can keep tiny sites operating at huge numbers is because the actual work of adding resources is trivial. You figure out what you're lacking and you add more of it. The key is being constantly aware of what is going on and keeping one step ahead.

No comments:

Post a Comment