Monday, July 30, 2012

Authentic Steam Watches

Why don't more steampunk novels involve the mechanics/engineering of victorian-era devices? Watches and gears are everywhere in steampunk culture, but it seems there's never technical discussion of them in or relating to the plot.

I got a pocket watch recently, and as is natural to me I find myself gravitating to the technical aspects. (Wikipedia) For instance, Railroads demanded very specific standards for their time pieces, lest a train derail from being off-time by a couple minutes. Therefore there are specific watches which are engineered to be very robust and keep time much more accurately.

There's even technical words which i've never heard of before, such as 'isochronism' - keeping the same time even if parts of the whole start to change. Apparently this is even used in some modern technical documents; USB has an isochronous transfer mode.

One of the things that stood out was how temperature changes the operating of the device. Extreme cold or heat will contract or expand the steel balance, causing the watch to run slow or fast. They engineered a solution which involves mating it to a brass balance and having two cuts in both, though that leads to only keeping accurate time in either extreme cold or hot environments. Special alloys ended up fixing the problem for good, though that was post-Victorian, essentially.

You could even extend some knowledge of watchmaking to other parts of a story. Jewels are often used as a hard, durable, low-friction mounting point for the moving pieces of a pocket watch. You could include in your story a plot device where one particular jewel (though normally valueless) unlocks some key to some device made thousands of years ago, as part of some detective novel revolving around ancient devices with a modern spin.

There's a plethora of technical jargon specific to pocket watches which might be nice to include in the story. If you want to write for geeks/nerds, including technical details like this can't hurt.

Some neat facts:

Wrist watches ('wristlets') were considered feminine and unmanly until they were introduced by the military and finally made standard issue in the 1940's.
The vest-pocket in a three piece suit is intended for a pocket watch. Since vests fell out of fashion, the only place to put a pocket for a watch was in trousers. Hence that little pocket you tend to put change in or try to cram a cellphone into.
A four-minute delay in one watch caused a train wreck, hence Railroad chronometers will (among other things) keep time to within 30 seconds in a week.

Tuesday, July 24, 2012

HSTS makes CAs obsolete

I was in the toilet, where most of my brilliant ideas come from, and I was thinking about HTTP. How it's a bit crufty and old (13 years), how it could use significant upgrades to enhance delivery of content. I thought about SPDY and how i'm wary of the 'features' it mandates, like SSL.I don't like SSL in general (it's a pain in the ass) and I like being forced to pay to serve my own content even less.

Then I thought about HSTS and how it makes it easier to connect to a site securely. Sure, it has nothing to do with transport or encryption directly, but the aim was to keep the connection secure by preventing an attack on a browser's ability to *not* use SSL. And I remembered that browsers like Chrome ship with implicit lists of sites which should have HSTS enabled by default. And then it hit me.

With an HSTS-enabled flag for a website in your browser, if you also shipped a certificate fingerprint, it basically bypasses the need for a Certificate Authority.

Think of SSH. What's the one time your connection is in peril?

The authenticity of host 'syw4e.info (97.107.132.9)' can't be established.
RSA key fingerprint is 57:f9:cf:53:3a:fb:a4:af:e0:96:3c:20:99:30:82:8e.
Are you sure you want to continue connecting (yes/no)?

That question is all that stands between you logging into a real server and a fake server, and establishing a secure connection or not. If you had that key fingerprint already you would know if it was authentic, and you could go on with your life without answering stupid questions.

This is exactly what happens when you visit a site with a self-signed certificate, only it's much more complicated than it needs to be. If your browser simply had a list of those fingerprints (similar to the list of HSTS-enabled websites Chrome has) it could connect securely, automatically, without having to verify against a 3rd-party Certificate Authority.

Though this would make the browser's connections about as secure as with SSH, this isn't practical. There are lots of websites out there. We can't possibly keep a list of all of them in the browser. But if you believe in the idea that HSTS makes us more secure than without it, simply accepting and keeping the first certificate you got would be just as secure, right? Well, there's a problem there.

Websites are constantly changing their certificates. They add and remove hardware regularly, and sometimes revoke certificates if there's suspicion their private key might have been compromised. They also change before they expire. So even if we had a list of the initial certificates' fingerprints, what happens when they change?

In order to support a dynamic network of secured systems, there would need to be an extension to the encryption protocol that allowed downloading updates for future certificates. In addition, sites could publish a list of 'trusted' hosts (perhaps even on different domains) which can also update the fingerprint, so that if host A.B.C is down, you can still get an update for it's certificate fingerprint from B.B.C, or even D.C. In this way, a very simple peer-to-peer network of whitelisted hosts could share updates about the security of the network, without being explicitly tied to a few for-profit corporations (SSL CAs).

So then you may ask, what about internal servers? How can we ensure a user going to a closed, internet-less site knows they're connecting securely? The answer there is, of course, hypocrisy.

If you want your client to be connected securely, you have to exchange some secrets. It's mandatory for encryption to be secure. Maybe it's a HMAC'd challenge-response pair, or a shared key or certificate. You need to share something ahead of time.

With SSL, it's been the chain of certificates from Certificate Authorities that live in your computer and in your browser - up to 650 or more of them! So we can keep that system alive, and keep paying CAs for the 'extra assurance' we need for things like online banking, and offline encrypted connections. But we can also share our own secrets.

Take the case of the 'pay-for-wifi' connection abroad. You connect to the wifi and try to check your e-mail, and are redirected to a page asking you to put in your credit card details. Well wait just a minute! Is that the real page, or did some hacker put that up? With CAs you would be assured, because you have their chain of trust in your browser. But if you don't want to use CAs, you could input the certificate fingerprint yourself, perhaps if it was printed on the wall next to the access point's SSID and WPA-PSK passphrase.

tl;dr

But I digress. The main thing to take away from this post is how HSTS has dramatically changed our perception of 'secure web'. Instead of demanding that all connections are secure, we accept that on the very first visit, we might be getting a 'real' HSTS response from a website, or there might be an attacker lying in wait.

Of course i'm not as smart as I seem. Somebody else has already thought of all this and created it, and is trying to get the browsers and big internet players to buy in. It's not going so well. But since HSTS is already implemented in extremely popular browsers, they must have already accepted the idea of the no-assured-trust-on-first-visit model of security - for the HTTP protocol, anyway. If they accept that, then they're only a step away from the same security that SSH depends upon.

Considering all this, it seems that there's a disconnect in the reality of browser security. The browser and big-internet-guys already assume their connection might be compromised on the first visit. Yet they won't accept this new model that avoids the need for Certificate Authorities. Once implemented, SPDY adoption might actually skyrocket, because the protocol wouldn't be beholden to paying third-parties and depending on all 650 of them for security. I just hope progress and the pursuit of better security wins out over commercial interests.

Friday, July 13, 2012

How to Scale in the Real World

Every so often i'll see a blog post about "scaling lessons I learned when launching my startup." It's painful to read. Lessons like "use monitoring" and "you can use metrics for your application!" and "sharding is good".

Then they go into the hacks. "Add artificial load to your server so when it breaks, you can remove the extra load, and you have more capacity!" "Take a server offline every once in a while." "Automatically kill anything which has too much memory or CPU or takes too long." "Memcache memcache memcache." "Giving developers root makes their lives easier, and mine too!"

Yes, we were all beginners once. These little nuggets are a window into years ago when I too thought it was a good idea to make untested changes and restart services in the middle of the day, or up the limit on max connections until the server fell over, then scrambling to add another server. We learn from our mistakes.

The thing is, no book or blog post you read about scaling will work for everyone. Everyone experiences it differently because both the technology and what you're using it for is different. So you need to be flexible while relying on a little bit of old folks' wisdom. And yes, I just called myself 'old folks' at 28. Jesus i'm arrogant.

Anyway. Here's how I think of scaling in the real world. Keep in mind i'm only talking about "scaling" and not "keeping a fully-redundant high-performance site operating at peak optimization", because that's five different things and way more complex than a single blog post.

Step 1. BE AFRAID

A good mindset of fear and paranoia will help you plan and execute everything you do to scale your site. You should be aware of everything you do and what it's consequences could be. Fear of the site going down, fear of what happens when I push this commit out, fear of bottlenecking i/o, fear of accidental ddos'ing, fear of getting hacked.

Fear is a great motivator. You should also keep in mind it's just a job and calm the hell down, but in general being wary of things breaking or degrading should be high in your mind when you do anything. It will help you plan and execute your plan in ways that will minimize risk and maximize the value of changes.

Step 2. HAVE A GOAL

So your startup is going to revolutionize the way people take a bath by making a social network for rubber ducky owners. Great idea, but that's not the goal i'm talking about. Your goals towards scaling should have specific things you want to accomplish, such as a number of users on your site at the same time, or the average speed of anyone browsing any part of the site. You will execute your goals by building out your site to meet exactly these criteria.

Now you might be saying, but I want to scale infinitely! Can't you just tell me how to configure Redis so i'll never run out of capacity? The answer is of course, No. All scaling has upper limits. The point is to figure out how far you can go ahead of time, so that when you're getting nearer to the limit, you know to make a new goal and plan for that.

Imagine eBay. At some point they probably had a generic way to scale for a while, so they could keep adding servers and bandwidth and keep up with demand. But at some point, you outgrow datacenters. You outgrow coasts and continents. Will your little auction site keep churning away when it's stretched out across the globe, still using a static map file in Apache that needs to be reloaded every time you add an application server? The goals have to be re-imagined at some point. Figuring yours out will make it easier to focus on the 'now' while keeping an eye on the future.

Step 3. PLAN

A scaling plan is basically your architecture manifesto. Keep in mind, it's based on your goals, which should change as you grow, so don't be stuck on one kind of technology or way of doing something. Whatever it is you're doing, there's a different, probably better way to do it, so don't get too caught up with the details. To begin, take your goal and look at every single layer from the client to your app's guts and back.

Let's take a goal which says "I want to maintain 30,000 hits per second of traffic." Starting with the browser client, where is your traffic going? Probably to a web server. If it's going straight to your web servers, you're going to need to sustain over 30K connections, which is a problem for just one web server. If you were going to a CDN that would be much easier to deal with, and you can probably get by with one frontend caching proxy server like Varnish (though that's not redundant at all, your goal didn't include redundancy...). It will have to be a really beefy box to keep a good and fast cache, though. You'll probably also want to enforce cache headers to the CDN to make sure it's not pulling your whole site from the origin every 2 seconds.

So you have 30K HPS to static content. Wonderful! Oh what's that? You wanted to display a social graph of your rubber ducky empire to every user? Shit. I guess we need more stuff. MySQL for a database (because it's easy and universal), Starman for an application server (because fuck you Perl is more than good enough), Memcached for your "fast" application cache, and one of those Map/Reduce thingies for making your social graph (i'm not a real developer, I don't know how that shit works). But how do you configure them? How many do you need? What happens if you outgrow something? Calm down. And keep in mind it doesn't really matter what you pick, you'll figure out how to scale it soon enough.

First write your application for the stack you picked. It doesn't matter what your application is or how shitty it runs as that has nothing to do with scaling. Scaling happens once the piece of crap code is done. This is how scrappy start-ups can afford to write terrible on-the-fly hacks and still survive launch week. So now that your app is running, you need to gather benchmarks.

To gather benchmarks we need metrics. To get metrics you either write something yourself or grab something that's actually good, like collectd. Configure it to gather everything under the sun and send it somewhere not on the box it's collecting on. Then populate your system with fake data and start hammering all the parts of the site. This is useful later as you can keep testing functionality and capacity as your site grows.

As you test your site, see how much of the resources are used up by the meager benchmark you've made. Now compare that to your goal and add about 20% to that number, and you know how much resources you'll need to hit your goal. Now just allocate enough capacity to get there. Keep in mind disk i/o, bandwidth, cpu, database queries, connection pool numbers, cache hit percentage, etc etc.

These numbers are not just basic information you need for capacity planning, it's critical in monitoring your live site to see when you unexpectedly hit a bottleneck. All of these criteria should have monitoring alerts trigger if they get anywhere near 80%, or double in a less-than-manageable amount of time. (Can you double your database capacity in an hour? No? Then you should probably get alerts if any of your database metrics go up by 50% in a half-hour.)

Now that you know the basic resources you'll need to achieve your goals, tune your stack. This is where "premature optimization" is actually a great thing. For example, your resource numbers for MySQL probably look ridiculous - 50 servers just to handle 30k HPS? Apparently people forget that MySQL (like most tools) needs to be tuned to reach its peak performance. Once you tune your stack you can go back to your benchmarking tools and fine-tune the performance to get the numbers more efficient.

But let's be honest: the goal is not to get the fastest performing stack, it's to get a stack that can perform. You might start to rethink your application when you find out it's just not performing very well. In general it's a mistake to redesign your app just because it looks like scaling is taking a lot more resources than it should. As a famous customer support representative once said, "The future is gonna cost more money," and your application will get slower over time. Focus on scaling and let someone else optimize the application.

With realistic numbers about how your site can perform, you can start allocating resources.Your goal was 30K HPS, but you only get 100 hits per second right now. If you have no historical data to plot the growth of traffic, just shoot for 10 times the traffic you're doing now and allocate resources for that. Before you have a launch day or big advertising push or something, check your historical data and do another 10x increase beforehand. If you're not using the cloud, make sure your provider can allocate resources at the drop of a hat for you, or that you have spares to use. If you're using the cloud, make sure you have all the steps down-pat for adding your resources in real time, so if you suddenly get a million users signing up to your site you know how to throw more resources in place.

The "we just got 10,000,000 signups!" scenario is extremely rare. But for cases of unexpected, goal-smashing growth, you need to have an emergency plan as well. You can find examples of them around the web. Typically it's a combination of handicaps to your site to keep some core functionality running. The last thing you want is for everything to go down. It's better to cap the number of incoming connections and allow a slow stream of users to use the site while you rush to obtain more resources to grow the site in time. Anything can become a bottleneck - network traffic, disk i/o, memory ceiling, database connections/queries, etc. Be aware of the maximum level for each criteria by comparing the resource use from your metrics with the configuration of each software component.

The last thing you want, which you'll add probably as you realize you don't have the money or capacity to just keep adding resources, is caching. In short: Cache Everything. Cache on your frontends. Cache on your backends. Cache to disk. Cache in memory. Cache the highest-used pages. Use a bigger journal to cache in the filesystem. If you desperately need iops, using tmpfs and writing changes occasionally with rsync is a form of caching. You can send users to the same servers to maximize cache hits at the cost of high-resource hot spots, or send them random places for better spread-out load at the loss of cache hits and increase in global resource use. Figure out what works best for your application.

Step 4. EXECUTION

So you have your goal, you have your plan, now you need to put it into practice. Scaling is one of those things where you don't need it until you need it. So being prepared to execute your plan at a moment's notice is pretty important. Usually it involves fire drills where your site goes down or you lose capacity and you need to add more quickly. But the management of your site is important as well.

Are your changes automatic? Do you have good revision control and deployment, and can you revert your changes immediately? Is your application's use of your infrastructure abstract enough that you can change backend pieces without ever touching your code? Can you roll out new services at the push of a button? Have you been testing your changes?

It seems obvious, but many times the problem with rapid scaling is simply a lack of best practices. All those little things you ignore because you're a startup and you don't have time to implement configuration management because of your 'just ship it' mentality? Once you've shipped it, and you suddenly need to scale, you get bit in the ass by the eventuality of your apathy to best practices.

Scaling is a never-ending process of analyzing data, testing limits, and growing your infrastructure. There's no easy way to do it, but at the same time, pretty much anyone can do it. The reason scrappy kids right out of school that jump on the startup bandwagon can keep tiny sites operating at huge numbers is because the actual work of adding resources is trivial. You figure out what you're lacking and you add more of it. The key is being constantly aware of what is going on and keeping one step ahead.

open source and hacky stuff