Tuesday, October 20, 2009

mistakes by developers when creating a... whatever

  1. Not communicating with sysadmins. You want to discuss technical issues with your sysadmins early on so they can figure out what kind of hardware will be needed to handle the load, and they can propose methods that work within the current infrastructure to do what you want. I know you like working with Storables, but loading them off an NFS filer vs getting just the data you want from a MySQL cluster is an easy pick for a sysadmin. And can we say "application servers" (i.e. FastCGI)?

  2. Picking the latest tech, or any tech based on interest or novelty or perceived design gains. Do the opposite: start with the oldest tech and go up as you look at your requirements. The reason is again with the system in mind: does the tech you want to use scale well? Does it have a long history as a stable production-quality system? Does it support all the methods you'll need to work with it, in Dev, QA and Production? Is there any straightforward deployment model that works well with it? Most importantly: Do you know it backwards and forwards and can you actually debug it once it breaks on the live site?

  3. Not keeping security in mind at design-time. I still meet web app developers who don't know what XSS or SQL injection is. You need to take this seriously as a hosed website can cost you your job.

  4. Not using automated tests for your code. You need to know when an expected result fails so it doesn't find its way into your site. Not QAing or *laugh* not syntax checking your code before pushing it also falls into this category. Test, test, test.

  5. Writing crappy code. Oh yes - I went there. There's nothing more annoying than doing a code review 2 years in and looking at all the bloated, slow, confusing, undocumented, unreliable, crappy code from the early guys who just wanted to get the site off the ground. There's always going to be bit rot and nobody's code is perfect. Just try your best not to cut corners. There is never a good time to rewrite so try to make it last the test of time. A good example is some of the backend code i've seen in some sites: the same app being used for over 10 years without a single modification and never breaking once. Also good to keep in mind is portability. When the big boss says it's time to run your code on Operating System X on Architecture Z, what'll it take to get it running? (hint: you'll probably be doing that work on the weekends for no extra pay)

large scale ganglia

Ganglia is currently the end-all be-all of open source host metric aggregation and reporting. There are a couple other solutions which are slowly emerging to replace it, but nothing as well entrenched. Hate on RRDs as much as you want but they're basically the defacto standard for storing and reporting numeric metrics. It can be hairy trying to figure out how to configure it all though, so here's an overview of getting it set up on your network.

First install rrdtool and all of ganglia tools on every host. Keep the web interface ('web' in the ganglia tarball) off to the side for now. The first thing you should consider is configuring IGMP on your switch to take advantage of multicast groups. If you don't you run the risk of causing broadcast storms which could potentially wreak havoc on your network equipment, causing ignored packets and other anomalies depending on your traffic.

In order to properly manage a large-scale installation you'll need to juggle some configuration management software. You need to configure gmond on each cluster node so that the cluster multicasts on a unique port for that lan (in fact, keep all of your clusters on unique ports for simplicity's sake). If you have a monitoring or admin host on each lan you'll configure gmond to listen for multicast [or unicast] metrics from one or more clusters on that lan. Then you'll need to configure your gmetad's to collect stats from either each individual cluster node or the "collector" gmond nodes on the monitoring/admin hosts. Our network topology is as follows:
Monitor box ->
DC1 ->
DC1.LAN1 ->
Cluster 1
Cluster 2
DC1.LAN2 ->
Cluster 3
DC1.LAN3 ->
Cluster 4
DC2 ->
DC2.LAN1 ->
Cluster 1
Cluster 2
Cluster 3
Cluster 4
DC2.LAN2 ->
Cluster 1

The monitor box runs gmetad and only has two data_source entires: DC1 and DC2.
DC1 and DC2 run gmetad and each has a data_source for each LAN.
Each LAN has its own monitor host running gmond which collects metrics for all clusters on their respective LAN.
The clusters are simply multicasting gmond's configured with a specific cluster name and multicast port running on cluster nodes.
The main monitor box, the DC boxes and the LAN boxes all run apache2+php5 with the same docroot (the ganglia web interface). The configs are set to load gmetad data from localhost on one port.
Each gmetad has its "authority" pointed at its own web server URL.

(Tip: in theory you could run all of that off a single host by making sure all the gmetad's use unique ports and modifying the web interface code to load the config settings based on the requesting URL, to change the gmetad port as necessary)

In the end what you get is a main page which only shows the different DCs as grids. As you click one it loads a new page which shows that DC's LANs as grids. Clicking those will show summaries of that LAN's clusters. This allows you to lay out your clusters across your infrastructure in a well-balanced topology and gives you the benefit of some additional redundancy if one LAN or DC's WAN link goes down.

We used to use a single gmetad host and web interface for all clusters. This makes it extremely easy to query any given piece of data at once from a single gmetad instance and see all the clusters on one web page. The problem was we had too much data. Gmetad could not keep up with the load and the box was crushed by disk IO. We lessened this by moving to a tmpfs mount, archiving RRDs and pruning any older than 60 days. Spreading out to multiple hosts also lessened the need for additional RAM and lowered network use and latency across long-distance links.

If you think you won't care about your lost historical data, think again. Always archive your RRDs to keep fine-grained details for later planning. Also keep in mind that as your clusters change your RRD directory can fill up with clutter. Hosts which used to be in one cluster and are now in another are not cleaned up by ganglia; they will sit on the disk unused. Only take recently-modified files into account when determining the total size of your RRDs. Also, for big clusters make your poll time a little longer to prevent load from ramping up too often.

I'll add config examples later. There are many guides out there that show how to set up the fine details. As far as I can tell this is the simplest way to lay out multiple grids and lessen the load on an overtaxed gmetad host.

Monday, October 19, 2009

reverse versioning system

this is one of those ideas that is either retarded or genius.

so, you know how everyone versions their software off the cuff? this is 0.1, that is 2.3.4, this is 5.00048, etc. it all seems so arbitrary. package management has to attempt to deal with that in a sane way, and if you're ever tried to manage the versions of CPAN modules... just forget it. there should be a simpler way.

why not go backwards? you could keep going forward into infinity or whenever someone thinks their change is important enough to deserve a major number bump. or you could go down all the way to 0. it all hinges on the idea that you clearly define what your software does, the goals you want to reach and the tasks you need to complete to get there. in theory, once you have completed it all you should be done with your program and it should never need another version because it accomplishes everything you set out to do. thus you will have "counted down" to zero.

first take your goals: these will be the major numbers and are broad ideas. your tasks are the minor numbers. you can have a third field of numbers for revisions but these would have to count up to make logical sense (if they are in fact only touched for revisions). as you complete each task and goal your version goes down by that much, so completing 5 tasks would bring your minor version down by 5.

this of course would not work for applications which keep increasing their goals. in theory you could add a billion tasks to your next goal so you can still add features. but eventually all your goals and tasks will be 0, so you need to design it for a purpose and not just keep throwing in junk for each new goal. this system may also only work to provide a "stable base", whereupon reaching version zero you know that the system is complete and ready for use. perhaps negative versions after that, since in theory this would only indicate new features?

package management/version control systems would all need to be modified to fit such a system, but in the end i think it would be a more sane standard than just "these numbers went up" and having to figure out for yourself what that means for that application.

Monday, October 12, 2009

why brightkite is about to die

They had it all. A nice niche in the social networking world. A [semi-]long history of service that was stable and efficient. A sizeable global user base. They were a leg up on their competitors with years of head start. And then they made the one fatal mistake of any start-up: They upgraded.

The riskiest thing you can do as a start-up (or online venture in general) is to upgrade your system. Even something as small as a "graphical revamp" can lead to droves of your users leaving for something less pretentious or easier to use. Because nobody really depends on these kinds of sites, your whole business is based on keeping your users happy and making sure the competition's site isn't more attractive.

Brightkite has been working on a revamp of their site ("BrightKite 2.0") for a while now. They finally unveiled it sometime last week. It crashed. They said they'd get some of the bugs worked out pretty quickly. About one week later the site was still down, when at the end of the week they announced it was finally back. Random outages over the weekend continued to plague them and users slowly filtered back in. Even today the service was still spotty. All this downtime caused a massive surge onto competing services such as Twitter, and though I don't have a definite reason why, i'd bet Twitter's downtime was in part due to over-stressing of their system by fleeing BK users (the 503 errors basically confirm their backend web servers couldn't handle what the proxies were throwing at them and were probably toppling over from load).

They're sticking with the new 2.0 site of course, and still trying to figure out the bugs. The system is slow. The mechanisms which made their old site usable (such as searching for a business near you) goes from mildly broken to nonexistent. The new features on the site seem like a laundry list of nice-to-have features from other social networks which nobody needs.

We all know Twitter is coming out with their own geotagging system Real Soon Now(TM). Google Maps has GeoRSS and other sites are slowly developing their location-aware services. As soon as a viable alternative appears, the BKers will try it out, and as long as it doesn't crash for a week straight they'll probably jump ship entirely. The new site just needs to have an "Add your BrightKite friends!" option and the titanic 2.0 will be rendered below the surface.

It's really sad because they've basically fallen for all the common traps inherent to a site upgrade like this. First and foremost you need a beta site. There may have been one but I never heard of it. You need a *long* period of beta, incrementally adding new users to see the trends in system load and to look for hidden, load-driven bugs. They should have know everything they'd need to scale in the future based on that beta site and have 0 bug reports.

Secondly, you need a backup. If you try to launch the new site and it ends up being down for 24 hours, you *have* to go back to the original site. There's no excuse for this one. If you can't handle switching your site from the new code base to the old one you've made some major planning errors.

I don't know where they plan to go from here. The reviews of the new site were somehow positive - I guess they had a preview copy of the site or they'd never seen the old one. But they fucked up by turning away all their customers. The biggest lesson you could take away from this is how web services have to work: release early, release often. Constant dynamic development on the live site. You simply can't afford to launch a new release after a long cycle of development. It requires too much testing and one or two things missed can sink the whole ship. Test your features one at a time and make sure your code is modular yet lightweight.

From my limited experience, the biggest obstacle to this method of design is in keeping your app reasonably scalable. The same hunk of code hacked on for years will result in some pretty heinous stuff if you don't design it right and keep a sharp eye on your commits.

A side note: this is a good lesson in backing up your files. People who sent their pics to BK and deleted them from their phones may eventually want the pics back. If BK dies they may find themselves wishing they had set up a Flickr or Picasa account to catch their photos. Facebook probably has the biggest free allocation of picture storage ("Unlimited") but they also don't cater to fotogs. As for the geotagging, well... This might not be a bad time for someone to create a lightweight geotagging app or library.