By now, it’s hard to decide if the launch failure of the Obamacare exchange websites isn’t funny anymore, or just keeps getting funnier.
Sites went down — including the individual state sites for states that are running their own exchanges. When people weren’t getting “due to an extraordinarily high volume of calls” errors, they were getting 404 Not Found messages, and pages were finding new and creative ways of erroring out. Even Wednesday afternoon, I was getting server errors just trying to finish the account creation process on the California site.
Almost as quickly as the train wreck itself unfolded, so did the explanations for it evolve. First, both President Obama and then Press Secretary Jay Carney claimed with straight faces that the failures were a result of the massive interest in the exchanges. Then, others claimed that these were normal rollout errors that occur with all large, complex systems. Finally, as the engineers rolled the platform back to the hangar for retooling, there was no hiding the fact that this was indeed a software failure, not just a set of normal launch “glitches” (to use the press’s word du jour).
The exchanges’ bad day brought to mind a number of other high-profile website failures, including the Romney campaign’s spectacular white elephant of a killer whale, Orca.
I’ve been in web development for most of my professional career. I’ve participated in successful launches, and launches that needed to be rolled back and fixed. I’ve spent very long days dealing with one error after another, and equally long, uneventful days waiting for the deluge that mercifully never came.
It’s always easy to criticize someone else’s failures, and with my luck, tomorrow the QA guys will rain down trouble tickets on my head like nobody’s business. Nevertheless, it remains inescapably true that while there were reasons this happened, they weren’t good reasons, and could have been avoided. Given three years and hundreds of millions of dollars for development, they should have.
Here’s why, and how.
How Web Systems Work
First, a very simplified description of how large, commercial websites are put together nowadays. They basically have three layers of servers – 1) the web layer, which talks to you, the user; 2) the database layer, where the data is stored; and 3) middle-tier layers, which figure out what questions they need to ask the database, and what they need to tell the database, in order for the front-end that you see to work properly.
Each layer consists of many servers. You may be talking to Web Server 1 for a little bit, and then switch over to talk to Web Server 2. And Web Server 1 may send your first request to Middle Tier 1, and your next request to Middle Tier 5. This lets them answer many more questions at once, and talk to many users at once. It’s how Google is able to get results back to literally millions of simultaneous requests almost instantaneously.
These layers have traffic cops (called “routers”) to make sure that no one computer is trying to handle too many questions at once. Other traffic managers keep track of who you are and where you are on the site, so you don’t have to keep starting over.
There are even multiple databases. Data that change a lot (this is called “volatile”), like information about you, or your orders, or billing information, may only be stored once (and backed up regularly). But information that doesn’t change very often, like plan pricing and terms, may be stored in more than one database, to make it faster and easier to get to.
Web systems have used this basic architecture for over a decade now, and launching large, complex sites is now less art and more science.
What Can Go Wrong
Of course, no technology is foolproof, and large, complex websites do fail.
First, users are unpredictable. There’s a saying that you can make something foolproof, but you can’t make it damn-foolproof. People are ingenious in the ways they will misuse something that you put in front of them, and programmers are always complaining about users “doing it wrong.” Of course, it’s not the users who are “doing it wrong,” it’s the programmers who didn’t anticipate their doing it that way.
Second, servers will fail, network connections will fail, routers will fail. Sometimes this just happens, and there’s not much you can do about it, except hope that whatever’s left can handle the load, while you work to get the servers back up.
Sometimes, the load really is too large for the servers’ performance limits and number of servers. Web servers can only handle so many questions per second; the same is true for middle-tier and database servers. This is what happened to the Colorado Rockies in 2007, when seemingly all of Colorado tried to buy World Series tickets at once. The traffic jam brought the website to its knees, and people had to wait a day for the engineers to rework it so that wouldn’t happen again.
And sometimes, programmers just mess up. The database isn’t designed right, and it either loses information or takes too long to answer questions. The middle tier doesn’t ask the database the right questions, or fails to store what the customer needs stored. The web server can ask for information that isn’t there, not keep track of the where you are in the site, show you stuff you didn’t ask for, or let you choose things that don’t make sense in combination.
And the layers can send the wrong information to each other, or misread the information that gets sent to them by other layers.
How You Keep Things From Going Wrong
Of course, programmers are responsible for testing their own code as far as possible. But programmers are usually the worst people to test their own code. They know where all the bodies are buried, and only the most disciplined are likely to test things they know are likely to break. After all, they’ve fixed it before, and are heartily sick of making sure that the date field doesn’t bomb when someone enters 11//1994, instead of 1/1/1994.
There are QA testers, who make sure that things work as advertised. They’re given a list of expected behaviors, and run through the site, making sure that the it does the things the programmers say it will do. More importantly, they run through the site, deliberately making mistakes, to be sure that the site doesn’t break.
There’s beta testing, which basically is a larger group of people who aren’t given any specific instruction. They’re the ones most likely to imitate actual users, since ideally, they have no preconceptions of how the site is supposed to behave, and where it might break.
There’s load testing, which simulates a huge number of hits, all at once, to make sure that the servers don’t buckle and fold like a cheap suit when everyone tries to buy that cool toy all at the same time.
What Went Wrong
From the evidence, it’s clear that the Obamacare exchange servers saw errors of all different kinds. They weren’t prepared for the load, even though this was never very heavy. California reported about 600,000 unique visitors, and Colorado reported about 55,000 unique visitors.
There were screen captures of database errors, not because the data was bad, but because the structure that holds the data was misdesigned.
There were 404 errors, which are totally design errors, meaning that the web sever was trying to get to a page that didn’t exist. (This led to the best hashtag of the day, #404care.)
There were non-descript server errors like the one I got from the California server.
There were user-interface errors. At about 10:00 AM, Colorado suspended new accounts on its site (it’s one of the ones using its own site, not the main exchange site), and didn’t get around to allowing new accounts again until 3:00 PM. At that point, the “New Account” button sent you to the login page for existing accounts. If you chose to enter your childhood phone number for a secret question, it wouldn’t take it, no matter what format (certainly not the format it used when asking for your current phone number).
This is why I say it was clear that this wasn’t just one of those things. The volume of inquiries wasn’t high by large-system standards, and the rest of the errors were in the control of the programmers.
These were design and execution errors, pure and simple. They were all catchable, with proper beta and load testing.
What Could Have Been Done
Test. Test. Test.
If you’re going to have a big, splashy rollout of a controversial government service that half the country is rooting against anyway, you need to test it until it’s bulletproof.
Because failures are often ambiguous from the user side, it’s hard to tell exactly where a lot of these errors originated from. It’s certainly true that the data — involving as it does multiple insurance companies, with multiple plans, for different pricings based on location and number of people covered — is incredibly complicated, and that some states didn’t have final price and deductible information available.
As a programmer, I can tell you with certainty that simply logging into a system shouldn’t produce an error.
And with three years and tens of millions per site at the ready, this was inexcusable.
It didn’t have to be that way. Instead of announcing October 1 as the date that Obamacare would save the world, they could have had a series of smaller rollouts, opening up various portions of the registration process at, say, monthly intervals.
In effect, ask the public to act as your beta testers. They would have lost some of the sizzle in return for a robust system that wasn’t freighted with unrealistic expectations, but right now, I think that’s a trade they would happily have made.
It’s true that it’s hard to get a real feeling for how much of the problem was data-driven, since many times we couldn’t get far enough into the site to find out. But again, it could have been rolled out in pieces, letting people browse before the law said they could buy.
All of the code would still have needed merciless QA testing and beta testing, but each section would have been solid before the next one was rolled out, and where that wasn’t possible, the potential weaknesses would have been known beforehand, making it easier to locate the launch-day failures that remained.
In the cases cited earlier, the damage was either limited, or over. People’s irritation at not being able to score World Series tickets was tempered somewhat by the fact that they were seeing their team in the World Series at all. The Romney campaign had one day to make Orca work. Once it didn’t it was game over, and there was no payoff at all for getting it working Wednesday.
Obamacare exchanges are different. Not only are they supposed to be the tool by which tens of millions of Americans will — forever — select their health insurance, they’re a precursor to the systems that will store actual medical information for patients, insurers, hospitals, doctors, regulators.
In the end, the only good thing about these websites is that nobody’s actual health depended on their working.