A Programming Sutra
This is the way I heard it. (That’s the way all sutras start.) Long long ago — about 1997 and I’m not naming names to protect the innocent and because I figure the statute of limitations is up for the guilty, and the company I’m going to talk about has been through bankruptcy and several acquisitions so it’s not the same company anyway — a major toy retailer ToysForKids (TFK) with stores in malls all over America heard about this nifty new thing called “the web.” As I heard the story, two programmers in IT had the idea that TFK should be selling toys on the internet. They got permission to do a sort of side project, semi-bootleg, to build a demonstration e-commerce web site, ToysForKids.com. (By the way, that domain name is now owned by a domain-squatter in Hong Kong called “iGenesis Limited”, but then ToysForKids never existed anyway.)
They built the web site on a desktop server using a scripting language called tcl
, and demonstrated it. It looked so good they got permission to take it live, and they happily started making dozens of sales a day with it. It really was a lovely site, too, won lots of awards.
The CIO was so pleased that he arranged a demo for the CEO. The CEO was so pleased that he arranged a big advertising buy for Thanksgiving Day during the football game — as I recall, $50 million — so that everyone would know about the new ToysForKids.com.
Everyone did. And everyone’s mom, wife, and girlfriend that had a computer went and tried to start their Christmas Shopping sometime in the first quarter.
Now, remember this is 15 years ago. The desktop server they were using wouldn’t make a good iPad now, and the Internet connection, while good for the time, had less capacity than Comcast promises me.
And everyone who was bored with football and had computer access was trying to use it. The site pretty much melted down; it wasn’t long before the programmers had found different jobs, the CIO wanted to spend more time with his family, and the CEO, um, retired.
Now, this little sutra may remind you of something, specifically the debacle of healthcare.gov (that’s HuffPo just in case you don’t want to go there) trying to go live that I wrote about in this column last week.
There has been a lot of different discussions on the various flaws, much of it good, a few pieces hilariously poorly informed. But I used to go to companies building new web systems all the time — roughly 200 of them — as a consultant, and I think I see one other thing to this that no one has mentioned.
Basically, there are several problems that people seem to agree are a big part of this.
- The web site itself is buggy — things that should be working perfectly, like the log in pages, don’t. (See Joshua Sharf’s piece today.)
- Each page loads a large number of individual files — nearly 100.
- Sources tell me that the authentication — the secure part that handles login names and accounts — is being handled by a single database.
- A number of the pages include very large graphics; they’re pretty, but some of them are megabytes in size.
The result is, as a number of people have commented, a lot like a “distributed denial of service” attack on their own site, and it begins to explain why what was in fact not a particularly unusual or unpredictable load brought the system to a weeping, shuddering, gasping halt. Last week I was looking at the effort of loading 26,000 web pages (comparing it to Connecticut) but with this implementation, the actual effort would have been more comparable to 100 times that — 2.6 million pages. If each time someone visited they got that big graphic, we’re talking about 13 gigabytes of data.
Now I begin to believe load was a problem. Probably, the real limit is on the connections required to send all those separate files. Each time a web page makes a connection, there is a handshake that goes on to establish communications, and — at least in this example — a fetch to the image on disk. those connections can take a long time. (Watch the bottom bar of your browser sometime as you load PJ … you can see a lot of the same thing going on; sometimes the ad servers seem to be taking their own sweet time in particular.)
In effect, it’s like a circus clown car: this one little car arrives — but one clown after another keeps getting out.
All of these are problems that we run into every day, and all of them are easily solved. But they’re also the problems that you try to solve after you get the site running.
Now, another source — I turn out to know people involved — told me that in September the word inside the development was “They can’t be serious about going live with this!” Of course, with the delays and waivers, it clearly wasn’t politically feasible to delay.
It just as clearly wasn’t ready for prime time, and to me, it looks like they pushed out what was meant to be an initial prototype for alpha testing.
So here’s a prediction. When the final story comes out — almost certainly not until after the end of the Obama Administration — what we’ll find out is this:
- This was not ready for prime time and everyone technical knew it. It was political pressure that led to it being rolled out. (And remember the Challenger disaster if you think they wouldn’t have responded to pressure.)
- What we’re seeing is the development code, probably pushed into the web site at about 11:02PM Eastern Time on September 30.
Most importantly, I will bet cash money that we will eventually find out the government was demanding major changes — like waivers and coverage changes — up to within a couple months of the rollout.
Join the conversation as a VIP Member