How to Scale Stuff, Part 2: You don’t know how bad it is, and the little things count

For some insane reason, the people at Macy’s California hired me, literally right out of High School as something called a “Jr. Systems Programmer”. In today’s terms, that was sort of like “Jr. Systems Administrator”, but turned out to be a lot different.

I actually got this job by spending a lot of time in my prior job at a local college, as a “Computer Operator”, reading manuals, which was the equivalent of “Surfing the Web” in those days. It was because of that that I could rattle off terms like VSAM, VTAM, RPL, ACB, you name it.

(BTW, in those days, the “Computer Operator” controlled access to the Computer, which meant I spent a lot of time feeding student’s punched card decks into the computer. Around the end of the semester, when people had to get their Cobol 101 programs done, it was absolutely amazing what you could get for feeding someone’s deck in sooner, rather than later. But that’s another story for another blog.)

Back to Macy’s. At that time, the computer room, and our offices, were on the top floor of the Macy’s department store in San Francisco. That meant that you had to go through some part of the store to get to work, and I sometimes just rode the escalators all the way up just to remember where I worked (OK, it was mostly because of The Miracle on 34th Street where they were OPENING THE DEPARTMENT STORE, which happened to be Macy’s).

Unlike what we’d call a Jr. System Administrator today, about a year in, one of my jobs was to help write their first online credit transaction approval system. That’s right, up till then they didn’t have an online system, and almost all credit card transactions were manually approved. For some reason, it was decided that the system was to be written in System/370 Assembler Language, as opposed to a higher level language such as Cobol (this was probably because back then, compilers did generate much less efficient code than hand-coded assembly language: things like peephole optimization and loop unrolling were still a ways off). These days, I’m pretty sure that this choice was a case of premature optimization, but it could have also been that the project manager wouldn’t “lower himself” to programming in Cobol.

Let’s review what a credit authorization application would do:

  • Get the information about the charged (Credit Card #, Amount, etc)
  • Figure out if the card number is valid (checksum)
  • See if the card is stolen (negative credit check)
  • See if the card has exceed an intra-day purchase limit
  • Log the transaction

Pretty simple, huh?

Before I go on a couple of data points:

  • These were memo transactions. This meant that while it was important that they get recorded, it wasn’t the End Of The World if they weren’t. The real money came off the actual charge slips (in that day and age). So, if some disaster happened, and you lost transactions, that just meant some person could charge more than they were supposed to that day. The distinction between “memo” transactions and “audited” transactions is super important. More on that later.
  • Then, as now, “acceptable” response time was 1 second or less. In those days, it was that so the data could get across 9600-baud SDLC line from the register to the store controller to the mainframe, and back. Granted they were short messages, but that was a pretty long path, in that day. And, since the store controller was really a computer in itself (it had CPU, a disk, and a proprietary IBM microcode code base), it was really a distributed system.
  • This application ran under something called CICS, or “Customer Information Control System” as a Pseudo Conversational transaction. CICS was essentially “Apache for Mainframes”, and pseudo conversational transactions were just as cool as today’s CGI or even Apache modules.

So we wrote a bunch of assembler code, and did a bunch of testing, and eventually rolled it out to a few stores. I don’t remember what all they were, but they included Serramonte (our biggest suburban store), and, of course, the San Francisco store, which is the Flagship store.

Oh, I forgot. One more thing. At the time, Macy’s was a actually a collection of divisions, with the mothership division being in New York, of course. In those day’s, “Bambergers”, which was Macy’s New Jersey, was the “top dog”, and really frowned on us folks in the Macy’s California Division doing all this newfangled technology stuff, especially before they did, and especially on our own. So this whole “Host Positive Credit Authorization” thing was kind of stuck in their craw, if you know what I mean.

At first, everything was OK. We solved a bunch of bugs, and response time was good. Mostly. Well, except for busy times. But in those times, it was only a little over a second, ok maybe it was more occasionally, but we all thought it was Ok.

Then Black Friday came. For those of you who don’t know, Black Friday is the day after Thanksgiving, and is always the busiest retail day of the year. Seriously. For some reason, people think Black Friday marks the beginning of the Christmas Shopping season, and they ALL come to shop.

Things went to hell. Response times less than a second were few and far between, and zillions of transactions were just plain timing out after some obscene timeout period (1.5 minutes or something). The only good news was that the store controller would fall back to “negative credit authorization” at that point, and just look at a local list of stolen cards, so that risk was minimized.

If you’re reading this blog, you may have been here more than once in your life. You deploy something, it works great, and then, on a big day, it all falls apart. Depending on where you are in the organization, you might experience this as watching your co-workers tear their hair out to getting calls from the CEO screaming “Web SIte Found But No Reply!” (they were reading me Internet Explorer Error Messages). Well, whatever happened to you, I’ll bet I’ve done you better.

It’s Black Friday. It’s the Biggest Day Of The Year for a retail store. No matter what year it is, retail stores operate on razor thin margins, so days like Black Friday were their lifeblood.

Except now, every charge transaction at the register is taking at least one and a half minutes longer, usually more as the poor clerk tried to figure things out.

Here’s where the queueing theory gurus will chime in and start explaining how this is a bad thing.

Ignore them. Here’s how it felt. Since it was technically a holiday, I wasn’t at work, but it was about 30 minutes after the store opened that I got a call telling me there “were problems” and could I please come in (in those days, we’d didn’t have pagers, much less cell phones, or even a way to access work computers from home). So I drove in, parked, and went into the store.

What I saw were lines and lines at every register of angry customers, and equally angry and frustrated clerks (the clerks are the ones who do the real work in stores 🙂 ). It wasn’t a few weeks before that I’d walked through the same store proudly beaming at the shiny new cash registers, and the shiny new authorization system (which got subsecond response time most of the time, so it was ok, right? Now, just slinked through as quietly as I could, hoping no one knew that I might be somehow even partially responsible for the problem.

Once I got up to the 8th floor (where the “datacenter” and our offices were) it actually got worse. The datacenter was actually just our own private ISP, where we had our own raised floors, our own computer (yes, that’s singular), our own disk drives, etc. And, like most ISPs, only a few people were allowed in there, and there were keycards and people stationed outside to keep unauthorized people out.

Except on holidays and weekends. Then the weekend crew would leave the doors propped open (because it was troublesome to use your card to get in and out), and there was no one to control access to the computer room itself.

So after slinking through the store, I am in the computer room talking to the operators about what’s going on, and this imposing man walks in to the computer room and starts yelling about “Who’s in charge??”, “When is this going to get fixed??”, and “Do you know who I am?”. Of course I imagine that this is some irate clerk, or maybe even department manager, and keep saying “Sir, you’re not allowed in here, let’s go outside and discuss it.”. Eventually I got him outside (boy was he MAD) and explained we were doing our best to fix it.

Turns out the guy was the Manager of Stores for Macy’s California. All the stores. And here I was, not even 21, shooing him out of the computer room like an errant tourist. The following week, my bosses bosses bosses (actually, the Controller, which is a financial function) spent significant time bragging about how some kid had stuck to the rules and tossed the Manager Of Stores out of his(the Controller’s) computer room. I guess there was some organizational friction there :-).

Anyway, that’s how bad it was. Let’s get back to the good stuff.

We didn’t have a lot of performance tools in those days (but it was a growing market — tools to monitor, evaluate, and make recommendations about performance would become big businesses, on their own over time), but a little work told us that it was the last part of the transaction – the logging – which was holding things up. Up to that point, things were flying, so it was a real shame.

The log file (or, more accurately, the log cluster was a VSAM KSDS, or keyed file (or database, what have you). Hmm. Keyed? For a log file? Why? Well I, who you remember was very young at the time, looked at the log file and saw a “timestamp” field in there, and thought “You know, one day someone might want to query that file across time ranges. Let’s make the timestamp a key.”.

Keep in mind that there was never any sort of requirement for this in the application specification, and, in retrospect, and application to look at these records by time would have been freaking useless, if not a violation of privacy (not that that was even a word in those days 🙂 ). But, I, in my far-less-than-infinite wisdom, decided to put it in, not know the impact of adding a key to a frequently-written-to file, much less a key like a timestamp.

Most of you know this, but the fact that this was a keyed (or, in today’s vernacular, “indexed”) file meant that for every recorded added, we had to index the key, write index records, and, depending on things like key collisions, perhaps do other things to the file’s index itself. And, oh, the file was probably locked for a significant part of this time, which meant that transactions has to wait in line to finish. If you’re new to this, that’s a big deal, since those lines of transactions waiting to complete mean additional processing, possibly additional locks on other resources, etc, etc.

The first thing we did was change that file from “Keyed” to “Sequential”, which meant that every transaction wasn’t burdened with indexing the log record it just added.

That was actually OK. It got response time down to “Well, most over them were still over a second, but lots less were timing out”. Of course, it was also NOT Black Friday any more, so it could have also been a function of fewer (angry) customers. The overall situation had changed from “mostly sucks” to “Still sucky, but sometimes OK”.

On Monday, two things happened. My boss didn’t come to work, because he had to fly to New Jersey to meet with the head of DP (Data Processing) (that was what you all do now) at Macy’s Corporate (Bamberger’s) about the “Status of our POS Project”. That would be Point Of Sale, not Piece of S**t, thank you. This meeting was scheduled looonng before Black Friday, but it was certain to be a topic of discussion.

The other thing that happened was that I (in my far less than infinite wisdom) decided that logging was still a bottleneck, and we should just change the situation completely by “queuing” log records in main memory, and let a background task actually write them out. Of course, if the machine (well, mainframe) or CICS (well, “Apache”) crashed, we’d lose any queued log records, but remember this was a memo system, so it wasn’t the end of the world. Certainly, the performance issues were the End Of The World, so the tradeoff was pretty obvious.

Here’s the part of this post where I really don’t have any good advice. After he arrived in New Jersey, my boss called me and I described my theory, and solution. His response was that that “couldn’t be the problem, and that wasn’t the solution”, and I should look for something else. I listened to what he said, and decided that I really did know what was what, and spent the night implementing and testing it.

(Don’t be too impressed here, CICS has built in facilities to handle queueing, dequeuing, queue locking/latching, and background tasks. All I had to to was write the proper application logic to use these facilities. Ok, it was still in assembler, so you can be a little impressed).

The next day the store opens, and things are schweeet. Response time is actually better than it was before Black Friday, in fact it was in the .3 – .5 second range. I actually couldn’t believe it myself, and kept peeking at the log file to see if it really did have all the data is was supposed to. The other cool thing is that there were hardly even more than 1 or 2 log records queued up to be written.

Around noon, I got a call from New Jersey, where my boss was sitting in the office of the Bamberger’s guy, and presumably on the hot seat about “Black Friday”. He said “I just wanted to check on how things were”. I said “Well sir, response time is 100% below 1 second, and the vast majority are less then .5 seconds.”. There was a moment of silence, and then he responded “So yes, response time is less than one second. That’s excellent. Ok, thank you”. (Ok yes, there was some performance art in there for the Big Boss, we all know that.).

Did I do the right thing by defying my boss? In my mind, yes. I’d been in all weekend trying to fix the situation (which he hadn’t) so I felt like I knew what was what. Also, at the time, I really didn’t have any other solutions, or even vague ideas what the problem could be, so there really weren’t a lot of other choices at the time. So, in my mind, if you wanted something to do with someone young, eager, and able to work for days without sleep, this was the right thing to do. I knew it couldn’t make anything worse, and it would eliminate a whole class of problems from the picture. Bottom line is when all else failed, I fixed the problems I understood at the time.

And, that night when I left work, it was OK to walk through the department store again.

Wow, long post. Let’s sum up:

  • Think very hard about adding stuff this isn’t called for in the specification. Even good intentions can go bad.
  • Keyed, or indexed, data costs. It doesn’t matter if it’s old style keyed files or indices on columns in SQL databases, it still costs. Not only does it cost indexing time, it can also lead to locking and queuing issues.
  • Make sure you understand the requirements for memo data and audited data. You should never disregard the requirement for data integrity, but in a pinch, this distinction may make a difference. Note: I seriously wonder if this applies any more. Data systems (Filesystems, databases, noSQL systems, etc, have evolved pretty far).
  • Response time is never just some numbers on a page, or on a screen. It’s real people, waiting. “Mostly Ok” response time is 100% not OK for the poor sucker that gets the slow transaction(s).
  • When all else fails, solve the problems you understand and get them out of the picture.

Thanks for reading.

P.S. This incident gave me a phobia called “Fear of Indexes”. It would come back to bite me later.

  1. #1 by Brad on 2010.04.10 - 8:16 pm

    This was extremely interesting! Hearing about how the older computing systems worked and tieing it into issues that are still relevant in the tech world today (indexing keys, slow load times, etc) is always nice to hear about.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: