How to Scale Stuff, Part 3: Use the right tool, and scaling organizations is important, too

I left Macy’s about 3 years as the youngest senior executive in the history of the company.

I had a lot of fun there, and learned a lot too. We converted all the stores to electronic point-of-sale, put in a magnetic ticketing system, and changed operating systems from DOS/VS to OS/VS1 (that was a big deal at that time, and a major resume dot).

More importantly I learned about what retail is, which is a lot of hard work, and a lot of skill. This is why I believe most “e-commerce” sites failed in the early days of the internet — the people who ran those sites weren’t retailers, and couldn’t, or wouldn’t, bring the years of experience brick-and-mortar retailers had to the online “shopping experience”. I also learned that retail is an insane business, with razor-thin margins which had to operate at the speed of light even then.

Here’s an example: About 3 weeks after the season began (Winter, Spring, Summer, or Fall), the buyer had to decide, based on 3 weeks worth of data, whether to re-order a line. If they guessed right, the line sold, and they were a success. If they guessed wrong, they ran out, or, worse yet, ended up with a huge surplus of clothing no one wanted at the end of the season. In a world where magnetically encoded tickets were a huge leap technologically, you can only imagine how little information people had in those days. You were lucky to have the first week’s worth of sales data by the third week, much less this weeks!

I also became the youngest senior executive in the history of Macys — all of Macy’s. That was a big deal for me, too, except I couldn’t go out and have a drink to celebrate because I wasn’t 21 yet.

At the end though, I felt like learning tin manage in an organization where “long range planning” was one month from now might not have been a good thing (yes, now I realize that was a silly way to perceive things and more a problem with my perceptions than anything else).

I moved on to Chevron. Yes, the Oil Company. In those days the public hatred of oil companies was high so my stock answer to “Where do you work?” was “Chevron, and no, we don’t squeeze oil out of live baby whales to make gasoline”.

Chevron did a lot of things right with regards to how they treated and trained their employees throughout the organization:

  • If you were doing well, you could count on changing jobs every 18 months, in many cases to departments or fields you knew little or nothing about. This was designed to broaden your understanding of the business, and was probably a test of sorts.
  • They had an in-house training program called “SSKP” or “Supervisor Skills and Knowledge Program”, which was a week-long retreat for young managers. During this week, you learned a lot about management from successful people from all over the organization, and, as a bonus, could invite anyone from the company to dinner and they had to accept. That meant that if your group decided to invite the Chairman of the Board of Chevron Corporation to dinner that week, they had to come, and did so happily.

I’m going to have to write a whole series of posts on How Many Mistakes I Made As a Manager, but that’s another day.

My best scaling stories at Chevron were not when I was doing programming stuff, or even “System Programming” (that’s SysAdmin to you) stuff, but when I ran some departments in what was called “Data Storage”.

Yes, we had a department of about 25 people which was called “Data Storage”, and we were in charge of, yes, you guessed it, data. This included a group which was solely responsible for backing up, taking care of, and, when necessary, recovering databases, and another group responsible for disk drives and mass storage. The first might make sense to you, but the second?

In those days, people didn’t have PCs – instead we ran mainframes. Big mainframes. And, to serve them, we had hundreds of disk drives. Hundreds. We had so many that we had to put them on the floors above and below the mainframes due to cable length restrictions. While disk space allocation and deallocation were somewhat automated, there was still considerable manual intervention required, plus the whole job of backing up and recovering data when drives invariably went bad. (There’s a whole story about an episode where a non-IBM disk drive vendor’s disks, of which we had a few score, started failing en masse when the magnetic surface started flaking off the platters. We had something like 3 drives failing a night ).

Anyway, at one point, I was put in charge of the “Database Recovery” group, whose whole job was, as I said, backing up, tending to, and recovering something like 300 IMS/VS databases. In those days the “database wars” weren’t about Oracle versus Sybase versus Informix, they were about hierarchal databases versus network databases, and IBM versus everybody else.

Just like today, these databases had transaction logs to keep track of changes made to the databases, which were then used to roll back transactions in the case of a failure, or forward recovery after restoring from a backup.

And, like many organizations today, our organizations was functionally “far” from the Database Administrators (DBA’s) (the people that designed the databases), and the Application programmers, who wrote the programs which accessed them. The result was that, despite all good intentions, there was little documentation on what databases were related to which, which was kind of a bummer when you were trying to recover from a failed disk.

And, just like today, our service organization didn’t have any programmers in it, and we were wayyyy at the bottom of the queue when it came to getting programming type stuff done.

So, they improvised. It turned out that all the compiled database definitions (DBDs) and application definitions (PSBs) in a file, which was it’s own special binary format. The folks in the group wrote a script which essentially trawled the file for strings and then did some huge merge operation to relate the databases to each other. Because of how it worked, it took hours and hours to run, and often didn’t finish in the window allotted to run it. As a result, this little hand-wrought list was often out of date, and worse than useless if it was too out of date.

Since I was a programmer, and this was actually a problem, I decided to solve it. Instead of trawling through the file in an antique recursive grep-like fashion, I wrote a program which actually loaded the DBDs and PSBs in an “official” way and then discerned the relationships. Imagine everyone’s surprise when a job (they were called “Jobs” then ) which used to take hours now took less than a minute. Voila’: Up to date reports.

There’s are a couple of morals to this story: The first one is, obviously, know when to use the right tool for the job. There are times when brute-force methods work (see my prior post on writing silly macros to do a job I could have done in less time the brute-force way), and there are times when you need to do things the “right” way. How do you know when? Well, there’s not rule-of-thumb, but if you have an important tool which isn’t working because something’s being done one way, it’s probably time to find a different way to do it (in all fairness, part of the problem here was that there were no programming resources available to the group until I came along, and that wasn’t even m job).

The second story is about scaling in organizations, sort of.

One of the other things our group was peripherally responsible for was the “Development” version of the system we ran for the Applications programmers to test with (think of it as a “developer machine”, except it’s a mainframe). When I joined I found out (from the grapevine) that the Applications folks were very unhappy with their little sandbox, but, again, since we didn’t have any engineers or systems programmers, we couldn’t do anything about it.

So, I took what was apparently the unheard of step of having a sit-down between the applications group, the systems programmers responsible for IMS, and our group. I guess it was unheard of because, up till then, the way groups were unhappy with each other was to sulk and not talk to each other. Once we got all those people into a room, we found out that their #1 issue was….performance. Up till then, the answer was always “Well, it’s a test system, that’s why it’s slow, now please go away.”

As a group, we decided we didn’t like that answer, so we dug into it. The first thing we did was move the development system to a bigger mainframe (Chevron had mainframes coming out of it’s ears, so we actually took one of the spare ones they had laying around (for when there were failures) and move the system there. The deal was of course if they needed the spare, the test system would go down, but that hardly ever happened.

That actually didn’t do a lot of good. Which was disappointing.

Then we looked at how the test environment was set up. That was shocking. Here’s what we found: If you use Windows or OSX, somewhere in the system is something called a “Path Variable” which tells the OS where to look for programs to load. In modern computers, it’s ok to have a lot of these since the whole program loading process is efficient, and, frankly, single users don’t load programs that often.

In this environment, there were something like 50 different concatenated libraries in this list which meant that the system had to search through them for each request. Not only were there a lot of these libraries, they were constantly updated (with new versions of programs and they were never “compressed”, which is like never ever de-fragging your windows hard drive. And, of course, no one knew what half of them were since they were created pretty much “on demand” and never audited.

Soooo, you can all guess what we did. We got rid of the dead libraries, set up a regular compression (defrag) schedule, and did a little organizational engineering to get from 50 libraries to … 5.

That alone was enough to make a huge improvement in response time. Orders-of-magnitude improvement. In turn, that improved the productivity of the applications group, which made them happier (we never measured their productivity, but we could experience their happiness ๐Ÿ™‚ ).

The first moral to this story is that a key part of scaling something is to first meeting with as many people as you can (within reason) to find out what the problems are. In most cases, this step alone will probably help clear the air and get the thought processes going again, since you’re apt to ask new questions, or force people to re-think what’s going on by explaining it to you. It also makes you look like a collaborative person. ๐Ÿ™‚

The second moral is to start simple. In our case, we didn’t dig into the applications being tested, or how the test system was configured (the “IMS SYSGEN”, to be specific). We just looked at how it was started up, and asked why about as many things as we could. When we say 50 libraries concatenated to each other we asked why instead of just accepting “no one knows” as an answer. Get the stupid stuff out of the way, it will make the hard problems easier to solve.

And, the last moral to this story is that scaling applies to organizations too – whether you’re a service organization with a broken tool, or an application (engineering group) with a broken test system. Since people are the most expensive asset an organization has, scaling their tools always makes sense.

Whew. Sorry that was so long. I promise, next I’ll write about Aliens, the time I had to wear a HazMat suit, and something even worse than angry customers in line at the checkout stand.

  1. Leave a comment

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: