An Interview with Chris Hendrickson

Chris Hendrickson

CH = Chris Hendrickson
GAM = George Michael

GAM: Today is the 26th of July, 2004 and we have the good fortune to talk with Chris Hendrickson, a person who has moved easily between worker and manager positions twice. So, I'll ask Chris to start by telling us where he went to school and how did he get to the Lab.

CH: OK, well I went to school at Fordham University in New York City (Bronx) and got a Bachelors Degree in Physics in 1960. I went into the Army immediately after that. I was in Army Unit for two years after which they sent me to Purdue for a two-year program where I got a Masters Degree in Engineering Science, which was kind of Mathematical Engineering. The program doesn't exist anymore, because it's been subsumed into the different engineering fields but, back then, it was very new and very good. I really liked it. One of the guys, Gerald Richards, who was at Purdue the year before I was had applied for work at the Lab as a Military Research Associate (MRA), got accepted, and everything I heard back from him said that it was really great. So, I applied. Interestingly, my employment interview was conducted at Chicago's O'Hare Airport by a Lab physicist, Charles Violet. I was accepted to work as an MRA and started working there in June, 1964.

I was assigned to A Division, where I spent two and one-half years working on Monte Carlo physics running on the 7094 computers. I actually had no programming experience whatsoever but, with the help of another programmer, John Kimlinger, I taught myself FORTRAN, as well as FAP (FORTRAN Assembler). I found that FAP got me close to the IBM�7094 and helped me understand machine architecture to some extent. I also taught myself to program the DD80 graphics device. I happily used the physics and mathematical knowledge I had acquired at Purdue for two and one-half years before the Army terminated my MRA status and posted me to Korea.

I spent thirteen months in Korea. When I returned, I resigned my commission and then applied at the Laboratory to see if I could get on as a full time employee. After a little bit of hassle about my starting salary, I was accepted and went back to A Division. They had just lost both the physicists and a programmer on their most highly-used physics code. I started as the brand new physicist on this code and Bob Gulliford was assigned as the new programmer. So, the two of us, both new, were given responsibility for the code which, at that time, consisted of 250,000 IBM punched cards. I should mention there were no comments, only flow diagrams on how the code was put together. And that was what we were expected to maintain and develop.

After working on code for a couple of years, I found that my interests were less along lines of Physics and more those of Computer Science (although I still enjoyed "practicing" Physics). Thus, I spent quite a bit of my time not only developing the code, but working with the computer science aspects of it with the other folks in the group, like Jan Moura and Norb Ludke, and, to some extent, Bob Cralle. But my real love was programming (we had yet to call it Computer Science).

We started getting more and more competent machines like the CDC�6600, and it brought a tremendous increase in computer capability. I learned that machine inside and out. I was fascinated by machine architecture. We were also, at that time, beginning to develop our own operating systems (Livermore Time Sharing System, etc.) for these machines. For years and years they were, I think, certainly state-of-the-art. They were also very flaky and, as users, we were very frustrated working on them. One of the good things I did at that time, as least in terms of the programming I enjoyed, was precipitated by my blowing up at the unreliability of the LTSS operating system and sending Sid Fernbach a nasty memo, which I broadcast to everybody, about the terrible state of the computers. The memo was titled "The computers are like the weather...", and contained a day-by-day account of my experiences with LTSS. After Sid calmed down, he asked me if I would like to be part of the solution. After checking with my management, I sat down with Pete (Pierre) DuBois, who was the system architect with a reputation of having little patience with users. But, once he realized I knew what I was talking about, he turned out to be a pretty good guy to work with. I developed a piece of code, a small program called SYSCHECK, which would run as a probe of the operating system. It was very small, only about 10K and it would sit and take snapshots of what the operating system was doing at the time. I believe this was our first interactive system diagnostic tool. We were able to watch as system tables were filled and used, and that gave us a good handle on what was happening in the machine. We also wrote a file containing each system snapshot and, if the machine hung (which seemed to happen every 20-30 minutes during the day's interactive period ), DuBois would look at this output and he'd look at something else and he'd say, "Oh, wow, OK, I know why that's happening." And we ran this thing for probably three or four months and stability of the operating systems was tremendously improved. And, I think, that was my contribution to helping LTSS get going.

GAM: Yes, that's great.

CH: One of the other things I was interested in, always had been interested in, was computer architecture. And when George Michael asked me to start looking at Data�Flow architectures, they turned out to be very interesting concepts to think about. And so, I was involved with that for four or five years. In the process of being involved with that, George suggested it would be nice if we wrote a more realistic test code for people in academia to use. It would have interesting physics in it and would allow for large problem sets to run with it. Up to that point, academics were limited to calculating Fibonacci numbers and the like. So, with Pat Crowley and Tim Rudy, we developed a code called SIMPLE, which was about the simplest thing we could think of that might have a chance of modeling a little bit of our workload. We put it together in a package, with documentation of the physics, and made it available the academic community. It was used quite successfully for several years. They used that code to refine some of the approaches they had toward dataflow architectures. Those were exciting times!

GAM: Yes, I think it was great.

CH: It was really good, and we got lots of feedback from the academics about how helpful SIMPLE had been.

GAM: That was the idea actually.

CH: One of the things that was funny was that I had used the equation of state for an ideal gas, which we expressed using a bi-quadratic interpolation scheme. It turns out that all but one of the constants in a bi-quadratic interpolation were zero. These equations of state tended to be computationally expensive for our codes and one of the guys in the University of Manchester dataflow group noticed that the equation of state calculation didn't cost anything to compute. What had happened, of course, was the compiler had gotten smart enough to recognize that all these constants were zero, so it didn't bother including them in the computation. It was just returning the single constant and there were no real computational costs. So we learned some things as well, about how to write codes for academics to use. But that was great fun.

GAM: We went to Europe along that era somewhere.

CH: Yes, two or three times, yes. We went to a conference in...

GAM: Newcastle.

CH: Yes. Phil Treleven put together a conference in Newcastle and, gosh, we got to meet many of the greats of English computing, including Maurice Wilkes.

GAM: Yes, it was a marvelous experience.

CH: Ah, well, I insulted Maurice. I basically accused him of not knowing what he was talking about when he was presenting a talk on the Cambridge Ring. I kept telling him the speeds he was talking about were much too slow to do anything real. It turns out I was right, but he was kind of irritated that some brash thirty-something American would be so rude. I could tell he was upset but he kept his cool�-very good and very British. It was a lot of fun meeting all those folks. I was very arrogant about my skills at that point. Returning from those series of trips, (there was at least one other we did as well), I realized that I really didn't know very much, especially after talking to all those academics and listening to all the stuff they knew. There was a huge amount of theory they were developing and, while I was very good in programming practice, I realized I was way short in theory. That's when I decided to sign up at the Department of Applied Science at UC Davis and try to codify what computer science I knew. That's what drove me to get my second Masters Degree, which I got in 1981. So, it was a really good thing for me. When I finished, I found I could talk to academics because I now knew what they were talking about, as I had the language and the jargon for it. Now I could be more helpful, and they could be more helpful, because we could talk more completely to each other. So it was a good experience for me.

GAM: Well, I remember when we were at Manchester, we both gave talks to their community about what we were doing and they liked your talk quite a bit. I mean, it opened their eyes about a whole new era, or a whole new branch, of computing that they hadn't thought about before.

CH: Yes, I think it did. They hadn't had anybody get up and talk to them honestly about how we used computers and why anyone would want more than one computer. We had, at that time, I think, four 7600s and that was more computing power than the British really had ever thought about because it was expensive and they didn't have the money.

GAM: Ohhhh, it was good duty, and learning all about "Boddingtons" was also good.

CH: Well, yes and we made some good friends over there too, which was really nice. The Gurds, John and his wife as you know are very, very, nice.

Earlier on, in the late 60s, one of the interesting things I was involved in was LLNL's first attempt at vector computing. Sid Fernbach decided that the next stage in the evolution of computing hardware was vector computing. He was very interested in working with Control Data Corporation (CDC) because we had their machines, the CDC 6600 and the CDC 7600. He was interested in their proposal for a vector machine called the STAR�100. The machine was very complicated, with lots of complicated microcode, and took several years to actually become real. The Lab eventually bought two of these machines, and migrated LTSS to them. Eventually, CDC fixed a lot of the STAR's problems. They went on to market the Cyber 205 as the CDC vector product, but it was not a real commercial success. We never bought a Cyber�205, and Sid's insistence that we should probably precipitated his fall from grace. The Star�100 was very expensive, very hard to use, mostly because of the vector startup time, and the core memory was slow, so the startup time on vectors was really long. You had to write code very carefully, which made conversion of codes to the new machine very time�consuming and therefore, expensive. But I actually proved then that I could take a code that was running very fast on the 7600 (about three megaflops) and move it to the STAR, and it would run on the STAR at about ten megaflops which was great and wonderful then. Nowadays, the SIMPLE code runs on my Macintosh at home at 350 megaflops without any work. The world has changed a lot in thirty years! But the STAR�100 was a really good experience for the people who went through it, which were mostly the B Division folks. When they moved their codes over they saw tremendous increases for two reasons�-the first, because the machine could really run fast if programmed correctly and, second, they took all their dirty FORTRAN and cleaned it up using STACKLIB, which Frank McMahon developed for the STAR [1]. Cleaning up the codes generated factors of 2-3 in efficiency even without moving to a new machine. Then, when they ran the code on the STAR, they got another factor of three. My code was originally written in assembler for the 6600/7600, so I didn't get any speed-up by going to STACKLIB. In fact, the code slowed down by a factor of 4! But, having STACKLIB as a STAR emulator allowed us to debug on a robust environment, the 7600, which made porting the code infinitely easier. But B Division also learned better programming techniques in the process of cleaning up their codes.

GAM: Yes.

CH: Despite the fact that everybody said the STAR wasn't a success, all that stuff is said by people who didn't run on it. It was successful for a lot of reasons.

GAM: There were several factors about the STAR that made it not a success. One of them the long gestation time when you were trying to convert your code. You can't do any physics while you are doing that. And the other thing it was just an awkward machine to get turnaround on.

CH: Yes, it was. It was not a good machine to do timesharing on because it was so slow. It was like being back on the CDC�6600, and worse actually.

GAM: Actually, I've used the STAR as my watershed point. I've not paid any attention to all the neat machines that have come after the STAR and I've concentrated only on the machines that led up to the STAR. And I think the STAR was kind of guilty, or responsible for, the eclipsing of the computer science developments at the Lab. We were a leader and then all of a sudden we weren't.

CH: Well we got the CRAYs and we did some really good work on the CRAYs.

GAM: Yes, and we learned vectorization and spread that around.

CH: Well, we really didn't spread it around a lot. When you and I went to the very first European Vector and Parallel Processing Conference (VAPP), in Chester, maybe it was in Cambridge, we sat there listening to people talk, giving their papers about what they were doing with the first vector machines they were getting in Europe. We had done all that stuff and, of course, we never published it cause we were not very good at publishing such things.

GAM: That's right.

CH: But we deserved the credit for doing it. They were relearning all the things we had learned five, ten years before. But eventually they caught up.

GAM: Well, I don't think they've caught up, I think that they've gone off on a slightly different tangent. They're not running the big kinds of calculations that we were struggling to run.

CH: Well, the weather programs certainly are.

GAM: The weather thing is running on vector machines.

CH: Yes, which is what Sid expected it to do. With respect to efficiency, a vector machine is possible, I think, to get fifty per cent efficiency if you work at it. Parallel machines perhaps get five per cent. I mean it's very hard to get the full efficiency because if you've got any kind of bulk messaging that has to be done, it makes it very hard to get your numbers up.

GAM: Is there any likelihood that by re-engineering the algorithms or changing the computer architecture a little bit that we could get better than they're getting now?

CH: I don't know about the architecture. I still think the dataflow in some form or another is still the best way to do computation. It's much easier for me to write code if I can concentrate on making the code more parallel and don't have to concentrate on hiding the latency. If latency is handled automatically by the hardware, then it's a lot easier to write code for. Dataflow is being used in other places to optimize for long instruction word machines and stuff like that, but it's not in the program as such, it's more in the firmware.

GAM: Why didn't dataflow catch on at the Lab? It was working well, and was showing that it was faster than FORTRAN.

CH: I think it's because it was another language and the Lab at that point had gotten this religion that, as you had mentioned, we couldn't do any computer science any more. We got very close to having NLTSS done, and I was proud to have been part of the project. We actually got NLTSS going and made the demonstration happen within a year after I took over the project. But then we said, "Ah, but we can't afford the staff it's going to take to do this kind of work anymore" and they started cutting the Computation Department back and back and back and wound up with a department full of system administrators and that was it. So, what the future held for the Computation Department was bringing in hardware and going with "off the shelf" software and keeping the machines running and that's the way things went. We did HPSS, but that was done as part of a Consortium with lots of other people involved. We had a small part in that, not a really big part. So, all of that stuff we developed went away. Now what I'm seeing now is that we are starting again to develop things like CHAOS. This system is being developed by another Consortium. The system is going to run on top of connected Linux boxes and handles all the resource management stuff.

GAM: I don't know much about it.

CH: Well, I don't either, other than that we had CHAOS 1.2 two or three weeks ago, and then we went to CHAOS 2.0 and all of a sudden all the machines are running about five times slower than they used to. Here we are, repeating the past, but 20 years later. I guess a whole new crop of people are in place, ready to re-learn our mistakes! I think we shorted ourselves because the operating systems don't have the goodies that NLTSS had for us when we developed it years ago.

GAM: Is this based on UNIX or Linux?

CH: Yes, these are all Linux. We're getting Linux boxes that are very fast. We're also getting new ASC (formerly ASCI) stuff from IBM which are interconnected Power5s and they run really well. IBM has also done some nice networks for us. I gather dealing with IBM is difficult because you have to talk with them through their lawyers. But it's pretty easy now to network Linux boxes. With the new switches that have come out like the Elan switch (from the former MEIKO machine folks, our first big parallel machine), you buy a bunch of Linux boxes and you get public domain stuff that runs on top to tie it all together and you can have a five thousand processor machine for ten million dollars that'll run ten teraflops. Kind of makes you wonder what export control means anymore because, if you buy a thousand machines, like the Macintosh on my desk�it's a two gigahertz dual processor Macintosh, with two gigabytes of memory, 160GB storage running SIMPLE at 360 megaflops�you've got a very fast piece of hardware made from Apple computers. And when I think back on what we ran on and how much money we paid for what, a hundred and twenty-eight megabytes of memory on the CRAY machines. I mean, jeez.

GAM: It's incredible.

CH: It's incredible, yes. So, it's a different world and I'm still not sure the commodity thing is the right way to go, because the efficiencies are not as good as they are on something like Japan's Earth Simulator. The Earth Simulator costs about $300,000,000 and you can put together a pretty good commodity super-computer for probably one percent of that.

GAM: Well, your simulator is cheating in a way. You can run those parallel things really well, as well as doing all the vectors.

CH: Yes, yes, yes.

GAM: That's a tribute to the weather research groups.

CH: Yes, and the geological stuff that they did. But, I really had hoped that the Tera (now Cray) dataflow product would go someplace, but it just never did seem to go far. I think the problem was that unless you had a really big problem to run on it, it's not cost effective. You know, if you run a small problem on it you're going to use one processor and you're not going to be fully parallel. It's going to cost you seven million dollars to do something you could do on a two thousand dollar Intel box. But if you want to run a really big problem really fast with a sixteen processor Tera computer, which is very heavily parallel, then you get to be cost effective. So it's only cost effective in the large, not in the small. It drives people away I think. Plus, it's different (but not difficult) to program.

One of the things that's really nice about what's been going on for us, is that the architectures are all pretty much the same whether you're running on an Intel Itanium four or five, or whatever, or whether you're running on an IBM Power4 or Power5. They're all cache based, they've all got a long instruction word and they've all got cheap (i.e., not fast) access to memory. If you've got sixteen processors per node like on the ASC White machine, and if you keep all sixteen processors busy, the memory can't keep up. But if you keep only eight of them busy then the bus will pretty much keep up with them. For the Intel boxes, you've only got two processors, so the memory system tends to keep up with the processors pretty well.

GAM: Well, that was partly what I had in mind when I was talking about architectural changes because all the things you are suggesting, suffer from that thing called the Von Neumann bottleneck.

CH: Yes, they do. And the only thing that they should work on, really, is instead of trying to shorten latency, find some way to hide it.

GAM: Well there may be a new technology coming along for memories, so we won't have to worry about the latencies and stuff like that, but right now you've got latencies to fight.

CH: Well, no. Compilers will have to discover all of that parallelism that's going to be necessary to hide the latency and, although compilers are really good, I don't think they're up to that. The compilers aren't up to, for example, reorganizing your code to make best use of cache, for example. And so what you wind up with is code you can't generally tune because the caches on all machines are different. That is a difference, I mean your code runs more or less the same on all the machines. But, if its got a 2�MB cache versus a 256�KB cache it's going to run differently. You're not going to rewrite your code to handle both a small cache and a large cache. If fact, that's probably not possible. And so, cache can be as much as a factor of two in terms of your performance if you use it correctly. It suffers from the same Von Neumann bottleneck if you don't get up to, say, a 97% cache utilization. If you're at 90%, you might as well be at 20%. So, and that's very hard to program for�-very, very hard. You can program for good register utilization, you can program loops so they're easily optimized, and you can do all kinds of other good things the compiler knows how to do. Actually, I did an exercise about five years ago, where I took a piece of code that we needed to make run faster, and I did all the tricks that I'd learned in my thirty years of programming. And, when I got finished, I'd lost the physics in the extra coding I did, and it didn't run any faster than what the compiler did when you turned on optimization level 3. So the compilers are doing all the stuff I was doing. All the loop unrolling and strength reduction and all that kind of stuff that I did, the compiler does automatically. So they are very good at that. They're very good for knowing how to use the computational registers. But, they're not good at knowing how to use the cache.

GAM: A lot of effort has been put into those things, so they ought to pay off.

CH: Oh, yeah, they have. I think they're really good and they're very stable and compiler writers have gotten to know how to write multi-targeted back ends so they can have a single front-end and multiple code generation stages. Once we're off the node however, it's a lot of hard work. We mix programming paradigms quite a bit. For example, the code I work on uses MPI for one level of parallelism. However, we also use threads. So if the machine comes with shared memory, we can do some of the physics using threads and some of the physics using MPI, or all of the physics using threads, or all of the physics using MPI and that makes programming really ugly. It's not like we can just do what's right for the Physics and then let the compiler and the run�time system take care of the rest of it for us. We've still got to do assembler�like work for MPI and threads. It certainly creates job security.

GAM: That's great.

CH: Actually, MPI is pretty easy to use after the learning curve is over. What's really ugly is doing threads. Unless you use OMP threads, for simple stuff like parallelizing a simple region or threading loops, everything else is ugly as hell. I wrote a simple code using pthreads, and bloated the code by a factor of 4. The algorithm disappeared in the pthreads implementation.

Chris Hendrickson
GAM: So, at some point you left programming and you took on being a manager.

CH: Yes, I did. That was kind of interesting. I was unhappy with my pay and my job ranking, and I felt my division leader wasn't being fair to me. But, he left and we got a new division leader, Roy Woodruff. Roy became the A�Division leader and I really liked him. He and I had had some history back when he was a designer. I had worked with Bill Comfort on a code Roy used to help design one of his shots. I had put together a very quick little one-dimensional code and then hooked it up to Bill Comfort's optimizer. The idea was to use this code and optimizer to automatically calculate the correct parameters for the device. We actually got it running and Roy successfully used it on one of his shots. That was the first time, the only time, I went out to the Nevada Test site for a shot. It was great fun. I think Roy thought I was pretty good and so, when he became Division Leader, the next thing I know I got a half-way decent pay raise. One day, he came up to me and said, "Bob Bell is leaving Computations and they need somebody to manage APD-II, I'd like you to apply. Why don't you think about it?" As na�ve as I was, I said, "Well, OK, but I probably won't get the job.", not realizing that he was the A Division leader, and that if he wanted me in the job, I was going to be in the job. So, lo and behold, I got the job and I spent seven or eight very happy years managing that group of programmers.

At the same time, I was going to school for my second Masters, and traveling with George for the dataflow stuff. It was a lot of fun, and fun makes you really enjoy the job. I think I was a good division leader. I did some really good and innovative things for the folks in the Division. I always felt my job was to make the jobs of the folks I worked with easier. To smooth the way, so they could work without a lot of crap getting in their way, and I think I managed to do that. Then I was asked to manage the NLTSS project, from the applications side, for a year, because it was going nowhere. And I did that, and we got the project off the dime and moving and actually had the first demonstration of what we called production multi-processing at the Lab. Unfortunately, the problem with multi-processing at the Lab was that nobody was really taking it seriously. The management wanted it, I don't know why, but none of the physicists wanted to do it because it was hard work and not Physics. In addition, the Computer Center never set the machines up so that they could run parallel codes efficiently. They were always running in the timesharing mode so your code would be in there trying to get four processors and somebody else would have one and they'd give it up and then someone else would get it and then you'd get it but you'd lost three of your other ones. And so, by the time you're running a four processor job, it would take you about ten times as long as running a one processor job. So, of course, nobody would run that way. It was awful.

Then I was asked to kind of manage, or more accurately, overlook the Weapons Program computers. And so I managed a project for two or three years to bring workstations into the Weapons Program. We brought in the first Sun workstations, which I think were Sun�3s. Of course, A Division didn't want to go that way and B Division did, so A Division had Computations spend a couple of million dollars developing what they called an intelligent terminal. This was basically an X�terminal before X�terminals really got to be anything. And that made my job very difficult because the A Division code manager was very hostile toward any oversight by me and that made my life really tough. I don't much enjoy constant confrontation. I did enjoy the project, though; doing it despite all the troubles, because you have troubles in any project and that's why it needs a leader. But I was happy when it was done.

After that, I went on to John Immele's staff to oversee all of the Weapons Program computational stuff. John was the Deputy Associate Director (DAD) for Nuclear Design. I was one of his ADADs. That was a very hard, unhappy time for me. I never did get people to work together mostly because one guy actually refused to participate. He eventually left the Lab for greener pastures. I think I drove him crazy just as much as he did me. But he finally left. After that, I was kind of kicking around. John Immele got removed by the new Director, John Nuckolls. So I reported directly to George Miller, who oversaw the entire Weapons Program. Unfortunately, George was much too busy with more important things to be concerned about what I was doing, so for a couple of years I was just marking time, holding things together. Not my favorite pastime!

Then Randy Christensen was selected as the head of Computations and asked me to come over and be his Deputy and I worked over there for three years. And, again, I didn't like it. There were just too many personalities that were really hard to deal with over there. So when the Lab lost the NMFECC contract...

GAM: I don't think they lost the contract...

CH: Actually, we did lose it. We lost the contract to site it at our Lab.

GAM: Right, they just moved to Berkeley. And, in the process, took a bunch of good programmers with them.

CH: They took a bunch of good people and, in the upheaval, upper management removed Randy Christensen from the job and put the new guy in. And, of course, I disappeared with Randy. At that point I was kind of looking for something else to do and I looked around for some management jobs and couldn't find anything. And I was five years from retirement. This thing about doing things at a high level, you know, it takes forever. You go over and say you've got an opening and I'd like to apply. And they answer, "OK we'll let you know". And then two months later they say, "We're ready to interview you", and you interview and then two months later they say, "Well, we're still interviewing people", and so you're five or six months into this thing before they'll let you know what's going on. So while I was waiting, I taught myself C++. I'd been wanting to learn about the language and about object-oriented programming and I got pretty good at it. I finally gave up waiting for the management job and I went looking for a job to get back into programming. For a long time, I couldn't find anybody to hire me. Finally, Linnea Cook, who used to work for me, convinced B�Division that they ought to take a chance on a guy who'd been a manager for twenty years and who really, really did know how to program. So I brought in some of my work in C, they liked it, and they nervously hired me. Within a year, I was indispensable. It helps when your first love is programming. Linnea was really good to work for, and I liked the organization. B�Division was really good�-a very nice place to work. They were really good to me and I really enjoyed being there.

So, I went from being a physicist to being a programmer. I had gotten a Master's in Computer Science years ago because that was what I really loved. But, looking for a better salary, I went into middle management. That worked pretty well, but when I went to higher level management, I got Peter Principled; I got to a place where I didn't like it and I really wasn't good at it either, and then stayed there actually for a lot longer than I would have liked. I actually spent about six or seven years working at something I disliked. I was exceptionally lucky to be able to go back to what I loved doing, programming. I think most managers aren't able to return to the technical pursuits they followed when they came to the Lab. I don't do any physics. I can understand the physics, but I don't do it. I leave that to the physicists.

GAM: I think that's an interesting career.

CH: It was a very good career. When I talk to young people, I try to tell them that just because they are doing what they are doing right now, it doesn't mean they're going to be doing that the rest of their lives. I can pretty much guarantee that, especially at the Laboratory where there are so many opportunities to do exciting things, when you're tired doing one thing you'll find something else exciting to do. And you won't have to leave the Lab to do it.

GAM: I guess I'm fairly well convinced that I don't know anything, really, about the Lab, the new Lab, I mean it's just different.

CH: Yes, it's changing.

GAM: It's no longer elegant, and I can't predict what it's going to do. There seems to be a loss of cohesiveness there, nobody feels like they are on a team. And we used to talk to each other across disciplinary boundaries. It just doesn't seem to be that way anymore.

CH: Well, the guys I work with are a team...

GAM: Yes, but you're inside of a division.

CH: Yes, inside a division and it's very cohesive.

GAM: You're talking about a cross step.

CH: Oh, I don't know, this always was bad. There were all those years when the Weapons Program was given twenty-three million dollars a year to the Physics Department to do Weapons Supporting Research. And every year they would say, "This is what we'd like you to do" and every year Physics would say, "Well, ok, but this is what we're going to be doing" and they didn't match. And Weapons Program felt like they were having $23M taken out of their hide and got nothing for it. And that went on for years and years. So, twenty years ago, thirty years ago, they were still all these fiefdoms and they stayed separate.

GAM: It could be I had blinders on, I don't know but, when I was there, I thought it was great.

CH: Well because, yeah, because of what you were doing and what you were allowed to do.

GAM: Yes.

CH: I think the lab has one of its best directors right now. I think Anastasio is probably the best one since Johnny Foster.

GAM: He's sure photographed a lot.

CH: Yes, he is, well his job really is trying to keep his head down so that the Lab doesn't get in the news the way Los Alamos does.

GAM: Yes.

CH: He spends a lot of time trying to make sure we don't do something stupid. We didn't worry so much about that in the old days.

GAM: Well, I met him when he came to the Salishan Meeting one year and he was going to talk about this micro-radar thing and he was very nice and I enjoyed meeting him and his family. His wife and his daughter, they were very nice. But I haven't talked to him since.

CH: Oh, he's a good guy, he's a guy who actually worries the people issues. Which is not the way John Nuckolls was and to some extent, Bruce Tarter, and certainly not Roger Batzel, I mean, Duane Sewell maybe worked the people issues, but Batzel didn't worry about them. So, you have to go pretty far back, I don't even know if you can go back far enough to find a Director who cared about the folks the way I think Anastasio does. He's not got the pizzazz of some of these other guys, so to speak, but I like him a lot. I really do.

GAM: I think we should go back to Herb York.

CH: Yeah, maybe so. That's a long ways back too. A long way's back.

GAM: Well, can you think of anything else that you want to cover here? Of all the computers, for instance, that you used, did you ever develop a taste for one particular computer?

CH: Oh yeah, I think the machine I liked working on the most was the CDC�6600. I liked the balance. I liked the instruction set. It had 64 instructions and it was almost perfect.

GAM: That was our first adventure in a RISC architecture.

CH: Yes, it was a RISC computer and it was very well balanced. The instruction set was beautiful except I think it was lacking an integer divide. I mean from that point of view and for its time that was a very, very beautiful machine. It was elegant, it was really elegant to look at.

GAM: It was a milestone if you will.

CH: And the 7600 was nice because it was so much faster. But, they had to fudge with Large Core memory (LCM) and that made programming somewhat uglier on that machine than it was on the 6600.

GAM: Well that was everybody's complaint about that machine.

CH: I never got that excited about Seymour's vector machine, the CRAY�1, probably because I worked on the STAR�100. The CRAYs were kind of nice, once they put SECDEC into it. But then they started grafting more stuff on, like multiple processors. They got more ugly again.

GAM: Which is in violation of what Seymour would do.

CH: Yeah, but he kept adding some, not that he kept adding much. He was a very difficult man to convince about adding features. He needed to understand the feature AND the applications need before he'd do anything. He said he never put cache into the CRAY because he didn't really understand it. I'm sure he did, but he meant a deep understanding, including how it was used by the applications. He was unique that way.

GAM: He was right, Chris.

CH: Oh, I agree.

GAM: And I think that the country lost an incredible asset when he was killed.

CH: Yeah, I think so too. In fact, when you think about what he did versus what was in the STAR, the CRAY Vector machines were still pretty much RISC machines. Operands went to vector registers for computations, so you got data reuse without hitting the memory as much. But, when you look at what they did for the STAR-100 with all the micro-code they did, one of the reasons why the STAR wouldn't work really well was that if you tried to use some of the really fancy instructions that they had, now I mean sparse instructions or matrix transpose, they were often flaky. I think that some of the esoteric STAR-100 instructions never worked reliably. It was difficult because you had a small program running in the micro-code to do one instruction for you. So the STAR was an example of a complex instruction set machine verses something like the CRAY which was more RISC-like. When we ran our code for the STAR, I made sure I kept it very simple, using very little of a very large instruction set. I learned in the Army to "Keep It Simple, Stupid". So I wrote the code using only a very basic set of instructions that were simple. Adds, subtracts, multiplies, divides, moves, you know things like that. Where B Division was always having trouble with their codes running on the STAR, we never had trouble. We would run and nobody else would run, because we only using about ten per cent of the instructions. The simple ones. They couldn't get them wrong. Later on, Lowell Wood was building the S-1 and the S-II machines, which were complex instruction-set machines, that would do a fast Fourier transform (FFT) as a single instruction. And I thought, "Why?" It seemed to me that made the machine difficult to build and hurt its reliability. Gordon Bell came out to look at Lowell's project and he was really intrigued by the S-1, and especially about its CIS nature. I couldn't understand why a guy as smart as he could be so dumb about a machine like that. And then after we got into the meeting and started talking about it, I realized the PDP machines were kind of complex instruction set machines. I mean you could, you had all this instruction modification table abilities in the PDP-6s and...

GAM: could do that within an instruction.

CH: Yes, you could. But I think Gordon was really interested in whether or not complex instruction sets were going to work. And, of course, I think it's turned out they don't, at least for very complex instructions. Intel's architecture is still fairly complicated and is very successful in its own way. Of course, so is IBM and their RISC architecture. We're using both types of machines in our clusters.

GAM: I can't, what shall I say, rationalize Gordon's attitude about the S-1.

CH: No, no I didn't understand it at all. But anyway, yes, those were some interesting times. I think a lot of times, when we got into building stuff ourselves, we just didn't have enough expertise. We didn't know enough to do it. And we certainly weren't going to be able to manufacture the S-1 ourselves. We couldn't keep up the support staff necessary to maintain NLTSS across all the machines we were going to be running on. Building compilers, you know, as much fun as that would have been, they just couldn't afford the 50 people it would take to do it.

That trip we made to IBM one time to talk to George Paul. We wanted to talk about NLTSS and about vectorization. I remember them asking how many people we had on the code developing NLTSS and CHAT and the boss said, "I don't know, twenty" and how many lines of code? And the boss said, "Oh about a million" and then, well, how many lines of code in the IBM and it was twenty million and, I don't know, 300 people and there was a database in the system as well. I mean, their operating system was a database manager as well as all this other stuff. And he asked and I remember, "what kind of cache management do your users get?" and he says, "Oh well, we've checked the users and they typically run about 97% or 98% cache utilization. I asked "How well does your software do?" "Oh, we run about 40%", he responded. So their stuff either was not amenable to cache or they just weren't putting their time in on it. To the user, efficiency is everything because there's never enough machine. So a factor of two in doing cache right is worth spending the effort on. Whereas, the operating system is usually so big you don't have much locality to optimize.

GAM: Well, our intersection at the IBM was at the T.J. Watson research labs. And that again is a different thing from the IBM production facilities.

CH: Well, but this thing was a little different because IBM was toying with the idea of going into the vector market. Matter of fact, they hired a couple of our guys out of the compiler group who knew how to write vector compilers, but IBM decided, I think, not to do it.

GAM: What happened was that they learned more about miniaturization of the RS-6000 and that's become their main line of mainframes if you want to call it that. And, in the Blue Gene, they are putting two of those RS-6000s on a chip.

CH: Yes, we're going to see how that goes. Actually, that's going to be an interesting architecture. Because the machines we've been on for the last ten years, I think, have all been classical machines. I mean you just tried to slap as many of them as you can onto a node, and hook them together by a bus or small crossbar. The Blue Gene's going to be real different. It's going to be interesting to see how well one can program for it. I may actually get to do that if I don't fully retire.

GAM: I don't know of anybody who's making any plan to use a significant fraction of the machines that are available on the Blue Gene.

CH: Well, if it turns out to be a useful machine, I'm sure we'll figure out how to use most, if not all, of the processors.

GAM: Well, some people are talking about using a thousand machines, two thousand, but it's a different story for sixty-five thousand.

CH: We've run, I don't know, six or seven thousand processors, and there's really no reason why we can't run more than that although we haven't done the scaling. So, I mean the issue is always scaling or not...

GAM: Well, I think before you start using more machines, aren't you going to need an enormous amount of infrastructure so that you can be told in reasonable terms what's going on all over the machine? Something can go wrong in a certain section and you'll never know it until...

CH: Well, the way they find out about it, mostly, is the user calls up and says, "My code isn't running right now, and it's clear to me it's processor 12 on node 5 that's causing the trouble". And then they can look and say, "Oh yes". So the user is a critical element of the diagnostics path.

GAM: Yeah, as usual.

CH: Diagnostics are, I think, still a black art for some of these things.

GAM: Well, I think that's the entire secret to making parallelism more attractive.

CH: It's easy to put two processors on a chip. Getting as much on a chip as you can makes for more capability.

GAM: Why, you can exploit a little of what you might call locality, by having these, your algorithm focused the right way on two processors instead of just one and then having to reach out for all nearest neighbors or something like that.

CH: Yeah, well, we do that. There are a couple of ways to do it. With the two processors per node, you could do two way treading on a node. And then use MPI to go off node. Or, you could just do MPI everywhere. When ASCI White came in, one of the nice things about it was that they did shared memory MPI as well as non-local MPI.

GAM: That's a very important part of that.

CH: Yeah, our code was constructed so its communication pattern was a master and, say, seven adjacent processors as workers. They communicated via shared memory MPI. So we had maybe 16 masters who would be doing most of the communication via memory sharing, and only occasionally did they need to go off-node to exchange data. We got really good speedups. And we went faster by structuring it that way. But you can get into other subtle problems�-like when you're using all 16 processors on a node and, every once in a while, a system periodic would kick in and preemptively take one. If you are given 15 processors to do a 16 processor job, your code will thrash until the system periodic finishes. We got so we were only using fifteen out of sixteen processors, always idling one so it could take care of the system. And we run faster. We run faster using fifteen processors than you do sixteen.

GAM: Chris, this has all been so very interesting and we've gone rather far from remembering things from our earliest days. But it's just been fascinating. However we're now at a natural breakpoint. So, I'd like to thank you for taking the time to talk about your adventures at the Lab.

[1] Editor's note: STACKLIB is a complete library of programs to implement efficient vector operations. STACKLIB programs typically executed about four times faster than the equivalent FORTRAN versions. It was proposed by Chuck Leith based on his experiences with the CDC 6600. The original implementation was done by Fred Andrews. Later improvements were made by Frank McMahon, and Lansing Sloan and implemented on other computers like the CDC 7600, STAR 100 and the CRAY-1. These routines were adopted by supercomputer users all over the world. Full details were written by McMahon for an internal report, but apparently were never published. Readers wishing to dig deeper into STACKLIB will need to contact McMahon directly. A Series of OCTOPUS COMMUNIQUES, numbers 847, 872, 930, and 979, available through the LLNL Archives and Research Center also relate to STACKLIB.