A Second Interview with John Fletcher

John Fletcher

JF = John Fletcher
GAM = George Michael

GAM: Today is January 9, 1995, and this is my second interview with John Fletcher. Let's just start, John.

JF: Okay. Well, in reference to our previous conversation, I said that maybe I would discuss something later, but that I didn't. That had to do with the error-correcting code I worked out that became part of an American National Standard.

I felt that this was one of my two biggest successes in terms of publication, the other one being the pentomino program that made the cover of "Communications." Well, the error-correcting code came about in an era where we did everything ourselves here, and we did it down to the very lowest levels. That's mainly because things weren't available any other way. In particular, I was putting into microprocessors some of the low-level communication activity for our network, and it was decided that we should have some form of check summing. The more-or-less standard way of check summing�which isn't quite the right word, because it isn't really a sum, but�

GAM: Well, error-correcting code.

JF: Right. Really, it was error-detecting, because actually in this case it was enough that it be an error-detecting code.

GAM: Can you remember who was involved in this decision?

JF: Well, I pretty much did it myself. I can't remember in particular who thought we needed some kind of error detection on the wire. I think it was just kind of obvious to everybody that we did. We had these little microprocessors�I forget which one at that time. At a later stage we used something called the 8X300.

Well, I could tell another story about the 8X300. We also used the 8039. And then there were things like PDP-8s and PDP-11s, which really aren't microprocessors. They're mini computers, as I guess we'd call them. Anyway, the more-or-less standard way for error detection or correction was to use what's called a CRC�a cyclic redundancy check. The mathematics of that is that you treat a string of bits as the coefficients of a polynomial over the binary field. Then you divide it by some fixed polynomial and keep the remainder. Well, in fact, what you do to put the check sum on the end of the transmission is that you put on whatever is needed to make the quotient come out to zero. So the sender does that, and the receiver does the division, and if the quotient doesn't come out to zero, he knows there's an error and asks for a retransmission.

Well, a cyclic redundancy check done in software on a typical computer where there aren't any special instructions to handle binary polynomials, requires that you test every bit, and shift, and exclusive OR. So it's very time consuming; whereas, if you do ordinary arithmetic, it's much faster because that's built in to the processor.

So, I fooled around and came up with what I thought was an arithmetic check sum that was better than just the simple-minded "well, let's sum it up, and keep the sum." I was coding this when, if I remember correctly, Danny Nessett more or less challenged me. He said, "Well, here, you know, you're just doing this thing, and you don't know if it's good or bad!" And I said, "Well, you know, I have some reasons for thinking it's good." And he said, "Well, that's not good enough."

So, what I did was I got Peterson's original publication about CRCs, in which he proves that CRCs have four desirable properties. And I said that I would study this thing that I've done in regard to these same four properties and see how it does. I fussed around and finally developed some proofs about its properties; they weren't as good as the CRC, but they were really pretty good. It was pretty amazing, because I'd just come up with this thing haphazardly. It wasn't like I had scientifically gone out on a search. So that more or less convinced Danny Nessett.

But, then, having gone to this effort and written it up, I decided I would try to publish it. I submitted this paper to some IEEE journal�I can't remember which one right now. After a while I got back some rather negative referee reports. What I'd hoped the referees would do was improve my proof, which seemed to me to be very long-winded. I thought there must be a better way to prove some of these things, and shorten the paper. But these guys hadn't looked at the mathematics in any serious way it seemed to me. One of them said, "You know, there's no need for this. We can do all this with CRCs," completely ignoring the fact that one might not have the hardware or the fact that the CRCs took longer. Another guy picked at some small point in the proof, and he was just wrong. I forget what the third referee did.

Anyway, I thought, "Well, if I'm going to be thrown out, I shouldn't be thrown out for these foolish reasons." So I wrote a rather lengthy rejoinder, point by point, as to why the referees were wrong, and sent it back. Then something like a year passed, and nothing! So I finally wrote the journal, and I said, in effect, "Look�are you going to publish this thing or aren't you?" And almost immediately, by return mail, they said, "Yes, we're going to publish it."

What I pieced together reading between the lines is that they'd somehow lost track of my paper! My letter, after a year, had drawn their attention to the whole thing again�that I had sent back a rejoinder, and they'd apparently ignored it. They were so embarrassed by the situation that they just said they would go ahead. My feeling is that I got published just because they'd made a mistake, and not because they were really convinced it was worthwhile.

So, the paper came out in, I think, a January issue. About May, I got a letter from the same, or maybe it was another, journal to act as a referee myself. They'd obviously picked me because the paper to be refereed referred to the work I'd done that had been published in January. The paper made a little modification to what I had done so that the check bytes were not at the end of the message but in the middle somewhere. And I thought, boy, that is a strange thing to do. But their mathematics was correct; they'd clearly done it. So, I essentially wrote back and said, I don't know why anybody would want to do this, but they did it right.

No sooner than I'd done that, than Dick Watson, who was perhaps Division Leader�I can't remember whether this was the era when he was the Division Leader, but he was one of us working on the network�came back from some kind of meeting he'd been to in the east. I think it was in Washington, D.C. And he said to me, "You know, your check sum is going to go into a national standard." And I said, "A national standard for what?" He said, "For check sums of the headers for some protocol." And I said, "Well, that's really strange, because that isn't what this thing was intended for." It was intended to be a check sum of a whole message, and would be an appended byte at the end of a message that was, in general, of arbitrary length; whereas, of course, a header�at least this header�was fixed length, and the bytes were to be imbedded in it. But I immediately recognized now what the refereed paper was about. Somebody had been converting my check sum into one for a header.

But then I said to Watson, "You know, it's really strange that they would use my check sum when that isn't what it was intended for. Why do you think they did that?" And I've always remembered his answer. He said, "Because it's written down." In other words, the way standards happen is not because people sit down in the standards committees and think, we have a problem�what is the best way to solve it�"let's get some experts and give them a grant," or anything like that. It's that they look around for something that's already written up and say, "We'll take it. And if it's not quite right, we'll modify it." That is just exactly what happened in this case.

GAM: I think that's precious!

JF: Yes, I was just really amazed at that. Right now, today�1994-1995 time frame�I got a request, not to referee, but to review a paper. This was from "Computing Reviews." And this paper is another idea for, again, doing something in software�actually, the way the paper is written it also suggested that you can do it also in hardware�something better than a CRC. And it mentions the so-called Fletcher check sum. (I had actually done more or less unintentionally something that I'd always thought was a clever move: If you invent something, don't give it a good name, and then people will refer to it using your name. That's essentially what had happened.)

It turns out that this new paper is badly flawed, but it has a good idea. It's just that the mathematics supporting the idea is all gummed up. So, in my review I said that this paper has a very good idea in it, but it's full of so many mistakes that you can't really appreciate that. And my last line of the review was, "This paper cries out for a rewrite."

So, that's what I most recently have done�I've not rewritten the guy's paper, but I've written a sort of follow-up paper for it that corrects its mistakes, hopefully very cleverly hiding that I'm criticizing him too heavily. So possibly now I'll have another publication in this field. It's currently at the journal, waiting for a year, maybe.

GAM: That's great!

JF: It seems to me at the very beginning of this little session I said I would have�

GAM: An anecdote about the 8X300 or whatever it was?

JF: Oh, yes. This is not really very relevant to anything, but the 8X300 was being sold by Signetics at that time. Before we actually had an 8X300, we had the manual. And the situation was such that I was starting to write the programs for this machine that we didn't have.

I was working on it, and it became clear that one aspect of the operation of this machine�the programmer's interface to it�was not clear. There were two possible ways it could go. It seemed to say it went one way, but in thinking it over, you would think it really should work the other way. It would be much better.

So, I finally decided I would call Signetics and ask them how it really worked. Well, it turns out nobody at Signetics knew, at least I couldn't find anybody there, because Signetics had not developed the chip. Another company named SMS had developed it, and then sold it to Signetics. Finally, somebody at Signetics gave me the name of somebody at SMS. I'd been talking to all these people who just knew nothing of what I was talking about�about their own chip.

I called the guy at SMS, and it was just amazing. As soon as I started talking to him, I thought, "Oh, yes, here's somebody who really knows what this is all about." And there was a tricky way to compute the parity of an 8-bit byte in three instructions on this machine, which I had already discovered. And he mentioned it, you know, "Oh, yes! Yes! We both know that!" It was just wonderful talking to him.

Then I asked him about the point I wanted raised. And he told me, "Yes, the way the manual says it is, is actually right." And I said, "Well, you know, that creates some big problems for the assembler to figure out just what you're doing." He said, "Yes, we had no end of trouble with that. You couldn't really tell the assembler everything, and you had to keep filling it in yourself," and so on. So it was just like talking to an old friend.

Actually, Gale Marshall's assembler had an added feature to take care of the problem. I told Gale, "We need something in this assembler that�" You were going to say?
John Fletcher

GAM: You were talking to a kindred spirit. That's great.

JF: Yes, right. Anyway, that's the story of my foray into the American National Standard. I eventually had to write a little squib that I guess appears in some of the standard supporting documents that explains how good and how bad my check sum is.

In fact, a little more of the story: It turns out that a possible kind of error that was never analyzed by Peterson or myself (because Peterson didn't analyze it) is the idea of some bits just disappearing from the middle of a message, or from the end�in other words, not being changed from 1 to 0 or back, but just gone!

GAM: So that would get everything out of step.

JF: Right, yes. But it turns out that it is possible�well, I've never analyzed this for a CRC, but the result might be about the same as for my check sum�that if the bits drop out at a crucial place, it's undetected. If the right number of bits drop out at a point when the check sum is more or less zero�if you drop out a bunch of bits that would really have been zero but now are just missing, then you don't notice it.

The way this particular flaw came to light is that we were using this check sum in our network, and there was a particular terminal that you couldn't always log in from. You would try to log in, and the thing�I forget just how it misbehaved�would not work right. And you'd have to try again. It was just this one terminal. The guy who had the terminal obviously didn't like this situation. I don't remember just why it was that we really got interested in it, because you'd think, well, you know, this guy's got a flaky piece of hardware.

But I guess, when they replaced the hardware, eventually it became clear that there was something wrong. The problem followed the number that the terminal had. In other words, you could change the hardware, you could change the wires and everything, but it was the number that was assigned to the terminal that was important.

Well, it turned out that the number of the terminal appeared in the header of the messages, and, depending on what the number was, when you got to a certain point in a message the check sum would have a different value. And since the numbers of the terminals pretty much ranged over all the possible values of the 8-bit bytes, there would be one particular terminal number for which, at a critical point in the message, the sum was zero. At that same critical point in the message, a flaw in some piece of hardware in the network was such that it would just drop a byte out at that moment. So, for this one terminal, the error would be undetected. If it were any other terminal, the error would be detected and we would retransmit, and then it would work probably the second time or the third time. But this one terminal was undetected, and therefore this message went God knows where on the network, because it was flawed, and in any event didn't do the right thing.

GAM: Well, those are great things to discover; they make you feel good when you find the trouble.

JF: Yes, right. That was one of the things I always like about programming: It was a kind of detective work.

I remember once we were having trouble with a PDP-8 or a PDP-11 concentrator. It was attached to something called a selector line unit, which was a home-built piece of equipment for interfacing the minicomputer to the mainframes, the Control Data 6600 or 7600s. There was a problem that we were having with these machines. There were several PDP-11s and several selector line units, and they were all having this trouble, which was called "hanging the selector line unit." That is the name that was used for the trouble, because when the trouble happened, the lights on the selector line unit would just freeze up in a pattern like nothing was happening. You'd have to restart the minicomputer. The engineers were just driving themselves crazy trying to find this problem. I would be called in, and I would study the thing in the light of this idea that everybody said there was something wrong in the hardware that was causing this hanging.

Then one day, driving to work, I said to myself, you know, I'm no good at solving the problem as it's been described to me, because all I can do is tell them what they can see�that nothing's happening. But if this were a software problem, I would be able to solve it myself. So I said, I'm going to go to work with the idea that this isn't a hardware problem like everybody tells me it is�it's a software problem. And I sat down, and in less than hour, I found the bug in my own code that was causing the trouble. It did nothing to the selector line unit. It just caused the computer to go into a tight loop, and because of that, of course, the selector line unit stopped doing anything, because nothing was telling it to do anything, and it stopped.

This is a thing that, even today, I don't think I've really learned: When people tell you they're having trouble, you shouldn't believe anything more than their statement that they're having trouble. Any of their analysis about what the trouble is, you should take with a grain of salt. It often, much too often, turns out that it's not what they're telling you.

Another story along that line, as one thing leads to another, was at a time when I'd gotten wary about the possibility that people would say one thing and mean another. So, some guy called from way out somewhere in the Lab and said that his terminal didn't work at all. He would type and nothing would happen. I could think of several things that were possibly wrong, depending on exactly what all those words meant. So I went over it very carefully. I said, "Now, let me make clear that what you're saying is you press a key on the terminal, and absolutely nothing happens. Nothing appears on the paper." (This was a long time ago.) "Nothing appears on the paper at all." "Yes." "You push a key�nothing on the paper?" "Nothing at all." I said, "Okay, I'll be out."

I went out there, he showed me the terminal, and I tapped a key, like "a". Bing�"a" comes up on the paper. And I said, "Oh look, something did come up." He said, "Oh, I didn't think you meant that!" In other words, the echo was not included among the "nothing at all." Of course, that completely�I can't remember in what way now�changed the nature of the problem as to what was wrong. And, in fact, in particular, I could have solved that problem without coming all the way to where the guy was! So I was less than happy.

It's hard to get people who don't know the technical aspects of a thing to tell you the important facts, is how it turns out.

GAM: Well, I think really the problem is that an awful lot of people haven't a deep enough awareness of the meanings in the English language. They just don't know what the words mean!

JF: Yes, it's amazing. In fact, this reflects a philosophical belief of mine that a lot of disagreements are merely over: What do the words mean?

GAM: You say something and the other guy hears something different.

JF: Yes, in fact, an interesting thing about that is what I call the "but you said" syndrome. You talk to someone and there's a misunderstanding; and you talk some more, and finally it's clear that both people now know what was said and what the meaning was. And you think, now we can go ahead. But the other guy in the conversation then says, "But you said�." And what he's trying to do now is prove to you that what you said before doesn't mean what you now both agree is the meaning. And I think, what a useless enterprise! You can't prove to someone that he meant something else�obviously the same words meant something else to him. That's just the unfortunate way it is!

GAM: I have that experience lots of times when I deal with making�in business, with e-mail. I send something that's perfectly clear, and the guy doesn't read it! There's nothing you can do about that. You just have to suffer it. That's the way it goes!

JF: Yes, you sometimes think people are more interested in words than in meaning. Just before I retired, I got involved in a project with one of these CRADAs (Cooperative Research And Development Agreements) that involved a large computer firm from outside the Lab. We were doing this project according to their way of doing things, which involved lots of paperwork. And I finally just sort of dropped out�I knew I was going to retire, so I wasn't under any strong compulsion to keep working on it. It didn't interest me, because there was an endless�it seemed to me endless�sequence of meetings in which we would write up one of the requirements documents or design documents or something, and it just was wordsmithing. You know, how are we going to say this? It always seemed to me, the important thing is: Do we all understand what it is that we mean to do here? But the important thing to them was to get the words on the paper.

GAM: Actually, that syndrome, to me, is indicative that there is a non-technical person at the decision point. They don't understand what you're saying, and they feel very unsafe, so they continue to ask for clarification.

JF: Yes, something along that line in�I forget what era, but Bob Lee was the head of Comp at the time. Lee and a couple of other people, including me, went to Clearwater, Florida. It was to some large computer company�I forget which one now. Anyway, one of the things at that time was provably secure kernels. In other words, the idea was that you were going to prove that something was somehow safe in a security sense. So this company had been working on that, and they were going to give us a talk about how they did it.

What it came down to was that they had written down the specifications of what they wanted in some kind of formal language. Then they wrote down a description of what they were doing�sort of a design thing�in some other quasi-programming language. And then they had a methodology to prove that the one met the other.

Now, there were all kinds of missing steps. For one thing, there were no proofs that the design was correctly reflected in the actual program or that the original requirements really were secure. It was just this little piece of it. But they were very proud of it.

I remember bringing up the point at one time in the discussion. I said, "But what about the people who are actually working on this thing�do they really think it's secure? I mean, aren't they expert enough to be able to figure out for themselves whether what they're doing makes security sense or not?" And the answer, of course, was, "Well, yes, of course, they think it's secure."

I said, "Well, why couldn't you then just skip all this formal stuff? Why couldn't you send those experts to Washington and say, 'Look, we've put together a system here, and in the consensus, or even the unanimous statement, of all these experts working on it (or they could bring in some other outside experts), it's secure.'" And the guy in Washington would say, "Well, if all these experts agree to it, it must be so." And they'd say, "No, no. The people in Washington do not think that's adequate. They want something more than just the opinion of experts."

And I said, "Okay. So, what you're doing is you've written down the spec, and you've written down this design, and you've written down this proof that the one meets the other. You're going to take all that paper to this guy in Washington, and he's going to look it over, and convince himself that you haven't made a mistake!" And they said, "Oh, no. No. Because he isn't himself capable of doing that. You know, this is not his field." I said, "Well, then, how is he convinced?" And they said, "Well, you know, the experts tell him it all works."

So, the expert can't just tell you it works, but he can tell you that the proof that it works, works, and he is believed. So, funny things go on, and the trouble is that you and I can sit here and laugh about it, and maybe people who eventually read what we say or listen to the tape will laugh about it, but it goes on!

Actually, people don't stop doing it, even though other people are laughing about it. It's interesting. I don't know whatever became of the provably secure kernel. I don't hear about it anymore.

GAM: I think it went the way of all other wild ideas that don't have any real basis in technology.

JF: Well, there have been all kinds of fads over the years, and they eventually seem to fade away. And yet, of the people�which includes myself�who all the time said, this is just a fad and it doesn't make sense, nobody ever says, "Hey, those guys were right; maybe we'll believe them next time!" Instead, the next fad comes along, and then you pooh-pooh it. But it has a life of its own, and it doesn't go away.

GAM: That's always been the case. That's the way people are. It's just�sad.

Note (added April 10, 2008): John's first interview can be found here.
Interested persons may also view his recently unearthed 1983 video, "Computing at the Livermore Lab" on Page 4.