A Second Interview with John Fletcher
JF = John Fletcher
GAM = George Michael
GAM: Today is January 9, 1995, and this is my second interview with John
Fletcher. Let's just start, John.
JF: Okay. Well, in reference to our previous conversation, I said that maybe
I would discuss something later, but that I didn't. That had to do with the
error-correcting code I worked out that became part of an American National
Standard.
I felt that this was one of my two biggest successes in terms of publication,
the other one being the pentomino program that made the cover of
"Communications." Well, the error-correcting code came about in an era where we
did everything ourselves here, and we did it down to the very lowest levels.
That's mainly because things weren't available any other way. In particular, I
was putting into microprocessors some of the low-level communication activity
for our network, and it was decided that we should have some form of check
summing. The more-or-less standard way of check summing�which isn't quite the
right word, because it isn't really a sum, but�
GAM: Well, error-correcting code.
JF: Right. Really, it was error-detecting, because actually in this case it
was enough that it be an error-detecting code.
GAM: Can you remember who was involved in this decision?
JF: Well, I pretty much did it myself. I can't remember in particular who
thought we needed some kind of error detection on the wire. I think it was just
kind of obvious to everybody that we did. We had these little microprocessors�I
forget which one at that time. At a later stage we used something called the
8X300.
Well, I could tell another story about the 8X300. We also used the 8039. And
then there were things like PDP-8s and PDP-11s, which really aren't
microprocessors. They're mini computers, as I guess we'd call them. Anyway, the
more-or-less standard way for error detection or correction was to use what's
called a CRC�a cyclic redundancy check. The mathematics of that is that you
treat a string of bits as the coefficients of a polynomial over the binary
field. Then you divide it by some fixed polynomial and keep the remainder. Well,
in fact, what you do to put the check sum on the end of the transmission is that
you put on whatever is needed to make the quotient come out to zero. So the
sender does that, and the receiver does the division, and if the quotient
doesn't come out to zero, he knows there's an error and asks for a
retransmission.
Well, a cyclic redundancy check done in software on a typical computer where
there aren't any special instructions to handle binary polynomials, requires
that you test every bit, and shift, and exclusive OR. So it's very time
consuming; whereas, if you do ordinary arithmetic, it's much faster because
that's built in to the processor.
So, I fooled around and came up with what I thought was an arithmetic check
sum that was better than just the simple-minded "well, let's sum it up, and keep
the sum." I was coding this when, if I remember correctly, Danny Nessett more or
less challenged me. He said, "Well, here, you know, you're just doing this thing, and you don't know if it's good or bad!" And I said, "Well, you know, I have some reasons for thinking it's
good." And he said, "Well, that's not good enough."
So, what I did was I got Peterson's original publication about CRCs, in which
he proves that CRCs have four desirable properties. And I said that I would
study this thing that I've done in regard to these same four properties and see
how it does. I fussed around and finally developed some proofs about its
properties; they weren't as good as the CRC, but they were really pretty good.
It was pretty amazing, because I'd just come up with this thing haphazardly. It
wasn't like I had scientifically gone out on a search. So that more or less
convinced Danny Nessett.
But, then, having gone to this effort and written it up, I decided I would
try to publish it. I submitted this paper to some IEEE journal�I can't remember
which one right now. After a while I got back some rather negative referee
reports. What I'd hoped the referees would do was improve my proof, which
seemed to me to be very long-winded. I thought there must be a better way
to prove some of these things, and shorten the paper. But these guys hadn't looked at the mathematics in any serious way it seemed to
me. One of them said, "You know, there's no need for this. We can do all this
with CRCs," completely ignoring the fact that one might not have the hardware or
the fact that the CRCs took longer. Another guy picked at some small point in
the proof, and he was just wrong. I forget what the third referee did.
Anyway, I thought, "Well, if I'm going to be thrown out, I shouldn't be
thrown out for these foolish reasons." So I wrote a rather lengthy
rejoinder, point by point, as to why the referees were wrong, and sent it back.
Then something like a year passed, and nothing! So I finally wrote the
journal, and I said, in effect, "Look�are you going to publish this thing or aren't you?" And almost immediately, by return mail, they said, "Yes,
we're going to publish it."
What I pieced together reading between the lines is that they'd somehow lost track of my paper! My letter, after a year, had drawn their
attention to the whole thing again�that I had sent back a rejoinder, and they'd
apparently ignored it. They were so embarrassed by the situation that
they just said they would go ahead. My feeling is that I got published just
because they'd made a mistake, and not because they were really convinced it was worthwhile.
So, the paper came out in, I think, a January issue. About May, I got a
letter from the same, or maybe it was another, journal to act as a referee
myself. They'd obviously picked me because the paper to be refereed referred to
the work I'd done that had been published in January. The paper made a little
modification to what I had done so that the check bytes were not at the end of the message but in the middle somewhere. And I thought,
boy, that is a strange thing to do. But their mathematics was correct; they'd
clearly done it. So, I essentially wrote back and said, I don't know why anybody
would want to do this, but they did it right.
No sooner than I'd done that, than Dick Watson, who was perhaps Division
Leader�I can't remember whether this was the era when he was the Division
Leader, but he was one of us working on the network�came back from some kind of
meeting he'd been to in the east. I think it was in Washington, D.C. And he said
to me, "You know, your check sum is going to go into a national standard." And I
said, "A national standard for what?" He said, "For check sums of the headers
for some protocol." And I said, "Well, that's really strange, because that isn't
what this thing was intended for." It was intended to be a check sum of a whole
message, and would be an appended byte at the end of a message that was,
in general, of arbitrary length; whereas, of course, a header�at least this
header�was fixed length, and the bytes were to be imbedded in it. But I
immediately recognized now what the refereed paper was about. Somebody had been converting my check sum into one for a header.
But then I said to Watson, "You know, it's really strange that they would use
my check sum when that isn't what it was intended for. Why do you think they did that?" And I've always remembered his answer. He said,
"Because it's written down." In other words, the way standards
happen is not because people sit down in the standards committees and think, we
have a problem�what is the best way to solve it�"let's get some experts and
give them a grant," or anything like that. It's that they look around for
something that's already written up and say, "We'll take it. And if it's not
quite right, we'll modify it." That is just exactly what happened in this case.
GAM: I think that's precious!
JF: Yes, I was just really amazed at that. Right now, today�1994-1995 time
frame�I got a request, not to referee, but to review a paper. This was
from "Computing Reviews." And this paper is another idea for, again, doing
something in software�actually, the way the paper is written it also suggested
that you can do it also in hardware�something better than a CRC. And it
mentions the so-called Fletcher check sum. (I had actually done more or less
unintentionally something that I'd always thought was a clever move: If you
invent something, don't give it a good name, and then people will refer
to it using your name. That's essentially what had happened.)
It turns out that this new paper is badly flawed, but it has a good idea.
It's just that the mathematics supporting the idea is all gummed up. So, in my
review I said that this paper has a very good idea in it, but it's full of so
many mistakes that you can't really appreciate that. And my last line of the
review was, "This paper cries out for a rewrite."
So, that's what I most recently have done�I've not rewritten the guy's
paper, but I've written a sort of follow-up paper for it that corrects its
mistakes, hopefully very cleverly hiding that I'm criticizing him too heavily.
So possibly now I'll have another publication in this field. It's currently at
the journal, waiting for a year, maybe.
GAM: That's great!
JF: It seems to me at the very beginning of this little session I said I
would have�
GAM: An anecdote about the 8X300 or whatever it was?
JF: Oh, yes. This is not really very relevant to anything, but the 8X300 was
being sold by Signetics at that time. Before we actually had an 8X300, we
had the manual. And the situation was such that I was starting to write the
programs for this machine that we didn't have.
I was working on it, and it became clear that one aspect of the operation of
this machine�the programmer's interface to it�was not clear. There were two
possible ways it could go. It seemed to say it went one way, but in thinking it
over, you would think it really should work the other way. It would be
much better.
So, I finally decided I would call Signetics and ask them how it really
worked. Well, it turns out nobody at Signetics knew, at least I couldn't
find anybody there, because Signetics had not developed the chip. Another
company named SMS had developed it, and then sold it to Signetics. Finally,
somebody at Signetics gave me the name of somebody at SMS. I'd been talking to
all these people who just knew nothing of what I was talking about�about
their own chip.
I called the guy at SMS, and it was just amazing. As soon as I started
talking to him, I thought, "Oh, yes, here's somebody who really knows what this
is all about." And there was a tricky way to compute the parity of an 8-bit byte
in three instructions on this machine, which I had already discovered. And he
mentioned it, you know, "Oh, yes! Yes! We both know that!" It was just
wonderful talking to him.
Then I asked him about the point I wanted raised. And he told me, "Yes, the
way the manual says it is, is actually right." And I said, "Well, you know, that
creates some big problems for the assembler to figure out just what you're
doing." He said, "Yes, we had no end of trouble with that. You couldn't really
tell the assembler everything, and you had to keep filling it in yourself," and
so on. So it was just like talking to an old friend.
Actually, Gale Marshall's assembler had an added feature to take care of the
problem. I told Gale, "We need something in this assembler that�" You were
going to say?
GAM: You were talking to a kindred spirit. That's great.
JF: Yes, right. Anyway, that's the story of my foray into the American
National Standard. I eventually had to write a little squib that I guess appears
in some of the standard supporting documents that explains how good and how bad
my check sum is.
In fact, a little more of the story: It turns out that a possible kind of
error that was never analyzed by Peterson or myself (because Peterson didn't
analyze it) is the idea of some bits just disappearing from the middle of
a message, or from the end�in other words, not being changed from 1 to 0 or
back, but just gone!
GAM: So that would get everything out of step.
JF: Right, yes. But it turns out that it is possible�well, I've never
analyzed this for a CRC, but the result might be about the same as for my check
sum�that if the bits drop out at a crucial place, it's undetected. If the right
number of bits drop out at a point when the check sum is more or less zero�if
you drop out a bunch of bits that would really have been zero but now are just missing, then you don't notice it.
The way this particular flaw came to light is that we were using this check
sum in our network, and there was a particular terminal that you couldn't always
log in from. You would try to log in, and the thing�I forget just how it
misbehaved�would not work right. And you'd have to try again. It was just this one terminal. The guy who had the terminal obviously didn't like
this situation. I don't remember just why it was that we really got interested
in it, because you'd think, well, you know, this guy's got a flaky piece of
hardware.
But I guess, when they replaced the hardware, eventually it became clear that
there was something wrong. The problem followed the number that the terminal
had. In other words, you could change the hardware, you could change the wires
and everything, but it was the number that was assigned to the terminal
that was important.
Well, it turned out that the number of the terminal appeared in the header of
the messages, and, depending on what the number was, when you got to a certain
point in a message the check sum would have a different value. And since
the numbers of the terminals pretty much ranged over all the possible values of
the 8-bit bytes, there would be one particular terminal number for which,
at a critical point in the message, the sum was zero. At that same critical point in the message, a flaw in some piece of hardware in the network
was such that it would just drop a byte out at that moment. So, for this one
terminal, the error would be undetected. If it were any other terminal, the
error would be detected and we would retransmit, and then it would work probably
the second time or the third time. But this one terminal was undetected, and
therefore this message went God knows where on the network, because it was
flawed, and in any event didn't do the right thing.
GAM: Well, those are great things to discover; they make you feel good when
you find the trouble.
JF: Yes, right. That was one of the things I always like about programming:
It was a kind of detective work.
I remember once we were having trouble with a PDP-8 or a PDP-11 concentrator.
It was attached to something called a selector line unit, which was a home-built
piece of equipment for interfacing the minicomputer to the mainframes, the
Control Data 6600 or 7600s. There was a problem that we were having with these
machines. There were several PDP-11s and several selector line units, and they
were all having this trouble, which was called "hanging the selector line unit."
That is the name that was used for the trouble, because when the trouble
happened, the lights on the selector line unit would just freeze up in a pattern
like nothing was happening. You'd have to restart the minicomputer. The
engineers were just driving themselves crazy trying to find this problem. I
would be called in, and I would study the thing in the light of this idea that
everybody said there was something wrong in the hardware that was causing this
hanging.
Then one day, driving to work, I said to myself, you know, I'm no good at
solving the problem as it's been described to me, because all I can do is tell
them what they can see�that nothing's happening. But if this were a software problem, I would be able to solve it myself. So I said, I'm
going to go to work with the idea that this isn't a hardware problem like
everybody tells me it is�it's a software problem. And I sat down, and in less than hour, I found the bug in my own code that was causing the
trouble. It did nothing to the selector line unit. It just caused the computer
to go into a tight loop, and because of that, of course, the selector line unit
stopped doing anything, because nothing was telling it to do anything,
and it stopped.
This is a thing that, even today, I don't think I've really learned:
When people tell you they're having trouble, you shouldn't believe anything more
than their statement that they're having trouble. Any of their analysis about
what the trouble is, you should take with a grain of salt. It often, much too
often, turns out that it's not what they're telling you.
Another story along that line, as one thing leads to another, was at a time
when I'd gotten wary about the possibility that people would say one thing and
mean another. So, some guy called from way out somewhere in the Lab and said
that his terminal didn't work at all. He would type and nothing would happen. I
could think of several things that were possibly wrong, depending on exactly
what all those words meant. So I went over it very carefully. I said, "Now, let
me make clear that what you're saying is you press a key on the terminal, and absolutely nothing happens. Nothing appears on the paper." (This was a
long time ago.) "Nothing appears on the paper at all." "Yes." "You push a
key�nothing on the paper?" "Nothing at all." I said, "Okay, I'll be out."
I went out there, he showed me the terminal, and I tapped a key, like "a".
Bing�"a" comes up on the paper. And I said, "Oh look, something did come
up." He said, "Oh, I didn't think you meant that!" In other words,
the echo was not included among the "nothing at all." Of course, that
completely�I can't remember in what way now�changed the nature of the problem
as to what was wrong. And, in fact, in particular, I could have solved that
problem without coming all the way to where the guy was! So I was less than
happy.
It's hard to get people who don't know the technical aspects of a thing to
tell you the important facts, is how it turns out.
GAM: Well, I think really the problem is that an awful lot of people haven't
a deep enough awareness of the meanings in the English language. They just don't
know what the words mean!
JF: Yes, it's amazing. In fact, this reflects a philosophical belief of mine
that a lot of disagreements are merely over: What do the words mean?
GAM: You say something and the other guy hears something different.
JF: Yes, in fact, an interesting thing about that is what I call the "but you
said" syndrome. You talk to someone and there's a misunderstanding; and you talk
some more, and finally it's clear that both people now know what was said
and what the meaning was. And you think, now we can go ahead. But the other guy
in the conversation then says, "But you said�." And what he's trying to do now
is prove to you that what you said before doesn't mean what you
now both agree is the meaning. And I think, what a useless enterprise! You can't prove to someone that he meant something else�obviously the
same words meant something else to him. That's just the unfortunate way it is!
GAM: I have that experience lots of times when I deal with making�in
business, with e-mail. I send something that's perfectly clear, and the
guy doesn't read it! There's nothing you can do about that. You just have
to suffer it. That's the way it goes!
JF: Yes, you sometimes think people are more interested in words than
in meaning. Just before I retired, I got involved in a project with one
of these CRADAs (Cooperative Research And Development Agreements) that involved
a large computer firm from outside the Lab. We were doing this project according
to their way of doing things, which involved lots of paperwork. And I finally
just sort of dropped out�I knew I was going to retire, so I wasn't under any
strong compulsion to keep working on it. It didn't interest me, because there
was an endless�it seemed to me endless�sequence of meetings in which we would
write up one of the requirements documents or design documents or something, and
it just was wordsmithing. You know, how are we going to say this? It always
seemed to me, the important thing is: Do we all understand what it is
that we mean to do here? But the important thing to them was to get the
words on the paper.
GAM: Actually, that syndrome, to me, is indicative that there is a
non-technical person at the decision point. They don't understand what you're
saying, and they feel very unsafe, so they continue to ask for clarification.
JF: Yes, something along that line in�I forget what era, but Bob Lee was the
head of Comp at the time. Lee and a couple of other people, including me, went
to Clearwater, Florida. It was to some large computer company�I forget which
one now. Anyway, one of the things at that time was provably secure kernels. In
other words, the idea was that you were going to prove that something was
somehow safe in a security sense. So this company had been working on that, and
they were going to give us a talk about how they did it.
What it came down to was that they had written down the specifications of
what they wanted in some kind of formal language. Then they wrote down a
description of what they were doing�sort of a design thing�in some other
quasi-programming language. And then they had a methodology to prove that the
one met the other.
Now, there were all kinds of missing steps. For one thing, there were no
proofs that the design was correctly reflected in the actual program or that the
original requirements really were secure. It was just this little piece of it.
But they were very proud of it.
I remember bringing up the point at one time in the discussion. I said, "But
what about the people who are actually working on this thing�do they really think it's secure? I mean, aren't they expert enough to be able to figure
out for themselves whether what they're doing makes security sense or not?" And
the answer, of course, was, "Well, yes, of course, they think it's secure."
I said, "Well, why couldn't you then just skip all this formal stuff? Why
couldn't you send those experts to Washington and say, 'Look, we've put
together a system here, and in the consensus, or even the unanimous statement,
of all these experts working on it (or they could bring in some other outside
experts), it's secure.'" And the guy in Washington would say, "Well, if
all these experts agree to it, it must be so." And they'd say, "No, no. The
people in Washington do not think that's adequate. They want something more than
just the opinion of experts."
And I said, "Okay. So, what you're doing is you've written down the spec, and
you've written down this design, and you've written down this proof that the one
meets the other. You're going to take all that paper to this guy in Washington,
and he's going to look it over, and convince himself that you haven't made a
mistake!" And they said, "Oh, no. No. Because he isn't himself capable of
doing that. You know, this is not his field." I said, "Well, then, how is
he convinced?" And they said, "Well, you know, the experts tell him it
all works."
So, the expert can't just tell you it works, but he can tell you that
the proof that it works, works, and he is believed. So, funny things go
on, and the trouble is that you and I can sit here and laugh about it, and maybe
people who eventually read what we say or listen to the tape will laugh about
it, but it goes on!
Actually, people don't stop doing it, even though other people are laughing about it. It's interesting. I don't know whatever became of the
provably secure kernel. I don't hear about it anymore.
GAM: I think it went the way of all other wild ideas that don't have any real
basis in technology.
JF: Well, there have been all kinds of fads over the years, and they
eventually seem to fade away. And yet, of the people�which includes myself�who
all the time said, this is just a fad and it doesn't make sense, nobody ever
says, "Hey, those guys were right; maybe we'll believe them next time!" Instead,
the next fad comes along, and then you pooh-pooh it. But it has a life of its
own, and it doesn't go away.
GAM: That's always been the case. That's the way people are.
It's just�sad.
Note (added April 10, 2008): John's first interview can be found here.
Interested persons may also view his recently unearthed 1983 video, "Computing at the Livermore Lab" on Page 4.