Capability Computing at LLNL

Jed Donnelley Jed HS Service Award

by Jed Donnelley
May 4, 2005


This document provides a brief overview of a little known but very rich thread of computing history at Lawrence Livermore Laboratory (now LLNL), that of "Capability Computing." It focuses largely on the development of two operating systems, RATS (the RISOS ARPAnet Terminal System) [RATS75-2] and NLTSS (Network LTSS) [NLTSS79-7]. However, it also touches on some other historical threads, some network protocol development (most LINCS - the Livermore Interactive Network Communication System), and provides some basic background on the technology involved.

Since this document focuses on "Capability" computing, and since (sadly) the vast majority of information technology (IT) workers and even the majority of computer scientists are unfamiliar with what the term "Capability" means in the context of access control in computing systems, I'll define it here. A "Capability" is a computer representation of the authority to access a computer resource. A capability both identifies a resource and defines some authority for accessing the resource. For most people in the IT area the simplest example to think of as a resource is a file, though it's important that the capability concept is generalized to be the authority to access any resource, e.g. a process, directory, access to a network port, the authority to print to a printer, the authority to write to a screen or read from a keyboard, etc., etc. For the case of a file, one capability might grant the authority to read the file and another capability might grant the authority to both read and write the file.

For most information technology people this concept is unfamiliar because it doesn't appear in most operating systems. Most operating systems and all of today's commercially significant systems (systems like Unix, Windows, VMS, TOPS-10 and Tenex/TOPS-20, OS-360, etc., etc.) use an "ambient authority" model for granting authorities to running processes. In such ambient authority systems when a program runs in a process, the process almost always has the authorities of a "user" - that is the authorities of a person who has access to the system (an account on the system). In such systems programs run in some sense as a surrogate of the user with all the authorities the user has. If the process runs as user Jed then it has user Jed's authorities (access to user Jed's files, directories, network access, etc.). If the process runs as user Root or user Administrator then the program has all authorities available on the system.

Capability based operating systems are quite different. In a capablity based operating system processes only run with the authorities they are explicitly granted - as capabilities. For example, if a person wants to edit a file they run an editor program. In a traditional operating system the editor program has access to all the user's files. When the user commands the editor to begin editing a specific file, the editor can open the file because, in its position as proxy for the user, it has the authority to open any of the user's files. In a capability based system when the editor runs it only has access to the file and the keyboard and display "capabilities". It doesn't run as the user, but on behalf of the user with more limited authorities.

Why is this distinction important? The easiest way to understand the importance of the difference between having processes run as a user and having them run with a specific set of capabilities (authorities) is to consider the concerns about computer viruses. It is fairly common for people to receive programs either from email sources or pick them up off the Web. A program might be a computer game, or a funny interactive Christmas card. It might be a video rendering or graphical display program. It might help people sort/manage databases of pictures or audio/music files, etc., etc.. When a person runs such a program in a traditional operating system (e.g. Unix or Windows), the program runs with all the authorities of the user. It might actually be a "Trojan horse" and do more than was intended. It might destroy all the users files or look for sensitive information and send it out on the Internet or any number of things that it has the authority to do by virtue of its authorities running as the proxy of the user.

In a capability based operating system any program (e.g. one picked up in email or off the Web) can run in a process with just the authorities it needs to do what is requested by the user. This ability to restrict authorities to what is needed is commonly referred to as the Principle Of Least Authority, POLA. With programs running under POLA the damage that they might be able to do if they are untrustworthy is limited.

While the case of process initialization is important, the most general case that capability based operating systems deal with is that of communication between mutually suspicious processes. The way this works is commonly described these days with an ABC or Alice, Bob, and Carol diagram like that on shown in the diagram on this Web page. Alice and Bob are processes that are mutually suspicious. Alice has an authority to a resource managed by Carol. Alice wishes to send a message to Bob (e.g. make a request of Bob) that grants Bob the authority to the resource managed by Carol that Alice had. For example, Carol could be a file server and Bob could be a remote editor. Alice sends her "capability" to the file resource serviced by Carol to Bob to make use of Bob's editing service.

A common metaphor used to clarify the capability concept is to compare a capability to a physical key. If I want to have my car serviced, I take it to a repair shop and give the service person the key to the car. This gives the service person the authority to operate my car. When I ask for such a service I don't give the service person all my authorities as a person, e.g. my house key, keys to my safe deposit boxes, bank pins and credit cards, etc. Yet this is what traditional operating systems (all of today's commmercially significant operating systems) demand whenever a user requests a service. Capability based operating systems not only allow compartmentalized (POLA) running of applications, but also allow most parts of the system, all but the base capability communication mechanism, to be broken into domains of mutual suspicion. This naturally forms what's been referred to as a micro kernel architecture. With a capability based system programming is object oriented at the interfaces between processes - domains with separate authorities.

With that introduction it's time to pick up the historical thread of capability computing and how it passed through LLNL. In 1965 a paper was published by Jack Dennis and Earl Van Horn titled, "Programming Semantics for Multiprogrammed Computations" [1]. That paper is typically considered the seminal paper describing capabilities, capability computing, and capability communication. It describes what is now considered a traditional capability-list operating system where each process has an associated protected list of descriptors that constitute its authorities referred to as its capability list or c-list.

After that paper was written an operating system was designed and implemented following the design principles in that paper for a PDP-1 computer at the Massachusetts Institute of Technology. One of the many people who did programming work on that system was Charles Landau. Charlie and I both came to Livermore Laboratory in 1972 and joined an ARPA funded project called RISOS, Research Into the Security of Operating Systems.

The RISOS work is discussed some in my interview and elsewhere, e.g. [4]. Basically the idea of the RISOS project was to find security problems in operating systems and in doing so improve OS security. One aspect of this work was that RISOS had a computer to use for communicating with other computers. The computer that was purchased was a PDP-11/45. One thing a bit unusual about that system was that it was installed in a large trailer with the idea that it could be trucked into a secure installation and connected to systems in such a physically secured facility in order to use the RISOS computer for testing. In addition to supporting telephone dial out/in and communication, this system needed to be connectecd to the early ARPA network. Charlie implemented the operating system for this computer, RATS [2]. It was natural that in doing so he would implement a system that was capability based and similar to the system on the PDP-1 at MIT. Beyond being of value in its role in communicating with other systems there was the hope that this system could be an illustrative example of secure system design and implementation.

I was the technical liaison to the ARPA network community for our ARPA network node. In that role I was involved in the development of the early ARPA network protocols such as Telnet (for remote connection of interactive logon sessions), the File Transfer Protocol (FTP) and others. It also happened that I was involved in the security analysis of the Tenex operating system developed by Bolt Beranek and Newman (BBN). BBN was the prime contractor to ARPA for the development of the ARPA network interface nodes called Interface Message Processors or IMPs. The Tenex system had the most advanced network support in that time period, though Tenex fell into the category of a traditional "ambient authority" operating system (all user processes ran as a proxy for a human user with all that person's authorities).

One project that BBN worked on for Tenex was called the Resource Sharing Executive or RSExec [5]. The idea of the RSExec was to allow resources available on one Tenex system to be utilized by other Tenex systems across the ARPA network. My exposure to the ARPA network resource sharing protocols like Telent, FTP, and RSExec combined with my capability experience with RATS started me thinking about a capability sharing protocol. The advantage that I saw to such a mechanism is that one could develop a protocol to share a generic resource as represented by a capability and then have that one protocol suffice to share all resources on a system since all such resources are accessed via capability. Doing so would eliminate the need for developing or supporting resource specific protocols like Telent, FTP, and all the mechanisms in RSExec.

When I started working on what eventually became the Distributed Capability Computing System [6] it was one of the few times in my life when I experienced what might be referred to as a "brain storm". That was a situation where the more I worked on the DCCS protocol and implementation design the more it seemed the obviously "right" basis for network resource sharing. I remember spending a long sleepless night where I put togther all the basic mechanisms of the DCCS e.g. as pictured in this diagram from that period:


The basic idea of the DCCS is to extend what has more recently been referred to as a transparent proxying mechanism (creating a new capability that works just like an existing capability but is supported by another "server" process) across a network. When it comes time to pass a capability from one process to another process on another system on the network, the extension mechanism saves the capability being passed locally and forwards identifying information to the remote system. The remote system then emulates the capability by accepting any requests on the capability and bundling the request up to send across the network to the system that kept the real capability in a service list just for such requests. There are some interesting issues that must be solved as authorities to resources (network capabilities) are passed around between systems on the network [5], but it all seems to work out well.

After the DCCS mechanism was designed and published naturally we considered implementing it using RATS as the base capability system and the ARPA network as the communication infrastructure. There was even an ARPA network RFC written and registered to this end [6]. Unfortunately when we looked more carefully at the base RATS capabilities we found that one fundamentally important capability, the file capability, could not be transparently proxied and therefore could not be extended across the network using the DCCS mechanism. This was unfortunate in some respects, but it clarified for me how important the ability to do transparent proxying is and prompted me to consider network extension as fundamental to any capability authorities communication mechanism, what I now refer to as "network discipline."

The RISOS work including the RATS and DCCS development was done in relative isolation at the Livermore Laboratory. Independent of the RISOS work a rich tradition and technical base of computer network (Octopus) and operating system (LTSS) development had been ongoing at LLL since at least the middle 1960s. By the time he RATS/DCCS work was winding down in the middle 1970s there was some thought that Livermore could benefit by an update to the operating systems and networking protocols that had been developed to that time.

An early effort to update the Livermore operating systems was an OS design effort lead by Charlie Landau. Not surprisingly the programmer's manual that he produced for this MUTT (LTSS + (1,1,1,1)) system looked a lot like that for the RATS system. However, about that time Charlie left Livermore for Tymshare. There he and Norm Hardy and Bill Frantz and others continued what might be considered a fork of this capability development thread and developed the GNOSIS and later the KeyKOS operating systems that were descriptor based capability systems along the lines of RATS. Perhaps the history of those developments will be described elsewhere.

At Livermore the idea of revamping the operating system and networking protocols emerged again in about 1978. At that time I had just finished some local area networking research [7] and found myself involved in these discussions. After some preliminary discussions it fell out that John Fletcher, a fairly newly hired Dick Watson, and I ended up with the lead design roles for this effort.

One aspect that I remember most about these early design discussions was the clash between John Fletcher and I over the capabilities as descriptors model. I think it relevant to note that there was an independent "capability" thread that had been injected into Livermore some 10 years earlier. John Fletcher had read some of the early Multics [9] papers out of MIT (also Project MAC like the Dennis and Van Horn paper) and had picked up, adopted, and implemented a descriptor based directed graph directory structure for the Elephant storage system at Livermore. I still consider this architecture to have the most effective sharing mechanism that I've seen. To simplify a bit, each user had a "home" directory and a "Take" directory for receiving objects (files and directories in the Elephant system) from others. Every user also had access to what amounted to a large directory containing insert only access to the "Take" directories of all users.

With this scheme any user could create a new "group" directory for the purpose of sharing with others. By placing the new directory into the "Take" directories of a group of other users (e.g. place a "group project" directory into the "Take" directories of Alice, Bob, and Carol) then anybody given the directory could share any of its contents (e.g. shared access to files or other directories). One aspect of this system is that anybody that has access to such a shared directory can give it to anybody else - effectively expanding the group. This is very unlike the group mechanisms in systems like Unix or Windows or other "third generation" operating systems where administrative (root) intervention is needed to create and support resource sharing groups.

The debate that John Fetcher and I had would today be described as a debate over whether to implement capabilities (resource authorities) as system protected descriptors (think Unix file descriptor) or as data (more like a password, but with resource identifying and server address information encoded in the data). John argued persuasively that even if one implemented a descriptor based system for one computer, to share resources on the network that it was attached to (the network was the computer after all) required a mechanism (e.g. like the DCCS) for essentially flattening the capabilities into some sort of a "data" form. If such a mechanism could work for the network, why not for local capabilities within a single computer system using an inter process communication mechanism?

We were not alone in having such a debate. There were a number of systems developed that used each type of capability. Capability as descriptor systems include the Dennis and Van Horn system, RATS, CMU Hydra, GNOSIS and KeyKOS, the more recent EROS, the LANL developed Demos system, CalTSS, Mach and others. This represents a rich heritage that is still ongoing to some extent today. Capability as data systems include NLTSS, Amoeba, and the Australian "Monash" systems and one or two others. Not quite as rich a tradition, but still with significant representation. Developments along these lines also continue today in mechanisms like YURLs and others.

In some respects in looking back I feel that I failed to make the case effectively for capabilities as descriptors. One aspect of least authority that capabilities as data don't capture is the authority to communicate. That is, if a process has a capability to some resource (e.g. a file) then to exercise that authority it must be able to communicate to the file server. Following POLA reasoning, a process should only be able to communicate to processes that service capabilities that the subject process owns. This makes the capability system essentially support what today one might call a local firewall mechanism. However, with capabilities the restrictions on communication are much more natural and direct because they evolve out of the need to exercise the authority to access resources that are explicitly granted under the POLA policy.

While both John Fletcher and I understood this sort of restriction on communication from the descriptor based capability systems, such restrictions seemed to fly in the face of the nominal opportunity to communicate anywhere in computer networks. If computers on a network could communicate anywhere on the network, what sense did it make to restrict processes (virtual processors) within our operating system to only be able to communicate where they had explicit authorities to communicate? Today I think it is much easier to see the value in naturally restricting communication by processes within an operating system to POLA access just as their resource access is restricted.

In some ways it was lucky that John Fletcher won this argument in that it greatly simplified the design of the NLTSS [10] system that we developed at LLNL. In fact it was that simplification that partly sold the pure message passing model to me. NLTSS had only one system call - a call that said "communicate" where there could be any number of send and receive buffers specified in the call. This simplicity at the lowest level of the system made possible some quite elegant mechanisms at higher levels (e.g. the separation of the transport and application level services, the support for mixing multiprocessing with thread services that uniquely in NLTSS allowed blocking at the thread level - not the process level - for I/O, and other aspects of the system).

I now believe that this base simplification could have been made even if the underlying message passing kernel knew just enough about capabilities to identify their network addresses and to validate the authority to communicate that they carried. Still, in those early days we didn't know this and trying to incorporate such a mechanism into NTLSS at that time would have effectively blocked development. It wasn't until we later developed a public key model of capability as data protection mechanism [8] that I believe such authority to communicate could have effectively been bundled with our capabilities as data that could be easily moved across the network.

Still, the weakness of the NLTSS system from a user perspective had nothing to do with any of these underlying issues. The problem that it had was that the only user libraries that it had available and therefore must support were the libraries developed around the earlier LTSS system. Ultimately NLTSS, which had a much richer and more capable base resource model, ended up having to effectively emulate the older LTSS system - even to the extent of having processes run with the authorities of users (ambient authority). A glaring example of this restriction is that of the process model for the two systems. In the NLTSS system processes were simply resources protected with capabilities. Such capabilities could be stored in directories, communicated in messages, etc. However, the only user libraries supported at LLNL at the time were built around a simple "controller - controllee" model of processes where there was a linear chain of processes (not even as rich as the Unix tree structure for processes) that could send signals up and down the chain in a rather complex way. Naturally an NLTSS emulation of this structure was no more capabable than the LTSS base implementation of the mechanism and it was necessarily somewhat less effecient. The fact that a system like NLTSS could share its resources (e.g. processes, directories, files, etc.) more flexibly and even over a network was not available in the older LTSS system and so didn't show up in the support libraries, making these facilities irrelevant for users. NLTSS ended up looking to users like a somewhat inefficiently implemented port of the older LTSS system.

This aspect of adapting to a set of existing user libraries might be referred to as the "tyranny of the API" and is seen quite clearly in the dominance of the Unix and Windows APIs and even in their relative similarities today. It's effectively impossible to introduce explicit resource sharing (POLA) into these APIs, so one instead sees awkward add-ons such as the mechanisms of SELinux (who knows what Microsoft will come up with) to try to address POLA concerns. Backward compatibility is the key, but it also makes any innovative POLA resource sharing mechanisms nearly impossible to introduce. I personally believe that the only route to acceptance for such mechanisms is to introduce them in an area (e.g. such as network resource sharing) outside the realm of the typical OS API and then have libraries gradually develop to incorporate these more flexible and powerful (e.g. for protection/security) facilities.

It was because of this "tyranny of the API" that the capability thread at LLNL was effectively submerged and disappeared when NLTSS (and LTSS) were displaced by Unix.


[1] Jack B. Dennis, Earl C. Van Horn, "Programming Semantics for Multiprogrammed Computations," Association For Computing Machinery Conference on Programing Languages and Pragmatics, San Dimas, California, August 1965.

[2] C. R. Landau, "The RATS Operating System," Lawrence Livermore National Laboratory, Report UCRL-77378 (1975).

[3] J. E. Donnelley, "DCAS" - A Distributed Capability Access System, Lawrence Livermore Laboratory Report UCID-16903, August 1975.

[4] R. P. Abbott, J. E. Donnelley, et. al., Security Analysis and Enhancement of Computer Operating Systems, National Bureau of Standards Report NBSIR 76-1041, April 1976.

[5] B.P. Cosell, P.R. Johnson, et. al., "An Operating System for Computer Resource Sharing*", Proceedings of the Fifth Symposium on Operating System Principles, November 19-21, 1975. Also available on-line from the ACM: ACM.

[6] J. E. Donnelley, A Distributed Capability Computing System, Proceedings of the Third International Conference on Computer Communication, August 1976, pp. 432-440. Also available at: http://www.webstart.com/jed/papers/DCCS/.

[7] J. E. Donnelley, Components of a Network Operating System, Fourth Conference on Local Networks, Minneapolis, 1979. Also in Computer Networks 3 (1979) 389-399,also available at: http://www.computer-history.info/Page4.dir/pages/LTSS.NLTSS.dir/pages/Components.dir/index.htm.

[8] J. E. Donnelley, Managing Domains in a Network Operating System, Proceedings of Local Networks and Distributed Office Systems Conference, London, May 1981, pp. 345-361.

[9] Daley, R. C., and J. B. Dennis, Virtual memory, processes, and sharing in Multics, Commun. ACM 11, 306-312, May 1968.

[10] An amusing aspect of the NLTSS saga is the name - Network LTSS, or Network Livermore Time Sharing System. This name started being used on a temporary basis while early design was underway with the thought that we would chose a more permanent name if it looked like the system was going to be coded. When it became clear that we were going to put the system into production we had a meeting (the whole LINCS/NLTSS design was very meeting heavy with infamous "Friday Afternoon meeting"s that were knock down, drag out affairs) and chose the name "LINOS" for LINCS Operating System. LINCS was itself an acronym for LIvermore Network Communications Systems - the protocol set for the Livermore network and NLTSS. Unfortunately, by the time we got around to choosing this name, some budget documents had already found their way to Washington with the "NLTSS" name on them. We were told that it was too late to change the name.