Policy as a System Administration Tool

Elizabeth Zwicky
Steve Simmons
Ron Dalton

Abstract All decisions about how to manage a given system are made with respect to local policy. This is true even in the absence of such policy, as the consistent actions of the system manager become de facto policy. This paper will discuss the interactions between policy and systems management. Using a series of case studies, we will illustrate two points: how proper policies can be used to ease the day-to-day tasks of systems administration; and how technical issues can and should be used as one of the driving forces in policy decisions.

Introduction

At first glance policy is a political issue rather than a technical issue. But policy made without regard to technical issues is a recipe for organizational and administrative disaster. Conversely, letting technical considerations dictate policy is a recipe for political disaster.

Policies are often a major factor in technical decisions; for instance, a backup system cannot be satisfactorily designed without an existing policy about what files get backed up by whom how often. Ohio State University’s Computer and Information Science department (OSU-CIS) and SRI International’s Information, Telecommunication and Automation Division (for brevity, called “SRI” throughout this document) back up roughly equivalent amounts of disk space, on the same types of machines, to the same sorts of tape drives, using programs written in the same language, and containing some of the same code - but the programs are completely different, because they implement very different policies.

To further complicate the situation, most systems are administered in a policy vacuum. The administrator may set de facto policies, but rarely will there be any formal recognition of those policies by management. This may at first seem depressing, but used properly can be a method of easing system administration.

This paper will discuss a number of policies both formal and informal, the technical and administrative needs driving those policies, and the results of their application. Unless otherwise stated, no names have been changed to protect the innocent. The guilty are left anonymous.

Points of Contention

Policies are developed primarily because of conflicts between users and system administrators. Most system administrators have a sense that certain things are common sense (for instance, it seems intuitive to most that a single user should not have multiple accounts on the same system). It comes as a blow to discover that users do not usually share these intuitions. The following sections discuss some common points of conflict in large installations.

Independence vs. Service

Users often want or need to do eccentric things with their machines. By contrast, in order to make them more easily managed the system administrators prefer that all the machines be as close to identical as possible. Policies in this area serve two purposes; they provide polite and relatively unarguable ways in which to say “Not on my network you don’t,” and they provide clear statements of the price a user must pay for independence. In general, they have to give up something in order to have full control, and they are made deeply miserable if they have their own control and do something stupid. Users need to have a good idea exactly what they give up and exactly what the consequences are.

Some relevant questions: Do users get root on their machines? Do they get disks on them? Do they get to modify the system software? Do they get to run other operating systems? Can they inflict any sort of machine they take a fancy to on you? If they do any of this, how do they get backups, operating system upgrades, mail, news, access to printers, access to networks? Who decides what the machines are named?

Case Study: The Untrustworthy Hosts

At the Industrial Technology Institute (ITI) there was (and still is) a fairly large laboratory devoted to implementing MAP/TOP protocols on UNIX systems. This required a great deal of device driver work, kernel builds and installs, and kernel-level debugging. This was not a problem for the systems administration staff because the researchers gladly gave up central support in turn for unlimited root access (they later came to regret that, but that’s another paper). Relations between the two groups were reasonably cordial once areas of responsibility and authority had been decided. The administrative staff did the backups and co-ordinated hardware and software maintenance projects; the researchers added and deleted accounts, managed their own disk usage, etc.

Problems began once the project was completed. The systems had been purchased by the research group with research funds. They were on a private ethernet, not connected to the central network. The researchers were not anxious to give up their windowed dedicated development environment, particularly when this meant returning to ASCII terminals on an overloaded VAX 785. They wanted to connect their systems to the central network and work from the lab. They wanted to send and receive mail, read news, have access to the Internet, mount NFS partitions, and all the goodies one expects in a well-connected environment. But they refused to give up root access, justifiably citing the ongoing support tasks of the original project.

Describing lab systems as “insecure” would be an understatement. Many accounts did not have passwords; other accounts existed for users who had left years before. Given we had been at least brushed by previous break-in attempts and the Morris worm, there was a great deal of resistance to allowing unlimited connectivity.

The policy decision made was largely driven by technical issues, and with very little management involvement. The issues:

External Security

It was decided to attach to lab machines to the network, but not provide any external routing (we use static routing internally, including our gateway). This permitted the lab machines to be attached but not be accessible from (or to) the Internet. This effectively removed the issue of mail - they could do it only if they developed the sendmail expertise to forward everything to a trusted host and faked the return addresses via a hidden net. News works fine through NNTP, and in this case the service was carefully configured to hide the laboratory hosts.

Internal Security

Entries in host tables and domain name service were made to identify the hosts on the trusted network, but those hosts were not placed in /etc/hosts.equiv on the trusted systems. Thus they could not rlogin, rsh, etc, without supplying a password. While this was an inconvenience, most users quickly came to terms with it. Now that the users have discovered .rhosts files (and use them in spite of requests not to) the appropriate changes are being made to no longer allow .rhosts to override hosts.equiv. For a broader solution to the same problem, see [Harrison]/

Improved Security

The users still desire NFS mounts, access to Internet, etc. They were understanding of the security needs, and requested a technical solution that would permit it. We proposed and they accepted the use of cops [Farmer] as a security check to validate their security. When all their systems pass a cops audit, they will be added to the trusted hosts.

Results and Re-Evaluation

In this case purely technical issues drove the creation of a policy. In every request we were able to provide both a technical reason for a policy and technical means that would permit modification of the policy (In retrospect it was actually a benefit to have been touched by previous security problems -- they convinced the user community that security was a real issue.)

This policy has eased the integration of new computers into our network. As workstations and PC-based UNIXes appear on desktops, the policy developed for laboratory machines has been extended to apply to desktop systems. Having a policy in place made it much simpler to deal with objections. In the case of users who wished to fight the policy, we invited them to form a committee and make a policy acceptable to all. This being an impossible task, the users have thus far yielded to the inevitable.

Case Study: SRI

Because of SRI’s somewhat baroque financial arrangements, it is very clear which machines are and are not maintained by the staff; if we charge you an hourly fee to use your machine, it’s a facility machine. You can, of course, refuse to pay us, in which case the machine is your own; we can also refuse to accept your machine as a part of the facility, if it is not like the rest of our machines. If you do not pay an hourly fee, you pay on a time and materials basis, for a minimum of half an hour, every time we do anything at all for your machine.

For practical reasons we need to offer some services to non-facility machines. (We own all of the networks and all of the printers.) On the other hand, the money we charge pays our salaries; we can’t afford to offer all of our services to people who aren’t paying us. Furthermore, it is unfortunately easy for poor configuration on a non-facility machine to make life unlivable for facility machines.

Our compromise has been to allow non-facility machines to connect to the network, charging only for required hardware, and to register them in our name servers. In return, they are required to register all networks and hosts with us, and to configure their machines so as not to interfere with network operations. Hosts that are not well-behaved are disconnected from the ethernet, without any particular attempt at kindness. Hosts that can manage either to speak directly to our Ethernet-based Imagens, or to speak to a Berkeley line printer daemon, get printer access (in the latter case, via a special printer equivalence file, not hosts.equiv).

Other services are available at an hourly charge for the time we spend providing them, with other restrictions as needed. For instance, we will provide backups for hosts; we require control of root privileges on machines that we need to trust for this purpose, we charge for the hours required to set the system up (on a modern Sun running a modern SunOS, this is our minimum half-hour charge; on other machines it may run to 20 or 40 hours, especially if they are non-UNIX machines for which we have to devise new backup systems), and we charge on a weekly basis for the labor involved in running and monitoring the backups (usually half an hour a week). We do not attempt to charge for media, and we do not charge for restores, so long as they are infrequent.

The result is that there is usually considerable monetary advantage to a project in turning over machines to us if we are willing to take them. The hourly fee works out to much less than people normally ending up paying us for assistance, especially if they want to be reasonably integrated with the rest of the division. For machines that are capable of using facility services like NIS (previously YP) service, NFS mounting of file systems, and so on, there is an uncomfortably large grey area. Hosts that we trust because of backups end up being able to avail themselves of services that do not require human intervention without being charged for them. So far, this has always worked itself out, if only because hardware support contracts are also covered by the facility; projects living in the grey area usually find that hardware repairs alone make it more economical to come into the facility all the way.

Security vs. Ease of Use

The ITI case study above shows a second common point of conflict; users want to be able to do anything they want to without trouble, but they also want to be safe from malicious others. It is left to the system administrators to provide the security. There is obviously a large technical component to this, but there is also a major political component.

Rules that are technically uncomplicated, like rules mandating that passwords must be changed regularly, or that users cannot share accounts, or cannot have root access, turn out to be emotionally complex. (A user once explained at length in a public meeting that he was too eminent a professor to be required to change from the password he had always used - which had just been broken in the first pass of an automated password tester.)

Security concerns are relatively easy to get management support on; security violations are highly visible in the media, and the technical issues surrounding passwords are easily understood. (At a commercial site, the argument that competitors could exploit security holes to gain access to internal information is extremely effective.) Password changing policies can be approved at a high level, and then implemented impartially in software. Shared accounts can be replaced with groups, usually with minimal resistance.

Root passwords, however, remain a point of contention. Some people actually need them; some people sincerely but incorrectly believe that they need them; and some people want them just as a sort of merit badge, to indicate that they are powerful and competent. Some sites have had success in discouraging people in the latter two classes by giving out root access to machines conditionally; one favorite is a site which requires all people with root access to wear beepers so that they can be summoned to fix the machines when they break.

Some relevant questions: What rules are there about choosing and changing passwords and how are they enforced? Can multiple users share a single account? What does it take to make a machine trusted? Can users have .rhosts files?

Resource Utilization

It is a recognized law of computing that usage will increase to consume all the available resources; what appears one day to be endless amounts of free disk space turns out to be barely enough on the next. Furthermore, there are cases where resources can be temporarily monopolized, even when they are generally in ample supply. Printers are usually the victims of this syndrome; there’s plenty of printing capability, until the day someone prints out accept/reject letters for an entire conference from an automated script, and puts over 200 jobs in one print queue. One such occurrence is enough to produce large numbers of users who want Something Done.

Disk space, printer pages, and CPU cycles are the three most commonly abused computing resources. There are systems for accounting for all of them; these systems differ widely from machine to another, but are more or less uniformly unsatisfying. Most of them simply report the usage, and let you try to figure out what to do about it. Even those that do apply restrictions need to be told which restrictions to apply. Any way you look at it, it turns out to be almost a pure policy decision.

For disk space and printer pages there are two common methods: assign an allocation and cut people off when they go above it, or charge per-page or per-kilobyte in either real or imaginary money. Methods that impose quotas may be impractical, since users with a critical need may run over the quota when nobody is available to restore service to them. It is also tricky to determine where quotas should be set. Quotas need to be high enough so that users do not normally exceed them; on the other hand, they should be low enough so that if people do reach their quotas they do not exceed the available resources. We have never actually seen a system that reliably met both these goals. Instead, quotas are usually positioned where 90 percent of the users fall into them, and resources are allocated so that problems are rare in practice, disregarding the possible results of all users using up their quotas at the same time. Money-based methods, even if they are based on imaginary money, tend to bring out the worst in users. Many become paranoid about getting charged correctly. Since accurate charging is difficult, system administrators may find themselves spending large amounts of time fixing accounting systems which do not really reflect costs. Users also spend a great deal of time and energy questioning the basic accounting structure in hopes of changing it to their benefit.

Informal systems can be quite effective. For instance, OSU-CIS controlled disk space usage effectively for some years by simply publicizing the usage statistics for the top 10 users on any partition that got too full. As long as the largest users on a partition are not also the most powerful, peer pressure is very effective. (The system adopted after that became impractical is detailed in [Zwicky].) Similarly, if you track pages printed, you can deal individually with excessive users.

Some relevant questions: How much disk space do users get, and what happens when they overflow it? How many pages, at what time of day, on what printer, constitutes fair printer usage? What can you do on someone else’s workstation? Who has priority use on public workstations? On a multi-user system, how much of the machine’s capacity can you use for what?

Accounts

At first glance, there seem to be relatively few issues about accounts, aside from the security issues discussed above. However, in a multiple-machine environment, there are considerable difficulties in deciding who gets accounts on what machines, as well as the technical problems in reconciling accounts between machines that interact with each other. Technical solutions are a dime a dozen, and come in three forms: network user database services like Sun’s Network Information Service (NIS, formerly called YP) or Project Athena’s Hesiod; services that provide unique and consistent user ids for a site, which are then used as administrators wish on individual machines; and systems that reconcile password files between machines as users are added (for instance, the one described below).

Some relevant questions: Who gets accounts on which machines? When do accounts expire? What do you have to do to get an account? What are accounts named? Can a single user get more than one account on the same machine? Can multiple users share a single account?

Case Study: ITI

In the past, unofficial policy was to grant user accounts only on the systems needed by the individual user. This kept down the total number of accounts, and made dealing with loosely connected system easier.

As technology progressed, this became more and more of a problem. Cross-mounting NFS systems between hosts with disjoint passwd files was a nightmare. Having a user home cross-mounted between systems was difficult due to different setups on different systems.

Over the course of time, a user’s needs would change. Accounts once required on one system became inactive, while new accounts were required elsewhere.

We also make extensive use of PC-NFS. The initial installation dedicated a Sun file server to PC-NFS usage, while requiring users to have other accounts on other systems. Disk space crunch quickly made this infeasible, and PC-NFS-mounted directories became intermixed with user home directories. As our user community and our use of PC-NFS became more sophisticated, this became a bottleneck. It also led to such bizarre circumstances as users ftping files from their home directories to their PC-NFS directories when both were in the same partition.

In addition, each of our central systems are quite different. Vendor and resource constraints constrained us in trying to make them identical; expensive 3rd party software that only ran on one system made it inevitable systems would be different.

The Solution

Briefly, we decided to adopt a rule of “one user, one uid, one home directory.” To avoid problems of disjoint access to systems, we decided to change the policy on systems so that all users had access to all central UNIX systems (MIS systems are an exception). This had to be done without use of yellow pages (highly insecure, and not available on all systems) or Hesiod (some systems could not easily be retrofitted). In addition, to defeat previous break-in attempts we were running custom login programs with shadow passwords. We were forced to continue with flat files.

Mass implementation would be a nightmare; we didn’t even attempt it. Instead we went to a sliding implementation.

All new users were immediately added to a central system. A variant to the new user script was written expressly for the purpose of duplicating a user entry from one system to another. The new user script was run to create the user, then the duplicator run on all other systems. This gave a common home, login name, uid/gid, and common initial password on all systems.

Reconciling the old users was (is) a stepwise process. The machines which were the primaries are gradually being removed from service. As each user is moved to the new systems, his account is cloned. If the uid and login id were unique, they are carried over. If not, new ones are assigned. However formed, the new account is then distributed to all systems. When the old system is decommissioned conflicts with old uids and login names become irrelevant.

Results And Retrospective

The change of policy was justified to management by claiming it would simultaneously decrease administrative cost while increasing user access. This process is still continuing as of this writing; it is expected to be complete by presentation of this paper. The preliminary results are bearing out our estimate.

Requests for accounts on other systems have dropped to almost zero, and will vanish when implementation is complete. This has not only reduced our unplanned administrative tasks, but has also eliminated the problem of duplicated disk space, resulting in more available disk without purchasing additional spindles.

Reconciling system setups was daunting but doable; we’re quite proud of the design, implementation, and result of this reconciliation. Previously giving a user an account on a new system immediately led to a flurry of phone calls on what was different where; these have been greatly reduced. At some small per-user cost in loading initial accounts, we have eliminated a great deal of ongoing support. The time and effort expended in designing system-sensitive user initialization files is quickly being paid back.

Without the change in policy, these savings would not have been realized.

Case Study: OSU-CIS

Originally, OSU-CIS maintained a single password file for all workstations that would run both NIS and NFS, ensuring that each user had one account and one home directory. Machines that did not run NIS each had individual password files; user numbers were distinguished by giving each password file a unique range of IDs, and giving an account the first unused ID in the range for the machine it was first installed on. Accounts on the individual machines were given to faculty on request; students had to get a faculty member’s signature to get accounts. The machines that had individual password files were primarily the CPU-intense machines (a selection of Pyramids, a BBN Butterfly, and an Encore Multimax). In fact, the Pyramids all used the same password file, distributed from a central machine via rcp by cron. The Sun servers were not YP clients, and had password files with only staff members in them. To complicate matters, while some undergraduates had permanent accounts, most were given accounts only when they were taking classes; approximately 1500 of these temporary accounts were created at the beginning of every quarter, and deleted at the end of the quarter.

In order to manage this, OSU-CIS developed two account installation programs (both primarily originally written by Chris Lott). One of them, for regular accounts, allowed you to enter the information about a single user; it then polled each machine which had a password file to determine whether the user already has an account, and if so, tells you the user name and number. You were free to override this, especially since the program might find multiple accounts (usually because of multiple users with similar names, but sometimes because somebody made a mistake). The other one read a tape, produced by the university’s registration system, and created a single account for each student on it. The registration tape contained university ID numbers, which allowed that program to be completely certain which were duplicate entries for the same student, and which were entries for different students with the same name. Since this information was not available for existing users, there was no attempt to avoid giving the same student both a regular and a temporary account.

This system, while workable, was inconvenient: limitations on root privileges meant that system administrators tended to deal with user files from the file servers, where the users did not have accounts, so that all the files were shown by numeric ID; mail could not be delivered to students on the central department machines, since those were the CPU-intense machines that the students didn’t have accounts on; and password files tended to slowly diverge from each other, as administrators made “temporary” changes. Furthermore, maintaining the password files on the servers became burdensome as the number of servers increased from 1 to 14, and the number of people who needed access went from 8 to approximately 30. On the other hand, there was no interest in changing the fundamental policies about access; giving the world at large access to either the CPU-intense machines or the servers was obviously undesirable. (The per-quarter accounts were a temporary expedient, due to be replaced by a user database allowing undergraduates to have accounts for the duration of their time as CIS majors at OSU.)

Client NIS was enabled on the Sun servers, as well as the clients, but instead of simply pulling in all accounts, two separate lines were added. One pulled in all the accounts for systems staff members, using a netgroup. The other pulled in all remaining accounts, overriding the passwords and the shells. A modified version of su, created by Paul Placeway, allowed systems staff to su to users without forking the user’s shell. Thus, the systems staff could not only see real names on files, they could also run as users on machines that the users could not log in on.

The password files were reconciled with a perl program, written by J Greely, which took each password file in turn, and added lines for the users that were present in the other files but absent in it, with a dummy password and shell, ignoring system accounts. It ran once a night, from cron.

Software support

As systems are used, they accumulate more and more software. This has to be installed, upgraded to new version as they become available, ported to new machines as they become available, fixed when bugs are noticed, and explained to users. If software is allowed to accumulate at the whim of users, the tasks involved in supporting it rapidly take over.

Some relevant questions: Which programs can you expect the staff to fix for you, and how soon? When can you expect to get help, and from whom?

Case Study: SRI

Over the years, SRI’s machines had gathered an immense amount of software in /usr/local; we were providing support for any program anybody had ever asked for or purchased. As we moved from VAXes and Sun-3s to SparcStations, we were being asked to port all of this software and continue its support. Some of these programs had no locatable source code; others would not compile; some we objected to on basically aesthetic grounds; and others were simply the third or fourth program to do the same thing. We rebelled, and refused to invest our time in porting four SunView clocks to SparcStations. We then found ourselves embroiled in a political argument.

We developed an 8 page list of software, which we are in the process of publishing to the division. It details exactly what we are willing to support in formal terms, and carries the approval of three levels of management. The list itself is bound to be controversial, but it will get all the arguments over at once. It will minimize users trickling into our office for months, claiming that their lives are incomplete without a really good digital clock for SunView.

Our list currently divides software into 7 categories:

Fully supported: Fully supported tools are considered necessary for day to day life. If they become unavailable, restoring them is first priority. With the noted exceptions, they are available under all versions of the operating system, on all hardware platforms. They are upgraded to new versions regularly, and they are supported by multiple people on systems staff.
Partially supported: These are considered useful, but not essential. If they become unavailable, some priority is given to restoring them. We attempt to make them available on all versions of the operating system and on all hardware platforms. They are upgraded to new versions as time permits, and are supported by at least one person on systems staff.
Available but unsupported: These tools have been installed on some machines. They may not be available on all operating systems or hardware platforms. If they stop working, they may never be fixed. They are unlikely to be upgraded to new versions. Support for them may be unavailable from systems staff.
Under evaluation: A small number of licenses are available for evaluation purposes, or as part of beta-test program. These programs are supported by at least one person on systems staff, but may disappear without warning. They should under no circumstances be used for important or long-term work.
Supported in future: These packages are not yet available, but we are in the process of purchasing and/or installing them.
Supported during transition: These packages are supported because they are still in use, but have been replaced. Users who are already using them are encouraged to move to a fully supported option, and new users should choose a fully supported option. However, those users still relying on transition programs will receive full support as far as is possible.
Completely unsupported: We do not believe that these packages are currently available on our systems. They will not be made available in the future, and any copies that may have escaped our notice are not supported. This category includes programs that we have previously supported, but which are no longer available, and programs which have been evaluated and rejected.

The list is divided up into rough categories (window systems, programming languages and tools, text editors, and so on). Most categories simply include all the programs in the category, sorted by support levels. In some cases, we found it useful to add extra information about what we do and don’t support. For instance, we have discovered that people assume that we will lovingly preserve any changes they make to the disks on the supposedly dataless workstations on their desks. Our opinion on the subject is not really repeatable in polite company, so we added a paragraph explaining which changes to a workstation we would and would not preserve.

We also discovered that there were bitmapped backgrounds installed in system space that were not repeatable in polite company either. Since all opinions on the subject of nude and semi-nude backgrounds can be classified as fascist, sexist, or both, we made a blanket decision that we would not install or provide support for images that didn’t come with operating system or window system releases, and added that to the support list.

Ideally, this support list should be accompanied by a policy that states who gets to control the list, a question we have so far managed to finesse. The list reflects primarily the opinions of the people who were willing to spend the time compiling and editing it. The process was considerably simplified by having management who understand that it is advantageous to limit the number of programs supported, and to move to new technology as it becomes available. This makes them unsympathetic to users who claim that we need to port the Rand “e” editor to SparcStations “for backwards compatibility”.

In the case of programs that must be purchased, there is an unofficial policy that multiple choices will be evaluated by the staff and the users, and a final decision will be made by a group of the primary users for the program, and approved by the people who spend the money. After the public evaluation period, people who object to the choice can simply be informed that they should have spoken up when we asked them to, and that it is now too late. This procedure has been used in the last several major software purchases, and has been quite successful. Our major problem was restraining enthusiastic users who wanted to buy the first program that they tested.

Case Study: ITI

In moving from a loosely coupled to a tightly integrated environment, one immediate problem was differences in utilities from system to system. Our heavily populated VAXes were loaded with things from users, from USENET, and from unknown sources. In order to make users mobile between the systems, we had to somehow deal with these differences.

Licensed software was not a difficult issue. This usually came in object form only, with restrictions that it could only be used on a given system. .FS We actually have very little software that has restrictive licenses. .FE Users who wished to have some licensed utility on another system were asked to justify the cost of obtaining it for the other system; on learning the cost of same the user usually dropped the request. Other custom but non-sharable items like databases were distributed so as to be closest to their user communities.

Most difficult was the wealth of software that had shown up in /usr/local/bin over the years. This actually became another case of turning a problem into a policy for preventing problems. Previous administrators had been lax in such areas as documenting and archiving these utilities. They had also been fairly firm about not letting users put things into /usr/local/bin. Starting with the installation of a new central system, we established a policy that all programs to be installed must include source. This ensured that it was at least minimally possible to provide a program in other environments.

Programs were broken into 4 categories: vendor-supported (i. e., came with the system), ITI-supported (such as MIS systems, etc), ITI-installed, and user-installed. The last two categories are almost identical, the only difference being in whether the program came because the systems staff thought it was useful or if it came from a user. Neither of the last two categories is really supported, although for ITI-installed the systems staff agrees to at least look at problems and consider fixes. User-installed programs are the responsibility of the user donating the program. If the user leaves the program either becomes orphaned, gets adopted by another user, or (if sufficiently popular) gets adopted by the systems staff.

This last policy has had an interesting effect on programs from users. Previously we had a regular series of requests that amounted to “Gee, I found the neat program. Would you install it?” Now that we say “Yes, but we’ll refer questions and problems to you” the response is often “Never mind.”

Changing Technology

As time goes by, new computers, operating systems, and programs become available. Usually, the new technology fixes things that were broken before; without exception, it breaks things that worked fine before. Users are usually split between the people who want the newest thing, today, and the systems staff can figure out how to work around the bugs, and the people who never want to change anything. The staff has to hold out until technology becomes reasonably usable, and then has to pry the remaining users off the old technology when it becomes unusable.

Case Study: SRI

SRI is in the process of moving from being based on SunOS 3.5 running on Sun 3s to being based on SunOS 4.1 running on Sparc machines. The process has actually been simplified by changing hardware and software at the same time; the users find it logical that the software should be different on different hardware platforms, for one thing.

Initially, we converted the staff to Sparc, starting originally in SunOS 4.0 Beta. We declined to move users to the new OS until 4.0.3 was released, at which point we moved a few servers worth of Sun 3 clients that either wanted the new operating system, or were purely administrative and did not care which operating system they were running under. We introduced SparcStations as 4.1 Beta came out; users were told that they could have SparcStations running the Beta software, or no SparcStations at all, and quite a few took the deal. When 4.1 was released, we began the move in earnest.

We purchased a Sparc server, and Sparc upgrades for two of our eleven servers. We brought up the new Sparc server, and freed up one of the existing servers by moving clients to other servers, or changing them to dataless SparcStations and moving the relevant home directories to the new server. We then upgraded this server, and the clients and home directories from 3 of the remaining old servers onto it. Two of those servers were decommissioned completely, and their disks re-used on the remaining one. We took advantage of the complete change to make the hardware and software layouts on the servers more consistent as well, which involved re-using most of the racks as well. (The CPUs and the remaining odd-sized racks will be used to upgrade remote sites running on older Sun 3 hardware.) When the third new server was brought up, we moved the clients and home directories from most of the remaining servers onto it, and decommissioned them. Of the remaining machines, one is a staff server, one is a dedicated database server, and one holds the remaining programmers who have projects that cannot be moved to 4.1. Because conditions have changed since we started, the original server needs to be re-configured before we can move the three last clients off the last machine scheduled to be decommissioned, but the move is otherwise complete.

The results have been quite satisfactory. Since each machine ran in parallel with the machines it was replacing for a few days, we were able to go back and fix things that we had failed to move correctly the first time. We were able to introduce some minor changes that increased consistency and security as part of the global change. The users actually have found the change smooth enough so that they occasionally forget it happened, and call us up to ask why they can’t log into machines that no longer exist. We did discover some odd side effects of decommissioning central file servers while leaving most of the systems running; mysterious performance problems cropped up, which were eventually traced to machines that were desperately trying to arp for servers that had ceased to exist weeks before. These problems had to be traced by watching the network, since the machines in question had all been reconfigured for the new configuration, but not rebooted.

Enforcing Commonality

The single biggest headache in administering a network of systems is trying to remember the differences from system to system. The obvious solution is to reduce those differences. While this cannot be completely done, enforcement of several simple policies can greatly improve consistency.

Case Study: ITI

As mentioned above, we have established conformance of logins and aliases across networked systems at a given site. In the past it was practice to divide the user community across the various systems to maximize load balancing. This resulted in a nightmare of administrative activity to keep everything “straight” on the various platforms. By mandating that all users will have ids on all systems, we have reduced this problem somewhat.

We soon expect to be automated to the point of having master user id/password and aliases files that get distributed to the other systems when updated. A new adduser script has been written not only to generate the ids, initial password, home directories and the other normal functions of such a script, but also to distribute the created entries to the other connected systems.

With common accounts, the next step was to force common NFS layouts. We adopted the /home style for user accounts, such as seen in Sun 4.0.0 and other more recent UNIXes. Each partition on a given system is named after the system and number 1 through n. In all cases, an entire partition is given over to a home area. A standard /home directory is present on all systems, with the mount points being /home/systemN. This ensures that all homes are identical on all system. Enforcement is almost trivial, as it is much simpler for systems to comply with the policy than to use some other method. This also has the benefit that no matter what system one views the network from, the configuration is identical.

With common IDs and home, there must be a standardized method for delivering mail. Each each user has a home system, defined at this site as the system upon which the user receives electronic mail. While this is usually where the user performs the day to day activities, we do not require this. This is managed by a master alias file which is distributed to all systems and then automaticly localized as need on the individual systems. These common aliases allow ease of managing mail delivery. It also has the curious benefit of allowing an administrator to quickly find the home system for a given user by looking at the alias file.

These policies have proven quite useful across a variety of system types. Our current systems include two DEC VAX 11/785 systems running BSD 4.3; one Encore (nee Gould) PowerNode 6040; a SUN 3/160 file server, a DECSystem 5810 running Ultrix and a number of PC-based and other small UNIXes. In spite of our best efforts there are system differences, but standardizing disk configurations and user ids has greatly reduced the systems administration burden. While this has not made our site any more user-friendly, it has made it less user-hostile.

Selling The Policies

How does one go about establishing policies such as those discussed above. Most of the time it is a matter of simply stating it as policy and the great bulk of the users simply follow along. In many cases the users simply don’t care or are willing to put up with minor inconveniences (especially temporary ones) if they are assured of better (faster, more understandable, less surprising) systems as a result.

Does management care? If such policies are presented as improvements in user or administrator productivity, management usually eagerly approves. But be prepared to back up the proposed policies with facts, don’t exaggerate the benefits: The policies discussed here will not make a system administrator 1000 percent more productive (though it often seems so to us once things are in place). State reasonable numbers that can be expected. Management loves to hear of 10 and 20 percent productivity gains, but is usually skeptical of 50 and 100 percent.

Be prepared to show that such improvements did occur. Not only will positive and truthful results increase your credibility, and thereby allow management to give you obscene raises in salary, but they really will make your life as a system administrator much more comfortable.

References

Daniel Farmer and Eugene H. Spafford, “The COPS Security Checker System” Proceedings of the Summer USENIX Conference, pp. 165-170.

Helen E. Harrison and Tim Seaver, “Enhancements to 4.3BSD Network Commands” Proceedings of the Workshop on Large Installation Systems Administration III, pp. 49-52.

Bud Hovell, “System Administration Policies” UNIX REVIEW, March 1990, pp 28-39

Elizabeth Zwicky, “Disk Space Management Without Quotas” Proceedings of the Summer USENIX Conference pp. 41-44.

About the Authors

Elizabeth Zwicky

At the time this paper was written, Elizabeth Zwicky was a system administrator for the Information, Telecommunications, and Automation Division of SRI International in Menlo Park, California. She is working on compiling a perfect record as the speaker with the smallest number of slides at every LISA conference. She is currently with Silicon Graphics.

Steve Simmons

Steve Simmons is a graduate of the University of Michigan, and has done UNIX-based development at Bell Northern Research, Schlumberger Technologies, and ADP Network Services. At the time this paper was written he was the UNIX systems manager at the Industrial Technology Institute and a consultant. His publications include music, humor, essays, and software. He has published no intentional fiction.

He is currently running his own consultancy.

Ron Dalton

Ron Dalton is a graduate of Ohio State University, and has a long career as a systems and software development manager at ITT and Schlumberger. At the time this paper was written he was a systems and MIS manager at the Industrial Technology Institute. He is currently with Libby-Owens-Corning in Toledo, Ohio.

Back to Steve’s home page.
Contact, License and Copy Issues.