From jrussell@gpo.govMon Feb 5 11:06:25 1996 Date: Sun, 4 Feb 1996 01:18:36 -0800 From: "Judith C. Russell" Reply to: Discussion of Government Document Issues To: Multiple recipients of list GOVDOC-L Subject: FDLP Study: Task 8D: OTA Web Site STUDY TO IDENTIFY MEASURES NECESSARY FOR A SUCCESSFUL TRANSITION TO A MORE ELECTRONIC FEDERAL DEPOSITORY LIBRARY PROGRAM (FDLP) PRELIMINARY REPORT: TASK 8D: OTA WEB SITE As part of the Study, a task force examined the issues that must be addressed when an agency no longer makes electronic information dissemination products and services available at its Web site, and the site contains information that needs to be remain available to the public through the FDLP and/or transferred to the National Archives and Records Administration (NARA). This task force was lead by Fynnette Eaton and Tom brown, NARA Center for Electronic Records. This preliminary report of the task force is being made available for review and comment. Comments should be submitted by Friday, February 16, 1996, by internet e-mail to study@gpo.gov, by fax to FDLP Study at (202) 512-1262, or by mail to FDLP Study, Mail Stop SDE, U.S. Government Printing Office, Washington, DC 20401. ***************************************************************** TASK 8D: Identify issues that must be addressed when an agency no longer makes electronic information dissemination products and services available at its Web site, and the site contains information that needs to be remain available to the public through the Federal Depository Library Program (FDLP) and/or transferred to the National Archives and Records Administration (NARA). BACKGROUND The use of Web sites as a means to disseminate information is becoming increasingly common among Government agencies. It is also likely that agencies will begin to use their Web sites to distribute information not available in any other format. These Web sites are in essence forms of publication and therefore may be Federal records as defined by 44 USC 3301. However, the ease in which these sites can be established and modified creates problems for both the Government Printing Office (GPO)and the National Archives and Records Administration (NARA) which share an interest in identifying and preserving the valuable information on these Web sites. GPO and NARA have dissimilar, but complementary, goals to assure public access for the full life cycle of this information. GPO must address measures that ensure continued short-term access (5 years minimum) for much of the information on the Web sites. NARA focuses narrowly on that portion of the information which has historic value and its goal is to assure long-term access (indefinitely) to that information. Records schedules can serve as a tool for identifying these sites, but GPO and NARA will have to work together to create ways in which information can be transferred without added burden to publishing agencies. Issues concerning short and long-term access to information on agency Web sites were brought to the forefront by the closing of the Office of Technology Assessment (OTA) on September 29, 1995. OTA's Web site, OTA Online, included a catalog of all the reports produced by OTA from 1972 to 1995, ASCII text files of the 1994 reports, and ASCII as well as ADOBE Portable Document Format (PDF) texts of the 1995 reports. The 1995 reports include some reports that will not be formally published. OTA made arrangements to mount information from OTA Online on GPO's Web site. The final transfer to GPO will be sometime in February 1996. Since November 1, 1995, the OTA Web site also has been mirrored by the National Academy of Sciences and the Woodrow Wilson Public and International Affairs at Princeton University. OTA also has a contract to scan the texts of all their reports dating to 1972 to PDF format. The PDF files will be packaged along with much of the information available via OTA Online, and some additional historical material, on a set of five CD-ROMs. The CD-ROM collection will be sold through GPO. FEDERAL DEPOSITORY LIBRARY DISTRIBUTION Most of the OTA information available in electronic format is available in other formats through the FDLP. The only exceptions are the reports and/or summaries that are still being completed and will not be formally published. DISSEMINATION ALTERNATIVES Alternative A GPO will mount the information from OTA on its own Web site for depository library access. When available, both ASCII and PDF files will be offered. The CD-ROM collection of OTA reports will be distributed to depository libraries upon completion. Benefits Public access to the information is maintained through the FDLP. A variety of methods are available for accessing OTA information. More depository libraries are equipped to use CD-ROMs than have Web access for the public. Disadvantages/Problems Some OTA information is distributed to depository libraries in three different formats: paper, CD-ROM, and online through the GPO Web site. This is not consistent with the Transition Plan for the FDLP which proposes eliminating all dual distribution. GPO incurs additional costs for maintaining the information on its Web site. OTA is responsible only for the costs related to the initial mounting of the information. Reports that have been scanned are not entirely searchable. Although the reports will be scanned using Adobe Acrobat Capture, which will convert them to machine readable form, non-recognizable portions will be retained as images. In addition, due to time constraints, the scanned reports will not be reviewed. PDF is software dependent and therefore not an acceptable format for long term retention. Alternative B The OTA CD-ROMs would be distributed to depository libraries. After a predetermined period of time, OTA information will be removed from the GPO Web site. Benefits Public access to the information is maintained through the FDLP. More depository libraries are equipped to use CD-ROMs than have Web access for the public. Dual distribution in electronic format is eliminated. Disadvantages/Problems Scanned reports contain non-searchable portions and are not reviewed. The CD-ROMs cannot be archived because they use the PDF software-dependent format. [See above.] Public access to the reports is available only at depository libraries, although as mentioned there are two other private Web sites that will be providing this information for at least a period of time. ISSUES TO BE ADDRESSED (FDLP) Archival Responsibilities GPO will coordinate with NARA to transfer electronic information which no longer warrants maintenance for the FDLP to NARA. If GPO places agency data on a server and makes it available via GPO Access, then the data becomes part of GPO records and GPO will be responsible for its disposition ( or transfer) to NARA. If an agency has maintained electronic Government information and GPO points to the information for the FDLP, it will be the legal responsibility of the individual agency to transfer this information to NARA. GPO and NARA will need to determine whether statutory changes are needed to clarify each agencies' respective roles and responsibilities for extended access and preservation of electronic information dissemination products and services. Life-Cycle of Electronic Information Dissemination Products and Services GPO and NARA will need to define a life-cycle for electronic information dissemination products and services, beginning with the original documents as an electronic file and ending with its disposition. It is NARA's responsibility to determine whether an electronic information dissemination product warrants permanent retention or no longer warrants continued preservation by the Government. In accordance with the goal of providing extended access, GPO will assume such costs as data preparation for mounting, maintenance, storage, and ongoing costs to minimize deterioration and assure technological currency. Format Standards GPO plans to receive electronic information provided by agencies in any format. However, GPO needs to address the prospect of determining a small number of "recommended standard formats" for agency information, prior to receipt. Also, GPO will need to develop standards for formats of data that have been received and need to be mounted on GPO Access for public availability. It is anticipated that certain electronic source files provided to GPO by agencies will not readily lend themselves to GPO Access in their original formats. Steps may need to be taken to make information received in these types of formats more suitable for extended access. GPO will offer this information to NARA once it is determined that usage no longer warrants maintaining the information at a GPO authorized site. This does not imply that GPO will assume the responsibility of converting this information for NARA if the file format used for extended access through GPO Access is not suitable for the preservation requirements of NARA. It is expected that GPO may have electronic information for which usage no longer warrants that will not be accepted by NARA because of file formats. GPO and NARA must seek to coordinate their efforts to assure that format standards used by GPO for extended access to electronic information can be converted easily to formats acceptable to NARA. Software Dependent Information Some electronic information dissemination products and services produced by agencies in particular formats (such as certain types of spreadsheet files) are embedded with file structures that only have intrinsic value when used with particular software. If this information is converted to another generic format, such as ASCII, it loses value for the user. This poses a concern for GPO, which will need to make this information available via GPO Access, and NARA, which currently will not accept electronic information that is software dependent. ARCHIVAL BACKGROUND The OTA Web Site contains two main types of information. 1) Organizational Structure and Members, and 2) Publications. The organizational structure, lists of Technology Assessment Board (TAB)and Technology Assessment Advisory Council (TAAC) members, can be found in the annual reports of OTA, which are scheduled for permanent retention under N1-444-94-1. Additional information on the members' work with OTA is scheduled as permanent in TAB/TAAC Member Files. The site also contains information on ongoing projects (moot issue), how to contact the staff, different online methods of obtaining publications, and links to other government sites. All of the information in the OTA Web Site has been scheduled in a variety of different records covered by different items in the schedule. However, the schedule does not directly apply to the OTA Web Site. The OTA Web Site can be viewed as another "publication" used by OTA to disseminate information. The existence of the Web Site, as well as its content, provide evidence of the image OTA wanted to portray to the public and the work it accomplished. Even though the information exists, in bits and pieces, among the records of OTA (records covered by the schedule), by bringing this information together, and "packaging" it in a different way, OTA has created a different record that is not covered in the schedule. Thus, the OTA Web Site should be scheduled as an item under the office that manages and maintains the Web Site. In FY 1995, the National Archives, Center for Electronic Records, scheduled and appraised the ASCII text files of the 1994 and 1995 reports (N1-444-94-1). These ASCII files were appraised as temporary because they do not contain the graphs, charts, and photographs which are integral to the publication, thus diminishing their value. At present, the Center for Electronic Records will not accession files that are dependent on any specific software package. This is referred to as software dependence. This precludes the Center from accessioning the reports produced using ADOBE software. For these reasons, NARA has chosen to maintain the print formats of all the reports produced by OTA. However, NARA will accession the ASCII text file for the Catalog of Publications, 1972-1995 (N1-444-96-1). This file is used to upload the Catalog unto the OTA Web Site. In the case of OTA electronic information, NARA will accession only the ASCII file used to create the Catalog of Publications, 1972-1995. Since OTA is able to send the file in the software independent format specified in 36 CFR 1228.188, OTA will transfer the file directly to NARA, Center for Electronic Records. NARA also will receive electronic versions of the OTA reports in three different formats: ASCII, Hypertext Markup Language (HTML), and PDF. These files will not be accessioned by the NARA, but will be used to examine technical issues of the different formats. However, NARA may retain for a limited time the HTML and/or PDF format as an extra copy for convenience of reference. HTML files are essentially ASCII files that contain text which is "tagged" using a standardized language. HTML was created as a standardized way to format documents so that they could be read and interpreted by a variety different computer platforms. These commands are written using ASCII characters. Any word processing software package can be used to tag a document with HTML commands. However, there are software packages which were developed to "markup" documents with HTML commands. If a tagged document is printed out the HTML commands are visible along with the text of the document. Therefore these files are software independent and can be treated as ASCII files. If needed, PDF files also can be converted to ASCII. Despite the fact that all these files are or can be transferred into software independent files, the original reports contain graphics, which cannot be software independent. PDF files contain graphics and the HTML files contain links to graphics. That is, the graphics "reside" elsewhere, not in the tagged document. APPRAISAL CONSIDERATIONS What information is in the Home Page, and which files (and addresses)does it link to? What is the structure/"hierarchy" of the Site? There is a distinction between a Home Page and a Web Site. A Home Page is the first "page" of a site. It usually contains an introduction or welcome statement. This Home Page provides links to other pages. There are two main types of links: a) links to other files (pages) in the same location, and b) links to other Web sites. A Web Site can be described as the sum of a Home Page and all the files that are linked to it. It is important to determine which file is the Home Page and trace how other pages are linked to the Home Page and other pages. The structure of the page can provide evidence as to what the agency feels its primary mission is and how it wants to portray itself to the general public. Need to determine criteria/"draw lines" to limit the "links" that will be appraised. In appraising a Site it is necessary to examine the Home Page and the files that are linked. However, the links to other sites should be appraised with the records of the agencies that maintain those sites. If there is a link to a site which maintains information for the site being appraised, and the agency (of the records being appraised) is responsible for the content, then that particular link should be considered for appraisal. This does not mean that a whole new site is to be appraised along with the first site. A precedent for this can be found in N1-149-95-1P, Item 20.8, VAX Client Server, memo from NSXA to NIR dated January 9, 1995 "[Electronic Photocomposition Division (EPD)]uploads the publications, which they receive on tape or disk. EPD is not responsible for the creation or content of the publications. The individual agencies that send the publications to be are uploaded into the system are responsible for all the data and information. For these reasons, the files in the VAX Client Server should not be appraised as GPO records..." Which files within a site should be accessioned? Do all the files need to be brought in? Is it adequate to simply document that a particular link contained certain information which can be obtained among the other records of the agency? If links to other sites, document the name and agency which maintained the site? The determination of specific files in a Web Site that should be accessioned and which links should be documented or appraised must be done on a case by case basis. APPRAISAL ALTERNATIVES Alternative A Accession the records of the persons or committees responsible for maintaining the Web Site. The records of these persons or committees should reflect the content and structure of the site. In fact, these files serve as documentation of the electronic files posted on the Web Site. Thus, the information that appeared on the Web Site could be reconstructed. In this case, we would be documenting the existence of a Web Site without actually accessioning the information on the Web Site. Benefits This approach avoids the duplication of information NARA would be accessioning. The information provided by the persons or committees in charge of the Site, would provide researchers with evidence of the information which was posted and they would then search out the desired documents from the records of that agency. This would be especially true of larger agencies which strictly control the information on their Web sites. Disadvantages/Problems Not all agencies have a centralized place where this information can be found. In smaller agencies, the Web sites might be constructed and maintained by interns or interested personnel, yet their records may not provide adequate information on the content and structure of the Web Site. This option also ignores the possibility that in the future, the information posted on the Web site might not appear in any other format. In these cases, it is necessary not only to appraise the records of those maintaining the files, but the files on the Web site itself. Alternative B Accession all the files within the Web Site. These could be viewed through a browser. However, it is important to note that different browsers servers will "interpret" the HTML commands differently. Also, most Web sites contain links to graphics and other sites, therefore those links or graphics would not be functional. In this case, the links can be documented by identifying the institution maintaining that site and providing a brief description of the content of those sites. Benefits The Web site can be preserved in a fashion in which researchers will be able to "navigate" though. Researchers would also get a better idea of the structure of the site. Disadvantages/Problems At the moment graphics cannot be preserved, an integral part of most Web sites. The sheer size of some Web sites and the number of links that must be accounted for make them difficult to document. The possibility exists for duplicating information that already exists among the records of the agency. Alternative C Accession selected files from the Web Site, as well as preserving the records of the persons, offices, or committees maintaining the Site. Valuable files, which may not exist in any other format, or are more valuable in electronic format can be preserved. These files could be either requested from the agency without HTML markup (in plain ASCII)or the NARA could maintain the markup. Benefits This approach ensures the preservation of unique files or valuable information without the burden of accessioning the whole site. Disadvantages/Problems In accessioning select files, it is important to document the context. The documentation package would include technical information, but also information of the content of the site were the selected file was originally placed. Web sites are always changing. Files can easily be added, updated, and deleted. This poses a problem for accessioning files in a Web site. The solution proposed in the "Preserving Digital Information: Draft Report of the Task Force on Archiving of Digital Information" (August 24, 1995) is to take "periodic snapshots" of the pages in a site. Ultimately, the agency is responsible for scheduling the files in their Web site. NARA can work with the agency to develop a strategy for accessioning files which constantly are being changed. ISSUES TO BE ADDRESSED Identifying Information for Preservation How can Web sites with valuable information be identified? Federal agencies are creating a large number of Web sites. Once agencies are no longer interested in maintaining that information, there is no mechanism in place to preserve that information for future users. Both GPO and NARA share an interest in preserving this information for future use. However, as Federal records, the Web sites must be scheduled along with other agency records. Therefore, records schedules could serve as a tool to identify valuable Government information on Web sites. Transfer of Information to GPO and NARA Once identified, what information from the Web sites should be transferred? As explained earlier, GPO and NARA have different goals. Each agency will have to decide what information on the Web sites will be of value to their customers. Sometimes both agencies will be interested in the same information. However, GPO is primarily interested in providing information for short-term access. Since NARA is interested in maintaining indefinitely information with historic value, it needs to apply criteria for determining which information from Web sites warrants continued preservation by the Government. How should this information be transferred to GPO and/or NARA without added burden to the agencies? GPO and NARA will have to work together to identify ways in which agencies can transfer the information without an added burden. Extended FDLP Access to Electronic Information Dissemination Products and Services What is the most cost-effective and useful method for preserving FDLP access to electronic Government information available from agency Web sites or online services? The maintenance and migration of electronic information over a period of years can be very costly. If information already has been distributed in paper, microfiche or CD-ROM does it make sense to provide continued online access to the information? If an agency decides to discontinue access to information through their Web site, does GPO have a responsibility to obtain the information and provide funds and resources for its continued access through the FDLP? Differences Between the Life-Cycle of Information Dissemination Products and Services in Electronic vs. Traditional Formats How is the lifecycle for electronic information different from that of traditional formats like paper and microfiche? What part of the information dissemination process must be changed in order to ensure extended access and the archivability of information on agency Web sites? ***************************************************************** Judy Russell Comments should be submitted by Friday, February 16, 1996, by internet e-mail to study@gpo.gov, by fax to FDLP Study at (202) 512-1262, or by mail to FDLP Study, Mail Stop SDE, U.S. Government Printing Office, Washington, DC 20401.