MetaData Mine-ing

Peter Weinstein, Brian Dunkel, Nandit Soparkar

The quantity and complexity of information on the Web is growing rapidly. While several simple interactive search tools exist to help locate items of interest on the Web, the increasing sophistication of uses and the sheer quantity of data together indicate a need for techniques to locate relevant application-specific information. To this end, we describe a two-stage, customizable, information indexing strategy --- especially suited for situations with specialized and long term needs. In the first stage a customized filter is configured by selecting from a library of test and data extraction functions. A search tool ``crawls'' the Web in some appropriate way, and assesses encountered documents using the filter. Useful metadata is extracted from selected documents and stored in a proprietary metadatabase. In the second stage, the metadatabase, which may be regarded as a view of the Web, may be queried to locate information relevant to a specific inquiry. Our approach potentially achieves greater flexibility and specificity as compared to currently available search engines. Also, filter libraries can incorporate domain-specific, or ad hoc, information filtering mechanisms. We describe the preliminary design, implementation, and experimentation for our proof-of-concept effort.

Test results for sites passing filter are stored as relational data

This paper is currently available in hard copy only.