Wednesday, May 31, 2000

A Client Side Agent for Intelligent Internet Search
Scott Parent (scott@DoctorDatabase.com)
Bamshad Mobasher, DePaul University (
mobasher@depaul.edu)
Steve Lytinen, DePaul University (
lytinen@depaul.edu)

Objective

We proposed to investigate and develop a client-side tool--an intelligent agent--to guide the web user towards optimally effective web search by:

Vision

The defining characteristic of this tool is the integration of a dynamic knowledge base and a highly specialized user interface for expressing and manipulating queries as well as browsing results from these queries in a variety of sampling modes. We believe that a client-side agent which combines a multi-modal and highly interactive interface for guiding the user in formulating queries, and unsupervised methods for automatically clustering the search results can provide an effective tool for handling information overload on the Internet.

Requirements

Allow the user to interact seamlessly with these components:

The Query Process

As Marti Hearst points out, the goal of a user searching the web can range anywhere from obtaining a single fact--such as a phone number--to becoming familiar with a broad subject.[He99] In most cases, the first step is becoming familiar with the pattern of representation enjoyed by the topic at hand on the Internet. Highly specific interests may offer only a few web publications--in which case the precision associated with a high-recall search is more likely to be acceptable. However, for well-represented topics--especially those associated with a number of subtopics--the query must be incrementally refined in order to squelch the noise and arrive at a manageable level of precision.

Two key characteristics of this client side agent are the ability to intelligently filter documents resulting from a given search and imposition of a hierarchical organization of those documents that pass the filter. Studies by Ben Shniederman associate hierarchical representation with optimal manageability and point to the commercial success of Yahoo! as strong evidence of user preference for such.[ Sh97,SFR99] Hierarchical presentation of topics offers a kind of graphical analog to the human speech pattern of introducing a topic in terms sufficiently general as to be part of the speaker's and listeners' shared context--then following a chain of subjects into a more specific context.[Pi95] Apart from the disambiguation benefits of telescoping contexts, a tree with a reasonable branching factor (say, ≤7) would be more easily digested by the human cognitive chunking facility than would a large, heterogeneous list.[Mi56,Ca96]

Knowledge Base

The knowledge base is personalized to an individual user. The key components of the knowledge base are:

Here again, the hierarchical representation comes into play--by definition, in the case of the concept classification hierarchy. In the case of the lexicon, key semantic relationships such as hyper-/hyponymy (i.e., "is-kind-of") and holo-/meronymy (i.e. whole/part) tend to offer traversable trees.

User interests, we believe, can also be arranged into topic-subtopic hierarchies using the same learning techniques indicated above for organizing search results.

User Profiles

The user profile component employs data mining techniques--adapted for information retrieval--to characterize a user's interests and vocabulary. Past browsing behavior as well as ongoing query activity are both included in the evidence considered to arrive at this characterization. Given that both of these activities pertain to identifiable sets of documents, we can use the terms and collocations featured in a given document as a basis of comparison to other documents as well as topics already established in the knowledge base. Documents can then be attached to existing topics or clustered into new topics based on the similarity discovered by these comparisons.

The query interface--described below--also gives users a more direct means of expressing interest--or lack thereof--in particular topics. At query time, broad classifications of relevance include:

In the first two cases, we list the evaluated item--either a topic or a specific document--as Interesting in the user profile. In the last case, we list the item as Not Interesting. It follows, then, that the root topics listed in the personal interest topic space include, "Interesting" and "Not Interesting."

By definition, items listed under "Interesting" are expected to be of most interest to the user--and therefore, most likely to be revisited. It is also reasonable to expect that this set will continue to grow with use. For the reasons given above, then, it might well be desirable to continually reorganize this set into an evolving hierarchy. Doing so manually, of course, would be a particularly tedious task, so the benefit of automated organization would be intuitively clear for this set of items.

For the "Not Interesting" set, the user may on occasion wish to reevaluate particular topics. Again, a hierarchical organization is likely to facilitate comprehensive identification of such topics. And, in as much as manual classification of interesting items is likely to be tedious, manual classification of largely uninteresting items promises to be painful. Again, automated classification appears to fit the needs of this scenario.

In either case, it is computationally desirable to operate with smaller numbers of aggregate representations--as opposed to an increasing number of specific document representatives. So clustering documents in advance of employing topic representations at some point becomes necessary.

As with Syskill and Webert [PB97], our tool should allow reactivation of a previous query. This, then, argues for listing "Queries" as a third root topic in the user profiles collection. Also, if a distinction is to be made between queries that occurred before and after the most recent clustering process, the subtopic structure might be adapted to indicate which queries have been so assimilated and which have not.

Concept Hierarchy

While automated classification techniques facilitate organization of items which our agent perceives to hold our user's interest, a manually constructed taxonomy gives us a priori information about the relationship between concepts. Domain specific hierarchies--like MeSH--would certainly be candidates for this role. However, the more general-purpose Yahoo! hierarchy is the initial focus of our investigation here.

All of Yahoo's categories fit into a tree structure. At the time of this writing, the root node of this tree has fourteen children, including, "Arts & Humanities," "Business & Economy," and "Computers & Internet." The URL for each of these fourteen nodes is expressed as an immediate subdirectory of http://dir.yahoo.com. Every other node in Yahoo! is a descendent of one of these fourteen main nodes. And, the URL for each of these categories is the URL of that category's parent node followed by a slash and a subdirectory name unique in the namespace of that parent.

Yahoo! also offers links to from topics in one path to related topics in different paths. In order to distinguish between the two kinds of relationships, we refer to immediate descendents of a given node in that node's path as subtopics. In this vernacular, then, for each category, c1 with a link appearing in category, c0, c1 is a related-topic with respect to c0. However, a category, c2, can only be a subtopic of c0 if the URL for c2 is an immediate subdirectory to that of c0.

Early exploration results indicate that Yahoo! features approximately 213,000 categories and links to more than 1.2 million external URLs. The average depth of a node is 7.4 (where each of the fourteen main nodes has a depth of 1). The maximum node depth observed is 17. Aggregate branching factors are as follows:

 

Mean

Max

Categories to all related-topic categories

2.4

2026

to proper subtopic categories

1.0

486

to non-subtopic related categories

1.4

2016

Categories to category member documents

5.7

4405

Lexicon

The lexicon maps terms to their meanings or senses. A given term will have one or more senses. A term with one sense is called monosemous, and a term with many is polysemous. The different senses associated with a polysemous term effectively represent homographs.

We intend to drive our lexicon with WordNet, an electronic thesaurus produced by the Cognitive Science Laboratory at Princeton University. A quick tally of the records in WordNet's database reveals that WordNet contains 99,642 distinct senses.[Pa99] Each sense has a prosaic definition and belongs to a single Lexicographer File Category. In turn, each of these categories is associated with a single part of speech. The four parts of speech supported by WordNet are:

Part Of Speech

Number of Senses

Noun

66,025

Verb

12,127

Adjective

17,915

Adverb

3,575

Semantic relationships between senses include hyper-/hyponymy and holo-/meronymy. Arguably, each of sixty-six-thousand noun senses corresponds to a topic. And each hyponym of a given noun sense could then be considered a subtopic thereof. In this way, the Lexicon behaves as another concept hierarchy.

A by-product of mapping terms to senses is that all terms associated with a given sense are--by definition--synonyms of one-another. Synonyms can be used to improve the recall of a query--by matching documents referring to the same concept by a different name.

The key function of the lexicon, however, is to support natural language processing operations--including:

Query Generator

The concept hierarchy, lexicon and user profile all act as topic spaces for expansion of query terms. These topic spaces may also be used to seed clusters for organizing query results and will certainly provide criteria for filtering these results.

The query generator manages the creation, manipulation, expansion, execution, persistence and reactivation of queries. A query can be created with one of more terms supplied by the user. Alternatively, the user could start a query by picking one or more topics from the available topic spaces.

The query generator is responsible for expanding any query that is not fully expanded. Expansion is a function of a topic space. That is, a query in any state can be compared to topic representations in a given topic space. Those topics bearing sufficient similarity are added to the expanded state of the query as "suggestions." The threshold used to determine sufficiency here is an important avenue of investigation--and may well be a function of user preference.

An expanded query state can be re-expanded so long as there are additional topic spaces in which matching topics may be sought. Intuitively, the order in which topic spaces are chosen for use with query expansion seems to be an important decision. This too, however, may be left for run time specification as a user preference.

The original query as well as any expansions thereof are all considered "searchable." That is, a query in one of these states can be executed to find matching documents on the Internet. Once sought, the user must manipulate a query--by modifying the set of free terms or by selecting or rejecting particular topics suggested during expansion--before it makes sense to re-execute the query.

The agent may execute promising query expansions on its own to provide sample documents. One purpose of doing so is to give the user a chance to provide relevance feedback on those documents.

The user may save a query that is either searchable or sought. At a later time, then, the user may choose to reactivate the saved query. Given the dynamic nature of the Internet--and therefore of any worthwhile search engine--added to the dynamic nature of the user profiles, a reactivated query should be treated as user-manipulated. The query generator will then regard it as both searchable and potentially expandable.

User Interfaces

Query Interface

The query interface connects the user to the query generator. Feedback is provided to the user via a consistent, graphical representation of topic spaces and query states. Though some studies have found that users who are less spatially adept can find three dimensional presentations frustrating--particularly when features of interest become occluded or inaccessible to the frustum[He99]--employing 3D technology to the extent that neither of these difficulties can occur might steer users around these problems while still offering visualization benefit.

The query interface must also recognize user gestures intended to create, manipulate, save, execute or reactivate a query. These gestures are communicated via pointing device, a free-text entry facility, and ultimately through speech recognition.

Specifically, the interface must allow for activation of a query via one of the following:

Once activated the interface must allow the user to manipulate the query or trigger an operation on the query. Manipulation involves changing the free text or assessing the relevance of attached items. Attached items consist of topics drawn from the knowledge base as well as documents resulting from searches.

Operations that may be triggered on a query include expansion and execution. A query expansion can result in a number of possible states--one for each permutation of topic spaces in the knowledge base. And, any one of these states might interest the user. It is therefore desirable that the user interface present--the to extent practicable--all possible expansions of a query. The user then may choose which state to operate on next. The user may also wish to switch between alternative states or even backtrack by discarding expansions: undo-ing the expansion as it were.

A manageable way to approach this ideal, then, is to limit expansion to a single generation per user request. That is, when a user selects a state to expand, an expanded state results for each topic space in the knowledge base. The user may then select any of these resulting states for further such expansion, execution or manipulation.

One possible metaphor to expose these behaviors through the user interface is a stack of cards for each topic space. The top card represents the next operable expansion based on the associated topic space. Each card has the following features:

A means of toggling between this representation and the resulting term weights should also be available. An issue that must be examined here is the mutability of these terms: evidence [KB96] suggests that users would benefit from the ability to directly manipulate the set of terms resulting from our expansion. Once altered, though, the set of terms becomes inconsistent with the original set of topic- and document- representatives. So, switching back to the "topic" view becomes complex.

Another opportunity afforded by visually representing multiple states (both visited and possible) of the active query is that of preserving a given state as the starting point of an alternative strategy. Semantically this might be characterized as creating a new query to activate later using a "Save As…" button. However, even if a visited state is not so persisted, the user might still return to it using the "discard" operation.

Browsing Interface

Upon execution of a query, the browsing interface collects, filters and organizes resulting documents.

Implications

Language Choices

For a complex, Internet-aware tool such as our agent, Java is an excellent choice for development platform--particularly when execution-platform independence is desirable.

Persistence

Though we currently have a flat-file representation for topic spaces, this representation is not scalable. In particular, the size of a concept hierarchy like Yahoo! calls for keyed, random access. The dynamic nature of the user profile calls for a modification-tolerant format. Combined, a relational database or a B+ tree package seem to be indicated.

To further constrain these options to include platform independence, relational database dependence becomes much more costly.

The Berkeley database (http://www.sleepycat.com/), however, is a platform independent B+ tree package that appears to offer free licensing for academic use.

User Interface Tools and Technologies

Given Java as the development platform for this tool, some user interface extension packages offered by Sun (http://java.sun.com) warrant our consideration:

Java 2d/Java 3d

These graphics packages have implementations for most modern client platforms.

Java Speech API

Sun does not provide implementations of this directly, and vendors who do provide implementations do not appear to offer much in the way of platform independence. IBM provides a version that depends on Via Voice and runs on Windows. Lernout & Hauspie provide an implementation for Sun Solaris.

There are a number of speech recognition packages available in the CMU repository (http://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/speech/systems/), but the platform independence of each of these solutions--as well as its affinity to Java interface--needs to be examined.

References

[Ca96]

William H. Calvin, "HOW BRAINS THINK." New York, BasicBooks, 1996. http://williamcalvin.com/bk8/index.htm

[He99]

Marti A. Hearst, "User Interfaces and Visualization," Chapter 10 of, "Modern Information Retrieval." By Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Addison Wesley, 1999. http://www.sims.berkeley.edu/~hearst/irbook/chapters/chap10.html

[KB96]

Jürgen Koenemann and Nicholas J. Belkin, "A Case For Interaction: A Study of Interactive Information Retrieval Behavior and Effectiveness." In Proceedings of the Human Factors in Computing Systems Conference (CHI'96). ACM Press, 1996. http://www.acm.org/sigchi/chi96/proceedings/papers/Koenemann/jk1_txt.htm

[Mi56]

George A Miller, "The Magical Number Seven, Plus or Minus Two: Some Limits on Our Capacity for Processing Information." The Psychological Review, 1956, vol. 63, pp. 81-97. http://www.well.com/user/smalin/miller.html

[Pa99]

Scott Parent, "WordNet Landscape: Senses and Terms." 1999. http://doctordatabase.com/papers/wn/landscape.htm

[PB97]

Pazzani, M. and Billsus, D. "Learning and Revising User Profiles: The identification of interesting web sites." Kluwer Academic Publishers, 1997. http://www.ics.uci.edu/~pazzani/Publications/SW-MLJ.pdf

[Pi95]

Steven Pinker, "The Language Instinct." HarperCollins, 1995. http://www-bcs.mit.edu/~steve/tli.html

[SFR99]

Ben Shneiderman, David Feldman, and Anne Rose, "Visualizing Digital Library Search Results with Categorical and Hierarchical Axes," 1999. ftp://ftp.cs.umd.edu/pub/hcil/Reports-Abstracts-Bibliography/99-03html/99-03.html

[Sh97]

Ben Shneiderman, " Designing Information-Abundant Websites: Issues and Recommendations," 1997. http://www.cs.umd.edu/projects/hcil/members/bshneiderman/ijhcs/main.html