Xiaoguang Qi, Dawei Yin, Zhenzhen Xue, Brian D. Davison, Computer Science & Engineering, Lehigh University

Choosing your own adventure: Automatic taxonomy expansion to permit many paths

A taxonomy organizes concepts or topics in a hierarchical structure. The dmoz Open Directory Project and the Yahoo! Directory are two well known examples which, with the help of many human editors, organize what are considered to be good quality web pages into topical taxonomies. Besides being created and maintained manually, taxonomies can also be created by automated systems. Such systems often group objects together based on their syntactic similarities or distance in the feature space. Automatic taxonomy generation can reduce the enormous amount of human effort and thus speed up the process. However, automatically generated taxonomies do not always match human’s preexisting mental image about the “world”; and therefore are not always easy for human to understand.

A taxonomy (or hierarchy) is usually constructed based on hypernym/hyponym, or supertopic/subtopic relationships. At any topic in the hierarchy, following a branch to get to a topic at a lower level means adding another constraint to its parent topic. When a user seeks information in a taxonomy, she usually starts from the root (”everything”), and narrows down the topic by choosing one child under the current topic until she finds what she needs. Unlike keyword search, seeking information in a taxonomy does not require the user to formulate her information need into search keywords. The names of topics (sometimes descriptions, too) serve as guidance for the user.

One major drawback of taxonomies is that they require users to have the same view of the topics as the taxonomy creator. That is, when a user follows a top-down path to find the specific topic of her interest, she has to make choices along the constrained sequence that is present in the hierarchy. As a result, users who do not share that mental taxonomy are likely to have additional difficulty in finding the desired topic. For example, in one taxonomy, information about Emacs, an open source text editor, can be organized under /Software/OpenSource/Editors/Emacs. Such a taxonomy will not be helpful for a user who looks for information about Emacs but does not know whether Emacs is an open source package. This problem can be reduced by remedies like cross-topic links (as used in ODP). However, after adding such links, the target nodes of the links will have more than one parent, making the taxonomy no longer a tree. Furthermore, such links bring dependencies among topics and increase editing cost: editing one topic may result in changes in its linked topics and then the linked topics of those topics. Faceted search/browsing has also been proposed to address this problem; however, identifying facets in large scale datasets is a big challenge.

In this talk, we present a new approach to taxonomy generation and expansion which is able to provide more flexible views. Based on an existing taxonomy, our algorithm finds possible alternative paths and generates a new, expanded taxonomy with flexibility in user choices. Furthermore, our algorithm can also be applied to a collection of tagged documents to generate flexible taxonomies of tags.

In particular, we first treat each internal node (topic) in the original taxonomy as a tag, and break down the input taxonomy into a set of tag-object tuples. Then we transform the taxonomy rebuilding problem into a problem to generate a taxonomy based on a bipartite graph of tags and objects. Starting from a single root, we build the new taxonomy in a top-down fashion by selecting a set of subtopics to split each node until no further split is necessary. Every time a topic is split, we generate three types of subtopics, a group of basic subtopics to cover all the associated objects (in some sense, the most obvious split), one or more groups of extension subtopics to provide alternative paths, and a group of shortcut subtopics to allow fast access to the most popular descendant topics. The process of selecting subtopics can be seen as a set cover problem, in which topics are sets, and URLs under the topic are the objects to be covered. Since the set cover problem is known to be NP complete, we adapt an approximation algorithm by changing its objective function. The subtopics are then chosen by optimizing the objective function.

In our experiments on the dmoz Open Directory Project and a social bookmarking dataset, the rebuilt taxonomies show favorable characteristics (more alternative paths and shorter paths to information). Our algorithm can be used to expand current web hierarchies such like those provided by ODP and Yahoo!, as well as generate new hierarchies based on social bookmarking systems such as Delicious and Bibsonomy.

Posted under: Uncategorized