Comparative Evaluation of Three Types of Semantic Distance Metrics – Implications for Use in Semantic Search
Abstract
Semantic relatedness is an important measure for search functionality and design in the 21st century. We envision that a 21st century search system should be able to accept as a “query” a sample document or object – and return results which are “like this” or “related.” Today, search systems that suggest “related results” do so based on the similarity of values in defined properties or bibliographic fields (e.g., faceted search using metadata values) or on high co-occurrence rates of query terms and full-text indexes. These search systems are commonly referred to as Similarity Search. For search systems to be able to support this capability in the future there must be a reliable mechanism for semantically identifying facets and values in the query document, and for calculating the semantic relatedness or similarity to other documents. The literature is rich with discussions of semantic relatedness and similarity measures. Among the measures discussed, semantic distance appears to hold the greatest promise for this future search capability. Semantic relatedness is a concept that has been treated in philosophy, psychology, artificial intelligence and computational linguistics. This research approaches the concept of semantic distance from the computational linguistics and semantic analysis perspective, e.g., the degree of similarity or relatedness of two lexemes in a lexical resource. Semantic distance provides a more practical and quantitative approach to defining “similarity.” In addition, this research expands the definition of a lexical resource to include: full text and text corpus, knowledge organization systems, and metadata structures for documents.