To Stopword or Not to Stopword?
By Greg Jones,
Sr. Software Engineer
Many search engines identify a class of words as "stopwords." These are typically the short, frequently occurring words in a language. Stopwords usually have only a grammatical function within a sentance, and don't add to the meaning. Some examples of stopwords for English are:
the
and
it
is
of
Stopwords include articles, case particles, conjunctions, pronouns, auxiliary verbs, and common prepositions.
How Do Other Search Engines Handle Stopwords?
Some search engines ignore stopwords in queries. There are two reasons why a search engine might ignore stopwords. If the engine requires all query terms to match, it may exclude a document because the stopword did not appear in the document. Even though the stopword is probably irrelevant to what the user is searching for, the document may have actually been a very good result.
If the engine does not require all terms to match, then any document containing the stopword would be included in the results. Since stopwords are very common, this might be nearly every document in the engine's index.
Ignoring Stopwords Can Be a Problem
For example, if a user searches for "vitamin a", the letter "a" will be recognized as a stopword. If the search engine ignores it, then the results will include, and rank similarly, documents about "vitamin A" or "vitamin B" or "vitamin K." Generally, words that have two meanings, one of which is a stopword and one of which is not, create this kind of problem. Another problem can occur if users search for phrases consisting entirely of stopwords. Individual stopwords are not usually meaningful, but phrases of them can be; for example, "to be or not to be".
Ultraseek Does Not Ignore Stopwords
Ultraseek does not need to ignore stopwords. By default, Ultraseek does not require all query terms to match. So if an Ultraseek query contains a stopword, nearly all documents in the index might be returned. However, the numerous documents that don't contain any of the query terms except the stopword will be ranked very low. Ultraseek uses a tf/idf relevance algorithm, which gives little weight to common words. Stopwords are very common words.
What Ultraseek Does Do with Stopwords
Ultraseek uses a list of stopwords for two other purposes. First, Ultraseek does ignore stopwords in the query when highlighting. It would be distracting to to see every occurrence of the highlighted in the results. Secondly, Ultraseek uses the stopword list internally to improve the relevance of phrase searches.
Ultraseek has advanced query handling for queries that contain significant stopwords. For example, Ultraseek will return relevant results for queries like "vitamin a" and "to be or not to be", while other engines may ignore the significant stopwords within those queries. Ultraseek intelligently processes the queries so its results are relevant without requiring users to surround their query with quotation marks to make an explicit phrase search.
Posted September 7, 2005 07:25 AM by editor
Category:
|