EnglishChinese

Less is More: Why fewer documents means better search results

When the right answer is not returned by the search engine, people tend to believe the engine has not found all of the content available on the network. The truth is that it may have discovered too much content.

For each valuable page of information on your network, there are at least 10 pages of useless information, such as log files, zip archives, and dynamic pages.

For example, do you need to index every page of an on-line calendar that goes to the year 2023? Probably not.

In fact, with the inclusion of this low-value content, the user can be overwhelmed by the number of results that match their query term. Especially if it is a common one.

One of our customers had 4.5 million pages in their index. They adjusted their spider to skip a lot of low-quality content, and ended up with 1.5 million pages. The quality of the search results improved dramatically. A smaller index, but better results.

To choose what should be in or out of your index, try some of your most popular queries. For each irrelevant result, ask yourself "should this be in the index?"

It is easy to control what gets indexed: adjust the filters on the Collections > Filters tab in the admin console. You can disallow content by matching its URL against a wildcard or regular expression pattern. The filters are matched from the top down, so make sure your specific disallow filters come before your more general allow filters.

by Ryan Weisenberger
Manager, Software Development

Posted June 20, 2005 07:39 AM by admin
Category: Indexing

Categories

Customizing

Indexing

Searching

Usability

User Stories

Archives

January 2006

November 2005

October 2005

September 2005

August 2005

July 2005

Recent Entries

More Quality Quick Links

Quick Links in Action

Tuning the Search Relevance on Your Site?

'Richer Suite of Functionality'

Fueling Your Business Search Engine to Find the Right Answers

Resources

DOWNLOAD ULTRASEEK NOW!

XML   RSS Feed