EnglishChinese

Keeping Your Index Fresh with Add URL

Walter Underwood, Principal Software Architect, Verity

Everyone wants their search index to be an accurate, timely reflection of their content. Ultraseek automatically revisits pages to find new URLs, and that is very effective, but some sites have even stronger reqirements for how quickly documents need to be available in search results. This is called "index freshness." A stale index frequently misses new pages and has old information including pages that have already been deleted, and old copies of pages that have changed since they were indexed. For maximum index freshness, use Ultraseek's Add URL feature for notifications of deleted, changed, or new URLs.

Add URL does what it says and a little more. You pass in a URL, and the spider adds it to the URLs it will crawl, putting it on the highest-priority queue. It also forces a revisit and reindex if the URL is already known to the spider. If a URL is new, it will be visited and indexed (if it isn't a duplicate). If the document at that URL is in the index and has changed, it will be reindexed. If there is no longer a document at that URL, the old document will be removed from the index.

Add URL has both a user and a programmatic interface. Users can access it from the help pages. Administrators can access it there, or on the URL tab of a spider collection. Programs can do exactly what the UI form does, send an HTTP POST or GET with the URL, or they can use the dedicated Java API or SOAP Web service.

Doing Add URL with HTTP is straightforward. Assuming that Ultraseek is installed at search.example.com, and that you are notifying it about changes at http://example.com/new.html, access this URL with a GET:

http://search.example.com/help/addurlgo.html?url=http://example.com/new.html

For Add URL with the XPA Java API, see SpiderCollection.addURL in the Javadoc shipped with the library. For the SOAP Web service see the VisitURL operation in the Web services documentation.

The URL to be added is checked against the collection URL filters first, so the search administrator still has control over what is allowed in the collection. Web page authors and site webmasters can use Add URL to keep their pages fresh in the search index without making requests to the search administrator. This saves time for everyone.

Add URL usually updates the index very quickly, often in a few seconds. Sometimes, the document will have been visited and reindexed by the time the URL Status page is shown. Make sure that the website is updated before you send the Add URL notification. We had one customer do the Add URL before pushing changes to the site, and even though it was only a two second delay, Ultraseek visited the URL before the content arrived and got a 404 response from the Web server. A spider curfew or a suspended spider will prevent Add URL from taking effect immediately. So allow the spider to run and make sure the content is pushed first.

For the best possible index freshness, integrate Add URL into your publishing system. Right after publishing any change to your website, notify Ultraseek of all the URLs with changes through Add URL. Moments later, the search index will be updated and your changes will be published to both the site and the index, so that visitors can find the new pages through browsing or through search.

Posted July 19, 2005 10:26 AM by editor
Category: Indexing

Categories

Customizing

Indexing

Searching

Usability

User Stories

Archives

January 2006

December 2005

November 2005

October 2005

September 2005

August 2005

Recent Entries

More Quality Quick Links

Quick Links in Action

Win a Limited Edition Ultraseek T-shirt

Tuning the Search Relevance on Your Site?

'Richer Suite of Functionality'

Resources

DOWNLOAD ULTRASEEK NOW!

XML   RSS Feed