EnglishChinese

Archive: July 2005

Don't Reindex Every Week!

Walter Underwood, Principal Software Architect, Verity

If you have used other search engines, you probably had to manually configure your indexing schedule to make sure new content was found and indexed. This is not necessary with Ultraseek.

Ultraseek has "continuous spidering with real-time indexing and adaptive revisit intervals." It sounds complicated, but it means that Ultraseek will automatically spider most pages at the right times.

The Ultraseek spider is always on, always ready to find URLs and index documents. This is called continuous spidering. When a new or changed document is found, it is immediately indexed and is available for the next query. This is called real-time indexing.

How does the spider decide when to revisit a URL to check for changes? It measures page change rates and adjusts to match them. This is called adaptive revisit intervals. For every URL it visits, the spider tracks how often it changes. It uses that information to choose a revisit interval. If a page changes every day, it is visted every day. If another one changes every week, it is visited weekly.

Let's think about a sample site, with press releases and a page listing recent press releases. The page listing the press releases will change frequently, and Ultraseek will visit it often, finding the new press releases promptly. Individual press releases won't change, so Ultraseek will adjust their revisit interval to the maximum, about one visit per month.

So, if you are planning to set up regular revisit schedules, don't do it right away. Let Ultraseek run for a while and adjust to your website. Then, when you want a really fresh index, integrate your publishing system with Add URL. That will get new pages into the index in a few seconds, not just once a week.

Posted July 21, 2005 by editor
Category: Indexing

Keeping Your Index Fresh with Add URL

Walter Underwood, Principal Software Architect, Verity

Everyone wants their search index to be an accurate, timely reflection of their content. Ultraseek automatically revisits pages to find new URLs, and that is very effective, but some sites have even stronger reqirements for how quickly documents need to be available in search results. This is called "index freshness." A stale index frequently misses new pages and has old information including pages that have already been deleted, and old copies of pages that have changed since they were indexed. For maximum index freshness, use Ultraseek's Add URL feature for notifications of deleted, changed, or new URLs.

Add URL does what it says and a little more. You pass in a URL, and the spider adds it to the URLs it will crawl, putting it on the highest-priority queue. It also forces a revisit and reindex if the URL is already known to the spider. If a URL is new, it will be visited and indexed (if it isn't a duplicate). If the document at that URL is in the index and has changed, it will be reindexed. If there is no longer a document at that URL, the old document will be removed from the index.

Add URL has both a user and a programmatic interface. Users can access it from the help pages. Administrators can access it there, or on the URL tab of a spider collection. Programs can do exactly what the UI form does, send an HTTP POST or GET with the URL, or they can use the dedicated Java API or SOAP Web service.

Doing Add URL with HTTP is straightforward. Assuming that Ultraseek is installed at search.example.com, and that you are notifying it about changes at http://example.com/new.html, access this URL with a GET:

http://search.example.com/help/addurlgo.html?url=http://example.com/new.html

For Add URL with the XPA Java API, see SpiderCollection.addURL in the Javadoc shipped with the library. For the SOAP Web service see the VisitURL operation in the Web services documentation.

The URL to be added is checked against the collection URL filters first, so the search administrator still has control over what is allowed in the collection. Web page authors and site webmasters can use Add URL to keep their pages fresh in the search index without making requests to the search administrator. This saves time for everyone.

Add URL usually updates the index very quickly, often in a few seconds. Sometimes, the document will have been visited and reindexed by the time the URL Status page is shown. Make sure that the website is updated before you send the Add URL notification. We had one customer do the Add URL before pushing changes to the site, and even though it was only a two second delay, Ultraseek visited the URL before the content arrived and got a 404 response from the Web server. A spider curfew or a suspended spider will prevent Add URL from taking effect immediately. So allow the spider to run and make sure the content is pushed first.

For the best possible index freshness, integrate Add URL into your publishing system. Right after publishing any change to your website, notify Ultraseek of all the URLs with changes through Add URL. Moments later, the search index will be updated and your changes will be published to both the site and the index, so that visitors can find the new pages through browsing or through search.

Posted July 19, 2005 by editor
Category: Indexing

'Implementation Went So Smoothly'

Marsha Luevane, Search Engine Manager at U.S. Department of Energy's National Renewable Energy Laboratory, sits down to discuss her organization's selection, use and longtime loyalty towards Ultraseek.

How long has the U.S. Department of Energy's National Renewable Energy Laboratory been using Ultraseek technology?
We've been using this software since 1997 when we installed it, to our amazement, in a couple of hours. We actually thought we had done something wrong because the implementation process went so smoothly.

Where is Ultraseek used today in your organization?
We have Ultraseek deployed on our employee intranet and two major public websites (www.nrel.gov and www.eere.energy.gov). Last year, our users made more than 2 million searches with Ultraseek across 128,000 documents.

Why did you choose Ultraseek?
A number of the Ultraseek capabilities are important to NREL. We needed a search engine that can index all of our document formats. About 15,000 of our most important documents are in PDF, Word, Excel and PowerPoint formats. Ultraseek indexes these beautifully. We needed a search engine that handles natural language queries effectively. We needed a search engine that can perform field searches, find similar documents, and stemming. Ultraseek met all of these criteria.

What are some of your best Ultraseek search examples?
There are several ways Ultraseek simplifies the search experience. Visitors to www.eere.energy.gov can easily narrow their search from more than 30,000 pages across hundreds of separate websites. For example, someone researching 'electricity generation' can use Ultraseek's scoped search capability to find relevant documents inside only EERE's Distributed Energy Program site. Alternatively, they can broaden their search to find information on 'electricity generation' across all of www.eere.energy.gov.

Ultraseek's Spell Suggest feature is also widely used because there are so many technical terms on www.nrel.gov and www.eere.energy.gov. For example, a user who types in the search query 'photovoltiacs' automatically gets the correct spelling suggested, 'photovoltaics.' This is an excellent feature.

How do you use Ultraseek's Reporting Manager capabilities?
The search logs are a goldmine. They give us an idea of what our users are looking for and the terms they are entering in searches. We use this very valuable information for content development, search engine optimization and site information architecture. The information has been used to support the decisions we make about our sites.

How much work goes on behind the scenes to keep Ultraseek running smoothly at NREL?
The NREL is a one-person search management shop, just me, so administration has to be extremely easy, if not automatic. Ultraseek provides the tools to make automatic additions, deletions, and revisiting. It is also easy to manually add documents on demand, easy to get stats of documents and selectively re-index sites.

Posted July 14, 2005 by editor
Category: User Stories

Learn Python in 10 Minutes or Less

Ryan Weisenberger
Manager, Software Development

Python is the primary language used throughout Ultraseek. Python is only a few years old, but has quickly become robust and powerful. Having an understanding of Python can help you perform advanced customizations of the user interface, and easily augment Ultraseek’s behavior through patches.py. Anyone with experience in programming languages can pick up the basics of Python in just a few short minutes.

Step 1. Download the Python interpreter

  1. Go to the Python website.
  2. Click the “Download” link.
  3. Click on the latest version of the Python interpreter. As of June 2005, this is Python 2.4.1.
  4. Download the binary package of Python for your operating system.
  5. Install Python on your system according to the Python installation instructions.

Step 2. Run Python

After you have installed Python on your system, you should run it according to the Python installation instructions. As Python starts, it will display something like the following:

Python 2.4 (#60, Nov 30 2004, 11:49:19)
[MSC v.1310 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license"
for more information.
>>>

That >>> is the prompt for the Python command-line interpreter. You can enter Python expressions directly into the interpreter, and immediately see their evaluation.

For example, enter 2+2, and the interpreter will return 4.

Step 3. Take the Python tutorial

Now that you have a fully functioning Python interpreter on your system, you should take the Python tutorial.

This tutorial will guide you through the basics of Python, such as the syntax, flow control, and the use of modules.

Step 4. Read a book or two

At this point, you should have a strong enough grasp of the basic Python programming concepts to go meddle around in the Ultraseek user interface, or maybe even patches.py. But if you really want to delve into Python, you should check out some of the Python books available. A comprehensive list can be found at the Python wiki.

Posted July 08, 2005 by editor
Category: Customizing

My Favorite Customer Problem

By Walter Underwood
Principal Architect

We are concerned about all of the problems reported by our customers, but there is one problem I don't mind hearing about.

At least once a month, we hear from a customer that has not had to touch their Ultraseek installation for months, and has forgotten the password. Sometimes, the search administrator changes jobs, and Ultraseek runs with no attention for a year or more.

We immediately explain how to set a new admin password, but I'm always happy that Ultraseek has been so reliable that it has needed no attention at all for months, maybe even years.

We don't know the record for unattended operation, but one university ran for five years without any support calls or upgrading to a new version, so we assume it didn't have any serious problems.

Posted July 05, 2005 by editor
Category:

Categories

Customizing

Indexing

Searching

Usability

User Stories

Archives

January 2006

December 2005

November 2005

October 2005

September 2005

August 2005

Recent Entries

Don't Reindex Every Week!

Keeping Your Index Fresh with Add URL

'Implementation Went So Smoothly'

Learn Python in 10 Minutes or Less

My Favorite Customer Problem

Resources

DOWNLOAD ULTRASEEK NOW!

XML   RSS Feed