EnglishChinese

Archive: Indexing

Craft Powerful Regular Expressions

By Yusuf Mohsinally
Sr. Quality Assurance Engineer

Wildcards can be used to define filters that allow or disallow URLs in Ultraseek. They can also be used for various other purposes in the admin interface, such as to block some IP addresses or to increase the quality score of certain documents. In most cases, wildcards are sufficient. However, sometimes, you need a more powerful syntax to express your rules. Regular expressions are a syntax that can be used as an alternative to wildcards.

A First Attempt
Suppose you want to allow all URLs from your domain, mycorp.com. You might first attempt to define a wildcard allow filter for

http://*.mycorp.com/*

That would just about do it. It allows all the URLs from the domain, but it could also allow some that are not on the domain. For example, if someone linked to an archive of your site on "The Wayback Machine", at

http://web.archive.org/web/19990116234941/http://www.mycorp.com/

that URL would be allowed. The star from the wildcard expression could match all of

web.archive.org/web/19990116234941/http://www

Use Regular Expressions
First, change the type of syntax by selecting the drop-down box that shows "wildard" and changing it to "regex". Then enter a expression into the text box using the syntax for regular expressions. Ultraseek is built on the Python language and uses Python to process regular expressions, so you must use Python's syntax for regular expressions. Python's regular expression syntax is similar to the Perl language.

Next Attempt
Try replacing the star in the wildcard expression with a more restrictive pattern:

http://[^/]*.mycorp.com/*

The contents of the square brackets define a set of characters. The caret indicates that any character except what follows can match. This pattern will allow a URL only if the sequence of characters between http:// and .mycorp.com does not contain a slash. The string that matched the star in the previous example contains a slash and therefore would not match the new pattern.

Oops
If you revisited your collection, you would find that only the root URLs of sites are allowed. The cause of the problem is the star at the end of the pattern. In wildcard syntax, this star means a sequence of almost any characters. In regular expression syntax, it means repeating matches to the previous pattern element, which is in this case a slash. The pattern would allow all of the following URLs:

http://www.mycorp.com
http://www.mycorp.com/
http://www.mycorp.com//
http://www.mycorp.com///

Instead of a star by itself, you need

.*
The dot is a pattern matching almost any single character. Using the star to look for repetitions of the pattern matches a sequence of almost any characters.

Almost
http://[^/]*.mycorp.com/.*

The above regular expression will probably do a pretty good job, but it's not quite right. Remember, the dot in regular expression syntax means almost any character, and Python is interpreting the dots in the hostname to mean that as well. To make sure these only match a dot literally, escape them with backslashes:

http://[^/]*\.mycorp\.com/.*

Done
This regular expression will allow all URLs from the mycorp.com domain and only URLs from the mycorp.com domain.

When writing regular expressions for matching URLs, follow these tips to avoid common pitfalls:

1. Use .* where you would use * in a wildcard expression.
2. Escape characters with special meaning in regular expression syntax by using \. The special characters you're most likely to encounter in a URL are ., ?, and +. A full list of special characters is given in the Python documentation for regular expression syntax.

Posted October 12, 2005 by editor

Don't Reindex Every Week!

Walter Underwood, Principal Software Architect, Verity

If you have used other search engines, you probably had to manually configure your indexing schedule to make sure new content was found and indexed. This is not necessary with Ultraseek.

Ultraseek has "continuous spidering with real-time indexing and adaptive revisit intervals." It sounds complicated, but it means that Ultraseek will automatically spider most pages at the right times.

The Ultraseek spider is always on, always ready to find URLs and index documents. This is called continuous spidering. When a new or changed document is found, it is immediately indexed and is available for the next query. This is called real-time indexing.

How does the spider decide when to revisit a URL to check for changes? It measures page change rates and adjusts to match them. This is called adaptive revisit intervals. For every URL it visits, the spider tracks how often it changes. It uses that information to choose a revisit interval. If a page changes every day, it is visted every day. If another one changes every week, it is visited weekly.

Let's think about a sample site, with press releases and a page listing recent press releases. The page listing the press releases will change frequently, and Ultraseek will visit it often, finding the new press releases promptly. Individual press releases won't change, so Ultraseek will adjust their revisit interval to the maximum, about one visit per month.

So, if you are planning to set up regular revisit schedules, don't do it right away. Let Ultraseek run for a while and adjust to your website. Then, when you want a really fresh index, integrate your publishing system with Add URL. That will get new pages into the index in a few seconds, not just once a week.

Posted July 21, 2005 by editor

Keeping Your Index Fresh with Add URL

Walter Underwood, Principal Software Architect, Verity

Everyone wants their search index to be an accurate, timely reflection of their content. Ultraseek automatically revisits pages to find new URLs, and that is very effective, but some sites have even stronger reqirements for how quickly documents need to be available in search results. This is called "index freshness." A stale index frequently misses new pages and has old information including pages that have already been deleted, and old copies of pages that have changed since they were indexed. For maximum index freshness, use Ultraseek's Add URL feature for notifications of deleted, changed, or new URLs.

Add URL does what it says and a little more. You pass in a URL, and the spider adds it to the URLs it will crawl, putting it on the highest-priority queue. It also forces a revisit and reindex if the URL is already known to the spider. If a URL is new, it will be visited and indexed (if it isn't a duplicate). If the document at that URL is in the index and has changed, it will be reindexed. If there is no longer a document at that URL, the old document will be removed from the index.

Add URL has both a user and a programmatic interface. Users can access it from the help pages. Administrators can access it there, or on the URL tab of a spider collection. Programs can do exactly what the UI form does, send an HTTP POST or GET with the URL, or they can use the dedicated Java API or SOAP Web service.

Doing Add URL with HTTP is straightforward. Assuming that Ultraseek is installed at search.example.com, and that you are notifying it about changes at http://example.com/new.html, access this URL with a GET:

http://search.example.com/help/addurlgo.html?url=http://example.com/new.html

For Add URL with the XPA Java API, see SpiderCollection.addURL in the Javadoc shipped with the library. For the SOAP Web service see the VisitURL operation in the Web services documentation.

The URL to be added is checked against the collection URL filters first, so the search administrator still has control over what is allowed in the collection. Web page authors and site webmasters can use Add URL to keep their pages fresh in the search index without making requests to the search administrator. This saves time for everyone.

Add URL usually updates the index very quickly, often in a few seconds. Sometimes, the document will have been visited and reindexed by the time the URL Status page is shown. Make sure that the website is updated before you send the Add URL notification. We had one customer do the Add URL before pushing changes to the site, and even though it was only a two second delay, Ultraseek visited the URL before the content arrived and got a 404 response from the Web server. A spider curfew or a suspended spider will prevent Add URL from taking effect immediately. So allow the spider to run and make sure the content is pushed first.

For the best possible index freshness, integrate Add URL into your publishing system. Right after publishing any change to your website, notify Ultraseek of all the URLs with changes through Add URL. Moments later, the search index will be updated and your changes will be published to both the site and the index, so that visitors can find the new pages through browsing or through search.

Posted July 19, 2005 by editor

Less is More: Why fewer documents means better search results

When the right answer is not returned by the search engine, people tend to believe the engine has not found all of the content available on the network. The truth is that it may have discovered too much content.

For each valuable page of information on your network, there are at least 10 pages of useless information, such as log files, zip archives, and dynamic pages.

For example, do you need to index every page of an on-line calendar that goes to the year 2023? Probably not.

In fact, with the inclusion of this low-value content, the user can be overwhelmed by the number of results that match their query term. Especially if it is a common one.

One of our customers had 4.5 million pages in their index. They adjusted their spider to skip a lot of low-quality content, and ended up with 1.5 million pages. The quality of the search results improved dramatically. A smaller index, but better results.

To choose what should be in or out of your index, try some of your most popular queries. For each irrelevant result, ask yourself "should this be in the index?"

It is easy to control what gets indexed: adjust the filters on the Collections > Filters tab in the admin console. You can disallow content by matching its URL against a wildcard or regular expression pattern. The filters are matched from the top down, so make sure your specific disallow filters come before your more general allow filters.

by Ryan Weisenberger
Manager, Software Development

Posted June 20, 2005 by admin

Categories

Customizing

Indexing

Searching

Usability

User Stories

Archives

January 2006

November 2005

October 2005

September 2005

August 2005

July 2005

Recent Entries

Craft Powerful Regular Expressions

Don't Reindex Every Week!

Keeping Your Index Fresh with Add URL

Less is More: Why fewer documents means better search results

Related Forum

Loadbalancing or clustering

read reports on another server?

Revisit changes the last updated date of all documents.

Upgrade v5.7.1 from v5.4.2

Report digest not working

Resources

DOWNLOAD ULTRASEEK NOW!

XML   RSS Feed