EnglishChinese

Craft Powerful Regular Expressions

By Yusuf Mohsinally
Sr. Quality Assurance Engineer

Wildcards can be used to define filters that allow or disallow URLs in Ultraseek. They can also be used for various other purposes in the admin interface, such as to block some IP addresses or to increase the quality score of certain documents. In most cases, wildcards are sufficient. However, sometimes, you need a more powerful syntax to express your rules. Regular expressions are a syntax that can be used as an alternative to wildcards.

A First Attempt
Suppose you want to allow all URLs from your domain, mycorp.com. You might first attempt to define a wildcard allow filter for

http://*.mycorp.com/*

That would just about do it. It allows all the URLs from the domain, but it could also allow some that are not on the domain. For example, if someone linked to an archive of your site on "The Wayback Machine", at

http://web.archive.org/web/19990116234941/http://www.mycorp.com/

that URL would be allowed. The star from the wildcard expression could match all of

web.archive.org/web/19990116234941/http://www

Use Regular Expressions
First, change the type of syntax by selecting the drop-down box that shows "wildard" and changing it to "regex". Then enter a expression into the text box using the syntax for regular expressions. Ultraseek is built on the Python language and uses Python to process regular expressions, so you must use Python's syntax for regular expressions. Python's regular expression syntax is similar to the Perl language.

Next Attempt
Try replacing the star in the wildcard expression with a more restrictive pattern:

http://[^/]*.mycorp.com/*

The contents of the square brackets define a set of characters. The caret indicates that any character except what follows can match. This pattern will allow a URL only if the sequence of characters between http:// and .mycorp.com does not contain a slash. The string that matched the star in the previous example contains a slash and therefore would not match the new pattern.

Oops
If you revisited your collection, you would find that only the root URLs of sites are allowed. The cause of the problem is the star at the end of the pattern. In wildcard syntax, this star means a sequence of almost any characters. In regular expression syntax, it means repeating matches to the previous pattern element, which is in this case a slash. The pattern would allow all of the following URLs:

http://www.mycorp.com
http://www.mycorp.com/
http://www.mycorp.com//
http://www.mycorp.com///

Instead of a star by itself, you need

.*
The dot is a pattern matching almost any single character. Using the star to look for repetitions of the pattern matches a sequence of almost any characters.

Almost
http://[^/]*.mycorp.com/.*

The above regular expression will probably do a pretty good job, but it's not quite right. Remember, the dot in regular expression syntax means almost any character, and Python is interpreting the dots in the hostname to mean that as well. To make sure these only match a dot literally, escape them with backslashes:

http://[^/]*\.mycorp\.com/.*

Done
This regular expression will allow all URLs from the mycorp.com domain and only URLs from the mycorp.com domain.

When writing regular expressions for matching URLs, follow these tips to avoid common pitfalls:

1. Use .* where you would use * in a wildcard expression.
2. Escape characters with special meaning in regular expression syntax by using \. The special characters you're most likely to encounter in a URL are ., ?, and +. A full list of special characters is given in the Python documentation for regular expression syntax.

Posted October 12, 2005 10:14 AM by editor
Category: Indexing

Categories

Customizing

Indexing

Searching

Usability

User Stories

Archives

January 2006

November 2005

October 2005

September 2005

August 2005

July 2005

Recent Entries

More Quality Quick Links

Quick Links in Action

Tuning the Search Relevance on Your Site?

'Richer Suite of Functionality'

Fueling Your Business Search Engine to Find the Right Answers

Resources

DOWNLOAD ULTRASEEK NOW!

XML   RSS Feed