Craft Powerful Regular ExpressionsBy Yusuf Mohsinally Wildcards can be used to define filters that allow or disallow URLs in Ultraseek. They can also be used for various other purposes in the admin interface, such as to block some IP addresses or to increase the quality score of certain documents. In most cases, wildcards are sufficient. However, sometimes, you need a more powerful syntax to express your rules. Regular expressions are a syntax that can be used as an alternative to wildcards. A First Attempt http://*.mycorp.com/* That would just about do it. It allows all the URLs from the domain, but it could also allow some that are not on the domain. For example, if someone linked to an archive of your site on "The Wayback Machine", at http://web.archive.org/web/19990116234941/http://www.mycorp.com/ that URL would be allowed. The star from the wildcard expression could match all of web.archive.org/web/19990116234941/http://www Use Regular Expressions Next Attempt http://[^/]*.mycorp.com/* The contents of the square brackets define a set of characters. The caret indicates that any character except what follows can match. This pattern will allow a URL only if the sequence of characters between http:// and .mycorp.com does not contain a slash. The string that matched the star in the previous example contains a slash and therefore would not match the new pattern. Oops http://www.mycorp.com Instead of a star by itself, you need .* Almost The above regular expression will probably do a pretty good job, but it's not quite right. Remember, the dot in regular expression syntax means almost any character, and Python is interpreting the dots in the hostname to mean that as well. To make sure these only match a dot literally, escape them with backslashes: http://[^/]*\.mycorp\.com/.* Done When writing regular expressions for matching URLs, follow these tips to avoid common pitfalls: 1. Use .* where you would use * in a wildcard expression. Posted October 12, 2005 10:14 AM by editor
|
CategoriesArchivesRecent EntriesTuning the Search Relevance on Your Site? 'Richer Suite of Functionality' Fueling Your Business Search Engine to Find the Right Answers Resources |