Archive: IndexingCraft Powerful Regular ExpressionsBy Yusuf Mohsinally Wildcards can be used to define filters that allow or disallow URLs in Ultraseek. They can also be used for various other purposes in the admin interface, such as to block some IP addresses or to increase the quality score of certain documents. In most cases, wildcards are sufficient. However, sometimes, you need a more powerful syntax to express your rules. Regular expressions are a syntax that can be used as an alternative to wildcards. A First Attempt http://*.mycorp.com/* That would just about do it. It allows all the URLs from the domain, but it could also allow some that are not on the domain. For example, if someone linked to an archive of your site on "The Wayback Machine", at http://web.archive.org/web/19990116234941/http://www.mycorp.com/ that URL would be allowed. The star from the wildcard expression could match all of web.archive.org/web/19990116234941/http://www Use Regular Expressions Next Attempt http://[^/]*.mycorp.com/* The contents of the square brackets define a set of characters. The caret indicates that any character except what follows can match. This pattern will allow a URL only if the sequence of characters between http:// and .mycorp.com does not contain a slash. The string that matched the star in the previous example contains a slash and therefore would not match the new pattern. Oops http://www.mycorp.com Instead of a star by itself, you need .* Almost The above regular expression will probably do a pretty good job, but it's not quite right. Remember, the dot in regular expression syntax means almost any character, and Python is interpreting the dots in the hostname to mean that as well. To make sure these only match a dot literally, escape them with backslashes: http://[^/]*\.mycorp\.com/.* Done When writing regular expressions for matching URLs, follow these tips to avoid common pitfalls: 1. Use .* where you would use * in a wildcard expression. Posted October 12, 2005 by editor Don't Reindex Every Week!Walter Underwood, Principal Software Architect, Verity If you have used other search engines, you probably had to manually configure your indexing schedule to make sure new content was found and indexed. This is not necessary with Ultraseek. Ultraseek has "continuous spidering with real-time indexing and adaptive revisit intervals." It sounds complicated, but it means that Ultraseek will automatically spider most pages at the right times. The Ultraseek spider is always on, always ready to find URLs and index documents. This is called continuous spidering. When a new or changed document is found, it is immediately indexed and is available for the next query. This is called real-time indexing. How does the spider decide when to revisit a URL to check for changes? It measures page change rates and adjusts to match them. This is called adaptive revisit intervals. For every URL it visits, the spider tracks how often it changes. It uses that information to choose a revisit interval. If a page changes every day, it is visted every day. If another one changes every week, it is visited weekly. Let's think about a sample site, with press releases and a page listing recent press releases. The page listing the press releases will change frequently, and Ultraseek will visit it often, finding the new press releases promptly. Individual press releases won't change, so Ultraseek will adjust their revisit interval to the maximum, about one visit per month. So, if you are planning to set up regular revisit schedules, don't do it right away. Let Ultraseek run for a while and adjust to your website. Then, when you want a really fresh index, integrate your publishing system with Add URL. That will get new pages into the index in a few seconds, not just once a week. Posted July 21, 2005 by editor Keeping Your Index Fresh with Add URLWalter Underwood, Principal Software Architect, Verity Everyone wants their search index to be an accurate, timely reflection of their content. Ultraseek automatically revisits pages to find new URLs, and that is very effective, but some sites have even stronger reqirements for how quickly documents need to be available in search results. This is called "index freshness." A stale index frequently misses new pages and has old information including pages that have already been deleted, and old copies of pages that have changed since they were indexed. For maximum index freshness, use Ultraseek's Add URL feature for notifications of deleted, changed, or new URLs. Add URL does what it says and a little more. You pass in a URL, and the spider adds it to the URLs it will crawl, putting it on the highest-priority queue. It also forces a revisit and reindex if the URL is already known to the spider. If a URL is new, it will be visited and indexed (if it isn't a duplicate). If the document at that URL is in the index and has changed, it will be reindexed. If there is no longer a document at that URL, the old document will be removed from the index. Add URL has both a user and a programmatic interface. Users can access it from the help pages. Administrators can access it there, or on the URL tab of a spider collection. Programs can do exactly what the UI form does, send an HTTP POST or GET with the URL, or they can use the dedicated Java API or SOAP Web service. Doing Add URL with HTTP is straightforward. Assuming that Ultraseek is installed at search.example.com, and that you are notifying it about changes at http://example.com/new.html, access this URL with a GET: http://search.example.com/help/addurlgo.html?url=http://example.com/new.html For Add URL with the XPA Java API, see SpiderCollection.addURL in the Javadoc shipped with the library. For the SOAP Web service see the VisitURL operation in the Web services documentation. The URL to be added is checked against the collection URL filters first, so the search administrator still has control over what is allowed in the collection. Web page authors and site webmasters can use Add URL to keep their pages fresh in the search index without making requests to the search administrator. This saves time for everyone. Add URL usually updates the index very quickly, often in a few seconds. Sometimes, the document will have been visited and reindexed by the time the URL Status page is shown. Make sure that the website is updated before you send the Add URL notification. We had one customer do the Add URL before pushing changes to the site, and even though it was only a two second delay, Ultraseek visited the URL before the content arrived and got a 404 response from the Web server. A spider curfew or a suspended spider will prevent Add URL from taking effect immediately. So allow the spider to run and make sure the content is pushed first. For the best possible index freshness, integrate Add URL into your publishing system. Right after publishing any change to your website, notify Ultraseek of all the URLs with changes through Add URL. Moments later, the search index will be updated and your changes will be published to both the site and the index, so that visitors can find the new pages through browsing or through search. Posted July 19, 2005 by editor Less is More: Why fewer documents means better search resultsWhen the right answer is not returned by the search engine, people tend to believe the engine has not found all of the content available on the network. The truth is that it may have discovered too much content. For each valuable page of information on your network, there are at least 10 pages of useless information, such as log files, zip archives, and dynamic pages. For example, do you need to index every page of an on-line calendar that goes to the year 2023? Probably not. In fact, with the inclusion of this low-value content, the user can be overwhelmed by the number of results that match their query term. Especially if it is a common one. One of our customers had 4.5 million pages in their index. They adjusted their spider to skip a lot of low-quality content, and ended up with 1.5 million pages. The quality of the search results improved dramatically. A smaller index, but better results. To choose what should be in or out of your index, try some of your most popular queries. For each irrelevant result, ask yourself "should this be in the index?" It is easy to control what gets indexed: adjust the filters on the Collections > Filters tab in the admin console. You can disallow content by matching its URL against a wildcard or regular expression pattern. The filters are matched from the top down, so make sure your specific disallow filters come before your more general allow filters. by Ryan Weisenberger Posted June 20, 2005 by admin |
CategoriesArchivesRecent EntriesCraft Powerful Regular Expressions Keeping Your Index Fresh with Add URL Less is More: Why fewer documents means better search results Related Forumread reports on another server? Revisit changes the last updated date of all documents. Resources |