ULTRASEEK SERVER AND ULTRASEEK SERVER CCE 3.1 RELEASE NOTES New License Key Required if Upgrading From Version 1.x or 2.x - If you are upgrading an existing version 1.x of Ultraseek Server, you may need to get a new license key from Ultraseek Corporation. You can tell if your license key will work by the fourth number in it. This version of Ultraseek server requires version 3 license keys, of the form xxxx-xxxx-x-3-x-xxx-xxxx-xxxx. If you do not have a version 3 license key, contact your Ultraseek Corporation sales representative or software-sales@ultraseek.com to get a new license key. Do not even bother to upgrade until you have a version 3 license key. CHANGES SINCE VERSION 3.0 New Features Ad Server Support - Ultraseek Server now has support for ad servers that can serve ads using specially formatted URLs, for instance Accipiter DirectServer or NetGravity AdServer Network. Maximum Termlist Length - Under the "parameters" tab on the "server" admin screen, a new "maximum termlist length" parameter has been added. The maximum termlist length parameter specifies the maximum length of any termlist in an index file. In order to keep its index file size and query cache size limited, in each of its index files Ultraseek Server keeps track of a limited number of documents per term. This parameter specifies the maximum number of documents per term to keep track of. If as a result of an index merge operation a termlist would be longer than this parameter, the termlist is truncated to this length, and references to the least relevant documents are dropped. Smaller maximum termlist lengths allow the use of smaller query caches and result in smaller index files. Larger maximum termlist length values ensure the correct operation of the require (+) and exclude (-) query operators on popular terms. The maximum termlist length value need not be larger than the total number of documents indexed, since no termlist would ever contain more documents than all indexed. Automatic Email Tests - There is now a simple way to test whether Ultraseek Server's automatic notification or bug-report mail generating mechanism is working. The system administrator can trigger the immediate generation of a test mail message by clicking on a "test" button next to the "Email addresses for important automatic notifications" or the "Email addresses for automatic bug reports" form element under the "parameters" tab on the "server" admin screen. Title Cache Size - Under the "parameters" tab on the "server" admin screen, a new "title cache size" parameter has been added. The title cache size parameter specifies how much main memory to devote to Ultraseek's search engine title cache. This cache holds the per-hit title information displayed on search results pages, including each url, title, summary, publisher, and date. This cache is useful for improving the performance of queries against large collections. Each document title record takes approximately 448 bytes. A size of 0 disables the title cache. Disallow Parent Directory Links From Apparent Directory Listing Documents - Under the "filter" tab on the spider "collection" admin screen, there is a new checkbox labeled "Disallow parent directory links from apparent directory listing documents". By the respective box you can disable the following of links to parent directories from documents that are directory listings. Ignore Case Differences in URLs - Under the "tuning" tab on the spider and scanner "collection" admin screens, there is a new checkbox labeled "Ignore case differences in URLs". When this box is checked, the spider ignores case differences in URLs. This box is unchecked by default for collections created before version 3.1, and is checked by default for all new collections. Thesaurus Expansion for Queries - Administrators may put sets of synonyms in the thesaurus.txt file, one set per line. When a query matches one of the terms in that file, the synonyms will be shown in as a checklist, to be added to the query. The thesaurus edit interface is available from the Query Parameters section of the Server Parameters page. A default thesaurus of American and British spellings is provided. Fielded XML Search - Element names can be mapped to search field names. This allows searches to be focused on a part of an XML document. The administrator can set mappings for particular DTDs (identified by the root element). There is a generic mapping for documents do not match a specific mapping. XML Namespaces - XML documents using the namespace mechanism can be parsed, and elements with namespace qualifiers can be mapped for fielded search. CCE Per-URL Topic Assignments - Administrators can now assign specific URLs to topics, and mark them for quality, from three stars down to no stars. All "starred" pages are displayed before other pages when a topic is browsed. One two or three stars are displayed before each hand-picked URL. The stars do not appear in regular search results. CCE Cross-references - Cross-reference topics are a new kind of topic in CCE. They provide a pointer to another topic. The cross-reference topic has its own name. Searches on a topic will match results in all subtopics, whether they are regular topics or cross-reference topics. Cross-reference topics are never shown as related topics. CCE Related Topics Display - Under the "parameters" tab on the "server" admin screen, a new "show full topic path" option is available for related topics. When this is selected, the full path is shown rather than just the topic name. CCE Topic Keywords - When editing a topic, keywords may be specified. If a query matches one of the keywords, the topic will be shown as a related topic. CCE Reload topics definition -- A reload button has been added to the Topics Status page. This will reload the topics from the topics.xml file on disk. This should only be needed if that file has been modified outside of Ultraseek Server CCE. Proxy Authentication - The spider and mirror collections now work with authenticating proxy servers using either "Basic" or NT Challenge/Response authentication. Proxy authorization values for spider collections are specified in the "Proxy Server Specification" section under the "network" tab on the "collection" admin screen. Proxy authorization values for mirror collections are specified in the "Parameters" section under the "status" tab on the "collection" admin screen. NT Challenge/Response authentication is only available when Ultraseek Server is running on Windows NT. View Sites - The total number of indexed documents for each site is now shown on the "view sites" page. NT Challenge/Response Authentication - Spider collections will automatically use NT Challenge/Response (NTLM) authentication (in addition to HTTP Basic authentication) if a web server requests it. This makes use of Microsoft security libraries, and is only available when Ultraseek Server is running on Windows NT. Term Highlighting in Search Results Page - Under the "parameters" tab on the "server" admin screen, a new "highlight query terms in results page " parameter has been added. When this is selected query terms are highlighted in titles and summaries on the search results page. Cascading Style Sheets - HTML in the search home page and results pages has class attributes and
tags to support customization with CSS. Customizable Dictionaries for Stemming - The built-in dictionaries for natural-langauge stemming can be augmented with additional entries. For each language, a dictionary of additional terms may be supplied. This allows domain-specific terms (like "webmasters") or variant usages ("landmines" instead of "land mines") to be processed by the stemmer (matching across plurals, past tense, etc.). After changes to the stemming dictionary, documents containing the added terms should be reindexed. URL Override - A "URL override" field has been added to the "HTML Meta Tag Names" section under the "tuning" tab on the "collection" admin screen. This field allows you to specify a meta tag that will override the URL displayed for the document in a search results page. You can use "URL override" to index one document but when a user clicks on the search results it leads to another document. This would be useful if you wanted to provide search of an image, audio, or video document database. Update Access - Users with restricted admin access can now be defined under the "users" tab on the "server" admin screen. When defining a new administrative user, access privileges to make updates to the "server", "topics", and "collection" admin modes can be selectively enabled. Collection-level update access can be also limited to selected collections. Access restrictions for existing users can be changed on the password change form. Microsoft Exchange Public Folders - There is a new collection type that will allow you to index content that resides in Exchange Public Folders if that content has been made web-accessable. Otherwise Allowed and Always Disallowed Sites - Under the filters tab for spider collections, otherwise allowed and always disallowed sites can now be listed for FTP and HTTPS sites in addition to the previous HTTP list. Query Deduping Control - A new checkbox has been introduced to allow adminstrators to disable query-time deduping. In the "Query Parameters" section under the "parameters" section on the "server" admin screen, the "remove from results duplicate hits with the same URL" checkbox enables the automatic removal of duplicate hits with the same URL from the results list. Normally enabled, this is useful when searching multiple collections and there is an overlap of URL coverage between the collections. By un-checking the checkbox this query-time deduping can be disabled. Provide Option to Search the Internet - In the "Query Parameters" section under the "parameters" tab on the "server" admin screen, a checkbox has been introduced to allow administrators to disable the "Search the Internet" option. When checked, the "Provide option to search the Internet" parameter specifies that an option to search the Internet is provided in the initial search form page and on results pages. By un-checking this box, this option can be disabled. Implementation Changes Faster XML Parser - James Clark's Expat parser is now used for XML. The ISO 8859-1 (ISO Latin 1) and UTF-8 encodings are decoded, but characters outside of ISO 8859-1 are not indexed. Mirroring Between Big-Endian and Little-Endian Hardware - Index mirroring collections now work between big-endian and little-endian machines. The instance of Ultraseek Server sending the index automatically byte-swaps the binary data as it sends it. Deferred Queuing in Topic Edits - When a topic is added or edited, the topics are saved to disk before any URLs are queued. In 3.0, administrators had to wait for a topic edit to complete in the background before making more changes to that topic. Now the topic is immediately available for further edits. Expires HTTP Header Line - An "Expires:" header line in an HTTP response will be used to help set the next revisit time. Since the expiration time is a hint for caches, rather than a promise about when the page will change, the spider does not use that information unconditionally. If a document has expired, but is unmodified when revisited, the revisit interval will be lengthened. Collection rules - In topic rules, a collection can be required or forbidden by specifying the field "collection:" with the internal name of the collection. CCE Topics Mirrors - Topics mirroring in mirror collections is now pickier about when topic mirroring is allowed. Also, when the last topic mirror is unchecked, local topics are cleared. CCE Related Topics and Subtopics Display - Related topics and subtopics are now shown on only the first page of search results. This combines easy access to the topics (on the first page) with access to search results without scrolling down (on the subsequent pages). Spider Black Hole Warning Message - The spider now automatically sends an e-mail message to the administrator when it finds an unusually large number of URLs on a site. This warning message indicates that the spider may have found a "black hole" and is following an infinite number of dynamically generated links. Refresh Meta Tags - The spider will follows URLs found in "refresh" meta tags. New Form Variables "ht" (Home Topic) - The index.html and query.html files now support the "ht" form variable. This variable can be used to specify an alternate browsing root of the CCE topic tree. This is useful if you would like to provide an topic browsing interface to a subtree of your entire topic tree. The default value of ht is 0, the topic id of the root of the CCE topic heirarchy. By explicitly specifying ht with an alernate topic id, you can provide an interface that limits browse topics and related topics to subtopics of your alternate. The ht form variable is most useful when used in combination with the qp, col, and qc form variables to provide a complete search and directory interface to a subset of your content. "si" (Search Internet) - The index.html and query.html files now support the "si" form variable. If this variable is non-zero, then an option to search the Internet is presented in the search forms. The default value of si is determined by the "Search the Internet" option in the "Query Parameters" section under the "parameters" tab on the "server" admin screen. An explicit value of this variable is useful to override the default "Search the Internet" setting. For example, you could have the "Search the Internet" option enabled by default so your global intranet search page provides the option, but you could also provide a more narrow site search facility with the option disabled, simply by having the URL of the site search form explictly specify a value of 0 for si. "ex" (Extra) - The query.html file now supports the "ex" form variable. Values of this form variable are appended with commas to the query text. This form variable is used internally in Ultraseek Server in checkboxes by the Thesaurus feature to add extra thesaurus words to queries. RELEASE NOTES FOR THE LINUX RELEASE SUPPORTED OS VERSION -------------------- Ultraseek Server has been tested on RedHat Linux 5.1, with a kernel version of 2.0.34 and glibc 2.0.7-19 (upgrade). REPORTING A BUG --------------- When reporting a problem, be sure and include your Linux distribution, the version of your kernel (uname -r), and the version of your glibc. Send problem reports to software-bugs@ultraseek.com. SUPPORT FOR POSTSCRIPT ---------------------- Ultraseek Server will index postscript files if you have installed ghostscript and it is on your path. KNOWN PROBLEMS -------------- * The standard version of glibc (2.0.7-17) that ships with RedHat 5.1 has problems with the thread manager that may cause Ultraseek Server to crash. If you do a 'ps' and see this: daemon 8644 0.0 0.0 0 0 p0 Z 14:13 0:00 (pyseekd ) and the pid is the next-to-the-smallest pid associated with Ultraseek Server, this is your problem. You need to upgrade to at least the 2.0.7-19 version of glibc. Patchlevel 1 Fix Windows NT PDF CreationDate/ModDate parsing problem. Fix navigation link problem with advanced query and search these results. Fix problem with advanced query and thesaurus terms. Fix problem with find similar and search these results. Add Expat copyright notice to help/copyright.html. When processing query, only stem capitalized words when using German stemmer. Fix mirror collection creation blowup when not using CCE. Put cmp2 back into topic.py. Patchlevel 2 Catch KeyError when delete dictionary entries. When sending mail through SMTP make sure we include a \r before every \n. Catch EPIPE in httpsrver.py. Fix error on 401 with an empty www-authenticate. Fix error in proxy authentication. Fix "pending" error in init_sitelist_opened. Fix "view sites" problem with version 3.0 help files. Make the "HTML Meta Tag Names" applicable to XML documents. Delete old z3950 log files. Improve phrase searching to handle phrase fragments, like "end of" or "the end". Recognize "C++", "COBOL++", etc. as phrases. Increase internal buffers to handle more text, and generate more informative messages when internal limits are reached. Fix problem parsing wildcard patterns ending with $. Add Expat copyright notice to admin/help/copyright.html. Update year in copyright notices to 1999. Add .shtml to the initial "doc types" table of known file extensions. Improve term highlighting to handle required terms and quoted phrases. Include the version number in "restarted" log messages. Fix blowup in query report when query log file has a bad line in it. Fix blowup when using pyqueryall.spy and a required phrase has no entries in the index. Don't truncate long URLs used as titles in saquery.spy. Update inline DTD in topics.xml file. is obsolete, replaced by . Enclose "the name" advanced query elements in doublequotes. Don't report an error if a year entered in the advanced query form is earlier than 1970 or later than 2037. Fix a delmap merging bug. Patchlevel 3 Fix "exceptions.TypeError: loop over non-sequence" bug that happens when "Remove from results duplicate hits with the same URL" option is disabled. Make oifilter /tmp file cleanup pattern more specific. It now only matches alphanumeric filenames starting with three capital letters (no dots). Generate sufficient phrases at index time. Patchlevel 4 Fix problem parsing links in HTML documents that are relative URLs but contain colons. Don't blow up if we can't get the site from the URL of a document to be deleted from the index. In spider and scanner, when revisiting URLs of duplicate documents, force a reindex if the original is not found in the index. Fix problem with URL override HTML Meta Tag. Add a form field for the domain when indexing exchange public folders. Make sure exchange collection parameter changes will take effect immediately. Make sure NTLM security will be use when both NTLM and Basic authorization are requested. URL quote the mailalias used for exchange public folder indexing. Change qt maxlength from 2033 to the correct value of 1991 in indexform.html, queryform0.html, and queryform1.html. Improve the way we compute our heap size on Windows NT. Fix "exceptions.NameError: reindex" problem in netnewscolprmconfirm.html. Fix segmentation fault merging no indexes. Prevent IE 4.0 blowup on svcgo.html. Version 6.0 INSO filters. Allow locale to be specified in the configuration file. Enforce 5 minute min interval for scanner, mirror, netnews, exchange, and merge collections. Allow stemmed uncapitalized phrase matches of capitalized phrases. Change the "handle double-byte Chinese characters" option from a per-collection tuning option to a global server parameters option Eliminate the confusing "allow documents with non-western characters" per-collection tuning option. When importing topics from a sitemap, check for no URL given. Version 6.0 INSO filters for Linux. Extract pdf document properties in Linux. Patchlevel 5 Add support for HP-UX. Add diagnostic logging to exchange collection. Improve NT performance of GETs on long file names. Remove potential crash with GETs on long file names. Pass si form variable through advanced query forms instead of forgetting it. Make "Try this query on the entire web" visible on qti instead of qt so it works on advanced query results. Fix xpa related_topics error when CCE is disabled. Fix xpa urlstatus error when url not found. Fix xpa insert URLdecode problem. New "Duplicate Determination" option: "don't do any document de-duping". Fix allhits to allow the first document in the index to be returned. Log warning instead of shutdown if cannot set locale at initialization. Update link to Mozilla Public License in agreement. Correct nesting errors in
tags. Add spaces to make HTML comments pass weblint. Corrected spelling in thesaurus. Fix "exceptions.NameError: proxyauth" in getdoc_http_https when using proxy and keep-alives. Fix misspelling of KeyError in adddoc_sitelist. Extend checksum to cover entire document and eliminate false "essentially unchanged" decisions in indexing. Patchlevel 6 Automatically recompute optimal "maximum termlist length" and "termlist cache size" parameters whenever a new license key is entered. Read all pages of exchange public folders messages. Send more HTTP headers when importing topics from a URL. Strip blanks from ends of URL when importing topics. Catch zero-length CCE rules when queuing URLs to be re-indexed. Fix xpa insert "seek.error: invalid language" problem. Add volume type info to OS Status on Windows NT. Improve performance receiving extremely large POSTs in http server. Send warning e-mail when disk free space might not be sufficient for a merge. Skip comments in thesaurus.txt. Check for URL too long in idoc.insert(). Add anti-loop checks when editing both parent and topic in a cross-reference topic. Don't allow subtopics of cross-reference topics. Patchlevel 7 Fix problem parsing HTML documents ending in stopindex mode. Change initial default netnews poll interval to 1 hour. Add XPA support for SearchResult.getTopics() and for topic administration. Use smaller initial defaults for maximum heap size, termlist cache size, title cache size, and in-memory index size in big memory configurations to reduce the number of MemoryError exceptions. When logging HTTP redirects encountered by the spider, include the URL the redirect gives and also note if the URL is not allowed by the URL filter. In spider and scanner, when revisiting URLs of duplicate documents, force a reindex if the original was recently reindexed. This should reduce the number of false duplicates. In spider cache fewer open sites, reducing memory and file descriptor consumption in big configurations, but increasing the amount of work done opening and closing site URL database files. In spider, try three times to open a site URL database file, sleeping inbetween, before giving up due to an bsddb.error exception. This should reduce the occurance of spurious 'Permission denied' bsddb.error exceptions in Windows NT due to a race condition. Add the .phtml, .rfc822, and .vcf file extensions to the list of known file extensions. Include the util/exception.html page in distributions so error pages can be customized. Add support for message and multipart content types. This includes renaming the unused parameter "charset" to "params" in the parse procedure patched in patches.py. Also extract uuencodes from text/plain documents. Netnews collections now index articles with attachments. Don't show "Find Similar" link in onehit1.html if iterms variable is empty. Recognize URLs with Apache httpd directory index column-sorting paramaters as directory URLs (that is, /?M=A to sort ascending by modified date). Save actual referring URLs in the spider URL database, not just URL hashes, and change urlstatusgo.html to show these referring URLs. This will take more room in the URL database, but will show the referring URL in the "URL status" page even if the document with the referring URL isn't in the search index. Fix "exceptions.NameError: reindex" problem in scannercolprmconfirm.html and indexercolprmconfirm.html. Allow license keys to work for aliases and binding addresses as long as they work in the dns. Fix smtp mail client to not send extra blank line after message. Remove qdb, bfi, and toc code. Change spider to only present HTTP authentication header when challenged for one. This fixes an incompatibility with Microsoft IIS where it would respond with a 401 if it got an authentication header when it didn't expect one. Fix "exceptions.OverflowError: float too large to convert" problem in merge_space_check. Raise the maximum allowed values for "Maximum number of directories in a URL" and "Maximum number of hops from root URL" from 9999 to 2147483647. Remove nonfunctional "Recognize uncapitalized proper names" option. Change "Maximum termlist length" server parameter from a dropdown to a text input to allow a much wider selection of values. Fix to CCE to allow URLs to be starred in more than one topic. Ignore empty links (href attributes) in XML documents. In HTML parser if an apparently malformed entity reference ends with an equals sign, interpret it as an unescaped ampersand instead. Patchlevel 8 In spider when "Filter Lotus Domino Navigation Links" is enabled, when processing a page with URL ending with ?OpenView, ?OpenDatabase, or ?OpenNavigator, automatically construct a link to the same URL but with &ExpandView appended to it. Also adjust Domino filter patterns so that they only look in the query string. Clean up HTML in Edit Topic admin page. IE 4.x on Macintosh caches pages regardless of the "no-cache" pragma. Add Microsoft-approved JavaScript workaround to force "no-cache" on admin pages. Add semicolon statement separators to multi-statement Javascript onsubmit form attributes on help and admin pages so Netscape 6.0 is happy. Add support for user-defined file type filtering. In parsehtml.c strip trailing newline characters from ends of URLs. Fix javascript typo in svc.html so that the termlist cache length parameter can be changed to a value greater than the minimum allowed. Fix "exceptions.AttributeError: 'None' object has no attribute 'urls'" in indexer.py when deduping and topics are allowed by the license key but no topics are defined. Ignore capitalization when looking for "Basic" authentication scheme in WWW-Authenticate HTTP MIME headers. When indexing and walking a topic's parents and cross-references, catch invalid topic IDs (caused an exceptions.KeyError in indexer.py). Patchlevel 9 Prevent "exceptions.TypeError: expected integer index" in spider saveinfo(). Handle "exceptions.EOFError: EOF read where object expected" in spider getpathdict() and scanner geturldict(). Improve error messages from XML parser when there are illegal characters in a document. In spider automatically disallow URLs with adjacent slashes in path. Fix problem deleting topics when reloading topics.xml. Handle binascii.Error when decoding malformed base64 rfc822 attachments. Allow administrative users with "server" access to reload scripted HTML pages. Fix mutex leak in idbm.c. Replace   with   in python-scripted HTML. Avoid "exceptions.TypeError: string without null bytes, string" in config save_locked_internal(). Don't report error on "interrupted system call" in httpsrvr step(). Fix collection: field so that it is not stemmed. In Linux, do zmap as mmap of /dev/zero. Adobe Acrobat 4.0 filter. Update copyrights, trademarks, license agreement, and links for Ultraseek Corporation. Close exchange sessions after indexing. Save related_topics_count in configuration file. Make sure that summary returned by parsehtml is null-terminated. Modify ultraseek rc script so that when stopping, it first tries to gracefully stop ultraseek server using SIGTERM, and if that fails after 10 seconds, it uses SIGKILL. Fix mirror collection to not delete old indexes prematurely, which was causing merge collections to fail. Patchlevel 10 Change term+ to term* in saquery.xml dtd, as no terms are included when a search matches no results. Fix mirror collection problem that causes unnecessarily large amounts of elapsed time and disk space to be used sometimes when mirroring large active collections. Replace references to infoseek.com with ultraseek.com. Remove circle-i logo from images. Rearrange "os status" output so it is useful even when Ultraseek Server is in a restarting loop. Update copyrights, trademarks, license agreement, and links for Inktomi Corporation. Prevent thesaurus expansions for URL-like search terms (url:, site:, link:, and so on). Fix runaway cpu and memory consumption when license key is invalid and there were more than two acceptable hostnames that the key was checked against. Patchlevel 11 Fix "exceptions.NameError: text" parsing multipart documents. Change Ultraseek Corporation to Inktomi Corporation on copyright page. Change HTML parser so that it accepts tags ending with slashes so that it accepts XHTML.