Release Notes for Inktomi Search Software 4.0 --------------------------------------------- New License Key Required if Upgrading From Version 1.x, 2.x, or 3.x ------------------------------------------------------------------- If you are upgrading an existing version 1.x, 2.x, or 3.x of Ultraseek Server, you may need to get a new license key from Inktomi. You can tell if your license key will work by the fourth number in it. This version of Inktomi Search Software requires version 4 license keys, of the form xxx-xxx-xxx-4-xxx-xxx-xxx-xxx. If you do not have a version 4 license key, contact your Inktomi Corporation sales representative or software-sales@ultraseek.com to get a new license key. Do not upgrade until you have a version 4 license key. Unicode ------- Inktomi Search 4.0 uses Unicode to represent text internally. Documents to be searched may use any of more than 150 character encodings. Each document is converted into the internal Unicode character set, so all documents are searched, regardless of their original encoding. Unicode includes all major world scripts (alphabets or other writing systems). For example, documents written with Greek, Cyrillic, or Thai characters will be properly decoded and indexed. Unicode also includes ideographic characters from the major character encodings for Japanese, Chinese, and Korean. Effective search for these languages also requires the optional language-specific packages for Chinese, Japanese, or Korean. HTML pages generated by Inktomi Search 4.0 are transmitted in a character encoding appropriate to the language selected. You can specify the default language and encoding in the "Server Parameters" section under the "parameters" tab on the "server" admin screen. Inktomi Search 4.0 uses version 2.1.2 of Unicode. See www.unicode.org for more information on the Unicode standard. Asian Language Support ---------------------- Optional language support for Chinese and Japanese is now available. The support provides word segmentation for languages that do not use spaces between all words and provides stemming for inflected languages. Localized HTML Pages -------------------- Inktomi Search 4.0 now provides localized search and help pages in fourteen languages: Simplified Chinese, Traditional Chinese, Danish, Dutch, English, Finnish, French, German, Italian, Japanese, Norwegian, Portuguese, Spanish, and Swedish. The initial search form, search results pages, and search help pages are localized. Localized search and help pages are included in the optional language support for the respective language. Improved Index Format --------------------- Inktomi Search 4.0 can make indexes in an improved index format. This new index format supports extensible title records and high precision URL and phrase searches. Inktomi Search 4.0 still fully supports the old index format. One Inktomi Search 4.0 instance can search collections in both index formats, with two important limitations. The first limitation is that on one Inktomi Search 4.0 instance, all spider, scanner, netnews, exchange, and indexer collections must use the same index format. The second limitation is that merge collections can only work with source collections all in the same index format. A configuration parameter controls whether or not index files are made in this new format. On new Inktomi Search 4.0 instances, the default behavior is to make index files in the new format. On old Inktomi Search 4.0 instances, when upgrading, Inktomi Search 4.0 will offer to automatically clear all spider, scanner, netnews, exchange, and indexer collections to rebuild them with the new index format. If you decline the offer, Inktomi Search 4.0 will continue to make index files in the old format. You can choose to switch to the new index format at a later time by un-checking the "make pre-version-4.0 index files" in the "Server Parameters" section under the "parameters" tab on the "server" admin screen. Since older versions of Ultraseek Server do not understand the new index format, if you are using mirror collections, make sure that your Ultraseek Server instances with the mirror collections are upgraded to Inktomi Search 4.0 before you convert your Inktomi Search 4.0 with the mirrored source collection to the new index format. Server Parameters ----------------- Language - Available in the "Server Parameters" section under the "parameters" tab on the "server" admin screen, the "language" parameter specifies the default language for the server's query and help pages. Encoding - Available in the "Server Parameters" section under the "parameters" tab on the "server" admin screen, the "encoding" parameter specifies the default character encoding for pages returned by the built-in webserver. Make Pre-Version-4.0 Index Files - Available in the "Server Parameters" section under the "parameters" tab on the "server" admin screen, the "Make pre-version-4.0 index files" checkbox controls the format of index files on all spider, scanner, netnews, exchange, and indexer collections. Handle Double-byte Chinese Characters - The old "Handle double-byte Chinese characters" checkbox has been eliminated. It is no longer necessary. Tip file - The old "Tip file" parameter has been eliminated. It is no longer used. Tips are now language localized and are stored in each language configuration XML file. Show Search Image on Initial Search Form Page - The old "Show search image on initial search form page" checkbox has been eliminated. Document Type Parsing - In the "Document Type Parsing" section under the "doc types" tab on the "server" admin screen, the "Q" column has been eliminated. All Collections --------------- Scheduled Activity - A new "Scheduled Activity" section now exists under the "tuning" tab on the "collection" admin screen. In this section you can schedule tasks to be started at regular intervals. The tasks that can be scheduled depend on the type of the collection. For example, you can specify that a scanner collection start rescanning every night, or that a spider collection start revisiting a particular site every week. Activity Curfew - The old "Schedule" section under the "tuning" tab on the "collection" admin screen has been renamed the "Activity Curfew" section, since the section controls when the collection is not allowed to run at all. The new "Scheduled Activity" section is more useful for specifying that the collection start specific new tasks at regular intervals. Spider, Scanner, Netnews, Exchange, and Indexer Collections ----------------------------------------------------------- Default Encoding - Located in the "Spider Tuning Parameters" section under the "tuning" tab on the "collection" admin screen, the "Default encoding" option specifies which character encoding to use to decode documents when no encoding can be determined by any other means. Title Record - On Inktomi Search 4.0 instances configured to make indexes in the new format, a new "Title Record" section exists under the "tuning" tab on the "collection" admin screen. This section specifies how document metadata is assembled into title records as documents get indexed. Title records contain the information that is displayed on search results pages. Title Record Size - Located in the new "Title Record" section, the "Title record size" parameter specifies the amount of space on disk to reserve for each title record. The size can be any value from 512 bytes to 8192 bytes. Choose the smallest title record size that can hold the information you need. Inktomi Search 4.0 performs best when the size is an integer multiple of 512 bytes. Maximum Component Lengths - Located in the new "Title Record" section, the "Maximum component lengths" title record parameters specify the maximum number of characters allowed in each variable-length title record component. Default values have been provided for the standard components title, description, URL, and publisher. No component is allowed to contain more characters than the title record size minus 129 bytes. For all but the URL component, Inktomi Search 4.0 automatically truncates the text of the component to fit. For the URL component, Inktomi Search 4.0 disallows longer URLs. Extra Title Record Components - Extra title record components can be specified below the standard components in the new "Title Record" section. To specify an extra component, enter the field from which you want the text for the component to come, and also specify the maximum number of characters you would like saved. For example, if you would like up to 63 bytes of the contents of the "author" meta tag to be displayed in your search results, specify the component "author:" and the maximum length of 63. For each document containing an "author" HTML meta tag, Inktomi Search 4.0 will automatically save the contents of the meta tag in the title record for the document, and will automatically display the contents on search results pages. Allowable Languages - The list of languages in the "Allowable Languages" section under the "filters" tab has been extended to include Arabic, Czech, Danish, German, Greek, English, Spanish, Estonian, Finnish, French, Hungarian, Icelandic, Italian, Hebrew, Japanese, Korean, Lithuanian, Latvian, Dutch, Norwegian, Polish, Portuguese, Romanian, Russian, Swedish, Thai, Turkish, Simplified Chinese, and Traditional Chinese. Quality Factor - The "Quality Factor" section under the "dupes" tab on the "collection" admin screen has been improved. The range of quality values has been expanded fourfold to a minimum value of -16 and a maximum value of +15, and the "Add one to a documents quality for every N documents on other sites that link to it." setting has been removed. The section is now used to specify a "baseline" quality based on URL pattern matches. The indexer will assign a quality factor to each document it indexes based on this baseline quality value, with some automatic adjustments based on built-in heuristics. The forms have been improved so that it is easier to assign baseline quality values to a larger number of URL patterns. Spider Collections ------------------ Disallow FrontPage Special Directories - Located in the "URL Filter Specification" section under the "filters" tab on the "collection" admin screen, the "Disallow FrontPage special directories" checkbox controls the indexing of documents in special directories used by the Microsoft FrontPage Server Extensions. The box is checked by default. This new rule is equivalent to entering these filter patterns in the topmost "disallow" box: http*://*/_vti_* [disallow] http*://*/_derived/* [wildcard] http*://*/_overlay/* http*://*/_borders/* http*://*/_themes/* http*://*/_fpclass/* http*://*/_private/* Accept Cookies - Located in the "Spider Tuning Parameters" section under the "tuning" tab on the "collection" admin screen, the "Accept cookies" checkbox controls whether the spider should accept HTTP session cookies from each site it spiders and return these cookies in subsequent HTTP requests. This box is checked by default. Remove Session IDs from URLs - Located in the "Spider Tuning Parameters" section under the "tuning" tab on the "collection" admin screen, the "Remove session IDs from URLs" checkbox controls whether the spider should remove commonly used session IDs from newly discovered URLs before adding them to the URL database. This box is checked by default. Scanner Collections ------------------- Roots and Filter Changes that Require Reindexing - If a change is made under the "roots" or "filters" tab that more tightly constrains the directories scanned, Inktomi Search 4.0 requires that the collection be cleared. A confirmation page is shown so that you can cancel the change. Netnews Collections ------------------- Filter Changes that Require Reindexing - If a change is made under the "filters" tab that more tightly constrains the newsgroups followed, Inktomi Search 4.0 requires that the collection be cleared. A confirmation page is shown so that you can cancel the change. Indexer Collections ------------------- Direct Indexing Using the Inktomi XPA Search API - Indexer collections can now be created from the "new collection" form. These collections are filled by direct indexing using the XPA Search API. Inktomi Search 4.0 performs no activity to fill these collection. Instead, the collection is filled by an external Java client application using the Inktomi XPA Search API. Diagnostic Support ------------------ Memory Warning - If the current license key specifies a configuration with memory requirements that exceed the current memory configuration, a warning is displayed at the top of the "activity" admin screen. Support Resources - The green "help" button at the top of every admin screen now leads to a "Support Resources" screen. On this screen are links to useful Inktomi Search software support resources. Ask Software Support Question - Available from the new "Support Resources" page, the "Ask Software Support Question" page provides a form that can be used to ask a question of an Inktomi Search software support representative. The question is submitted to the appropriate e-mail address along with your current configuration information. Improved Status Information - Improvements have been made to the status information included in the "os status" page and in automatic bug reports. The information now includes more detailed information about the current collections. Recent Log Messages - The record of recent log messages now includes log messages from before the last restart. Log files - Log files maintained by Inktomi Search 4.0 are encoded using UTF-8. Upgrading Customizations ------------------------ HTML Customizations - You may be able to use customized Python-scripted HTML results pages from an earlier version of Ultraseek Server, provided you do not mix old and new HTML pages. However, we strongly recommend re-customizing the new pages, to make use of bug fixes and new features like localization. Patches.py - If you have made customizations to your patches.py file, you will not be able to use your old customized patches.py file verbatim. Changes have been made to some parameters and return values of the procedures patched in the patches.py file. If you have customized your patches.py file, you should re-do your customizations in the patches.py file included in this release. The new patches.py file is located in the lib/python1.6 directory under the install directory. An existing patches.py file located in lib/python1.5 will not be read. New Python Version - Inktomi Search 4.0 is based on Python 1.6. If your customizations have made use of code or features from an earlier version of Python, we recommend updating your customizations. New Default Program And Data Directories ---------------------------------------- Inktomi Search 4.0 uses new default locations for its program and data directories: Sun Solaris Old Program Directory: /opt/SEEKultra Old Data Directory: /var/SEEKultra New Program Directory: /opt/InktomiSearch New Data Directory: /var/opt/InktomiSearch Microsoft Windows NT Old Program Directory: C:\Program Files\Infoseek\UltraseekServer Old Data Directory: C:\Program Files\Infoseek\UltraseekServer\data New Program Directory: C:\Program Files\Inktomi\InktomiSearch4.0 New Data Directory: C:\Program Files\Inktomi\SearchData Linux Old Program Directory: /opt/UltraseekServer Old Data Directory: /var/opt/UltraseekServer New Program Directory: /opt/InktomiSearch New Data Directory: /var/opt/InktomiSearch Hewlett-Packard HP-UX Old Program Directory: /opt/UltraseekServer Old Data Directory: /var/opt/UltraseekServer New Program Directory: /opt/InktomiSearch New Data Directory: /var/opt/InktomiSearch RELEASE NOTES FOR THE LINUX RELEASE SUPPORTED OS VERSION -------------------- Inktomi Search has been tested on RedHat Linux 6.0, with a kernel version of 2.2.5 and glibc 2.1.1. C++ LIBRARY REQUIRED -------------------- The C++ runtime library libstdc++-libc6.1 is required to run Inktomi Search. If you are using RedHat Linux, this file is part of: Version RPM Package ------- ----------- 6.X libstdc++-2.9.0 7.X compat-libstdc++-6.2-2.9.0 REPORTING A BUG --------------- When reporting a problem, be sure and include your Linux distribution, the version of your kernel (uname -r), and the version of your glibc. Send problem reports to software-bugs@ultraseek.com. SUPPORT FOR POSTSCRIPT ---------------------- Inktomi Search will index postscript files if you have installed ghostscript and it is on your path. Patchlevel 1 Include support for XPA search 1.0.3. Fix thesaurus problem with synonyms that contain the original term as a substring. Fix "exceptions.ValueError: invalid literal for int(): unlimited" in hpux. Fix problem where the xlt user dictionary was ignored when stemming queries. Fix problem extracting "alt" attributes from img HTML tags. Fix problem parsing "asctime" dates. Change allhits to return 0 hits on empty query instead of raising an error. Be smarter at detecting document encodings for Chinese, Japanese, and Korean documents by using rosette as well as teragram. In the default "Document Type Specification" section under the "doc types" tab on the "server" admin screen, add an entry for .jsp, specifying text/html. Save spider accept_cookies parameter in configuration file. Add missing tags in help/refine.html. Add "encoding:" header lines to header.html and footer.html. Treat fullwidth space in queries just like ASCII space. Treat segmentation points like spaces in unquoted Chinese, Japanese, and Korean query text. Fix thesaurus to work with Chinese, Japanese, and Korean text. Eliminate "504 suspended" indexer log messages when indexing is globally suspended for shutdowns, restarts, and topic updates. Change default encoding for Japanese to shift_jis. Highlight phrases. Remove iso-8859-15 from the list of supported encodings. Make sure data directory is a fully-qualified pathname to prevent filter failures. Detect title_size problems when incr dbs are opened. Suspend collection on any adddoc_insert_locked exception. Patchlevel 2 Fix occasional NT crash at startup. Raise minimum heapsize from 80Mb to 128Mb Fix language_display_names 3.1 docset compatibility problem. Don't warn about 100% licensed sites if the number of license sites equals the number of licensed collections. Fix the spider URL filter to correctly count the number of directories in URLs. Eliminate extra colon on "Extra-" lines in saquery.spy. Treat fullwidth space in thesaurus.txt just like ASCII space. Fix spelling of avancerad in sv.xml. Fix "exceptions.AttributeError: default_lang" in edittopicgo.html. Fix xmlquote to discard control characters illegal in XML so that they never end up in topics.xml. Fix "exceptions.TypeError: an integer is required" in edittopic.html when no parent is specified. Fix "exceptions.UnicodeError: ASCII decoding error: ordinal not in range(128)" when handling "URL contains illegal character" error. Fix "exceptions.AttributeError: 'Collection' instance has no attribute 'now'" starting a scheduled reindex in a scanner collection. Patchlevel 3 On NT, make sure we remove the InktomiSearch4.0 registry key when the product is uninstalled. Fix some corner cases in scheduled activity. Add a warning if you attempt to schedule a oneshot activity in the past. Clean up HTML in CCE topics status page. Don't show document counts or highlight terms in Chinese, Japanese, or Korean. Unescape special characters when turning URLs into file pathnames for scanner collection add URL. Support Unicode 3.0.1. Major update of linguistic support for Japanese (JMA 3.1). Handle Lotus Domino ReadForm URLs like OpenView URLs. Percent-encode ASCII quote in URLs (%22). Strip trailing whitespace from URLs when parsing HTML files. Patchlevel 4 Make the message width in the URL list for a site be configurable by setting config.urllist_msg_len. Include the actual license key in the message stating a key is bad. Workaround for the bug in the PDF library that gives bad titles. Reduce MemoryErrors by limiting the number of concurrent queries. Ignore corrupt query log entries. Add a directory listing command to os status so we can tell if ssl support is installed. Remove windows codepage overrides for some ISO charsets. Exit cleanly if the UNIX username is not a valid user. Patchlevel 5 New directory listing format Fix excessive delete previous bug when URLs have commas