Release Notes for Verity Ultraseek 5.3 -------------------------------------- Page Expert ----------- Page Expert helps Ultraseek choose what portions of a page to index in a spider collection. The processing for a page type is configured using built-in and custom content filters. Each filter prevents Ultraseek from indexing portions of the page which are not primarily content. For example, navigation bars, ads, or copyright statements could be removed. Custom filters match a section of a document template that marks the main content. It might be comments or HTML markup, like "" or "
". Configure the custom filter to match the beginning and end of the main content. When the custom filter is enabled, only that section will be indexed. Filters are combined to make page types. A page type might be "Javadoc" or "intranet portal". URL patterns and content strings are used to select which page type to use for a document. During spidering, the URL and content patterns are applied to each HTML page. The first set that matches is the page type for that page. The filters are applied, and the resulting text is indexed. Hit Level Authentication ------------------------ The search interface is now capable of filtering the search results returned to a user based on their credentials. This feature is off by default, but it can be enabled on the Server > Advanced tab by checking the "Filter results based on user credentials" checkbox. To determine if a user has access to a search result, the search engine tries to access the document over HTTP or HTTPS with the user's credentials. Only HTTP and HTTPS URLs are currently supported for Hit Level Authentication. Once Hit Level Authentication is turned on, several security features are activated that change the interface to the search engine. First, it is no longer possible to get a list of the total number of hits that matched the query, as well as counts for each individual term. All of the interfaces will omit this information, except for XPA and Web Services, which return a result count of 0. Since the search engine has to fetch each URL in the result list to determine if a user can access it, Hit Level Authentication may have a large impact on query performance. Also, since the server sends the user credentials to each web server from the result list, only results from trusted web servers should be filtered. If you change the level of authentication required for a URL, you must reindex the URL, otherwise the change will not be reflected in the Ultraseek index. If confidential information is removed from a document, and the document is then made accessible without authentication, it could be returned in a search result to any user. This is because Ultraseek successfully retrieves the document, regardless of the user's credentials, at search time. However, if the URL hasn't been reindexed, the content associated with it within Ultraseek will be the older version of the document. The confidential information in the older version could be revealed to any user with a search. If a URL that was not previously under access control is put under access control, it should not be a problem. Although it could be returned in search results to any user until it is re-indexed, the results will be based on the older version of the document that did not require authorization. The user could already have seen this information. Ultraseek will not contain any new confidential information added to the document after it was put under protection until the URL is re-indexed. At that time, Ultraseek will also begin enforcing authentication on that URL. Users can only supply one username and password at a time. If different URLs in your index require different credentials, they can't all be returned in a single search. User sign-in and the Hit Level Authentication schemes are customizable. The sign-in method is contained in the file "get_auth.html" and "signin.html" in the docs directory. The Hit Level Authentication scheme can be customized by editing the patch to check_auth in the patches.py file in the lib/python2.2 directory. API Changes for Hit Level Authentication ---------------------------------------- Hit level authentication is not implemented for the interfaces that provided a list of all hits without relevance ranking. In XPA, the allHits method of UltraseekCollection is now unsupported when Hit Level Authentication is turned on. Also, pyqueryall.spy has moved into the admin directory, and requires an admin username and password to access it. Setting Document Quality ------------------------ The Collections > Quality interface now allows for finer control of the quality of a document. The quality can be adjusted based on: - the depth of the directory of the document - the number of incoming links - whether the URL ends in a slash - whether the URL contains a query string - whether the document is from a user directory - if a document contains specific meta tags or meta data - a pattern that matches the URL Proxy Server for In-document Highlighting ----------------------------------------- Ultraseek can now use a proxy server when retrieving documents for highlighting. You must configure this proxy separately from the proxies for your spider collections. Configure the proxy on the Server > Parameters > Advanced tab in the In-document highlighting proxy server section. Import Ultraseek Collections ---------------------------- Collection configuration can now be imported from other instances of Ultraseek Server, either over the network, or from a OS Status or configuration file. Weblog Ping ----------- Ultraseek will accept requests from weblogs ("blogs") using the standard blog ping XML-RPC protocol. These pings are treated like an Add URL request. If the blog content URL is accepted by the URL filters for any collection, a successful result is returned for the ping. Failure is returned if the URL is not acceptable to any collection or there is some problem with the request. Pings should be sent to the path "/help/ping.xml" on the Ultraseek installation. If it is on the host "search.verity.com" at port 8765, pings would go to http://search.verity.com:8765/help/ping.xml. To verify that blog pings are working, add the Ultraseek ping URL to the ping list in your blog server, then publish a new entry. The entry should be added to the search index almost immediately. The request will also be shown in Ultraseek's HTTP access log and in the blog server log. The blog ping specification is at http://www.xmlrpc.com/weblogsCom. PDF Support ----------- Ultraseek now supports PDF 1.5 (Acrobat 6.0) files using the Keyview document filter. To use Keyview instead of Acrobat to filter PDF documents, set the drop down for Document Type "application/pdf" to "Adobe Acrobat (Key View)" under Server > Doctypes. Implementation Changes ---------------------- Mirror Collections are improved. - Files are compressed prior to transfer (both Ultraseek servers must be 5.3) - https: can be selected as the transfer protocol - Admin interface provides better status indications. HTTP Server. - Ultraseek now serves pages with Content-Encoding of gzip, x-gzip, or deflate, if the requesting browser supports the encoding. Spidering. - The Ultraseek spider now accepts Content-Encodings gzip, x-gzip, and deflate - which speeds up transfer of documents during spidering. - The spider now uses multiple threads per site. The number of threads per site is set to 5 by default. This can be changed on the Collections/Network tab. The spider will not use more than 1 thread per site for any site that matches a pattern in the Spider Throttle settings. Spider and Scanner collections now have a more robust error handling interface with the URL database. NTLM authentication (Windows challenge/response) is now supported on all platforms, rather than Windows only. Content Assistant can now be limited to specific collections. In-document highlighting through a proxy server is now supported. Higher priority is now given to meta tag names at the front of the HTML Meta Tag Names list. The first meta tag in the list that appears in the document will now be used, rather than the first meta tag in the document that appears in the list. The functionality to identify and decrease the relevance of mailing list archives built with hypermail or MHonArc has been moved to the Collections > Indexing > Page Expert tab so that it is now configurable. Interface Changes ----------------- You can now set the following form variables in a style setting: col Set of collections to search qc Set of collection options to display dt Date restriction for search inthe Seconds from current date before Timestamp for upper range of search after Timestamp for lower range of search rl Require logon for hit level authentication qi Query terms mode (ANY or ALL) The Style Editor now has controls for the "skip to content" link, the "Help" and "Advanced" links, and the "Powered by Verity" logo on the search form. There are also controls for the lock icon and the sign in text that appears when Hit Level Authentication is activated. The URL Status, Add URL, and Delete URL tabs in a spider or scanner collection are moved to a sub tab called "URLs". The View Sites page now has additional statistics for each site. The interface for the Ad Server feature on the Server tab has been deprecated. Any Ad Server settings you currently have configured will continue to work, but the web interface has been removed. Double-clicking on a collection checkbox in the search interface will automatically select that collection only. Patches.py Changes ------------------ The interface to the parse.parse routine is changed. This requires you to rewrite any customization to the new_parse routine in patches.py. The parse.parse routine now returns a single object instead of a tuple. This should alleviate any future upgrades in which the parse.parse routine has changes. Also, the parse.parse routine no longer takes the site object as the second argument. Two methods (unpack and repack) have been added to the ParsedDocument object to make this upgrade easier. The new_parse routine should be changed as follows: def new_parse(col,doc,url,size,date,dict,doctype,params): ## pre-parse customized code here ## parsedDocument = old_parse(col,doc,url,size,date,dict,doctype,params) (title,description,publisher,url,size,date,flags,extra, httpequivs,robots,fields,terms,hrefs,imgs) = parsedDocument.unpack() ## post-parse customized code here ## parsedDocument.repack( title,description,publisher,url,size,date,flags,extra, httpequivs,robots,fields,terms,hrefs,imgs) return parsedDocument Updated Components ------------------ KeyView 8.0 DataDirect ConnectODBC 4.2 New Platforms ------------- Suse Linux 8.0 RedHat Enterprise Linux 3.0 Bug Fixes --------- [85584] Multiple tables are now supported in all database collections. [BZ052] Strip #fragment from URLs before URL matching. [BZ174] Admin interface: "duplicate" Email sent as part of "Test Autobug" is no longer filtered. [BZ187] Highlighting not working after unclosed option tags. Was fixed in 5.2.2, but left out of Patchnotes. [BZ211] Digest creation is now performed by a low priority background thread. [BZ225] Admin interface: configuration file is now saved immediately after changes in collection pretty name (and similar parameters). [BZ229] Admin interface: Collection names are now displayed with both pretty name and internal name. [BZ333] Fix relevance with date queries. [BZ340] API: saquery.xml now includes a "charset=" parameter on the URL generated for the and elements. [BZ371] Auto Title Replacement with non-English characters produce incorrect titles. [BZ378] In KeyView, extract Create and LastSaved dates as dc.date.created and dc.date.modified, respectively. [BZ389] Fixed error on Collection/Status for file scan collections that occurred when attempting to display non-ASCII file names. [BZ402] Admin interface: fixed problems when running Ultraseek with an HTTPS server, but no HTTP server. [BZ424] API (XPA): delete/insert coordination problem that occurred under highly multi-threaded re-indexing of large documents. [BZ431] '400 Bad Request' from In-document highlighting with Netscape 4.x, 5.x, and 6.x and some web servers. [BZ458] Outgoing HTTP traffic now uses the IP address configured in the Binding address text box on the Server > Parameters > Main tab. [BZ459] Correct mappings for AE ligature and O-stroke characters. [BZ504] Unclicking "Various search options" in the style editor causes the back button to disappear. [BZ505] Normally, in a database collection, all fields are also treated as body text. However, when a field is specified as an extra value in the title record, the content could not be found with a non-field search. [BZ528] Remove extra colon from content-type field in SOAP response. [BZ533] Title parsing error in the HTML page when doctype has no space between attributes. Correction to Documentation --------------------------- Known Issues ------------ Spell checking, Group by Location, and passage-based summaries use additional computing time during a query. This overhead can affect the performance of the query server, reducing the maximum number of queries per second. If performance is critical, disabling these features will increase the peak queries per second. You can disable features on a per-style basis on the Interface tab. RELEASE NOTES FOR THE LINUX RELEASE WARNING FOR INKTOMI SEARCH 4.1 and 4.2 LINUX USERS -------------------------------------------------- Due to a change in the URL database format, users of Inktomi Search 4.1 and 4.2 on Linux will notice an error message after upgrading the server. Users upgrading from versions 4.0 or previous will not notice this. If 4.1 or 4.2 URL databases are detected, the following log message will appear: Version 4.1 or 4.2 URL database file detected. This is incompatible with this version. The URL database is being backed up, then it will be erased. The site will be revisited immediately. Be aware that this may cause extraneous documents to appear in the index that have actually been removed from the site. If this condition is detected, clear the collection to remedy it. SUPPORTED OS VERSION -------------------- Ultraseek has been tested on RedHat Linux 7.1 and later, with a kernel version of 2.2.5 and glibc 2.2.4. C++ LIBRARY REQUIRED -------------------- The C++ runtime library libstdc++-libc6.1 is required to run Ultraseek. If you are using RedHat Linux, this file is part of: Version RPM Package ------- ----------- 7.X compat-libstdc++-6.2-2.9.0 REPORTING A BUG --------------- When reporting a problem, be sure and include your Linux distribution, the version of your kernel (uname -r), and the version of your glibc. Send problem reports to software-bugs@ultraseek.com. SUPPORT FOR POSTSCRIPT ---------------------- Ultraseek will index postscript files if you have installed ghostscript and it is on your path. RELEASE NOTES FOR THE SOLARIS RELEASE Ultraseek requires Solaris 8 or above. On Solaris 8, update 7 with the T2 threading library is recommended, particularly for SMP hardware. From Sun's support library: http://developers.sun.com/solaris/articles/alt_thread_lib.html#question6 Solaris 8 Update 7 (2/2002) release is recommended for use of the T2 library since it contains all the performance enhancements and patches. If your release level is lower than Update 7, then you can apply the maintenance update 7 patch cluster or the following patches related to T2: * 108528-13: SunOS 5.8: kernel update patch * 108827-17: SunOS 5.8: /usr/lib/libthread.so.1 patch If Ultraseek is installed as a suid(root) program, you must install /usr/lib/lwp/ as a secured directory (see Question 11 in the above document). OS FILE HANDLE REQUIREMENT -------------------------- Ultraseek requires at least 1024 file handles be available. If your system's hard file-descriptors limit is set to less than 1024 the server will report an error message and refuse to startup. You can find out your hard file-descriptors limit as follows: sh ulimit -Hn exit If ulimit returns less than 1024, you will need to manually increase the hard limits in your system configuration. In Solaris 2.4+, this can be accomplished by adding the following lines to /etc/system: *Set hard limit on file descriptors set rlim_fd_max = 4096 Save the changes and reboot Solaris. See http://access1.sun.com/technotes/01406.html for more information on file descriptor limits. Patchnotes for 5.3 ------------------ Initial release Patchnotes for 5.3.1 -------------------- Fixed configuration problem when Mirror collections used HTTPS: Performance improvements when using Mirror collections to connect different hardware architectures ((Linux or PC) to/from Solaris). Admin localizations are now kept in an external XML file in the languages directory. Improved form-based authorization support to allow redirects to the same URL from a form. Support for "Content-Encoding: deflate" has been disabled in the HTTP server. There are two conflicting interpretations of the "deflate" standard - and which implementation applies is browser specific. XPA Indexer collections now have a "show log" button in the Admin interface. Fixed bugs: ----------- [BZ034] The "Searched Document" count is now correct for no-hits queries [BZ295] Some field searches using quotes returned unexpected results. [BZ437] A Mirror collection now generates email after 5 failed polls. [BZ444] Unicode exception from saquery.xml when qt was omitted. [BZ488] Searches using || without spaces between the terms returned incorrect results. Also, this problem could result in a hung server thread or segmentation fault. [BZ545] Content Assistant tracing also enabled tracing for web services. [BZ569] Grammar corrections in admin help pages. [BZ580] Content Assistant error: SOAP Proxy object is Null. [BZ584] Mirror collections always poll after restart. [BZ585] Some documents not deduped on initial index. [BZ591] Searches using link: do not work. [BZ593] Multi-spider may insert duplicate documents into the index. [BZ595] Unable to shutdown/restart when disk is full. [BZ597] IP address access control was not enabled for web services. [BZ605] Web services "describe" call threw exception for empty list of sources. [BZ618] Do not throw exception when web services tracing is enabled. [BZ619] NameError when doing a wildcard search in Delete URL. [BZ623] Exception was thrown when parsing or inserting HTML documents into direct indexing collection with XPA. [BZ625] Race condition during topics initialization after restart [BZ626] Mirrored Quicklinks would persist in memory after reset [BZ631] Document deletions from XPA throw exception. [BZ638] "unknown compression method" thrown when older XPA clients connected. [BZ641] A 5.3 Collection mirrored by another collection would acquire too many on-disk databases [BZ642] Idle indexing collections now auto-save on a regular basis [BZ644] Exception when Unicode queries sent to search web service [BZ645] Content Assistant was sending control characters in XML [BZ648] Timeout was not enforced while receiving HTTP headers [BZ649] Length limit was not enforced while receiving HTTP headers [BZ650] Column overrides were ignored the first time a database table was indexed. [BZ651] Finnish translation for "Advanced" was incorrect. [BZ662] Spell sessions were not removed from cache when collection updated. [BZ669] The URL filter test page misreported a customized patch to quality for certain documents. [BZ670] Cookie paths did not respect the "Ignore case differences in URLs" checkbox on the Collections/Tuning page. Patchnotes for 5.3.2 -------------------- Updated Japanese admin localization. Fixed bugs: ----------- [BZ654] Too much text was shown as a term match for an inferred phrase. [BZ700] Spelling was suggesting "rare" terms instead of frequent terms. [BZ703] Request line was missing from Activity/Threads page. [BZ707] Terms to the left of a || operator were influencing relevancy scores (including terms defined with "qp" query parameter). [BZ708] Terms to the left of a || operator were shown with word scores (including terms defined with "qp" query parameter). [BZ709] STARTINDEX and STOPINDEX tags in uppercase were not functional. [BZ714] Log files with Japanese characters caused ASCII decoding error when viewed through the admin interface. Patchnotes for 5.3.3 -------------------- The linguistic analysis software for all languages except Japanese, Korean and Chinese has been replaced by software from the Teragram Corporation. This should have no noticeable affect on searching and indexing, but the format of the user dictionaries has changed. Any customizations to pre-5.3.3 user dictionaries must be re-implemented. The user dictionaries are contained in: [Ultraseek Program Directory]/language The new format is: WORD,ROOT:w where WORD is the word to be stemmed and ROOT is the stem. For example, to stem the word "webservers" to "webserver" in the English language, enter the following line in the en.usr file in the language directory: webservers,webserver:w Any changes to a user dictionary will require a slight wait during start-up the first time (and only the first time) the changes are loaded. Other Changes ------------- Ultraseek now uses Python 2.2.3 (upgraded from 2.2.2). The Admin interface now has a "Test" button under "Interface" to test a style during modification. The Admin interface now has a "Test" button under the Form-based Authentication interface to test the authentication parameters. Ultraseek now throws an "IllegalArgumentException: Unsupported Locale" client exception for XPA API calls which specify a Locale which is unsupported or unlicensed on the Ultraseek server. The Form-based Authentication interface now has a "Test" button to test the authentication. Fixed bugs: ----------- [BZ417] Repeated term counts no longer display for terms that occur multiple times in a query (example: <>). [BZ694] Internal exception for XPA spelling suggestion with an unsupported Locale [BZ717] Segmentation fault occurred along with the message "GC object already in linked list" [BZ720] Hits were returned for queries using | or || query prefixes, a query suffix, and optional query terms which should result in no-hits. [BZ722] Topic browse now works when style requires "All Terms Must Match" [BZ741] Topics rules did not initialize correctly during server start up, causing some topics to lose documents over time. [BZ750] XSS security fix [BZ753] Mirrored documents were counted toward the license [BZ758] Internal exception for XPA search with an unsupported Locale [BZ791] Corrected multi-thread problem with wildcard expansion. Occasionally, this would incorrectly report "corruption" in a collection's TMS database. [BZ812] favicon.ico is now distributed in the docs directory [BZ813] /util/exception.html can now use "raise done()" and "raise redirect()" [BZ817] "Word Scores" can now be turned off on the Advanced Search page when the style has set them on by default. [BZ820] Sort-by-title now uses score & URL as keys when documents have the same title. Sort-by-date now uses score & URL as keys when documents have the same date. [BZ821] Resolved performance issues when using group-by-location or global setting query_dedup [BZ822] Term hit-counts are now reduced by removed duplicates when query_dedup is on Patchnotes for 5.3.4 -------------------- [BZ048] Database collections did not always decode the character set of the database table correctly. [BZ503] Spider reported "URL database file full" before reaching 1M URLs. [BZ661] Several "403 Disallowed" messages appear in the error log when using the agressive spider, even when "Log Disallowed Filters" is not checked on the Collections/Filters page. [BZ730] Using the Page Expert test page from within a scanner collection generated the error: "Collection instance has no attribute 'simple_http_client'". [BZ739] The default authorization for an Exchange or netnews collection was invalid. [BZ840] XSS Security fix [BZ844] Viewing a log file displays "UnicodeError" if the log file contained a non-utf8 character. [BZ856] Ultraseek was unable to index zip files that contained trailing comments. [BZ861] Invalid HTTP request caused "exceptions.AttributeError: 'NoneType' object has no attribute 'startswith' (httpsrvr.py:904)". [BZ862] "Group by topic" feature did not respect the "ht" form variable. [BZ868] Query response time slows down under query load on Solaris. [BZ870] Page Expert test page does not display properly. [BZ871] Keyview did not HTML quote text from binary documents.