Release Notes for Verity Ultraseek 5.3
--------------------------------------
Page Expert
-----------
Page Expert helps Ultraseek choose what portions of a page to index in
a spider collection. The processing for a page type is configured
using built-in and custom content filters. Each filter prevents
Ultraseek from indexing portions of the page which are not primarily
content. For example, navigation bars, ads, or copyright statements
could be removed.
Custom filters match a section of a document template that marks the
main content. It might be comments or HTML markup, like
"" or "
". Configure the custom filter
to match the beginning and end of the main content. When the custom
filter is enabled, only that section will be indexed. Filters are
combined to make page types. A page type might be "Javadoc" or
"intranet portal".
URL patterns and content strings are used to select which page type to
use for a document. During spidering, the URL and content patterns are
applied to each HTML page. The first set that matches is the page type
for that page. The filters are applied, and the resulting text is
indexed.
Hit Level Authentication
------------------------
The search interface is now capable of filtering the search results
returned to a user based on their credentials. This feature is off by
default, but it can be enabled on the Server > Advanced tab by
checking the "Filter results based on user credentials" checkbox.
To determine if a user has access to a search result, the search
engine tries to access the document over HTTP or HTTPS with the user's
credentials. Only HTTP and HTTPS URLs are currently supported for Hit
Level Authentication.
Once Hit Level Authentication is turned on, several security features
are activated that change the interface to the search engine. First,
it is no longer possible to get a list of the total number of hits
that matched the query, as well as counts for each individual term.
All of the interfaces will omit this information, except for XPA and
Web Services, which return a result count of 0.
Since the search engine has to fetch each URL in the result list to
determine if a user can access it, Hit Level Authentication may have a
large impact on query performance. Also, since the server sends the
user credentials to each web server from the result list, only results
from trusted web servers should be filtered.
If you change the level of authentication required for a URL, you must
reindex the URL, otherwise the change will not be reflected in the
Ultraseek index. If confidential information is removed from a
document, and the document is then made accessible without
authentication, it could be returned in a search result to any
user. This is because Ultraseek successfully retrieves the document,
regardless of the user's credentials, at search time. However, if the
URL hasn't been reindexed, the content associated with it within
Ultraseek will be the older version of the document. The confidential
information in the older version could be revealed to any user with a
search.
If a URL that was not previously under access control is put under
access control, it should not be a problem. Although it could be
returned in search results to any user until it is re-indexed, the
results will be based on the older version of the document that did
not require authorization. The user could already have seen this
information. Ultraseek will not contain any new confidential
information added to the document after it was put under protection
until the URL is re-indexed. At that time, Ultraseek will also begin
enforcing authentication on that URL.
Users can only supply one username and password at a time. If
different URLs in your index require different credentials, they can't
all be returned in a single search.
User sign-in and the Hit Level Authentication schemes are customizable.
The sign-in method is contained in the file "get_auth.html" and
"signin.html" in the docs directory. The Hit Level Authentication scheme
can be customized by editing the patch to check_auth in the patches.py
file in the lib/python2.2 directory.
API Changes for Hit Level Authentication
----------------------------------------
Hit level authentication is not implemented for the interfaces that
provided a list of all hits without relevance ranking. In XPA, the
allHits method of UltraseekCollection is now unsupported when Hit
Level Authentication is turned on. Also, pyqueryall.spy has moved into
the admin directory, and requires an admin username and password to
access it.
Setting Document Quality
------------------------
The Collections > Quality interface now allows for finer control of
the quality of a document. The quality can be adjusted based on:
- the depth of the directory of the document
- the number of incoming links
- whether the URL ends in a slash
- whether the URL contains a query string
- whether the document is from a user directory
- if a document contains specific meta tags or meta data
- a pattern that matches the URL
Proxy Server for In-document Highlighting
-----------------------------------------
Ultraseek can now use a proxy server when retrieving documents for
highlighting. You must configure this proxy separately from the
proxies for your spider collections. Configure the proxy on the Server
> Parameters > Advanced tab in the In-document highlighting proxy
server section.
Import Ultraseek Collections
----------------------------
Collection configuration can now be imported from other instances of
Ultraseek Server, either over the network, or from a OS Status or
configuration file.
Weblog Ping
-----------
Ultraseek will accept requests from weblogs ("blogs") using the
standard blog ping XML-RPC protocol. These pings are treated like an
Add URL request. If the blog content URL is accepted by the URL
filters for any collection, a successful result is returned for the
ping. Failure is returned if the URL is not acceptable to any
collection or there is some problem with the request.
Pings should be sent to the path "/help/ping.xml" on the Ultraseek
installation. If it is on the host "search.verity.com" at port 8765,
pings would go to http://search.verity.com:8765/help/ping.xml.
To verify that blog pings are working, add the Ultraseek ping URL to
the ping list in your blog server, then publish a new entry. The entry
should be added to the search index almost immediately. The request
will also be shown in Ultraseek's HTTP access log and in the blog
server log.
The blog ping specification is at http://www.xmlrpc.com/weblogsCom.
PDF Support
-----------
Ultraseek now supports PDF 1.5 (Acrobat 6.0) files using the Keyview
document filter. To use Keyview instead of Acrobat to filter PDF
documents, set the drop down for Document Type "application/pdf" to
"Adobe Acrobat (Key View)" under Server > Doctypes.
Implementation Changes
----------------------
Mirror Collections are improved.
- Files are compressed prior to transfer (both Ultraseek servers
must be 5.3)
- https: can be selected as the transfer protocol
- Admin interface provides better status indications.
HTTP Server.
- Ultraseek now serves pages with Content-Encoding of gzip,
x-gzip, or deflate, if the requesting browser supports the encoding.
Spidering.
- The Ultraseek spider now accepts Content-Encodings gzip, x-gzip,
and deflate - which speeds up transfer of documents during spidering.
- The spider now uses multiple threads per site. The number of
threads per site is set to 5 by default. This can be changed on
the Collections/Network tab. The spider will not use more than 1
thread per site for any site that matches a pattern in the Spider
Throttle settings.
Spider and Scanner collections now have a more robust error handling
interface with the URL database.
NTLM authentication (Windows challenge/response) is now supported on
all platforms, rather than Windows only.
Content Assistant can now be limited to specific collections.
In-document highlighting through a proxy server is now supported.
Higher priority is now given to meta tag names at the front of the
HTML Meta Tag Names list. The first meta tag in the list that appears
in the document will now be used, rather than the first meta tag in
the document that appears in the list.
The functionality to identify and decrease the relevance of mailing
list archives built with hypermail or MHonArc has been moved to the
Collections > Indexing > Page Expert tab so that it is now
configurable.
Interface Changes
-----------------
You can now set the following form variables in a style setting:
col Set of collections to search
qc Set of collection options to display
dt Date restriction for search
inthe Seconds from current date
before Timestamp for upper range of search
after Timestamp for lower range of search
rl Require logon for hit level authentication
qi Query terms mode (ANY or ALL)
The Style Editor now has controls for the "skip to content" link, the
"Help" and "Advanced" links, and the "Powered by Verity" logo on the
search form. There are also controls for the lock icon and the sign in
text that appears when Hit Level Authentication is activated.
The URL Status, Add URL, and Delete URL tabs in a spider or scanner
collection are moved to a sub tab called "URLs".
The View Sites page now has additional statistics for each site.
The interface for the Ad Server feature on the Server tab has been
deprecated. Any Ad Server settings you currently have configured will
continue to work, but the web interface has been removed.
Double-clicking on a collection checkbox in the search interface will
automatically select that collection only.
Patches.py Changes
------------------
The interface to the parse.parse routine is changed. This requires you
to rewrite any customization to the new_parse routine in
patches.py. The parse.parse routine now returns a single object
instead of a tuple. This should alleviate any future upgrades in which
the parse.parse routine has changes. Also, the parse.parse routine no
longer takes the site object as the second argument.
Two methods (unpack and repack) have been added to the ParsedDocument
object to make this upgrade easier. The new_parse routine should be
changed as follows:
def new_parse(col,doc,url,size,date,dict,doctype,params):
## pre-parse customized code here ##
parsedDocument =
old_parse(col,doc,url,size,date,dict,doctype,params)
(title,description,publisher,url,size,date,flags,extra,
httpequivs,robots,fields,terms,hrefs,imgs) =
parsedDocument.unpack()
## post-parse customized code here ##
parsedDocument.repack(
title,description,publisher,url,size,date,flags,extra,
httpequivs,robots,fields,terms,hrefs,imgs)
return parsedDocument
Updated Components
------------------
KeyView 8.0
DataDirect ConnectODBC 4.2
New Platforms
-------------
Suse Linux 8.0
RedHat Enterprise Linux 3.0
Bug Fixes
---------
[85584] Multiple tables are now supported in all database
collections.
[BZ052] Strip #fragment from URLs before URL matching.
[BZ174] Admin interface: "duplicate" Email sent as part of "Test
Autobug" is no longer filtered.
[BZ187] Highlighting not working after unclosed option tags. Was
fixed in 5.2.2, but left out of Patchnotes.
[BZ211] Digest creation is now performed by a low priority background
thread.
[BZ225] Admin interface: configuration file is now saved immediately
after changes in collection pretty name (and similar
parameters).
[BZ229] Admin interface: Collection names are now displayed with both
pretty name and internal name.
[BZ333] Fix relevance with date queries.
[BZ340] API: saquery.xml now includes a "charset=" parameter on the
URL generated for the
and elements.
[BZ371] Auto Title Replacement with non-English characters produce
incorrect titles.
[BZ378] In KeyView, extract Create and LastSaved dates as
dc.date.created and dc.date.modified, respectively.
[BZ389] Fixed error on Collection/Status for file scan collections
that occurred when attempting to display non-ASCII file
names.
[BZ402] Admin interface: fixed problems when running Ultraseek with
an HTTPS server, but no HTTP server.
[BZ424] API (XPA): delete/insert coordination problem that occurred
under highly multi-threaded re-indexing of large documents.
[BZ431] '400 Bad Request' from In-document highlighting with Netscape
4.x, 5.x, and 6.x and some web servers.
[BZ458] Outgoing HTTP traffic now uses the IP address configured in
the Binding address text box on the Server > Parameters >
Main tab.
[BZ459] Correct mappings for AE ligature and O-stroke characters.
[BZ504] Unclicking "Various search options" in the style editor causes
the back button to disappear.
[BZ505] Normally, in a database collection, all fields are also
treated as body text. However, when a field is specified as
an extra value in the title record, the content could not be
found with a non-field search.
[BZ528] Remove extra colon from content-type field in SOAP response.
[BZ533] Title parsing error in the HTML page when doctype has no
space between attributes.
Correction to Documentation
---------------------------
Known Issues
------------
Spell checking, Group by Location, and passage-based summaries use
additional computing time during a query. This overhead can affect
the performance of the query server, reducing the maximum number of
queries per second. If performance is critical, disabling these
features will increase the peak queries per second. You can disable
features on a per-style basis on the Interface tab.
RELEASE NOTES FOR THE LINUX RELEASE
WARNING FOR INKTOMI SEARCH 4.1 and 4.2 LINUX USERS
--------------------------------------------------
Due to a change in the URL database format, users of Inktomi
Search 4.1 and 4.2 on Linux will notice an error message after
upgrading the server. Users upgrading from versions 4.0
or previous will not notice this. If 4.1 or 4.2 URL databases
are detected, the following log message will appear:
Version 4.1 or 4.2 URL database file detected. This is
incompatible with this version. The URL database is being
backed up, then it will be erased.
The site will be revisited immediately. Be aware that this
may cause extraneous documents to appear in the index that
have actually been removed from the site. If this condition
is detected, clear the collection to remedy it.
SUPPORTED OS VERSION
--------------------
Ultraseek has been tested on RedHat Linux 7.1 and later, with
a kernel version of 2.2.5 and glibc 2.2.4.
C++ LIBRARY REQUIRED
--------------------
The C++ runtime library libstdc++-libc6.1 is required
to run Ultraseek.
If you are using RedHat Linux, this file is part of:
Version RPM Package
------- -----------
7.X compat-libstdc++-6.2-2.9.0
REPORTING A BUG
---------------
When reporting a problem, be sure and include your
Linux distribution, the version of your kernel (uname -r),
and the version of your glibc.
Send problem reports to software-bugs@ultraseek.com.
SUPPORT FOR POSTSCRIPT
----------------------
Ultraseek will index postscript files if you have
installed ghostscript and it is on your path.
RELEASE NOTES FOR THE SOLARIS RELEASE
Ultraseek requires Solaris 8 or above.
On Solaris 8, update 7 with the T2 threading library is recommended,
particularly for SMP hardware.
From Sun's support library:
http://developers.sun.com/solaris/articles/alt_thread_lib.html#question6
Solaris 8 Update 7 (2/2002) release is recommended for use of the T2 library
since it contains all the performance enhancements and patches. If your release
level is lower than Update 7, then you can apply the maintenance update 7
patch cluster or the following patches related to T2:
* 108528-13: SunOS 5.8: kernel update patch
* 108827-17: SunOS 5.8: /usr/lib/libthread.so.1 patch
If Ultraseek is installed as a suid(root) program, you must install /usr/lib/lwp/
as a secured directory (see Question 11 in the above document).
OS FILE HANDLE REQUIREMENT
--------------------------
Ultraseek requires at least 1024 file handles be available.
If your system's hard file-descriptors limit is set to less
than 1024 the server will report an error message and
refuse to startup.
You can find out your hard file-descriptors limit as follows:
sh
ulimit -Hn
exit
If ulimit returns less than 1024, you will need to manually increase
the hard limits in your system configuration.
In Solaris 2.4+, this can be accomplished by adding the
following lines to /etc/system:
*Set hard limit on file descriptors
set rlim_fd_max = 4096
Save the changes and reboot Solaris.
See http://access1.sun.com/technotes/01406.html for more information
on file descriptor limits.
Patchnotes for 5.3
------------------
Initial release
Patchnotes for 5.3.1
--------------------
Fixed configuration problem when Mirror collections used HTTPS:
Performance improvements when using Mirror collections to connect
different hardware architectures ((Linux or PC) to/from Solaris).
Admin localizations are now kept in an external XML file in the
languages directory.
Improved form-based authorization support to allow redirects to the
same URL from a form.
Support for "Content-Encoding: deflate" has been disabled in the
HTTP server. There are two conflicting interpretations of the "deflate"
standard - and which implementation applies is browser specific.
XPA Indexer collections now have a "show log" button in the Admin
interface.
Fixed bugs:
-----------
[BZ034] The "Searched Document" count is now correct for no-hits queries
[BZ295] Some field searches using quotes returned unexpected results.
[BZ437] A Mirror collection now generates email after 5 failed polls.
[BZ444] Unicode exception from saquery.xml when qt was omitted.
[BZ488] Searches using || without spaces between the terms returned
incorrect results. Also, this problem could result in a hung
server thread or segmentation fault.
[BZ545] Content Assistant tracing also enabled tracing for web services.
[BZ569] Grammar corrections in admin help pages.
[BZ580] Content Assistant error: SOAP Proxy object is Null.
[BZ584] Mirror collections always poll after restart.
[BZ585] Some documents not deduped on initial index.
[BZ591] Searches using link: do not work.
[BZ593] Multi-spider may insert duplicate documents into the index.
[BZ595] Unable to shutdown/restart when disk is full.
[BZ597] IP address access control was not enabled for web services.
[BZ605] Web services "describe" call threw exception for empty
list of sources.
[BZ618] Do not throw exception when web services tracing is enabled.
[BZ619] NameError when doing a wildcard search in Delete URL.
[BZ623] Exception was thrown when parsing or inserting HTML documents into
direct indexing collection with XPA.
[BZ625] Race condition during topics initialization after restart
[BZ626] Mirrored Quicklinks would persist in memory after reset
[BZ631] Document deletions from XPA throw exception.
[BZ638] "unknown compression method" thrown when older XPA clients
connected.
[BZ641] A 5.3 Collection mirrored by another collection would
acquire too many on-disk databases
[BZ642] Idle indexing collections now auto-save on a regular basis
[BZ644] Exception when Unicode queries sent to search web service
[BZ645] Content Assistant was sending control characters in XML
[BZ648] Timeout was not enforced while receiving HTTP headers
[BZ649] Length limit was not enforced while receiving HTTP headers
[BZ650] Column overrides were ignored the first time a database table
was indexed.
[BZ651] Finnish translation for "Advanced" was incorrect.
[BZ662] Spell sessions were not removed from cache when collection
updated.
[BZ669] The URL filter test page misreported a customized patch to
quality for certain documents.
[BZ670] Cookie paths did not respect the "Ignore case differences in
URLs" checkbox on the Collections/Tuning page.
Patchnotes for 5.3.2
--------------------
Updated Japanese admin localization.
Fixed bugs:
-----------
[BZ654] Too much text was shown as a term match for an inferred phrase.
[BZ700] Spelling was suggesting "rare" terms instead of frequent terms.
[BZ703] Request line was missing from Activity/Threads page.
[BZ707] Terms to the left of a || operator were influencing relevancy
scores (including terms defined with "qp" query parameter).
[BZ708] Terms to the left of a || operator were shown with word scores
(including terms defined with "qp" query parameter).
[BZ709] STARTINDEX and STOPINDEX tags in uppercase were not
functional.
[BZ714] Log files with Japanese characters caused ASCII decoding
error when viewed through the admin interface.
Patchnotes for 5.3.3
--------------------
The linguistic analysis software for all languages except Japanese,
Korean and Chinese has been replaced by software from the Teragram
Corporation. This should have no noticeable affect on searching and
indexing, but the format of the user dictionaries has changed. Any
customizations to pre-5.3.3 user dictionaries must be re-implemented.
The user dictionaries are contained in:
[Ultraseek Program Directory]/language
The new format is:
WORD,ROOT:w
where WORD is the word to be stemmed and ROOT is the stem.
For example, to stem the word "webservers" to "webserver" in the
English language, enter the following line in the en.usr file in the
language directory:
webservers,webserver:w
Any changes to a user dictionary will require a slight wait during
start-up the first time (and only the first time) the changes are
loaded.
Other Changes
-------------
Ultraseek now uses Python 2.2.3 (upgraded from 2.2.2).
The Admin interface now has a "Test" button under "Interface" to
test a style during modification.
The Admin interface now has a "Test" button under the Form-based Authentication
interface to test the authentication parameters.
Ultraseek now throws an "IllegalArgumentException: Unsupported Locale"
client exception for XPA API calls which specify a Locale which
is unsupported or unlicensed on the Ultraseek server.
The Form-based Authentication interface now has a "Test" button to
test the authentication.
Fixed bugs:
-----------
[BZ417] Repeated term counts no longer display for terms that occur
multiple times in a query (example: <>).
[BZ694] Internal exception for XPA spelling suggestion with an
unsupported Locale
[BZ717] Segmentation fault occurred along with the message "GC object
already in linked list"
[BZ720] Hits were returned for queries using | or || query prefixes,
a query suffix, and optional query terms which should result
in no-hits.
[BZ722] Topic browse now works when style requires "All Terms Must
Match"
[BZ741] Topics rules did not initialize correctly during server start
up, causing some topics to lose documents over time.
[BZ750] XSS security fix
[BZ753] Mirrored documents were counted toward the license
[BZ758] Internal exception for XPA search with an unsupported Locale
[BZ791] Corrected multi-thread problem with wildcard expansion.
Occasionally, this would incorrectly report "corruption" in
a collection's TMS database.
[BZ812] favicon.ico is now distributed in the docs directory
[BZ813] /util/exception.html can now use "raise done()" and "raise
redirect()"
[BZ817] "Word Scores" can now be turned off on the Advanced Search
page when the style has set them on by default.
[BZ820] Sort-by-title now uses score & URL as keys when documents
have the same title.
Sort-by-date now uses score & URL as keys when documents have
the same date.
[BZ821] Resolved performance issues when using group-by-location or
global setting query_dedup
[BZ822] Term hit-counts are now reduced by removed duplicates when
query_dedup is on
Patchnotes for 5.3.4
--------------------
[BZ048] Database collections did not always decode the character set
of the database table correctly.
[BZ503] Spider reported "URL database file full" before reaching 1M
URLs.
[BZ661] Several "403 Disallowed" messages appear in the error log
when using the agressive spider, even when "Log Disallowed
Filters" is not checked on the Collections/Filters page.
[BZ730] Using the Page Expert test page from within a scanner
collection generated the error: "Collection instance has no
attribute 'simple_http_client'".
[BZ739] The default authorization for an Exchange or netnews
collection was invalid.
[BZ840] XSS Security fix
[BZ844] Viewing a log file displays "UnicodeError" if the log file
contained a non-utf8 character.
[BZ856] Ultraseek was unable to index zip files that contained
trailing comments.
[BZ861] Invalid HTTP request caused "exceptions.AttributeError:
'NoneType' object has no attribute 'startswith'
(httpsrvr.py:904)".
[BZ862] "Group by topic" feature did not respect the "ht" form
variable.
[BZ868] Query response time slows down under query load on Solaris.
[BZ870] Page Expert test page does not display properly.
[BZ871] Keyview did not HTML quote text from binary documents.