Adding a Custom Document FilterRyan Weisenberger While Ultraseek can parse a wide variety of document formats from several different vendors, occasionally an enterprise has a document format that is unknown to Ultraseek. In this case, it is possible to add your own document filter so Ultraseek can parse your custom document type. To add a user-defined document type to Ultraseek, there are three steps you must follow.
Step 1: Edit your
|
| parameter | type | description |
|---|---|---|
| doctype | string | This is the name of your document type. If you look on the admin console under Server > Doctypes, in the section titled Document Type Parsing, the menus in the Parse as column will have your additional doctype. |
| convertername | string | This is the name of your program or script that will convert your user-defined type to html. |
| errdict | dictionary | This is a Python dictionary that converts the integer exit codes of your conversion script/program to a string error message. |
For example, here is how you would add support for a rot13 file, a file type where each character is rotated through the alphabet by 13 positions. (i.e., “abc” becomes “nop”)
Add the following lines to the patches.py file:
import parse, config
errdict = {
1: "script error - you used the wrong exit code!",
2: "some other error message"
}
# Look for the script in the Ultraseek lib directory;
convertername = config.program_lib_path("unrot13")
parse.define_doctype_filter("rot13", convertername, errdict)
After making these additions, restart Ultraseek.
The code you added to patches.py will be executed and your new document type will appear in the Parse as menu in the Document Type Parsing section under Server > Doctypes.
Step 2: Configure Ultraseek to use your document filter
In the Ultraseek admin console, go to the Server > Doctypes pane.
Add the following new extension to the Document Type Specification:
.rot13 application/rot13
Add the following new document type under Document Type Parsing:
Document type: application/rot13
Parse As: rot13
Click the OK button to save your new settings.
Note: Your Web server must serve the correct MIME type for this document for Ultraseek to use the custom filter. Make sure the document type and MIME type are registered with your Web server software.
Step 3: Install your conversion program
Your conversion program will be run by Ultraseek as follows:
converter inputfile outputfile errorfile
Your converter program must read the contents of inputfile, convert it to html, and write the results to outputfile. Error messages may be written to errorfile.
If the conversion works normally, it should exit with an exit code of 0. If the converter fails for some reason, it should exit with a non-zero exit code.
If the exit code is non-zero, an error message will be logged that is a combination of the error message looked up in the error dictionary passed in to define_doctype_filter and any output that is written to errorfile.
For example, you can use the following shell script to
convert rot13 files back into normal text:
#!/bin/sh
# $1 is the input file
# $2 is the output html file that will be indexed
# $3 is the file where we place any error messages that
# we may generate;
echo "<html><body>" > $2
tr n-za-mN-ZA-M a-zA-Z < $1 >> $2
echo "</body></html>" >> $2
# If we do this 'exit 1', we will see the
# 'script error' message
# from our sample error dictionary in the log file;
# exit 1
exit 0
This code should be placed in the Ultraseek /lib directory in a file named unrot13 to be used in the example above.
If you are using Windows, the conversion program must be a .exe file and not a script.
Posted June 28, 2005 09:51 AM by editor
Category:
Customizing
Categories
Archives
Recent Entries
Tuning the Search Relevance on Your Site?
'Richer Suite of Functionality'
Fueling Your Business Search Engine to Find the Right Answers