Yolinux.com Tutorial

WAIS - Wide Area Information Server


Related YoLinux Tutorials:

°HtDig Search

°Linux Networking

°Linux Sys Admin

°Internet Security

°Security Tools

°Web site configuration

°Web Tricks

°YoLinux Tutorials Index




Free Information Technology Magazines and Document Downloads
TradePub link image


WAIS Introduction:

WAIS is one of the original search facilities developed to index and search a web site. For something more current, see the YoLinux.com tutorial on htDig to provide an index and search capability for your web site.

WAIS was developed by "Thinking Machines Inc." in 1988 for indexing and searching document indexes. It employs a client/server architecture. It was an advance made necessary by the large number of documents residing on web sites. Free text searches such as "grep" were too slow to be applied against large numbers of documents. WAIS speeds up the process by performing the searches up front. A WAIS search will return the titles for documents best matching the search.

Indexing a site will create databases (or sources) by indexing the documents. This is done by the program waisindex. The sources generated are used by the waisserver. The program waisq is the interface to the WAIS server.

WAIS incorporates relevance ranking which assigns a factor to all indexed words. Words appearing in a title will be assigned a higher relevence. Words which are used less often get a higher ranking. The number of times a word is used in a document and the size of the document also influence the weighting of the word in the index.


Download:

ftp://sunsite.unc.edu/pub/packages/infosystems/wais/servers/freeWAIS/
Get binaries:
freeWAIS-0.5-<UNIX type>.tar.gz where "<UNIX type>" is SunOS, Linux, AIX...

or get source code: freeWAIS-0.5.tar.gz

Note the use of the word "source" in the WAIS world does not always mean source code. It often means the source of a search index. (as in origin)

Man pages:

ftp://sunsite.unc.edu/pub/packages/infosystems/wais/documentation/man-pages/*.1

Technical explanation of file structure: (not needed)

ftp://sunsite.unc.edu/pub/packages/infosystems/wais/documentation/protspec.txt

Note metalab.unc.edu = sunsite.unc.edu.


Unload:

Unzip-tar the binaries:

  • gzip -d freeWAIS---.tar.gz
  • tar xf freeWAIS---.tar

The essential elements you need from this are: waisq, waissearch, waisserver and waisindex.

(You can use swais.sh which calls swais to run WAIS without the network if you wish)

Place WAIS binaries in "/usr/local/bin", "/opt/bin" or other accessible bin directory.


Index:

Indexing a collection of documents generates a "sources" database comprised of the following files:

  • .dct - Dictionary file.
  • .inv - Inverted file. This holds the association between words and documents.
  • .doc - Pointers to documents and descriptive headers.
  • .fn - File names. List of files used to generate source files.
  • .hl - List of all the headlines.
  • .cat - List of headlines and corresponding documents.
  • .src - Sources database information.

My index script: (Indexes for use on the web)

Create synonym file if required: /usr/local/http/wais/sources/abc_index.syn
File entries: (list synonyms, one list per line. Not required)

Microsoft Monopoly
computer processor server
word synonym synonym ...
...

Words to be ignored are hard coded for you in waisindex.
See source code for list: freeWAIS-0.5/src/ir/stoplist.c for this list of words.

Start script:

#!/bin/csh
# Index pages at this site.
# Create reference to pages in the form of a URL
waisindex -l 1 -export -d /usr/local/http/wais/sources/abc_index  \
                       -t URL /usr/local/docs/HTML http://XXX.XXX.XXX.XX  \
                       -r /usr/local/docs/HTML/*

waisindex flags:

-d :Directory including file name prefix for source files.
/usr/local/http/wais/sources/abc_index  = File name without suffix for index.
waisindex will create abc_index.src,
abc_index.cat, abc_index.dct ...etc
Path directories must exist.
-t : Type of index created
                  URL = Returned result from search will be in the form of a URL
/usr/local/docs/HTML = Path name to trim from reference when creating URL ref.
http://XXX.XXX.XXX.XX = Web path to the same directory.(Best to use domain name)
Note there is no terminating "/".
-r : Recursively through subdirectories.
/usr/local/docs/HTML  = Path of html documents you will be indexing.


Server:

I could never get it working from inetd. Use script instead.

Used start script: (placed this statement in /etc/rc.local terminated with &)

#!/bin/csh
# Start wais server
/usr/local/bin/waisserver -l 0 -p 210 -d /usr/local/http/wais/sources \
                                      -e /usr/local/http/logs/wais &
Explanation:
-p = Port number. Ansi standard Z39.50 says use port 210
-d = Directory of index files

inetd setup: (DID NOT WORK!!)

File: /etc/inetd.conf (single line)

# wais web index server
wais  stream tcp nowait root /usr/local/bin/waisserver waisd.d \
         -d /usr/local/http/wais/sources -e /usr/local/http/logs/wais
#

File: /etc/services

wais            210/tcp                         # wais server for web indexing

AIX start script: --start from cgi-bin by server

#!/bin/ksh
# Start wais server

echo Content-type: text/plain
echo ""

/usr/bin/ps -ef | /usr/bin/grep "waisserver" | /usr/bin/grep -v grep >/dev/null
if [ "$?" -eq 0 ]; then
   echo "Wais already running!"
   exit 0
fi

if [ -x /userdata/ec/ecc5/bin/waisserver ]; then
   /userdata/ec/ecc5/bin/waisserver -l 0 -p 2210 -d /ad/src/html/wais-sources -e
 /userdata/ec/ecc5/tmp/wais &
fi

echo "Wais Started!"


Server CGI - kidofwais:

PERL script to invoke WAIS client "waisq".

Download scripts kidofwais.pl, print_hit_bold.pl and cgi-lib.pl and place them in your /cgi-bin/.

The cgi Perl script to execute can be found at:
Home page: http://www.cso.uiuc.edu/grady.html
Download script: http://ljordal.cso.uiuc.edu/kidofwais.pl

Edit script:

  • Edit script variables:
    • $waisq (location of waisq - /usr/local/bin/waisq),
    • $waisd (location of wais indexes),
    • $default_src, $openingTitle, $closingTitle, $wwwDocpath, $serverURL and $maintainer.
  • The path for $wwwDocpath and $serverURL both end with a "/".
  • Set "require" statement to point to cgi-lib.pl
  • Set variables:
    • $use_Source_table = 0;
    • $use_hilite = 0;
The hilite option works well only on plain text pages otherwise its pitiful. It messes up pages with embedded graphics. Thus $use_hilite = 0;

Download script: http://ljordal.cso.uiuc.edu/print_hit_bold.pl

Edit script variables $serverURL and $maintainer.

This requires the Perl script cgi-lib.pl:

  • home page: http://cgi-lib.stanford.edu/cgi-lib/
  • Download script: /2.17/cgi-lib.pl.txt
  • Rename to: /.../http/cgi-bin/cgi-lib.pl


Source Table: Using multiple search indexes:

- Previous setup is for one index -

Searching multiple indexes with one querry: (OPTIONAL) - Usefull for multiple servers

Set variables $use_Source_table = 1;

Create file /usr/local/http/wais/sources/Source_table

Sample:

abc_index~ABC Developer Web Site~1~ABC:~~abc_index,abc_index_2,abc_index_3
abc_index_2~ABC Next Generation~0~ABC_II:
abc_index_3~ABC Future Plans~0~ABC_Plans:

See: http://www.cso.uiuc.edu/grady.html/Source_table.txt

Note: First line references itself and the lines which follow. Use "1" on first line to allow it to reference other lines using "0" which do not further reference anything else.

Format: Table of wais sources and how to process them - columns separated by tilde.

wais_source_name~title_to_use~search_multiple_indices?~short_name
~file_title_table~list_of_src_names_to_search_if_multiple_indices
(all on a single line) This table contains the following info:
  • Name of "source" as passed to kidofwais; this can be a "pseudo" name to stand for a "multiple index search" (not an actual WAIS source itself).
  • Titles to display on "Search" page to describe what is being searched.
  • Is this a "pseudo name" standing in for a list of sources to each be searched and presented back as one hit list? (0 for no, 1 for yes).
  • A short title for cases where we are searching multiple indices and want to prefix the "hit" document title with an indication of which index it was found in.
  • Optionally specify a table to use to look up titles for individual filenames returned on the WAIS hit list -- good for filetypes like PDF that don't otherwise have a way of extracting a title.
  • The names of the sources (as found in the first column of this table) to be searched when this is a "multiple index search". These names should be separated by commas. This is currently not set up to be "recursive", so none of these source names will be further expanded into additional source names (so don't include a name that isn't a "real index" itself). This list can include its own name (the same name in column one of this entry) if it itself is also a "real index" (See the "ccso_main_www" entry in the above table).
  • The url to go to from the Search screen. This gives the user the option to jump back to a "higher/previous" page without going thru the "back button" a bunch of times.
  • A "title" for the above "url to go to"; this is used to label the link. This title is preceded by the words "go to the" when displayed, so pick something that "sounds good" in that context!

WAIS_fs:

WAIS-SF is one of the original search facilities developed to index and search a web site.

This version of WAIS for "Structured Fields" was developed in 1993 to extend query functionality. Added functionality includes wild card searches, boolean searches, numeric searches with operators such as numeric values less than and greater than, and searching based on defined fields in a document (i.e. author). The fields must be described using a WAIS-sf format description for the layout of the document.

freeWAIS-sf and SFgate:
See: http://www.igd.fhg.de/www/www95/papers/47/fwsf/fwsf.html
Download SFgate:
ftp://ls6-ftp.cs.uni-dortmund.de/pub/src/SFgate/
  • SFgate-5.111.tar.gz
  • SFgate.ps.gz

freeWAIS-sf:
ftp://ls6-ftp.cs.uni-dortmund.de/pub/src/freeWAIS-sf/

  • freeWAIS-sf-2.2.12.tar.gz
  • fwsf.ps.gz

wais.pm (PERL module)


Books:

  • "Web Publishing Unleashed, HTML, JAVA, CGI, VRML, SGML"
    ISBN #1-57521-051-7, SAMS
    This book dedicates an entire chapter to WAIS and search engines.

   

    Bookmark and Share


Advertisements




Copyright © 1999 by Greg Ippolito