htDig - Web Site Search

HtDig: Description

ht://Dig is available with most Linux distributions and is intended for a single web site or domain. Unlike WAIS or Pearlsearch which indexes a single server, ht://Dig can span several web servers. It is not an internet search engine like Yahoo or Google.

ht://Dig is a free GPL Open Source index and search engine one can install on a web server. It first generates a database by "indexing" the web content. HtDig provieds a CGI to support searching the database to generate a web page of search results pointing to the content on the website.

HtDig will index HTML and text file content to generate a search database for key words. It will also email you when there are "expired" documents.

HtDig: Installation

Red Hat/CentOS: yum install htdig htdig-web
Ubuntu: sudo apt-get install htdig
Install from source:
- tar xzf htdig-3.2.0.tar.gz
- cd htdig-3.2.0
- ./configure --prefix=/opt
- make depend
- make
- make install

HtDig: Configuration

Config file: /etc/htdig/htdig.conf

# Specify where the database files need to go. Needs lots of disk space.
database_dir:           /var/lib/htdig

# This specifies the URL where the robot (htdig) will start.
# You can specify multiple URLs here (separate with whitespace).
start_url:              http://www.yourdomain.com

...
...

If supporting multiple virtual domains you may want to create /etc/htdig/htdig-domain1.conf, /etc/htdig/htdig-domain2.conf, etc

File: /etc/htdig/htdig-domain1.conf

database_dir:   /var/lib/htdig/domain1

start_url:      http://www.domain1.com

common_dir:     /usr/share/htdig/domain1

exclude_urls:   /cgi-bin/ .cgi images
...
...

HtDig search index database directory: /var/lib/htdig
If supporting multiple domains, create: mkdir /var/lib/htdig/domain1, etc
Generate the database: rundig -c /etc/htdig/htdig-domain1.conf
To avoid down time, use the "-a" command line option: rundig -c /etc/htdig/htdig-domain1.conf -a which allows users to search the site while you are spidering the content.
Header and Footer pages: (used to display htDig search results)
- Red Hat:
```
/usr/share/htdig/footer.html
/usr/share/htdig/header.html
/usr/share/htdig/nomatch.html
etc
        
```
  Virtual/multiple domains: /usr/share/htdig-domain1/, /usr/share/htdig-domain2/, etc
  - mkdir /usr/share/htdig-domain1
  - mkdir /usr/share/htdig-domain2
  - cp /usr/share/htdig/* /usr/share/htdig-domain1/
  - cp /usr/share/htdig/* /usr/share/htdig-domain2/
  - chcon -Rt httpd_sys_content_t /usr/share/htdig-domain1/
- Ubuntu:
```
/etc/htdig/nomatch.html
/etc/htdig/footer.html
/etc/htdig/header.html
etc
        
```

Add HTML form to web page:

...
...

<form method="post" action="/cgi-bin/htsearch">
<font size="-1">
Match: <select name="method">
<option value="and">All</option>
<option value="or">Any</option>
<option value="boolean">Boolean</option>
</select>
Format: <select name="format">
<option value="builtin-long">Long</option>
<option value="builtin-short">Short</option>
</select>
Sort by: <select name="sort">
<option value="score">Score</option>
<option value="time">Time</option>
<option value="title">Title</option>
<option value="revscore">Reverse Score</option>
<option value="revtime">Reverse Time</option>
<option value="revtitle">Reverse Title</option>
</select>
</font>
<input type="hidden" name="config" value="htdig"/>
<input type="hidden" name="restrict" value=""/>
<input type="hidden" name="exclude" value=""/>
<br />
Search:
<input type="text" size="30" name="words" value=""/>
<input type="submit" value="Search"/>
</form>

...
...

...
...

...
...

Note for multiple domains reference the configuration for that domain:
<input type="hidden" name="config" value="htdig-domain1"/>

For a simple single search box, hard code the previous "options":

...
...

<form method="post" action="/cgi-bin/htsearch">
<input type="hidden" name="method" value="all"/>
<input type="hidden" name="format" value="long"/>
<input type="hidden" name="sort" value="score"/>
<input type="hidden" name="config" value="htdig"/>
<input type="hidden" name="restrict" value=""/>
<input type="hidden" name="exclude" value=""/>
Search:
<input type="text" size="30" name="words" value=""/>
<input type="submit" value="Search"/>
</form>

...
...

...
...

...
...

Note for multiple domains reference the configuration for that domain:
<input type="hidden" name="config" value="htdig-domain1"/>

Default Apache web server configuration: /etc/httpd/conf.d/htdig.conf
```
Alias /htdig /usr/share/htdig
Alias /htdig-domain1 /usr/share/htdig-domain1
    
```
Restart the apache web server to pick up the new configuration:
- Red Hat: /etc/init.d/httpd restart
- Ubuntu: /etc/init.d/apache2 restart
Test in browser: http://www.domain1.com/cgi-bin/htsearch?config=htdig-domain1&words=testword

Customizing the ht://Dig reults page:

The default page presentation is compiled into the CGI. To invoke the use of the header and footer files, the header and footer directives or the template directives must be turned on in the config file: /etc/htdig/htdig-domain1.conf

search_results_header: /usr/share/htdig-domain1/header.html
search_results_footer: /usr/share/htdig-domain1/footer.html
search_results_wrapper: /usr/share/htdig-domain1/wrapper.html
nothing_found_file: /usr/share/htdig-domain1/nomatch.html
syntax_error_file: /usr/share/htdig-domain1/syntax.html

Definetly specify nomatch.html as a blank page is uninformative.

Custom HTML Files:

File	Description
`COMMON_DIR/header.html`	The default search results header file.
`COMMON_DIR/footer.html`	The default search results footer file.
`COMMON_DIR/wrapper.html`	The default search results wrapper file, that contains the header and footer together in one file.
`COMMON_DIR/nomatch.html`	Page stating that "No matches" were found for the search terms.
`COMMON_DIR/syntax.html`	The default file that explains boolean expression syntax errors to the user.

Where COMMON_DIR is:

Red Hat: /usr/share/htdig/
Ubuntu: /etc/htdig/

ht://Dig notes:

Exclude a single content page from the search:
<meta name="robots" content="noindex, follow">
Place in the "head" section of the page to be overlooked.
List of words ignored by spider: /usr/share/htdig/bad_words
These are words like "the, and, for, with, that, this", etc.
Example cron job to re-index each week:
File: /etc/cron.weekly/htdig
```
#!/bin/sh
/usr/bin/rundig -c /etc/htdig/htdig-domainX.conf -a
    
```
Also see the YoLinux.com cron sysadmin tutorial

Apache Web Server Configuration:

Search results pages produced by HtDig use graphics provided by HtDig. To enable web server access, add the following:

...

    Alias /htdig/ "/usr/share/htdig/"
    <Directory "/usr/share/htdig">
        Options Indexes MultiViews FollowSymLinks
        AllowOverride All
        Order allow,deny
        allow from all
        Require all granted
    </Directory>

...

Apache httpd 2.4 configuration snipet

ht://Dig Man Pages:

htdig: retrieve HTML documents for ht://Dig search engine
htsearch: create document index and word database
htdump: write out an ASCII-text version of the document database
htdigconfig: script to create fuzzy databases for ht://Dig
htfuzzy: fuzzy command-line search utility for the ht://Dig
htload: reads in an ASCII-text version of the document database
htmerge: create document index and word database from files that were created by htdig.
htnotify: sends email notifications about out-dated web pages discov- ered by htmerge
htpurge: remove unused documents from the database
htstat: returns statistics on the document and word databases
rundig: sample script to create a search database for ht://Dig

Links:

ht://Dig Home Page