|
Related YoLinux Tutorials:
°Web site search
°Linux Networking
°Linux Sys Admin
°Web server configuration
°Web Tricks
°YoLinux Tutorials Index
Free Information Technology Magazines and Document Downloads
|
ht://Dig is available with most Linux distributions and is intended for a single web site or domain.
Unlike WAIS or Pearlsearch which indexes a single server, ht://Dig can span several web servers. It is not an internet search engine like Yahoo or Google.
ht://Dig is a free GPL Open Source index and search engine one can install on a web server.
It first generates a database by "indexing" the web content.
HtDig provieds a CGI to support searching the database to generate a web page of search results pointing to the content on the website.
HtDig will index HTML and text file content to generate a search database for key words.
It will also email you when there are "expired" documents.
- Red Hat/CentOS: yum install htdig htdig-web
- Ubuntu: sudo apt-get install htdig
- Install from source:
- tar xzf htdig-3.2.0.tar.gz
- cd htdig-3.2.0
- ./configure --prefix=/opt
- make depend
- make
- make install
- Config file: /etc/htdig/htdig.conf
-
# Specify where the database files need to go. Needs lots of disk space.
database_dir: /var/lib/htdig
# This specifies the URL where the robot (htdig) will start.
# You can specify multiple URLs here (separate with whitespace).
start_url: http://www.yourdomain.com
...
...
|
If supporting multiple virtual domains you may want to create /etc/htdig/htdig-domain1.conf, /etc/htdig/htdig-domain2.conf, etc
File: /etc/htdig/htdig-domain1.conf
database_dir: /var/lib/htdig/domain1
start_url: http://www.domain1.com
common_dir: /usr/share/htdig/domain1
exclude_urls: /cgi-bin/ .cgi images
...
...
|
- HtDig search index database directory: /var/lib/htdig
If supporting multiple domains, create: mkdir /var/lib/htdig/domain1, etc
- Generate the database: rundig -c /etc/htdig/htdig-domain1.conf
To avoid down time, use the "-a" command line option: rundig -c /etc/htdig/htdig-domain1.conf -a which allows users to search the site while you are spidering the content.
- Header and Footer pages: (used to display htDig search results)
- Red Hat:
-
/usr/share/htdig/footer.html
/usr/share/htdig/header.html
/usr/share/htdig/nomatch.html
etc
Virtual/multiple domains: /usr/share/htdig-domain1/, /usr/share/htdig-domain2/, etc
- mkdir /usr/share/htdig-domain1
- mkdir /usr/share/htdig-domain2
- cp /usr/share/htdig/* /usr/share/htdig-domain1/
- cp /usr/share/htdig/* /usr/share/htdig-domain2/
- chcon -Rt httpd_sys_content_t /usr/share/htdig-domain1/
- Ubuntu:
-
/etc/htdig/nomatch.html
/etc/htdig/footer.html
/etc/htdig/header.html
etc
- Add HTML form to web page:
-
...
...
<form method="post" action="/cgi-bin/htsearch">
<font size="-1">
Match: <select name="method">
<option value="and">All</option>
<option value="or">Any</option>
<option value="boolean">Boolean</option>
</select>
Format: <select name="format">
<option value="builtin-long">Long</option>
<option value="builtin-short">Short</option>
</select>
Sort by: <select name="sort">
<option value="score">Score</option>
<option value="time">Time</option>
<option value="title">Title</option>
<option value="revscore">Reverse Score</option>
<option value="revtime">Reverse Time</option>
<option value="revtitle">Reverse Title</option>
</select>
</font>
<input type="hidden" name="config" value="htdig"/>
<input type="hidden" name="restrict" value=""/>
<input type="hidden" name="exclude" value=""/>
<br />
Search:
<input type="text" size="30" name="words" value=""/>
<input type="submit" value="Search"/>
</form>
...
...
|
...
...
...
...
|
Note for multiple domains reference the configuration for that domain:
<input type="hidden" name="config" value="htdig-domain1"/>
For a simple single search box, hard code the previous "options":
-
...
...
<form method="post" action="/cgi-bin/htsearch">
<input type="hidden" name="method" value="all"/>
<input type="hidden" name="format" value="long"/>
<input type="hidden" name="sort" value="score"/>
<input type="hidden" name="config" value="htdig"/>
<input type="hidden" name="restrict" value=""/>
<input type="hidden" name="exclude" value=""/>
Search:
<input type="text" size="30" name="words" value=""/>
<input type="submit" value="Search"/>
</form>
...
...
|
...
...
...
...
|
Note for multiple domains reference the configuration for that domain:
<input type="hidden" name="config" value="htdig-domain1"/>
- Default Apache web server configuration: /etc/httpd/conf.d/htdig.conf
-
Alias /htdig /usr/share/htdig
Alias /htdig-domain1 /usr/share/htdig-domain1
|
Restart the apache web server to pick up the new configuration:
- Red Hat: /etc/init.d/httpd restart
- Ubuntu: /etc/init.d/apache2 restart
- Test in browser: http://www.domain1.com/cgi-bin/htsearch?config=htdig-domain1&words=testword
| Customizing the ht://Dig reults page: |
The default page presentation is compiled into the CGI.
To invoke the use of the header and footer files, the header and footer directives or the template directives must be turned on in the config file: /etc/htdig/htdig-domain1.conf
-
Definetly specify nomatch.html as a blank page is uninformative.
Custom HTML Files:
-
| File | Description |
| COMMON_DIR/header.html | The default search results header file. |
| COMMON_DIR/footer.html | The default search results footer file. |
| COMMON_DIR/wrapper.html | The default search results wrapper file, that contains the header and footer together in one file. |
| COMMON_DIR/nomatch.html | Page stating that "No matches" were found for the search terms. |
| COMMON_DIR/syntax.html | The default file that explains boolean expression syntax errors to the user. |
Where COMMON_DIR is:
- Red Hat: /usr/share/htdig/
- Ubuntu: /etc/htdig/
- htdig: retrieve HTML documents for ht://Dig search engine
- htsearch: create document index and word database
- htdump: write out an ASCII-text version of the document database
- htdigconfig: script to create fuzzy databases for ht://Dig
- htfuzzy: fuzzy command-line search utility for the ht://Dig
- htload: reads in an ASCII-text version of the document database
- htmerge: create document index and word database from files that were created by htdig.
- htnotify: sends email notifications about out-dated web pages discov-
ered by htmerge
- htpurge: remove unused documents from the database
- htstat: returns statistics on the document and word databases
- rundig: sample script to create a search database for ht://Dig
|
|