[CCP14 Home: (Frames | No Frames)]
CCP14 Mirrors: [UK] | [CA] | [US] | [AU]

(This Webpage Page in No Frames Mode)

Collaborative Computational Project Number 14

for Single Crystal and Powder Diffraction

CCP14

Installing and Configuring the SWISH-E "Free Text" Web Search Engine Software

The CCP14 Homepage is at http://www.ccp14.ac.uk

[Back to CCP14 Web/Config Main Page] | [Installing Isearch/Iindex (previous search engine)]

[Why use SWISH-E | Where to get SWISH-E | Compiling SWISH-E]
[Deciding What to Index | Config Files | Daily Auto Database-Creation]
[Creating Default HTML Forms Files | SWISH-E Documentation Files]


Why Use SWISH-E

SWISH-E Logo
  • Works on a large variety of systems and MS-Windows executables are also available
  • Is a free open source text search engine.
  • Can index via file system indexing and web spidering.
  • Can handle a relatively large site - 100's of Megs of text.
  • Can be made to recognise a number of data formats (HTML and text by default.
  • Flexible configuration
  • If you have a c compiler/gcc installed, is relatively easy to compile, install and setup (no big deals encountered).
  • Providing you update the SWISH-E database after you update your web files/site, seems to be quite accurate and good at finding relevant hits.
  • "Relevancy Scoring" to help get relevant hits high up on the score list.


Where to get SWISH-E 2

Due to needing the -e option (for making use of temp files and not putting everything in memory when generating the Index), the latest developmental beta version (2.1.8) of SWISH-E 2 was used. Not using the "-e" option in the latest SWISH-E may overwhelm your server's memory and slow it down to a crawl when indexing large sites. The SWISH-E-2.1.X builtin sample perl module was not used for the CGI and instead the older (more established) perl scripts were used as it seemed better at this time to avoid the "bleeding-edge" beta features. Upgrading to this at a later date would be a good idea for enhanged CGI script security.


Compiling and Installing the Web Server

This assumes you are using a UNIX machine and have C or GCC installed. Refer to Compiling the GNU GCC C compiler for information on how to do this. With a decent workstation, gcc should be there by default. Thus compiling should not be a problem and the ./configure programs should be able to detect make compiling a relatively trivial application.


Deciding What to Index

In the case of the CCP14. All html, htm and are indexed for each relevant virtual domain (Crystallographic www, alife, programming, etc). For possible regional mirroring purposes, it was decided to keep things separate and to limit irrelevant hits due to mirroring different subject areas.


Config Files

For SWISH-E, the important thing is to set up config files that are optimised for the locally based information. This can sometimes only be found by playing around with the options and seeing what effect they have.

CCP14 relevant Issues found (there may be better ways to handle this):

  • Stemming perturbs the search in not being able to find things about programs such as crystals (where that extra "s" is important)
  • Some parts of mirrored webpages using frames can cause problems.
  • Povray and Ghostscript mirrored areas are best excluded as they perturb the crystallographic searching.
  • Any webstats pages grabbed in the mirroring are best excluded.
  • Avoiding words with numbers on them helps with the mirrored pages in speeding up the indexing

SWISH-E Config Files (more options are available with SWISH-E - check the documentation)

Example live search files:


Daily Auto Indexing-Creation of the SWISH-E Search Database

As automirroring of webpages is implemented between 1am and 5am each morning using WGET and rsync, it is necessary that the SWISH-E database reflects this change after the auto-mirroring session.

In the .crontab file (which can then be passed into the crontab using the command crontab .crontab), put the script file that is going to be run after the automirroring. In this case, the script will run each morning at 5.07am.

05 07 * * * ./swish.searchsindex.script

This calls a script file to regenerate the index file using the recommend method (generating a file of all the files to be indexed, then running Iindex on this file), then move it over the old one so as to minimize downtime of the indexing to a fraction of a second. The last lines send an email to [email protected] confirming the script has run and the time completed.

#!/bin/csh

# You should CHANGE THE NEXT 3 LINES to suit your local setup
setenv  LOGDIR   ./web_area/mirrorbin/logs    # directory for storing logs
setenv  PROGDIR  ./web_area/mirrorbin         # location of executable
setenv  PUTDIR   ./web_area/web_live/ccp      # relative directory for mirroring
                               # relative due to possible kludge in wget
  #can change to absolute if you wish - some internal links may not work

set DATE=(`date`)
sed "/START_Iindex/s/NOT_FINISHED/Regeneration_STARTED $DATE/" ./report-template.txt  > ./report.txt.new
mv report.txt.new report.txt

# Some strange things have been happening that have been accumulating Iindex jobs - thus make
# sure they are dead.  Lachlan 3rd June 1999
/etc/killall -9 swish-e

wait

#  CCP14 ONLY SEARCH

/usr/local/bin/swish-e -e -c /web_disc/ccp14/search-databases/config.ccp14only > \
   /web_disc/ccp14/search-databases/config.ccp14only.log

mv /web_disc/ccp14/search-databases/swish.index.ccp14only.new \
   /web_disc/ccp14/search-databases/swish.index.ccp14only

#wait

#  2>&1    - puts standard err to the file as well.

set DATE=(`date`)
sed "/CCP14only_Iindex/s/NOT_FINISHED/Regeneration_COMPLETED $DATE/" report.txt  > report.txt.new
mv   report.txt.new   report.txt

#  CCP14 ALL CRYSTALLOGRAPHIC PAGES

/usr/local/bin/swish-e -e -c /web_disc/ccp14/search-databases/config.ccp14all > \
   /web_disc/ccp14/search-databases/config.ccp14all.log

mv /web_disc/ccp14/search-databases/swish.index.ccp14all.new \
   /web_disc/ccp14/search-databases/swish.index.ccp14all

set DATE=(`date`)
sed "/CCP14ALL_index/s/NOT_FINISHED/Regeneration_COMPLETED $DATE/" report.txt  > report.txt.new
mv   report.txt.new   report.txt

#  ALIFE   SEARCH

/usr/local/bin/swish-e -e -c /web_disc/ccp14/search-databases/config.alife > \
   /web_disc/ccp14/search-databases/config.alife.log

mv /web_disc/ccp14/search-databases/swish.index.alife.new \
   /web_disc/ccp14/search-databases/swish.index.alife


set DATE=(`date`)
sed "/ALIFE__Iindex/s/NOT_FINISHED/Regeneration_COMPLETED $DATE/" report.txt  > report.txt.new
mv   report.txt.new   report.txt 

#  PROGRAMMING SEARCH

/usr/local/bin/swish-e -e -c /web_disc/ccp14/search-databases/config.programming > \
   /web_disc/ccp14/search-databases/config.programming.log

mv /web_disc/ccp14/search-databases/swish.index.programming.new \
   /web_disc/ccp14/search-databases/swish.index.programming


set DATE=(`date`)
sed "/PROGRAMMING_Iindex/s/NOT_FINISHED/Regeneration_COMPLETED $DATE/" report.txt  > report.txt.new
mv   report.txt.new   report.txt 


# set DATE=(`date`)
# sed "/NETLIB_Iindex/s/NOT_FINISHED/Regeneration_COMPLETED $DATE/" report.txt  > report.txt.new
# mv   report.txt.new   report.txt

# Some strange things have been happening that have been accumulating Iindex jobs - thus make
# sure they are dead.  Lachlan 3rd June 1999
/etc/killall -9 swish-e
wait



/usr/sbin/Mail -s "Isite_Isearch_Creation_Results `date`" [email protected] < ./report.txt

Creating Default HTML Forms Files

Copy and Edit seems to be the name of the game here as per the following:


SWISH-E Documentation

SWISH-E


[CCP14 Home: (Frames | No Frames)]
CCP14 Mirrors: [UK] | [CA] | [US] | [AU]

(This Webpage Page in No Frames Mode)

If you have any queries or comments, please feel free to contact the CCP14