Main

RiSearch v.1.0 Manual

© S. Tarasov

Indexing

      RiSearch is a search script with index. It means, that before you can search it reads all your files and stores information in specific format for faster searching.

      To start indexing, you should run script "index.pl". You may do it using UnixShell, if your provider allows it, run it via admin panel or directly in browser window (script will ask for password, which can be created in admin panel). During the indexing script will create several files with information about your site (0_hash, 0_wordind and others) and store them in "db_N" directory, where "N" is some number.

      Another way to index your site is via HTTP protocol. Run "spider.pl" and it will crawl through your files and parse out all the links (spider.pl requires LWP module). It is useful for indexing dynamic sites (such as webboards).

      When script requests page from server it will identify itself as "RiSpider/1.0". You can change user-agent name in file "lib/common_lib.pm" in line:

$ua->agent("RiSpider/1.0");

      You may pass several parameters to scripts. For example:

 perl index.pl -base_dir=../ -base_url=http://www.server.com/ -rules=filename 

If no parameters are passed, script will use parameters from configuration file.

  1.  -base_dir=path/to/dir  - path to the directory, where your html files are located. Please note, that in all cases you should use or relative path, or absolute, starting from file system root (not from webserver root directory).

  2.  -base_url=http://www.server.com/  - URL of your site.

  3.  -rules=filter_filename  - file with filter rules (if no file is specified, default rules will be used).

  4.  -login=login  - login for access to closed sections of your site (used only with spider.pl).

  5.  -password=password  - password for access to closed sections of you site (used only with spider.pl).

      Indexing process requires a lot of system resources. Probably, it is better to index local copy of your site. Then just copy created database files to the server (please use "BIN" mode). Amount of RAM, required for indexing, depends on the "temp_db_size" variable in configuration file and the size of documents you want to index. New version of script has much smaller memory requirements, but still script may require 100-200 Mb of memory during indexing if your documents is bigger than 1 Mb.

      Please note, that most webservers will not allow to script to work too long time. After 30-60 seconds webserver will kill your script if it not finishes indexing at that time. Therefore, you will not be able to index more than several megabytes running "index.pl" as CGI script. In order to index large sites you have to run script via UnixShell or to index local copy of your site.



http://risearch.org S.Tarasov, © 2000-2003