Main

RiSearch v.1.0 Manual

© S. Tarasov

Configuration

      Edit file riconfig.pm to set several parameters. Most of them are self documented and does not require explanation.

  1.  base_dir => "../../",  - path to the directory, where your html files are located. If index.pl located in the same directory, leave this variable as is. Please note, that in all cases you should use or relative path, or absolute, starting from file system root (not from webserver root directory). More...

  2.  base_url => "http://www.server.com/",  - URL of your site.

  3.  site_size => 2,  - this variable controls database size and searching speed.

  4.  compact_index => 1,  - in compact mode index will take less space, but it will be limited to 65535 documents.

  5.  indexing_speed => 1,  - This parameter defines indexing speed and memory usage: 0 - slow indexing, but less memory required; 1 - fast indexing, more memory required.

  6.  non_parse_ext => 'txt',  - list of extensions, were script should not remove HTML tags.

  7.  bin_ext => 'ppt xls',  - files with these extensions will not be indexed, but URL will be indexed.

  8.  numbers => '0-9',  - during the indexing script removes all non alphabetic characters from page and index what is left. As alphabetic character script interprets Latin characters and characters of regional alphabet (will be discussed later). Here you may add other characters, which should be indexed (such as numbers, underscore sign and so on).

  9.  use_selective_indexing => "NO",  - this option is useful for big sites with complex navigation, news postings and other elements, which appear on every page and, probably, should not be indexed. It allows to tell to the script, which parts of page should be cut before indexing. Turn on this option ("YES") and uncomment next lines in file "config.pl".

     no_index_strings => {
      q[<!-- No index start 1 -->] => q[<!-- No index end 1 -->],
      q[<!-- No index start 2 -->] => q[<!-- No index end 2 -->],
     },

    Inside the square brackets you need to write two strings. Everything placed between them will be cut (note, if there are several occurrences of this strings in file, each occurrence will be processed). For this purpose you may use special marks, which divide different elements of design.

  10.  cut_default_filenames => 'YES',  - this variable allows to cut default filenames (such as index.html) from URL in search results.

  11.  INDEXING_SCHEME => 2,  - words indexing scheme. If indexing scheme equal "1", index is build on the whole word base. Fastest method, but script will find only words equal to the keyword.

    When indexing scheme is "2", index is based on the beginning of each word. Script will find all words, which begin with given keyword. For example, for query "port*" the words "portrait" and "portion" also will be found.

  12.  use_stop_words => "YES",  - list of common words, which should not be indexed.

  13.  verbose_output => 1,  - during indexing script will print information about every indexed file. Change value to "0" to print information about every 100th file.

  14.  min_length => 3,  - minimal word length for indeixing.

  15.  max_length => 32,  - maximal word length for indeixing (longer words will be truncated).

  16.  max_doc_size => 1000000,  - maximal document size (bigger files will be truncated).

  17.  res_num => 10,  - number of results in page.

  18.  max_res_found => 0,  - maximal number of found documents (0 - no limit).

  19.  del_descr_chars => "",  - listed here characters will be removed from document description.

  20.  url_length_limit => 0,  - URL length limit in results output (0 - no limit).

  21.  CAP_LETTERS => '\xC0-\xDF\xA8',  - Put here list of capital letters of your language (which are different from Latin). Do the same for small letters.

  22.  def_search_type => 1,  - Default search type. Possible values: 0 - substring search (can be used only with INDEXING_SCHEME => 2), 1 - exact word search.

  23.  def_search_mode => "AND",  - Default search mode. Possible values: "AND" or "OR".

Spidering

      Spidering script uses all parameters described above (except  base_dir  and  base_url ). You have to set just one additional variable.

  •  start_url  - List of starting URLs.

  •  spider_delay => 0, - delay in seconds between requests.

  •  max_depth => 20, - maximal spidering depth (number of "clicks" from start page to current page).

  •  login => "",  - login for access to closed sections of your site (used only with spider.pl).

  •  password => "",  - password for access to closed sections of you site (used only with spider.pl).

  •  proxy => "http://user:password@server.com:port/",  - proxy settings for spider.

  •  use_robots_txt_rules => 0,  - follow or not ROBOTS.TXT rules during indexing.

  •  e_mail => 'foo@bar.com',  - webmaster's e-mail.

URL filter rules

      Filter rules defines which URL should be indexed by spider. Rule consists of commands (Index, NoIndex, Follow, NoFollow, Allow, Disallow), optional modifiers (Match, NoMatch, NoCase, Case, String, Regex) and string (regular expression), which will be matched against URL. Two actions are possible for each URL: indexing and links extraction. Index Follow means that this URL will be indexed and all links from file will be extracted for further indexing. NoIndex NoFollow means that both actions are forbidden for this type of URL. Use Follow NoIndex to allow links exctraction without file indexing. Comand Allow is synonim for Index Follow, and command Disallow is synonim for NoIndex NoFollow.

 Allow [Match|NoMatch] [NoCase|Case] [String|Regex] [ ... ] 

      Use this to allow URLs that match (doesn't match) given argument. First three optional parameters describe the type of comparison. Default values are Match, NoCase, String. Use "NoCase" or "Case" values to choose case insensitive or case sensitive comparison. Use "Regex" to choose regular expression comparison. Use "String" to choose string with wildcard comparison. One wildcard can be used - "*", which stands for any number of any characters.

 Disallow [Match|NoMatch] [NoCase|Case] [String|Regex] [ ... ] 

      Use this to disallow URLs that match (doesn't match) given argument. The meaning of first three optional parameters is exactly the same with "Allow" command.

      Indexer compares URLs against all these command arguments in the order of their appearance in config file. Last command that matches some rule will take effect.

      Some examples are presented below:

 Disallow * 
 Allow http://risearch.org/* 
 Disallow */cgi-bin/* */img/* */temp/* 
 Disallow NoMatch *.htm *.html *.txt */ 



http://risearch.org S.Tarasov, © 2000-2003