.. mg 1 \*(Dt CITRI

mg 1 \*(Dt CITRI

NAME
DESCRIPTION
AVAILABILITY
TYPICAL USE OF mg
EXTENDING mg_get
SEE ALSO

NAME

mg - full-text inverted index support

DESCRIPTION

mg is a suite of programs that can be used to create and query full-text inverted indexes for a collection of documents.

An inverted index is essentially a list of pointers to occurrences of every word in a collection of documents, so that, for example, in a collection of Shakespeare's works, one can pose questions like How often is a fool mentioned in the plays? and Are Romeo and Juliet ever mentioned in the same sentence? While search utilities like the UNIX grep(1) family can be used for simple queries, they suffer from several limitations:

Each invocation requires a separate pass over the documents, which becomes impossibly slow if the document collection is very large.
It is difficult to compose Boolean queries like Caesar AND Brutus AND NOT Antony.
They provide no mechanism for ranking matches according to importance, which is extremely important when queries are complex and numerous matches are found.
They provide no mechanism for displaying surrounding context (although the Free Software Foundation implementations for the GNU Project remedy this with their -A num (after) and -B num (before) switches).
They provide no mechanism for searching for occurrences in the same paragraph, unless preprocessing is done on the files to wrap paragraphs into long lines. Even that will often fail, because input buffer sizes may be limited to a few hundred characters.
They provide no easy way to deal with grammatical word-ending variants, except by explicit enumeration, such as searching for compress, compressed, compresses, compressing, and compression.

Most computer users have been faced with the problem of finding information from a large collection of files, such as electronic mail, on-line documentation, or source code. Inverted indices provide a highly-effective solution to this problem.

The mg software is described in Appendix A of the book

Ian H. Witten, Alistair Moffat, and Timothy C. Bell
Managing Gigabytes: Compressing and Indexing Documents and Images
Van Nostrand Reinhold
1994
xiv + 429 pages
US$54.95
ISBN 0-442-01863-0
Library of Congress catalog number TA1637 .W58 1994

Many of the algorithms implemented in mg represent significant advances over previous work, both in speed, and in storage requirements. On a fast workstation, in tens of minutes, or a few hours, mgbuild(1) can create an index to all of the words in hundreds of thousands of documents occupying hundreds of megabytes, or even more than a gigabyte, of disk space. mgquery(1) can then be used to answer complex queries, with responses often returned in a second or less. mg also contains algorithms to deal with images, so that with a small amount of descriptive text for each image, it is possible to do searches in collections of images, and to have retrievals display the images using a viewer like xv(1).

mg can deal with compressed text and image files and surprisingly, it usually runs faster than it would if the files were not compressed! Thus, the considerable disk space savings possible from file compression are not lost because of the need for fast document search and retrieval.

The Free Software Foundation GNU Project compression utilities gzip(1) and gunzip(1) are recommended for general use over older alternatives, like compress(1), because of their speed and high compression ratios.

AVAILABILITY

The mg software can be obtained via anonymous ftp to the Australian archive host munnari.oz.au [128.250.1.21] from the directory /pub/mg.

TYPICAL USE OF mg

Although mg consists of more than 20 separate programs, many of which have complicated command-line options, take heart: most users require only two or three of these programs, and nothing more than a document name on the command line.

A document for mg is a fragment of text suitable for retrieval as a unit when it is found to contain a requested word, or words. In a collection of poetry, a document might be a stanza, while in a novel, it could be a paragraph. In an index of first lines of poems, a document would likely be just a single line.

Just what constitutes a document is decided by a user-modifiable UNIX shell script, mg_get(1). The default script provided with the mg source distribution knows about these named document collections:

alice: Lewis Carroll's Alice in Wonderland book.
allfiles: all mail files in the directory tree $HOME/Mail, including all of its nested subdirectories.
mailfiles: individual mail messages in $HOME/mbox and $HOME/.sentmail.
davinci: A small collection of text and images from the work of Leonardo da Vinci.

A document collection name is used by mg as a csh(1) case statement selector, and as a subdirectory name in the $MGDATA directory, or the current directory, if MGDATA is not defined.

EXTENDING mg_get

This section describes how to extend mg_get(1) to handle a new document collection.

Let us take two examples: all BibTeX .bib files, and all TeX files, contained in subdirectories under the login directory.

For BibTeX, each bibliographic entry will be considered a separate document. In order to facilitate easy identification of entries, we shall require them to begin at the start of a line; the bibclean(1) utility can be used to standardize the format of .bib files, and to validate their string values, so that this requirement is met.

For TeX, each paragraph will be a separate document, and we assume that paragraphs are separated by blank lines. We assume that files with extensions .atx, .ltx, .stx, and .tex contain input to common TeX macro package variants.

Make a personal copy of the mg_get script, using the one in the mg source distribution (mg-1.0/mg/mg_get.sh), or the one in the local binary program directory, at many sites called /usr/local/bin/mg_get.

Examination of the mg_get script shows that each document collection name is used in a csh(1) case statement selector, and that most of work is done by very simple awk(1) programs that extract documents from files. In your private copy of the mg_get file, after the line

  breaksw #davinci

and before the line

  default:

insert this new code:

  case bibfiles:
# Takes a list of files that contain BibTeX entries, and splits them up
#   by putting ^B after each entry. Assumes that each entry
#   begins with a line '^@'.
  switch ($flag)
    case '-init':
    breaksw

    case '-text':
      find $HOME -name '*.bib' -print | \
        sort | \
        xargs -l100 awk \
          '/^@/&&NR!=1{print "^B"} {print $0} END{print "^B"}'
    breaksw #-text

    case '-cleanup':
    breaksw #-cleanup

  endsw #flag
  breaksw #bibfiles

  case texfiles:
# Takes a list of TeX files and split them up
#   by putting ^B after each paragraph. Assumes that each entry
#   begins with a line '^@'.
  switch ($flag)
    case '-init':
    breaksw

    case '-text':
      find $HOME -name '*x' -print | \
        egrep '[.]tex$|[.]ltx$|[.]atx$|[.]stx$' | \
        sort | \
        xargs -l100 nawk '   /^ *$/ {if (b!=1) printf "^B";b=1} \
            \!/^ *$/ {print;b=0} \
            END {printf "^B"}'
    breaksw #-text

    case '-cleanup':
    breaksw #-cleanup

  endsw #flag
  breaksw #texfiles

The ^B characters here are Control-B characters, not caret-B pairs.

If you have a large number of BibTeX or TeX files, it is likely that a list of them would be too long for the UNIX shell to hold in a single variable, or on a single command line. Thus, instead of storing the output of find(1) in a variable, we proceed more cautiously, and employ it to produce a list of the required files, then pipe them to xargs(1), which in turn passes up to 100 filenames at a time to nawk(1) for document selection.

Install this modified mg_get script in your private directory for executable programs (e.g. $HOME/bin), create a directory $HOME/mgdata to hold the index, issue a rehash command if you are using csh(1) or tcsh(1), ensure that mg_get occurs in your search path before any system-wide one (the command which mg_get will tell you which version will be selected), then create the inverted indexes by mgbuild bibfiles and mgbuild texfiles. These commands may take several minutes to run if you have a lot of BibTeX or TeX files, or a large home directory tree. Once they are complete, you can then query the index with the commands mgquery bibfiles and mgquery texfiles. These should respond very rapidly.

In order to keep your index up-to-date, you should arrange for it to be recreated automatically and regularly, probably every night. You can do this with cron(1). Use the command crontab -e to edit your crontab file and add two lines like this:

00 04 * * * mgbuild bibfiles >$HOME/mgdata/bibfiles.log 2>&1
15 04 * * * mgbuild texfiles >$HOME/mgdata/texfiles.log 2>&1

Save the file and exit the editor. Now, every night at 4am and 4:15am, mgbuild(1) will reconstruct your inverted indexes, and the results of the builds will be saved in log files in your $HOME/mgdata directory.