$next$ $up$ $previous$
Next: Page number indexing Up: Automated indexing of BibTEX Previous: Author indexing

Title indexing

All words in every title are candidates for the title word index, with these transformations and reductions:

One- and two-letter words (before plural suffix stripping) are normally not indexed, although this restriction can be overridden at index-generation time.
Words containing only digits and punctuation are not indexed, unless they occur in mathematics material (see below).
Possessive suffixes 's or 'es are dropped.
Index entries for words containing accents are sorted without regard to the accents, so Salát is sorted like Salat .
Non-English letters, such as Scandinavian æ , ø , and å , and French $\oe$ , are sorted as if they were replaced by ae , o , a , and oe , respectively. [More precisely, TEX control sequences for accents, and for these letters, are reduced by dropping the non-letters when forming the sorting key.]
Certain very common words, mostly conjunctions, prepositions, and pronouns, are not indexed. Here is the list of excluded words:
a about above after also am among an and are as at be before beside between but by can do for from go he her hers him his i if in into is it its me my no of on or our out over she so some that the their them these they this those to under up us we with within without you your

Words in languages other than English are not considered for membership in this list, even though titles in at least several Western European languages may be present in the bibliographic data.
Compound words are indexed under all rotations: Euler-Gergonne-Soddy is also indexed under Gergonne-Soddy, Euler- and Soddy, Euler-Gergonne- .
When two words differ only in initial lettercase, then both are indexed under the lowercase form. Thus, Equation and equation appear under the latter.
Words that contain embedded uppercase-letters are not lowercased in the index, so there would be separate entries for the acronym DES (Data Encryption Standard) and the French word des .
If a lowercase form of a capitalized word is not found in any other title, then the capitalized form is indexed.
Lowercasing such words would require the software to distinguish between a valid lowercasing of Transitive to transitive , and an invalid lowercasing of Weierstrass to weierstrass . This is an impossible task for a computer, because it requires context-sensitive analysis, and human understanding of the text and subject area, to handle ambiguous cases: consider Green functions and green beans !
In some cases, this processing can lead to incorrect lowercasing, such as a capitalized German noun Software being indexed under the English software , but this minor error is acceptable, because it will never prevent a human from finding the word in the index.
English plural formation is irregular, and the associated grammar rules are ridden with exceptions. In hand-prepared indexes, it is conventional to index only the singular form of a word, even if the plural form occurred at that location. Because of the grammatical irregularities, algorithmic reduction of plurals in a computer program must be supplemented by a substantial exception dictionary.
In the indexing software used here, a simpler approach is taken. A plural form is reduced to a singular form by stripping a final s or es , reducing a final ies to y , or reducing an ices ending to ex or ix . However, the resulting word is rejected unless it contains only letters, and the word is already present in the list of words to be indexed. That list is thus treated as a dictionary.
If only a plural form is found, then that form is indexed.
This algorithm can produce false reductions and ambiguities, such as cubes to cube or cub , but doing a better job would require a more sophisticated algorithm for plural-to-singular conversion, and an exception dictionary. Furthermore, even that algorithm would fail completely when confronted with a non-English word, or a highly technical word that is absent from English dictionaries, both of which are very likely to occur in scientific bibliography data.
The indexing software therefore takes a conservative approach: it permits the user to supply a supplemental dictionary containing singular words, and one or more plural forms for each of them (e.g., index indexes indices , and symposium symposia symposiums ). This dictionary need not be a comprehensive list for the English language, but only for the few hundred plurals that might occur in the journal index. Such a list can be constructed by filtering the index word list to extract all of those with plural endings, and then manual augmenting them with corresponding singular forms. To avoid errors, the resulting list should itself be checked with spelling programs, such as UNIX spell or GNU ispell.
Any candidate word that is found in the list of singular forms from the supplemental dictionary will not be stripped of plural suffixes, so that, e.g., news can be prevented from reducing to new , even if the latter word occurs elsewhere in the index.
Words that end in a period (dot, full stop) arise from abbreviations, initials, and end of sentence. If, after stripping the final punctuation, the word is found in the index word list, it is indexed without the period. Otherwise, the period is retained.
To eliminate remaining unwanted final periods, it is sufficient to make suitable entries in the plural dictionary manually, then regenerate the index.
Mathematical material in TEX markup is indexed, under a separate section at the start of the index, but the alphabetization is based on a sorting key formed by eliminating all characters that are not letters, digits, hyphen, or space.
While this procedure can produce a few sorting irregularities, the order is readily discernible to anyone with even limited exposure to TEX mathematics markup, which generally labels mathematical symbols by their English names, and in any event, the number of index entries with mathematical material is usually fairly short, so a human reader can easily do a linear scan through that section.

$next$ $up$ $previous$
Next: Page number indexing Up: Automated indexing of BibTEX Previous: Author indexing

Nelson H. F. Beebe
12/30/1997