% title page % Here is TeX material that gets inserted after \input webhdr \font\ninerm=cmr9 \let\mc=\ninerm % medium caps for names like PASCAL \def\PASCAL{{\mc PASCAL}} \def\title{PATGEN} \def\contentspagenumber{45} % should be odd \def\topofcontents{ \line{\tenit Appendix\hfil \mainfont\contentspagenumber} \vfill \null\vskip 40pt \centerline{\titlefont {\ttitlefont PAT}tern {\ttitlefont GEN}eration program} \vskip 8pt \centerline{\titlefont for the \TeX 82 hyphenator} \vskip 20pt\null \vfill} \pageno=\contentspagenumber \advance\pageno by 1 @* Introduction. This program takes a list of hyphenated words and generates a set of patterns that can be used by the \TeX 82 hyphenation algorithm. The patterns consist of strings of letters and digits, where a digit indicates a `hyphenation value' for some intercharacter position. For example, the pattern \.{3t2ion} specifies that if the string \.{tion} occurs in a word, we should assign a hyphenation value of 3 to the position immediately before the \.{t}, and a value of 2 to the position between the \.{t} and the \.{i}. To hyphenate a word, we find all patterns that match within the word and determine the hyphenation values for each intercharacter position. If more than one pattern applies to a given position, we take the maximum of the values specified (i.e.\ the higher value takes priority). If the resulting hyphenation value is odd, this position is a feasible breakpoint; if the value is even or if no value has been specified, we are not allowed to break at this position. In order to find quickly the patterns that match in a given word and to compute the associated hyphenation values, the patterns generated by this program are compiled by \.{INITEX} into a compact version of a finite state machine. For further details, see the \TeX 82 source. @p @@/ program PATGEN(@!dictionary,@!patterns,@!output); label @@/ const @@/ type @@/ var @@/ procedure initialize; {this procedure gets things started properly} var @@; begin @@/ end; @ The patterns are generated in a series of sequential passes through the dictionary. In each pass, we collect count statistics for a particular type of pattern, taking into account the effect of patterns chosen in previous passes. At the end of a pass, the counts are examined and new patterns are selected. Patterns are chosen one level at a time, in order of increasing hyphenation value. In the sample run shown below, the parameters |hyph_start| and |hyph_finish| specify the first and last levels, respectively, to be generated. Patterns at each level are chosen in order of increasing pattern length (usually starting with length~2). This is controlled by the parameters |pat_start| and |pat_finish| specified at the beginning of each level. Furthermore patterns of the same length applying to different intercharacter positions are chosen in separate passes through the dictionary. Since patterns of length $n$ may apply to $n+1$ different positions, choosing a set of patterns of lengths $2$ through $n$ for a given level requires $(n+1)(n+2)/2-3$ passes through the word list. At each level, the selection of patterns is controlled by the three parameters |good_wt|, |bad_wt|, and |thresh|. A hyphenating pattern will be selected if |good*good_wt-bad*bad_wt>=thresh|, where |good| and |bad| are the number of times the pattern could and could not be hyphenated, respectively, at a particular point. For inhibiting patterns, |good| is the number of errors inhibited, and |bad| is the number of previously found hyphens inhibited. @= @!pat_start, @!pat_finish: dot_type; @!hyph_start, @!hyph_finish: val_type; @!good_wt, @!bad_wt, @!thresh: integer; @ The proper choice of the parameters to achieve a desired degree of hyphenation is discussed in Chapter~4. Below we show part of a sample run of \.{PATGEN}, with the user's inputs underlined. $$\vbox{\halign{\.{#\hfil}\cr $\underline{\smash{\.{ex patgen}}}$\cr DICTIONARY : $\underline{\smash{\.{murray.hyf}}}$\cr PATTERNS : $\underline{\smash{\.{nul:}}}$\cr OUTPUT : $\underline{\smash{\.{murray.pat}}}$\cr 0 patterns read in\cr pattern trie has 128 nodes, trie\_max = 128, 0 outputs\cr hyph\_start = $\underline{\.{1}}$\cr hyph\_finish = $\underline{\.{1}}$\cr pat\_start = $\underline{\.{2}}$\cr pat\_finish = $\underline{\.{3}}$\cr good weight, bad weight, threshold: $\underline{\.{1 3 3}}$\cr processing dictionary with pat\_len = 2, pat\_dot = 1\cr \cr 0 good, 0 bad, 3265 missed\cr 0.00 \%, 0.00 \%, 100.00 \%\cr 338 patterns, 466 nodes in count trie, triec\_max = 983\cr 46 good and 152 bad patterns added (more to come)\cr finding 715 good and 62 bad hyphens, efficiency = 10.72\cr pattern trie has 326 nodes, trie\_max = 509, 2 outputs\cr processing dictionary with pat\_len = 2, pat\_dot = 0\cr \cr \hskip 1.5em ...\cr \cr 1592 nodes and 39 outputs deleted\cr total of 220 patterns at hyph\_level 1\cr hyphenate word list? $\underline{\smash{\.{y}}}$\cr writing pattmp.1\cr \cr 2529 good, 243 bad, 736 missed\cr 77.46 \%, 7.44 \%, 22.54 \%\cr}}$$ @ Note that before beginning a pattern selection run, a file of existing patterns may be read in. In order for pattern selection to work properly, this file should only contain patterns with hyphenation values less than |hyph_start|. Each word in the dictionary is hyphenated according to the existing set of patterns (including those chosen on previous passes of the current run) before pattern statistics are collected. Also, a hyphenated word list may be written out at the end of a run. This list can be read back in as the `dictionary' to continue pattern selection from this point. The new list will contain two additional kinds of ``hyphens'' between letters, namely hyphens that have been found by previously generated patterns, as well as erroneous hyphens that have been inserted by those patterns. These are represented by the symbols |"*"| and |"."|, respectively. In addition, a word list can include hyphen weights, both for entire words and for individual hyphen positions. (The syntax for this is explained in the dictionary processing routines.) Thus common words can be weighted more heavily, or, more generally, words can be weighted according to their frequency of occurrence, if such information is available. The use of hyphen weights combined with an appropriate setting of the pattern selection threshold can be used to guarantee hyphenation of certain words or certain hyphen positions within a word. @= @!dictionary, @!patterns, @!pattmp: file of text_char; @ Below we show the first few lines of a typical word list, before and after generating a level of patterns. $$\vbox{\halign{\tabskip 1in\.{#\hfil}&\.{#\hfil}\cr abil-i-ty& abil*i*ty\cr ab-sence& ab*sence\cr ab-stract& ab*stract\cr ac-a-dem-ic& ac-a-d.em-ic\cr ac-cept& ac*cept\cr ac-cept-able& ac*cept-able\cr ac-cept-ed& ac*cept*ed\cr \hskip 1.5em ...&\hskip 1.5em ...\cr }}$$ @ We augment \PASCAL 's control structures a bit using |goto|\unskip's and the following symbolic labels. @d exit=10 {go here to leave a procedure} @d continue=22 {go here to resume a loop} @d done=30 {go here to exit a loop} @d found=40 {go here when you've found it} @d not_found=41 {go here when you've found something else} @d end_of_PATGEN=9999 @ @= continue, end_of_PATGEN; @ Here are some macros for common programming idioms. @d incr(#)==#:=#+1 {increase a variable by unity} @d decr(#)==#:=#-1 {decrease a variable by unity} @d return==goto exit {terminate a procedure call} @f return==nil @ The following definitions are used for terminal input/output. @d print(#)==write(tty,#) @d print_ln(#)==writeln(tty,#) @d input(#)==read(tty,#) @d input_ln(#)==begin if eoln(tty) then readln(tty); read(tty,#) end @# @d error(#)==begin print_ln(#); goto end_of_PATGEN; end; @ @= @{@&$C-,A+,D-@} {no range check, catch arithmetic overflow, no debug overhead} @^system dependencies@> @* The character set. Since different \PASCAL\ systems may use different character sets, we use the name |text_char| to stand for the data type of characters appearing in external text files. We also assume that |text_char| consists of the elements |chr(first_text_char)| through |chr(last_text_char)|, inclusive. The definitions below should be adjusted if necessary. Internally, characters will be represented using the type |ascii_code|. @^system dependencies@> @^character set dependencies@> @d first_text_char=0 {ordinal number of the smallest element of |text_char|} @d last_text_char=127 {ordinal number of the largest element of |text_char|} @= @!text_char=char; {the data type of characters in text files} @!ascii_code=0..127; {internal representation of input characters} @ We convert between |ascii_code| and the user's external character set by means of arrays |xord| and |xchr| that are analogous to \PASCAL's |ord| and |chr| functions. @= @!xord: array [text_char] of ascii_code; {specifies conversion of input characters} @!xchr: array [ascii_code] of text_char; {specifies conversion of output characters} @ @= i:text_char; j:ascii_code; @ For English, our dictionary and pattern files use the letters \.A through \.Z, the digits \.0 through \.9, and the symbols \.*, \.-, and \.. (period). Lowercase letters are also accepted; they are mapped internally to upper case. @d invalid_code=@'177 {ascii code that should not appear} @= for i:=chr(first_text_char) to chr(last_text_char) do xord[i]:=invalid_code; xord['*']:="*"; xord['-']:="-"; xord['.']:=".";@/ xord['0']:="0"; xord['1']:="1"; xord['2']:="2"; xord['3']:="3"; xord['4']:="4"; xord['5']:="5"; xord['6']:="6"; xord['7']:="7"; xord['8']:="8"; xord['9']:="9";@/ xord['A']:="A"; xord['B']:="B"; xord['C']:="C"; xord['D']:="D"; xord['E']:="E"; xord['F']:="F"; xord['G']:="G"; xord['H']:="H"; xord['I']:="I"; xord['J']:="J"; xord['K']:="K"; xord['L']:="L"; xord['M']:="M"; xord['N']:="N"; xord['O']:="O"; xord['P']:="P"; xord['Q']:="Q"; xord['R']:="R"; xord['S']:="S"; xord['T']:="T"; xord['U']:="U"; xord['V']:="V"; xord['W']:="W"; xord['X']:="X"; xord['Y']:="Y"; xord['Z']:="Z";@/ xord['a']:="A"; xord['b']:="B"; xord['c']:="C"; xord['d']:="D"; xord['e']:="E"; xord['f']:="F"; xord['g']:="G"; xord['h']:="H"; xord['i']:="I"; xord['j']:="J"; xord['k']:="K"; xord['l']:="L"; xord['m']:="M"; xord['n']:="N"; xord['o']:="O"; xord['p']:="P"; xord['q']:="Q"; xord['r']:="R"; xord['s']:="S"; xord['t']:="T"; xord['u']:="U"; xord['v']:="V"; xord['w']:="W"; xord['x']:="X"; xord['y']:="Y"; xord['z']:="Z"; @ On output, all letters are mapped to lower case. @= for j:=0 to 127 do xchr[j]:=' '; xchr["*"]:='*'; xchr["-"]:='-'; xchr["."]:='.';@/ xchr["0"]:='0'; xchr["1"]:='1'; xchr["2"]:='2'; xchr["3"]:='3'; xchr["4"]:='4'; xchr["5"]:='5'; xchr["6"]:='6'; xchr["7"]:='7'; xchr["8"]:='8'; xchr["9"]:='9';@/ xchr["A"]:='a'; xchr["B"]:='b'; xchr["C"]:='c'; xchr["D"]:='d'; xchr["E"]:='e'; xchr["F"]:='f'; xchr["G"]:='g'; xchr["H"]:='h'; xchr["I"]:='i'; xchr["J"]:='j'; xchr["K"]:='k'; xchr["L"]:='l'; xchr["M"]:='m'; xchr["N"]:='n'; xchr["O"]:='o'; xchr["P"]:='p'; xchr["Q"]:='q'; xchr["R"]:='r'; xchr["S"]:='s'; xchr["T"]:='t'; xchr["U"]:='u'; xchr["V"]:='v'; xchr["W"]:='w'; xchr["X"]:='x'; xchr["Y"]:='y'; xchr["Z"]:='z'; @ We assume that words use only the letters |cmin+1| through |cmax|. This allows us to save some time on trie operations that involve searching for packed transitions belonging to a particular state. @d cmin="@@" @d cmax="Z" @d edge_of_word="@@" @* Data structures. The main data structure used in this program is a dynamic packed trie. In fact we use two of them, one for the set of patterns selected so far, and one for the patterns being considered in the current pass. For a pattern $p_1\ldots p_k$, the information associated with that pattern is accessed by setting |@t$t_1$@>:=trie_root+@t$p_1$@>| and then, for |1:=trie_link(@t$t_{i-1}$@>)+ @t$p_i$@>|; the pattern information is then stored in a location addressed by |@t$t_k$@>|. Since all trie nodes are packed into a single array, in order to distinguish nodes belonging to different trie families, a special field is provided such that |trie_char@t$(t_i)=p_i$@>| for all |i|. In addition the trie must support dynamic insertions and deletions. This is done by maintaining a doubly linked list of unoccupied cells and repacking trie families as necessary when insertions are made. Each trie node consists of three fields: the character |trie_char|, and the two link fields |trie_link| and |trie_back|. In addition there is a separate boolean array |trie_base_used|. When a node is unoccupied, |trie_char| is zero and the link fields point to the next and previous unoccupied nodes, respectively, in the doubly linked list. When a node is occupied, |trie_link| points to the next trie family, and |trie_back| (renamed |trie_outp|) contains the output associated with this transition. The |trie_base_used| bit indicates that some family has been packed at this base location, and is used to prevent two families from being packed at the same location. @ The sizes of the pattern tries may have to be adjusted depending on the particular application (i.e.\ the parameter settings and the size of the dictionary). The sizes below were sufficient to generate the current set of \TeX 82 hyphenation patterns. @= @!trie_size=55000; {space for pattern trie} @!triec_size=26000; {space for pattern count trie, should be less than |trie_size| and greater than the number of occurrences of any pattern in the dictionary} @!max_ops=4080; {size of output hash table, should be a multiple of 510} @!max_val=7; {maximum number of levels$+1$, also used to denote bad patterns} @!max_dot=15; {maximum pattern length} @!max_len=50; {maximum word length} @ @= @!q_index=0..128; {number of transitions in a state} @!val_type=0..max_val; {hyphenation values} @!dot_type=0..max_dot; {dot positions} @!op_type=0..max_ops; {index into output hash table} @!word_index=0..max_len; {index into |word|} @!trie_pointer=0..trie_size; @!triec_pointer=0..triec_size;@/ @!trie_node=record ch: ascii_code; lh,rh: trie_pointer end; @!op_word=packed record dot: dot_type; val: val_type; op: op_type end; @ Instead of using the |trie_node| record defined above, the trie is actually stored with its components in separate packed arrays, in order to save space (although this depends on the computer's word size and the size of the trie pointers). @= @!trie_c: packed array[trie_pointer] of ascii_code; @!trie_l, @!trie_r: packed array[trie_pointer] of trie_pointer; @!trie_taken: packed array[trie_pointer] of boolean; @!triec_c: packed array[triec_pointer] of ascii_code; @!triec_l, @!triec_r: packed array[triec_pointer] of triec_pointer; @!triec_taken: packed array[triec_pointer] of boolean; @!ops: array[op_type] of op_word; {output hash table} @ When some trie state is being worked on, an unpacked version of the state is kept in positions |1..qmax| of the global array |trieq|. The character fields need not be in any particular order. @= @!trieq: array[1..128] of trie_node; {holds a single trie state} @!qmax: q_index; @!qmax_thresh: q_index; {controls density of first-fit packing} @ Trie fields are accessed using the following macros. @d trie_char(#)==trie_c[#] @d trie_link(#)==trie_l[#] @d trie_back(#)==trie_r[#] @d trie_outp(#)==trie_r[#] @d trie_base_used(#)==trie_taken[#] @d triec_char(#)==triec_c[#] @d triec_link(#)==triec_l[#] @d triec_back(#)==triec_r[#] @d triec_good(#)==triec_l[#] @d triec_bad(#)==triec_r[#] @d triec_base_used(#)==triec_taken[#] @d q_char(#)==trieq[#].ch @d q_link(#)==trieq[#].rh @d q_back(#)==trieq[#].lh @d q_outp(#)==trieq[#].lh @d hyf_val(#)==ops[#].val @d hyf_dot(#)==ops[#].dot @d hyf_nxt(#)==ops[#].op @* Routines for pattern trie. The pattern trie holds the set of patterns chosen prior to the current pass, including bad or ``hopeless'' patterns at the current level that occur too few times in the dictionary to be of use. Each transition of the trie includes an output field pointing to the hyphenation information associated with this transition. @= @!trie_max: trie_pointer; {maximum occupied trie node} @!trie_bmax: trie_pointer; {maximum base of trie family} @!trie_count: trie_pointer; {number of occupied trie nodes, for space usage statistics} @!op_count: op_type; {number of outputs in hash table} @ Initially, the dynamic packed trie has just one state, namely the root, with all transitions present (but with null links). This is convenient because the root will never need to be repacked and also we won't have to check that the base is nonnegative when packing other states. @d trie_root=1 @p procedure init_pattern_trie; var c: ascii_code; @!h: op_type; begin for c:=0 to 127 do begin trie_char(trie_root+c):=c; {indicates node occupied} trie_link(trie_root+c):=0; trie_outp(trie_root+c):=0; end; trie_base_used(trie_root):=true; trie_bmax:=trie_root; trie_max:=trie_root+127; trie_count:=128;@/ qmax_thresh:=5;@/ trie_link(0):=trie_max+1; trie_back(trie_max+1):=0;@/ {|trie_link(0)| is used as the head of the doubly linked list of unoccupied cells} for h:=1 to max_ops do hyf_val(h):=0; {clear output hash table} op_count:=0; end; @ The |first_fit| procedure finds a hole in the packed trie into which the state in |trieq| will fit. This is normally done by going through the linked list of unoccupied cells and testing if the state will fit at each position. However if a state has too many transitions (and is therefore unlikely to fit among existing transitions) we don't bother and instead just pack it immediately to the right of the occupied region (starting at |trie_max+1|). @p function first_fit: trie_pointer; label found, not_found; var s, @!t: trie_pointer; @!q: q_index; begin @; for q:=1 to qmax do {pack it} begin t:=s+q_char(q);@/ trie_link(trie_back(t)):=trie_link(t); trie_back(trie_link(t)):=trie_back(t); {link around filled cell} trie_char(t):=q_char(q); trie_link(t):=q_link(q); trie_outp(t):=q_outp(q); if t>trie_max then trie_max:=t; end; trie_base_used(s):=true; first_fit:=s end; @ The threshold for large states is initially 5 transitions. If more than one level of patterns is being generated, the threshold is set to 7 on subsequent levels because the pattern trie will be sparser after bad patterns are deleted (see |delete_bad_patterns|). @= if qmax>qmax_thresh then t:=trie_back(trie_max+1) @+else t:=0; while true do begin t:=trie_link(t); s:=t-q_char(1); {get next unoccupied cell} @; if trie_base_used(s) then goto not_found; for q:=qmax downto 2 do {check if state fits here} if trie_char(s+q_char(q))>0 then goto not_found; goto found; not_found: end; found: @ The trie is only initialized (as a doubly linked list of empty cells) as far as necessary. Here we extend the initialization if necessary, and check for overflow. @= if s+128>trie_size then error('Pattern trie too full!'); @.Pattern trie too full@> while trie_bmax hyf_val(h):=v; hyf_dot(h):=d; hyf_nxt(h):=n; new_trie_op:=h; return; end; if (hyf_val(h)=v) and (hyf_dot(h)=d) and (hyf_nxt(h)=n) then {already in hash table} begin new_trie_op:=h; return; end; if h>1 then decr(h) @+else h:=max_ops; {try again} end; exit: end; @ @= @!pat: array[dot_type] of ascii_code; {current pattern} @!pat_len: dot_type; {pattern length} @ Now that we have provided the necessary routines for manipulating the dynamic packed trie, here is a procedure that inserts a pattern of length |pat_len|, stored in the |pat| array, into the pattern trie. It also adds a new output. @p procedure insert_pattern(@!val: val_type; @!dot: dot_type); var i: dot_type; @!s, @!t: trie_pointer; begin i:=1; s:=trie_root+pat[i]; t:=trie_link(s); while (t>0) and (ipat[i] then @; s:=t; t:=trie_link(s); end; q_link(1):=0; q_outp(1):=0; qmax:=1; while i= begin if trie_char(t)=0 then begin {we're lucky, no repacking needed} trie_link(trie_back(t)):=trie_link(t); trie_back(trie_link(t)):=trie_back(t);@/ trie_char(t):=pat[i]; trie_link(t):=0; trie_outp(t):=0; if t>trie_max then trie_max:=t; end else begin {whoops, have to repack} t:=t-pat[i]; unpack(t);@/ incr(qmax); q_char(qmax):=pat[i]; q_link(qmax):=0; q_outp(qmax):=0;@/ t:=first_fit; trie_link(s):=t; t:=t+pat[i]; end; incr(trie_count); end; @* Routines for pattern count trie. The pattern count trie is used to store the set of patterns considered in the current pass, along with the counts of good and bad instances. The fields of this trie are the same as the pattern trie, except that there is no output field, and leaf nodes are also used to store counts (|triec_good| and |triec_bad|). Except where noted, the following routines are analogous to the pattern trie routines. @= @!triec_max, @!triec_bmax, @!triec_count: triec_pointer; {same as for pattern trie} @!triec_kmax: triec_pointer; {shows growth of trie during pass} @!pat_count: integer; {number of patterns in count trie} @ [See |init_pattern_trie|.] The variable |triec_kmax| always contains the size of the count trie rounded up to the next multiple of 4096, and is used to show the growth of the trie during each pass. @d triec_root=1 @p procedure init_count_trie; var c: ascii_code; begin for c:=0 to 127 do begin triec_char(triec_root+c):=c;@/ triec_link(triec_root+c):=0; triec_back(triec_root+c):=0; end; triec_base_used(triec_root):=true; triec_bmax:=triec_root; triec_max:=triec_root+127; triec_count:=128; triec_kmax:=4096;@/ triec_link(0):=triec_max+1; triec_back(triec_max+1):=0;@/ pat_count:=0; end; @ [See |first_fit|.] @p function firstc_fit: triec_pointer; label found, not_found; var a, @!b: triec_pointer; @!q: q_index; begin @; for q:=1 to qmax do {pack it} begin a:=b+q_char(q);@/ triec_link(triec_back(a)):=triec_link(a); triec_back(triec_link(a)):=triec_back(a);@/ triec_char(a):=q_char(q); triec_link(a):=q_link(q); triec_back(a):=q_back(q); if a>triec_max then triec_max:=a; end; triec_base_used(b):=true; firstc_fit:=b end; @ The threshold for attempting a first-fit packing is 3 transitions, which is lower than for the pattern trie because speed is more important here. @= if qmax>3 then a:=triec_back(triec_max+1) @+else a:=0; while true do begin a:=triec_link(a); b:=a-q_char(1);@/ @; if triec_base_used(b) then goto not_found; for q:=qmax downto 2 do if triec_char(b+q_char(q))>0 then goto not_found; goto found; not_found: end; found: @ @= if b+128>triec_kmax then begin if triec_kmax=triec_size then error('Count trie too full!'); @.Count trie too full@> print(triec_kmax div 1024:1, 'K '); if triec_kmax+4096>triec_size then triec_kmax:=triec_size else triec_kmax:=triec_kmax+4096; end; while triec_bmax0) and (sposword[spos] then @; b:=a; a:=triec_link(a); end; q_link(1):=0; q_back(1):=0; qmax:=1; while spos= begin if triec_char(a)=0 then {lucky} begin triec_link(triec_back(a)):=triec_link(a); triec_back(triec_link(a)):=triec_back(a); triec_char(a):=word[spos]; triec_link(a):=0; triec_back(a):=0; if a>triec_max then triec_max:=a; end else begin {have to repack} a:=a-word[spos]; unpackc(a);@/ incr(qmax); q_char(qmax):=word[spos]; q_link(qmax):=0; q_back(qmax):=0; a:=firstc_fit; triec_link(b):=a; a:=a+word[spos]; end; incr(triec_count); end; @* Routines for traversing pattern tries. At the end of a pass, we traverse the count trie using the following recursive procedure, selecting good and bad patterns and inserting them into the pattern trie. @p procedure traverse_count_trie(@!b: triec_pointer; @!i: dot_type);@/ {traverse subtries of family |b|; |i| is current depth in trie} var c: ascii_code; {a local variable that must be saved on recursive calls} @!a: triec_pointer; {does not need to be saved} begin incr(i); for c:=cmin to cmax do {find transitions belonging to this family} begin a:=b+c; if triec_char(a)=c then {found one} begin pat[i]:=c; if i; end; end; end; @ When we have come to the end of a pattern, |triec_good(a)| and |triec_bad(a)| contain the number of times this pattern helps or hinders the cause. We use the counts to determine if this pattern should be selected, or if it is hopeless, or if we can't decide yet. In the latter case, we set |more_to_come| true to indicate that there might still be good patterns extending the current type of patterns. @= if good_wt*triec_good(a)=thresh then {good pattern} begin insert_pattern(hyph_level,pat_dot); incr(good_pat_count); good_count:=good_count+triec_good(a); bad_count:=bad_count+triec_bad(a); end else more_to_come:=true @ Some global variables are used to accumulate statistics about the performance of a pass. @= @!good_pat_count, @!bad_pat_count: integer; {number of patterns added at end of pass} @!good_count, @!bad_count, @!miss_count: integer; {hyphen counts} @!level_pattern_count: integer; {number of good patterns at level} @!more_to_come: boolean; @ The recursion in |traverse_count_trie| is initiated by the following procedure, which also prints some statistics about the patterns chosen. The ``efficiency'' is an estimate of pattern effectiveness. @d bad_eff==(thresh/good_wt) @p procedure collect_count_trie; begin good_pat_count:=0; bad_pat_count:=0; good_count:=0; bad_count:=0; more_to_come:=false; traverse_count_trie(triec_root,0); @/ print(good_pat_count:1,' good and ', bad_pat_count:1,' bad patterns added'); level_pattern_count:=level_pattern_count+good_pat_count; if more_to_come then print_ln(' (more to come)') @+else print_ln(' '); print('finding ',good_count:1,' good and ',bad_count:1,' bad hyphens'); if good_pat_count>0 then print_ln(', efficiency = ', good_count/(good_pat_count+bad_count/bad_eff):1:2) else print_ln(' '); print_ln('pattern trie has ',trie_count:1,' nodes, ',@| 'trie_max = ',trie_max:1,', ',op_count:1,' outputs'); end; @ At the end of a level, we traverse the pattern trie and delete bad patterns by removing their outputs. If no output remains, the node is also deleted. @p function delete_patterns(@!s: trie_pointer): trie_pointer;@/ {delete bad patterns in subtrie |s|, return 0 if entire subtrie freed, otherwise |s|} var c: ascii_code; @!t: trie_pointer; @!all_freed: boolean; {must be saved on recursive calls} @!h, @!n: op_type; {do not need to be saved} begin all_freed:=true; for c:=cmin to cmax do {find transitions belonging to this family} begin t:=s+c; if trie_char(t)=c then begin @; if trie_link(t)>0 then trie_link(t):=delete_patterns(trie_link(t)); if (trie_link(t)>0) or (trie_outp(t)>0) or ((t<=trie_root+127) and (t>=trie_root)) then all_freed:=false else @; end; end; if all_freed then {entire state is freed} begin trie_base_used(s):=false; s:=0; end; delete_patterns:=s; end; @ @= begin h:=0; hyf_nxt(0):=trie_outp(t); n:=hyf_nxt(0); while n>0 do begin if hyf_val(n)=max_val then hyf_nxt(h):=hyf_nxt(n) else h:=n; n:=hyf_nxt(h); end; trie_outp(t):=hyf_nxt(0); end @ Cells freed by |delete_patterns| are put at the end of the free list. @= begin trie_link(trie_back(trie_max+1)):=t; trie_back(t):=trie_back(trie_max+1); trie_link(t):=trie_max+1; trie_back(trie_max+1):=t; trie_char(t):=0;@/ decr(trie_count); end @ The recursion in |delete_patterns| is initiated by the following procedure, which also prints statistics about the number of nodes deleted, and zeros bad outputs in the hash table. Note that the hash table may become somewhat disorganized when more levels are added, but this defect isn't serious. @p procedure delete_bad_patterns; var old_op_count: op_type; @!old_trie_count: trie_pointer; @!t: trie_pointer; @!h: op_type; begin old_op_count:=op_count; old_trie_count:=trie_count;@/ t:=delete_patterns(trie_root); for h:=1 to max_ops do if hyf_val(h)=max_val then begin hyf_val(h):=0; decr(op_count); end; print_ln(old_trie_count-trie_count:1,' nodes and ',@| old_op_count-op_count:1,' outputs deleted'); qmax_thresh:=7; {pattern trie will be sparser because of deleted patterns} end; @ After all patterns have been generated, we will traverse the pattern trie and output all patterns. Note that if a pattern appears more than once, only the maximum value at each position will be output. @p procedure output_patterns(@!s: trie_pointer; @!pat_len: dot_type);@/ {output patterns in subtrie |s|; |pat_len| is current depth in trie} var c: ascii_code; {must be saved on recursive calls} @!t: trie_pointer; @!h: op_type; @!d: dot_type; begin incr(pat_len); for c:=cmin to cmax do begin t:=s+c; if trie_char(t)=c then begin pat[pat_len]:=c; h:=trie_outp(t); if h>0 then @; if trie_link(t)>0 then output_patterns(trie_link(t),pat_len); end; end; end; @ @= begin for d:=0 to pat_len do hval[d]:=0; repeat d:=hyf_dot(h); if hval[d]0 then write(hval[0]:1); for d:=1 to pat_len do begin if pat[d]=edge_of_word then write('.') else write(xchr[pat[d]]); if hval[d]>0 then write(hval[d]:1); end; writeln; end @* Dictionary processing routines. The procedures in this section are the ``inner loop'' of the pattern generation process. To speed the program up, key parts of these routines should be coded in machine language. @= @!word: array[word_index] of ascii_code; {current word} @!dots: array[word_index] of ascii_code; {current hyphens} @!dotw: array[word_index] of integer; {dot weights} @!hval: array[word_index] of val_type; {hyphenation values} @!no_more: array[word_index] of boolean; {positions `knocked out'} @!wlen: word_index; {length of current word} @!word_wt: integer; {global word weight} @!wt_chg: boolean; {indicates |word_wt| has changed} @!buf: array[1..80] of text_char; {array to hold lines of input} @!buf_ptr: 1..80; {index into |buf|} @ We assume the word list uses only the letters \.A through \.Z. ``Dots'' between letters can be one of four possibilities: |"-"| indicating a hyphen, |"*"| indicating a found hyphen, |"."| indicating an error, or nothing. Furthermore single-digit word weights are allowed. A digit at the beginning of a word indicates a global word weight that is to be applied to all following words (until the next global word weight). A digit at some intercharacter position indicates a weight for that position only. @^character set dependencies@> The |read_word| procedure scans a line of input representing a word, and places the letters into the array |word|, with |word[1]=word[wlen]= edge_of_word|. The dot appearing between |word[dpos]| and |word[dpos+1]| is placed in |dots[dpos]|, and the corresponding dot weight in |dotw[dpos]|. @p procedure read_word; var c: ascii_code; begin readln(dictionary,buf); word[1]:=edge_of_word; wlen:=1; buf_ptr:=1; c:=xord[buf[buf_ptr]]; if (c<="9") and (c>="0") then begin word_wt:=(c-"0"); {global word weight} wt_chg:=true; incr(buf_ptr); end; repeat c:=xord[buf[buf_ptr]]; if c>="A" then {record the letter |c|} begin incr(wlen); word[wlen]:=c; dots[wlen]:=0; dotw[wlen]:=word_wt; end else if c>="0" then dotw[wlen]:=(c-"0") {dot weight} else dots[wlen]:=c; {record the dot |c|} incr(buf_ptr); until buf[buf_ptr]=' '; incr(wlen); word[wlen]:=edge_of_word; dotw[wlen-3]:=1; {dot position two letters from end is not weighted} end; @ Here is a procedure that uses the existing patterns to hyphenate the current word. The hyphenation value applying between the characters |word[dpos]| and |word[dpos+1]| is stored in |hval[dpos]|. In addition, |no_more[dpos]| is set to |true| if this position is ``knocked out'' by either a good or bad pattern at this level. That is, if the pattern with current length and hyphen position is a superstring of either a good or bad pattern at this level, then we don't need to collect count statistics for the pattern because it can't possibly be chosen in this pass. Thus we don't even need to insert such patterns into the count trie, which saves a good deal of space. @p procedure hyphenate; label done; var spos, @!dpos, @!fpos: word_index; @!t: trie_pointer; @!h: op_type; @!v: val_type; begin for spos:=wlen-3 downto 0 do begin no_more[spos]:=false; hval[spos]:=0; fpos:=spos+1; t:=trie_root+word[fpos]; repeat h:=trie_outp(t); while h>0 do @; t:=trie_link(t); if t=0 then goto done; incr(fpos); t:=t+word[fpos]; until trie_char(t)<>word[fpos]; done: end; end; @ @= begin dpos:=spos+hyf_dot(h); v:=hyf_val(h); if (v=hyph_level) then {check if position knocked out} if ((fpos-pat_len)<=(dpos-pat_dot))and((dpos-pat_dot)<=spos) then no_more[dpos]:=true; h:=hyf_nxt(h); end @ The |change_dots| procedure updates the |dots| array representing the printing values of the hyphens. Initially, hyphens in the word list are represented by |"-"| and non-hyphen positions are represented by |0|. A hyphen that is correctly found by some pattern is changed to |"*"|, while erroneous hyphens are represented by |"."|. Similary, erroneous hyphens that are correctly inhibited are changed back to |0|, and found hyphens that are inhibited are changed to |"-"|. The routine also collects statistics about the number of good, bad, and missed hyphens. @^character set dependencies@> @d incr_wt(#)==#:=#+dotw[dpos] @p procedure change_dots; var dpos: word_index; begin for dpos:=wlen-3 downto 3 do begin if hval[dpos]>0 then if odd(hval[dpos]) then begin if dots[dpos]="-" then dots[dpos]:="*" else if dots[dpos]=0 then dots[dpos]:="."; end else begin if dots[dpos]="." then dots[dpos]:= 0 else if dots[dpos]="*" then dots[dpos]:="-"; end; if dots[dpos]="*" then incr_wt(good_count) else if dots[dpos]="." then incr_wt(bad_count) else if dots[dpos]="-" then incr_wt(miss_count); end; end; @ The following procedure outputs the word as hyphenated by the current patterns, including any word weights. @p procedure output_hyphenated_word; var dpos: word_index; begin if wt_chg then begin {output global word weight} write(pattmp,word_wt:1); wt_chg:=false end; if wlen<6 then @ else begin write(pattmp,xchr[word[2]]); for dpos:=3 to wlen-3 do begin write(pattmp,xchr[word[dpos]]); if dots[dpos]>0 then write(pattmp,xchr[dots[dpos]]); if (dotw[dpos]<>word_wt) and (dpos= begin for dpos:=2 to wlen-1 do write(pattmp,xchr[word[dpos]]); writeln(pattmp); end @ For each dot position in the current word, the |do_word| routine first checks to see if we need to consider it. It might be knocked out or a dot we don't care about. That is, when considering hyphenating patterns, for example, we don't need to count hyphens already found. If a relevant dot is found, we increment the count in the count trie for the corresponding pattern, inserting it first if necessary. @p procedure do_word; label continue, done; var spos, @!dpos, @!fpos: word_index; @!a: triec_pointer; @!goodp: boolean; begin for dpos:=wlen-3 downto 3 do begin spos:=dpos-pat_dot; fpos:=spos+pat_len; @; incr(spos); a:=triec_root+word[spos]; while sposword[spos] then begin {insert new count pattern} a:=insertc_pat(fpos); goto done; end; end; done: if goodp then incr_wt(triec_good(a)) @+else incr_wt(triec_bad(a)); continue: end; end; @ The globals |good_dot| and |bad_dot| will be set to |"-"| and |0|, or |"."| and |"*"|, depending on whether the current level is odd or even, respectively. @= good_dot, bad_dot: ascii_code; @ If the dot position |dpos| is out of bounds, knocked out, or a ``don't care'', we skip this position. Otherwise we set the flag |goodp| indicating whether this is a good or bad dot. @= if (spos<0) or (fpos>wlen) or no_more[dpos] then goto continue; if dots[dpos]=good_dot then goodp:=true else if dots[dpos]=bad_dot then goodp:=false else goto continue; @ If |hyphp| is set to |true|, |do_dictionary| will write out a copy of the dictionary as hyphenated by the current set of patterns. If |procesp| is set to |true|, |do_dictionary| will collect pattern statistics for patterns with length |pat_len| and hyphen position |pat_dot|, at level |hyph_level|. @= @!procesp, @!hyphp: boolean; @!pat_dot: dot_type; {hyphen position, measured from beginning of pattern} @!hyph_level: val_type; {hyphenation level} @!filnam: packed array[1..8] of char; {for |pattmp|} @ The following procedure makes a pass through the word list, and also prints out statistics about number of hyphens found and storage used by the count trie. @p procedure do_dictionary; begin good_count:=0; bad_count:=0; miss_count:=0; word_wt:=1; wt_chg:=false; reset(dictionary);@/ @; if procesp then begin init_count_trie; print_ln('processing dictionary with pat_len = ',pat_len:1, ', pat_dot = ',pat_dot:1); end; if hyphp then begin @; filnam[8]:=chr(hyph_level+ord('0')); rewrite(pattmp,filnam); print_ln('writing ',filnam); end; @;@/ print_ln(' '); print_ln(good_count:1,' good, ',bad_count:1,' bad, ', miss_count:1,' missed'); print_ln((100*good_count/(good_count+miss_count)):1:2,' %, ', (100*bad_count/(good_count+miss_count)):1:2,' %, ', (100*miss_count/(good_count+miss_count)):1:2,' %'); if procesp then print_ln(pat_count:1,' patterns, ',triec_count:1, ' nodes in count trie, ','triec_max = ',triec_max:1); if hyphp then close(pattmp); end; @ @= if odd(hyph_level) then begin good_dot:="-"; bad_dot:=0 ; end else begin good_dot:="."; bad_dot :="*"; end @^character set dependencies@> @ @= begin filnam[1]:='p'; filnam[2]:='a'; filnam[3]:='t'; filnam[4]:='t'; filnam[5]:='m'; filnam[6]:='p'; filnam[7]:='.'; end @ @= while not eof(dictionary) do begin read_word; hyphenate; change_dots; if hyphp then output_hyphenated_word; if procesp then do_word; end @* Reading patterns. Before beginning a run, we can read in a file of existing patterns. This is useful for extending a previous pattern selection run to get some more levels. (Since these runs are quite time-consuming, it is convenient to choose patterns one level at a time, pausing to look at the results of the previous level, and possibly amending the dictionary.) @p procedure read_patterns; var c: ascii_code; begin level_pattern_count:=0; reset(patterns); while not eof(patterns) do begin readln(patterns,buf); incr(level_pattern_count);@/ @; @; end; print_ln(level_pattern_count:1,' patterns read in');@/ print_ln('pattern trie has ',trie_count:1,' nodes, ',@| 'trie_max = ',trie_max:1,', ',op_count:1,' outputs'); end; @ When a new pattern has been input into |buf|, we extract the letters of the pattern, and also convert |"."| to the |edge_of_word| symbol. @^character set dependencies@> @= pat_len:=0; buf_ptr:=1; while buf[buf_ptr]<>' ' do begin c:=xord[buf[buf_ptr]]; if c>="A" then begin incr(pat_len); pat[pat_len]:=c; end else if c="." then begin incr(pat_len); pat[pat_len]:=edge_of_word; end; incr(buf_ptr); end @ Then we scan |buf| again looking for hyphenation values (digits), and insert the pattern each time a value is found. @^character set dependencies@> @= pat_dot:=0; buf_ptr:=1; while buf[buf_ptr]<>' ' do begin c:=xord[buf[buf_ptr]]; if (c<="9") and (c>="0") then insert_pattern(c-"0",pat_dot) else incr(pat_dot); incr(buf_ptr); end @* The main program. This is where \.{PATGEN} actually starts. We initialize the pattern trie, get |hyph_level| and |pat_len| limits from the terminal, and generate patterns. @p begin initialize; {set up input/output translation tables} init_pattern_trie; read_patterns; procesp:=true; hyphp:=false;@/ print('hyph_start = '); input(hyph_start);@/ print('hyph_finish = '); input(hyph_finish); for hyph_level:=hyph_start to hyph_finish do begin level_pattern_count:=0; for pat_dot:=0 to max_dot do level_no_more[pat_dot]:=false; if hyph_level>hyph_start then print_ln(' '); print('pat_start = '); input(pat_start);@/ print('pat_finish = '); input(pat_finish);@/ print('good weight, bad weight, threshold: '); input(good_wt,bad_wt,thresh); @; delete_bad_patterns; print_ln('total of ',level_pattern_count:1, ' patterns at hyph_level ',hyph_level:1); end; output_patterns(trie_root,0);@/ @; end_of_PATGEN: end. @ The patterns of a given length (at a given level) are chosen with dot positions ordered in an ``organ-pipe'' fashion. For example, for |pat_len=4| we choose patterns for different dot positions in the order 2, 1, 3, 0, 4. The variables |dot1| and |dot2| control this iteration in a clever manner. @= @!dot1, @!dot2: dot_type; @!level_no_more: array[dot_type] of boolean; @ The array |level_no_more| remembers which positions are permanently ``knocked out''. That is, if there aren't any possible good patterns remaining at a certain dot position, we don't need to consider longer patterns at this level containing that position. @= for pat_len:=pat_start to pat_finish do begin pat_dot:=pat_len div 2; dot1:=pat_dot*2; dot2:=pat_len*2-1; repeat pat_dot:=dot1-pat_dot; dot1:=dot2-dot1; if level_no_more[pat_dot] then goto continue; do_dictionary; collect_count_trie; if not more_to_come then level_no_more[pat_dot]:=true; continue: until pat_dot=pat_len; for pat_dot:=pat_len downto 0 do if level_no_more[pat_dot] then level_no_more[pat_dot+1]:=true; end; @ When all patterns have been found, the user has a chance to see what they do. The resulting \.{pattmp} file can be used as the new `dictionary' if we want to continue pattern generation from this point. @= procesp:=false; hyphp:=true; hyph_level:=hyph_finish;@/ print('hyphenate word list? '); input_ln(buf[1]); if (buf[1]='Y') or (buf[1]='y') then do_dictionary @* Index.