











                     OCCASIONAL PUBLICATIONS IN ACADEMIC COMPUTING

                                        Number 7














                   DOCUMENT PREPARATION AIDS FOR NON-MAJOR LANGUAGES


                                           by


                                Andy Black, David Weber,
                               Fred Kuhl, and Kathy Kuhl















                         Summer Institute of Linguistics, Inc.
                                       Dallas, TX

                                          1987










          Occasional  Publications  in  Academic  Computing  is   devoted   to
          publishing  computer  software  and  documentation  deemed  to be of
          potential  usefulness  to  members  of  the  Summer   Institute   of
          Linguistics  for  carrying  out their field projects in linguistics,
          literacy, anthropology, and translation.  The software published  in
          the  series  may  represent  work  in  progress.  In publishing this
          software, the Summer Institute of Linguistics,  Inc.  is  making  no
          commitment   to   maintenance,  but  is  committed  to  making  full
          disclosure of source code in cases where maintenance requests cannot
          be serviced.

          EDITOR: Gary F. Simons
          ASSISTANT EDITOR: Linda L. Simons

          This  manual  documents  the WRDCHG, SYLCHK, SYLCOR, SPLCOR, HYPHEN,
          and DELIM programs.  These programs are written in the C programming
          language  for  on-the-field  application using personal computers or
          small time-sharing  systems.   They  run  under  the  RT-11,  MS-DOS
          (including Sharp PC5000), TSX, and UNIX operating systems.







              Copyright (c) 1987, by Summer Institute of Linguistics, Inc.







          Editorial correspondence or program bugs should be addressed to:
          
               Academic Computing
               Summer Institute of Linguistics
               7500 West Camp Wisdom Road
               Dallas, TX  75236
          
          
          Requests for further copies, standing orders, or accompanying
               software diskettes should be addressed to:
          
               Bookstore
               Summer Institute of Linguistics
               7500 West Camp Wisdom Road
               Dallas, TX  75236





                                        CONTENTS


          1. INTRODUCTION                                                4
          1.1 Overview of program functions                              4
          1.2 Overview of program structure                              5
          1.3 Some lessons from history                                  6
             
          2. WORD CHANGE (WRDCHG)                                        8
          2.1 Introduction                                               8
          2.2 Making a change table                                      8
          2.3 The default mode                                          11
          2.4 Making a standard format marker field file                11
          2.5 Running the program                                       12
          
          
          3. SYLLABLE-BASED SPELLING CHECKING (SYLCHK)                  14
          3.1 Introduction                                              14
          3.2 Running the program                                       14
          3.3 The form of the output                                    16
          3.4 How to write the ONC file                                 16
          3.5 How to write an orthography change table                  17
          
          4. SYLLABLE-BASED SPELLING CORRECTION (SYLCOR)                18
          4.1 Introduction                                              18
          4.2 Initiating a session with SYLCOR                          19
          4.3 Screen layout                                             23
          4.4 Handling possible errors: word edit mode                  24
          4.5 Making the auto-correction and exceptions files           25
          4.6 Ending a session with SYLCOR                              26
          4.7 Writing your own auto-correction and exception files      26
                                                                  
          5. SPELLING CORRECTION WITH TABLE LOOKUP (SPLCOR)             27
          
          6. HYPHENATION (HYPHEN)                                       27
          6.1 Introduction                                              27
          6.2 Data files                                                28
          6.3 Running the program                                       32
          6.4 Examples                                                  34
          6.5 Miscellaneous                                             40
          
          7. DELIMITER CHECKING AND NESTING CHECK (DELIM)               42
          7.1 Introduction                                              42
          7.2 Running the program                                       42
          7.3 The form of the output                                    43
          7.4 How to write a delimiter file                             44
          7.5 Program limitations                                       44



          DOCUMENT PREPARATION AIDS                                          4

                                    1. INTRODUCTION

               The programs described in this booklet are  aids  to  producing
          documents.   They  are  useful  for a wide range of languages.  Each
          arose in response to a need felt  by  field  linguists  involved  in
          producing documents in non-English languages.

          1.1 Overview of program functions

               WRDCHG  makes  changes to the words of a text, while preserving
          capitalization,  punctuation  and  formatting.   It  is  useful  for
          correcting  spelling  and  typographic  errors, or even for adapting
          text between closely related dialects.  It  is  simple  to  use,  it
          allows  for  conditioning  in  terms  of  word boundaries, and it is
          efficient when hundreds of changes are involved  because  it  stores
          the changes in a dense form and because it is fast.

               SYLCHK  identifies  potential  spelling  errors  in text, using
          decomposition into syllables as the method for identifying  possible
          errors,  and returns these as a list.  The user supplies information
          about the syllable structure of the language.

               SYLCHK and  WRDCHG  work  together  to  correct  many  spelling
          errors.   SYLCHK  is  first  run  on  the  text to collect potential
          errors.  This list is then (optionally) sorted  and  duplicates  are
          eliminated,  and then it is edited to make a list of changes.  These
          changes are then made to the text with WRDCHG.

               However, this method has a weakness: without context, the  user
          may  not know how to correct some errors.  For example, if the error
          were ther, one would not know whether it should be corrected to the,
          their, there, other, or something else.  This sort of case motivated
          the next program.

               SYLCOR  is  an  interactive  editor  for  correcting  potential
          errors.    SYLCOR   identifies  potential  errors  by  the  syllable
          decomposition algorithm used in SYLCHK, using the same data files as
          SYLCHK.   When  a  potential  error is found, it is displayed in the
          upper portion of the screen with the surrounding text and in a  work
          area  in the lower portion of the screen, where it can be corrected.
          If the word is modified, the user may make the change  an  automatic
          correction.   If  it  is not modified, the user may add it to one of
          various  lists  of  exceptions  (for  example,  names,  loan  words,
          acronyms, and so on).

               SPLCOR  is  like SYLCOR except that, rather than using syllable
          decomposition for detecting errors, it assumes that  a  word  is  an
          error  unless  it is found on one of the exceptions lists.  This may
          be a useful approach for languages where the writing system  or  the
          phonology  (or  both!) make syllable decomposition ineffectual as an
          error detection algorithm.  The user simply accumulates  a  list  of
          all words which are to be passed without further attention.

               This  brings  up  an  interesting question: What are some other
          useful error detection methods?  Syllable decomposition  has  proven
          to  be  useful  in  many languages, particularly where syllables are
          fairly restricted and the writing system  represents  the  phonology
          closely.  But it will not yield the same results for every language;



          Introduction                                                       5

          for example, it is  less  effective  for  Spanish  than  it  is  for
          Quechua.

               Another    possibility    is    morphological   parsing,   i.e.
          decomposition  into  morphemes  rather  than  into  syllables.   For
          Quechua,  morphological  parsing  is  a  more  effective method than
          syllable decomposition, but it is also more costly in terms  of  the
          complexity  of  the  program, the data which must be provided by the
          user, and the data which must be loaded each  time  the  program  is
          run.

               There  are  other  schemes that have been used other languages.
          One algorithm for English passes a three-character window  over  the
          word,  looking  up  the  probability  for  the  occurrence  of  each
          character triple in a table. (These probabilities are established by
          running  the program in a training mode on large portions of correct
          text.)  The word  is  rejected  or  passed  as  a  function  of  the
          probabilities of its character triples.

               I  leave  the  following  question  with  the  reader:  for the
          language to which you wish to apply spelling error  detection,  what
          would  be the best method of detecting possible errors?  If you come
          up with a new idea, perhaps  we  can  prepare  alternative  programs
          which  are  like  SYLCOR  and SPLCOR, but which have different error
          detection algorithms.  SPLCOR provides the skeleton into which other
          algorithms  for  error detection -- ones that you devise -- could be
          inserted; the program source code is available for those who wish to
          give it a try.

               HYPHEN  introduces  a  user-determined  character  at  syllable
          boundaries.  This can  be  used  as  a  "discretionary  hyphen"  for
          formatting with a program like Manuscripter.  The user provides data
          in terms of which the program recognizes syllable  boundaries.   The
          user  can control how close to the word boundaries the discretionary
          hyphen may occur, so as to avoid stranding parts of words which  are
          too small.

               DELIM checks text to see that delimiters (characters like quote
          marks, brackets, braces, parentheses, and  so  on)  are  paired  and
          properly  nested.   This  is  useful  for  technical  papers and for
          computer  programs,  both  of  which  often  contain  a  great  many
          delimiters.   The  user  has  control  over what DELIM regards as an
          opening delimiter character and what is  the  corresponding  closing
          delimiter.   DELIM  reports  errors  by  giving the line number, the
          line, and indicating the offending delimiter.

          1.2 Overview of program structure

               WRDCHG, SYLCOR,  SPLCOR,  and  HYPHEN   share  the  same  basic
          program  structure,  as proposed in Weber and Kasper "Getting at the
          Words in Text," Notes on Linguistics  2:17-22  (1983).   The  module
          which  performs  the particular action on a word is lodged between a
          module TXTIN which separate the word from other  characteristics  of
          the  text (capitalization, punctuation, formatting), a module TXTOUT
          which recomposes the text with the possibly-modified word  in  place
          of the original word.  See the following diagram:



          DOCUMENT PREPARATION AIDS                                          6

          
                                   +--------+
                     words ------- | ACTION | ---- (modified) words
                       |           +--------+             |
                       |                                  |
                  +---------+   punct,capit,format   +---------+
                  |  TXTIN  | ---------------------- | TXTOUT  |
                  +---------+                        +---------+
                       |                                  |
                   input text                        output text
          

          (SYLCHK uses the TXTIN module, but since  it  does  not  produce  an
          output  text, it does not use TXTOUT.)  Because these programs share
          this  structure,  they  share  a  lot  of  code,  facilitating  both
          development  and maintenance.  I suspect that other, future programs
          could benefit from this architecture, and perhaps even the TXTIN and
          TXTOUT modules.

          1.3 Some lessons from history

               A  bit  of  history  is  in  order,  particularly  since  it is
          instructive as to how programs such as these can arise  in  response
          to needs felt by field linguists.

               My  involvement in the development of these programs (exclusive
          of HYPHEN) has been to see  the  need  for  a  program,  to  get  an
          approximate  conceptualization  of  the  program,  to write out some
          elementary design, to  interact  with  the  implementors  (answering
          questions about how I think it should work, providing test data, and
          so on), and helping to write documentation.

               The programing  expertise  was  virtually  all  contributed  by
          volunteers.   The  first volunteer was Bob Kasper.  Bob came to Peru
          upon finishing his B.S.  at  Cornell  University  to  implement  the
          Computer  Assisted  Dialect  Adaptation program.  As part of this he
          wrote the TXTIN and TXTOUT functions.  The CADA program  required  a
          change  module,  so  after  that was developed, I suggested that Bob
          make the WRDCHG program by putting that  module  between  TXTIN  and
          TXTOUT.   Since  all  the pieces were there, it was not a major job,
          and the first version of WRDCHG was born.  About the  same  time,  I
          began  learning  the  C  programming  language,  and wrote the first
          version of SYLCHK and DELIM with Bob's help.

               During Bob's stay in Peru, Alex Waibel (who  worked  in  speech
          research  at Carnegie-Mellon University) came to Peru for a two week
          "working" vacation.  Bob and I had a design document ready for Alex,
          and  about  a  week  and  a  half after arriving, Alex had a working
          editor, called CADAED, for application to CADA output text.

               About two years later, Fred and Kathy Kuhl came to Peru  for  a
          six  week  period.  Fred had just finished his doctorate in Computer
          Science and Kathy had taken several courses in programming.   I  had
          written  a  design  of SYLCOR based on my experience with a spelling
          corrector on another system, and  on  Bob's  TXTIN  and  TXTOUT,  my
          SYLCHK,  and  Alex's  CADAED.  I also had some ideas for how WRDCHG,
          SYLCHK and DELIM could be improved.  Fred and Kathy  went  right  to
          work,  Fred  on  WRDCHG  and  SYLCOR, and Kathy on SYLCHK and DELIM.



          Introduction                                                       7

          When Fred and Kathy left six weeks later, the programs were as  they
          now are.

               SYLCOR  incorporates  work  which  Bob,  Alex, Kathy and I did,
          combined masterfully by Fred.  Thus, for me, SYLCOR is a monument to
          cooperation,  volunteerism,  and professionalism.  Bob, Alex, Kathy,
          and Fred contributed their skills, writing code which  others  could
          build  upon  or  building  on  the  work of the former.  My role was
          simply to orchestrate this development.

               My experience with these programs  has  confirmed  something  I
          first  learned by working with Bill Mann: that "linguistic" software
          is probably best developed as a collaboration between  the  linguist
          and  the  computer  professionals.   The  linguist must identify the
          problem(s) for which software is  needed,  conceptualize  a  program
          (which must be computationally tractable), and then communicate this
          to the computer professional, whose responsibility is to refine  the
          linguist's  conceptualization  and  produce the code.  And, computer
          professionals who are willing to go  to  the  field  (to  where  the
          linguist  confronts  the situation for which he feels the need for a
          program) can make a large contribution, even if  they  only  stay  a
          short while.

               The  development of the HYPHEN program suggests another lesson.
          HYPHEN was written by Andy Black in response to an obvious  need  to
          introduce  discretionary  hyphens for the text formatting demands in
          the SIL computer center he manages.  Andy could  have  started  from
          scratch  and  written  the  program  entirely  himself.   But, being
          familiar with the architecture and code used  for  WRDCHG,  he  used
          TXTIN  and TXOUT.  This accelerated his development effort, and will
          save program maintenence time in the future.

               Andy's example makes me optimistic  about  the  development  of
          other programs -- as yet unanticipated -- which can be built without
          exorbitant effort from program parts which are already in hand.   If
          we  can  make our software development cooperative in this way, each
          building as much as possible on  the  work  of  others  rather  than
          starting  from  scratch  for  every  program,  and  if, as discussed
          earlier, we  can  bring  together  the  linguist  and  the  computer
          professional, then perhaps we might be able to fulfill -- to a large
          measure -- our need for linguistic software.

               There are other people whose names do not appear as authors but
          who   have   contributed   considerable   effort  in  bringing  this
          publication to reality.  Steve McConnel ported the programs  to  the
          other   operating  systems  and  in  doing  so  cleaned  up  several
          inconsistencies  within  and  between  the  programs.   Gary  Simons
          provided  general  editorial  advice and offered suggestions to make
          the programs more general so they could be used in language families
          quite  different  from the one they were originally designed to work
          for.  Linda Simons tested the ported verions along  with  Steve  and
          took  the  documentation  through several updates to keep it in line
          with the program improvements.



          DOCUMENT PREPARATION AIDS                                          8

                                2. WORD CHANGE (WRDCHG)

          2.1 Introduction

               Word Change (WRDCHG) passes over  a  text,  changing  words  as
          specified  by  the  user  in a change table.  WRDCHG can only change
          words;   it   cannot   change   punctuation,   format   marking   or
          capitalization.   (Each  output word will have the capitalization of
          the corresponding input word.)  It is possible to condition  changes
          as  applying  only  at word boundaries.  The speed of application is
          not substantially affected by the number of changes  in  the  change
          table; a large number (perhaps as many as 1500) can be made quickly.
          It also can apply the changes  only  to  specified  standard  format
          fields.   This  gives  the  ability  to  make  changes  to  only the
          vernacular entries of a dictionary, for example.

          2.2 Making a change table

               A change table is a list of paired strings, each string bounded
          by  double  quotes  (").   The  first string of a pair is called the
          "match string"; it specifies some pattern to be matched in  a  text.
          The  second string, called the "substitution string," specifies what
          is to be substituted for each  occurrence  of  the  matched  string.
          Observe the following in writing a change table:

               1. The  changes  in  a  table may occur in any order (i.e., the
                  order in which changes occur in a table makes no  difference
                  in  the  effect upon any text).  Therefore changes cannot be
                  "ordered."  That  is,  a  second  change  dependent  upon  a
                  condition  created  by  a  first  change will not work.  For
                  example, if the following two changes are in a  table,  only
                  the  first  will  occur  since the program will not scan the
                  input text a second time to find "bi?u".

                         "'"      "?"
                         "bi?u"   "bi?o"

               2. All changes should be  given  in  lower  case.   It  is  not
                  necessary  to give a change with various capitalizations, as
                  the result of any change will be  capitalized  just  as  the
                  original word.  For example, the change

                          "yeild" "yield"

                  will  change  "yeild"  to  "yield",  "Yeild"  to "Yield" and
                  "YEILD"  to  "YIELD".    (WRDCHG   recognizes   only   three
                  possibilities,   all  lower  case,  all  upper  case,  first
                  character capitalized.)

               3. If a character (other than space or tab) appears on  a  line
                  before  the  first  double  quote  mark,  then  that line is
                  regarded as a comment, and any change on that  line  is  not
                  applied.   This  provides a simple mechanism for disabling a
                  change: simply put some character ahead of the first string.
                  For example, the following line would not make any change:

                          off "this" "that"




          Word Change                                                        9

               4. Any character(s) may be placed between the  left  and  right
                  strings.    This   allows  whatever  notation  you  like  to
                  symbolize the change; The  following  lines  have  the  same
                  effect:

                          "mispelled" becomes "misspelled"
                          "mispelled" --> "misspelled"
                          "mispelled" > "misspelled"
                          "mispelled" "misspelled"

               5. Anything  following the right string is ignored, so comments
                  may follow the pair of strings; for example,  the  following
                  three changes are effective:

                          "kachaka"       "alliya"        `get well'
                          "qo"            "qara"          `give'
                          "fiyupa"        "aliska"        `very much'

               6. Changes  may  be  specified  as  applying  (a) only  at  the
                  beginning of a word, (b) only at  the  end  of  a  word,  or
                  (c) only if the complete word is matched.  To specify that a
                  change applies only at the beginning of a  word,  include  a
                  space  between  the  leading  double  quote  and  the  first
                  character of the match string; for  example,  the  following
                  change affects only the first "ka" in "kaykan":

                          " ka" "ke"

                  To  specify that a change applies only at the end of a word,
                  include a space between the final  character  of  the  match
                  string  and  the  following  double  quote; for example, the
                  following   change   affects   only   the   last   "na"   of
                  "nakananpaqna":

                          "na " "nya"

                  To specify that a change applies only when the complete word
                  is matched, include spaces both at the beginning and end  of
                  the  match  string;  for  example, the following changes the
                  word "na" when it stands  alone,  but  would  not  make  any
                  change to "nakananpaqna":

                          " na " "nya"

               7. A  change table may have multiple changes whose match string
                  has the same character string but which differ in  terms  of
                  boundary  conditions.  The order of priority for application
                  of changes whose match  strings  are  the  same  except  for
                  boundary conditions is 3 > 2 > 1 > 0 where

                     (0) anywhere within a word
                     (1) only at the end of a word
                     (2) only at the beginning of a word
                     (3) only when the entire word is matched

                  That  is,  3  applies  in  preference  to  0-2, 2 applies in
                  preference to 1 and 0, and 1 applies in preference to 0.  (A
                  way  to  think  of  this  is  that  the change with the most



          DOCUMENT PREPARATION AIDS                                         10

                  restricted conditions is applied in preference to  a  change
                  with  a  less  restricted condition.)  For example, consider
                  Change Table I:

                          TABLE I
                  
                          "na"    "naa"   (0) anywhere
                          "na "   "nac"   (1) only at the end
                          " na"   "nab"   (2) only at the beginning
                          " na "  "nad"   (3) if complete word

                  Change    Table    I    changes     "Nakamaananpaqna"     to
                  "Nabkamaanaapaqnac".   The first instance of "na" is changed
                  to  "nab"  because  the  change  with   the   "word-initial"
                  condition  (2)  applies in preference to the change with the
                  "anywhere" condition (0).  Likewise, the  last  instance  of
                  "na"  becomes  "nac"  by  the  change  with the "word-final"
                  condition (1) because it applies in preference to the change
                  with the  the "anywhere" condition (0).  The second instance
                  of "na" is changed by the "anywhere" change because that  is
                  the  only  change  whose conditions are met.  Change Table I
                  changes the isolated word "na" to "nac".  In this case,  all
                  of  the  changes  are, in principle, applicable, but the one
                  with the "complete word" condition applies in preference  to
                  the others (0-2).

                  Further, consider Change Table II:

                          TABLE II
                          "na "   "nac"   (1) only at the end
                          " na"   "nab"   (2) only at the beginning

                  In  this case the change which applies only at the beginning
                  of a word (2) applies in  preference  to  the  change  which
                  applies at the end of the word (1).

                  If  the  same  match  string (including boundary conditions)
                  occurs in more than one change in a table,  the  last  given
                  will  prevail.   Thus,  if  a  table contained the following
                  lines, "number" would be changed to "last".

                          "number" "first"
                          "number" "last"

               8.  In an instance where one change table makes a substitutiton
                  string  for  "a"  and also for "ab", the "ab" change will be
                  made but  the  "a"  change  will  not  also  be  made.   For
                  instance, in the table

                          "'"     "?"
                          "'u"    "'o"
                  all  occurrences of "'u" willbe changed to "'o" but will not
                  be changed to "?u".  All other occurrences of "'" will go to
                  "?".   To  solve this problem, the second line of the change
                  table should read: "'u" "?o".



          Word Change                                                       11

          2.3 The default mode

               In many cases, virtually all the changes in a table  will  have
          the  same condition.  For example, suppose that you are working in a
          language which does not have prefixes, and you wish to make a number
          of  changes  to  roots.   It  would  be  possible to insure that the
          changes apply only to roots by including a space at the beginning of
          each  match  string.   However,   this  has been made unnecessary by
          providing the appropriate "default mode" at the time of running  the
          program.  WRDCHG gives the following prompt:

                  Should changes be made
                     (0) anywhere within a word
                     (1) only at the end of a word
                     (2) only at the beginning of a word
                     (3) only when the entire word is matched
                  Type 0, 1, 2, or 3 :

          The effect of answering 0 (or RETURN) is that all changes will occur
          exactly as you have specified them in the  change  table,  including
          the  leading  and/or following spaces you have included.  The effect
          of answering "1" is as though a space were included at the beginning
          of  each  match  string;  the effect of answering "2" is as though a
          space were included at the end of each match string, and the  effect
          of  adding  "3"  is as though spaces were added at the beginning and
          end of the match string.  Note that it the appropriate response  can
          make it unnecessary (though not incorrect) to include a space in the
          actual change table.  For example, Change  Table  III  applied  with
          default mode "3" is equivalent to applying Change Table IV:

                   TABLE III        TABLE IV
          
                  "yee" "yey"     " yee " "yey"
                  "kyo" "kiw      " kyo " "kiw
                  "pok" "puk"     " pok " "puk"

          2.4  Making a standard format marker field file

               This  file gives you the ability to pick and choose which parts
          of a standard format file the changes are to apply to.  To do  this,
          merely  create  a  file  listing  the markers indicating the desired
          fields.  If you want all fields or if  the  file  does  not  contain
          standard format data, then this file should be empty.  The layout of
          this file is very free.  Thus the following are all equivalent:

               (1)   The following markers indicate which fields
                        are to be change:
                     \w
                     \i
          
               (2)   \w
                     \i
          
               (3)   \w\i




          DOCUMENT PREPARATION AIDS                                         12

          2.5 Running the program

               When WRDCHG starts it prints the following:

                  WORD CHANGE  Version 2.3 (12-Dec-86)

          You are first informed of how much memory is available for a  change
          table by a message like the following:

                  SETUP-ALLOC 22832 bytes for records

          You  are  then  asked  to indicate characters which you wish to have
          treated as alphabetic characters along with the standard ones.  Note
          that  all  other characters will be regarded as occurring outside of
          words.  For example, if one wished to change "didn't" to "did  not",
          the  apostrophe  (')  would  have  to  be  treated  as an alphabetic
          character; otherwise WRDCHG will treat "didn't" as two words, "didn"
          and "t".

                  Type RETURN to include these as alphabetic characters: ~'
                  Otherwise type the characters desired:

          After  you respond, WRDCHG will inform you of the characters it will
          treat as alphabetic.  For example, if  you  responded  by  typing  a
          tilde (~), you will then see the following:

                  Using the following as alphabetics:
                  ~ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz

               Before  you  are asked for the name of the change table, WRDCHG
          needs to know two things, the "trie level" and the  "default  mode."
          We  will  now  discuss  each  of these in turn.  The change table is
          stored in the computer's memory as a type of tree structure,  called
          a  "trie."   Tries are more efficient than simple lists in two ways:
          (a) it is possible to find entries much more quickly, and  (b)   for
          large  tables, more changes can be stored.  The degree to which this
          efficiency is attempted is set by the number you give in response to
          the prompt:

                  Maximum number of levels in the trie:  [99]

          If  there  were  nothing to pay for the efficiency, one would simply
          strive for the maximum, responding always with  a  carriage  return.
          But  that is not the case.  If the dictionary is not great enough to
          take advantage of the density you hope to  achieve,  more  space  is
          used than necessary.  (It's something like packaging soap in economy
          sized boxes: if you don't fill them the result takes  up  more  room
          than necessary.)

               As  a  rule  of  thumb,  use  2 or 3 for tables with up to 1000
          entries.  You will probably  develop  a  feel  for  what  number  is
          appropriate;  you might even experiment, loading the same table with
          different numbers and seeing which number leaves the most space  (as
          reported  by  messages  concerning free space given before and after
          the change table is loaded).  By the way, if you set the number  too
          low (say 0 or 1 for over 500 entries) the time it takes to find each
          change will increase considerably.




          Word Change                                                       13

               Next you  will  be  asked  about  the  "default  mode"  by  the
          following prompt:

                  Changes should be made:
                     0) anywhere within a word
                     1) only at the end of a word
                     2) only at the beginning of a word
                     3) only when the entire word is matched
                  Type 0, 1, 2, or 3 : [0]

          This has been discussed above in section 2.3.

               Now  that  you  have provided the "trie level" and the "default
          mode," WRDCHG is prepared to load a change table.  It  asks  for  it
          with the following prompt: 

                  Change table file:

          When it is finished loading, it informs you of the number of changes
          loaded and the amount of storage left.  For instance,

                  235 changes loaded.
                  24733 bytes left, largest space is 14733 bytes.

          Now you are asked for the name of  the  file  that  indicates  which
          specific  standard format fields the changes apply to.  This is done
          by the following prompt:

               Standard format marker field file: (<RETURN> for all fields)

          See section 2.4 for a discussion of this  file.   If  you  want  all
          fields,  then merely press the <RETURN> key.  You are next asked for
          the name of the file to be changed:

                  Input file:

          You are also asked to give a name for the  output  file  (i.e.,  the
          changed  file).   WRDCHG  makes up a default file name which you can
          use by simply  responding with a carriage return.  For  example,  if
          your  input  file  name is abcdef.sfm, then the prompt for an output
          file will appear as:

                  Output file [abcdef.chg]:

          and by simply typing a carriage return you  can  create  the  output
          file on the default device with the name abcdef.chg.  After the file
          is processed, you will be informed at the terminal of the number  of
          words  which  were  read  and  the  number which were altered with a
          message like the following:

                  INPUT: 234 words
          
                  234 words read, 7 altered.

               WRDCHG allows multiple input files (all to  be  processed  with
          the  same  change  file,  the  same trie level, and the same default
          mode).  You are asked:




          DOCUMENT PREPARATION AIDS                                         14

                  Next input file: (<RETURN> if no more)

          If you respond with a file name, you will be  asked  for  an  output
          file  name  as  before,  and  that  file  will be processed.  If you
          respond with a carriage return, you terminate WRDCHG and  return  to
          the monitor.



                     3. SYLLABLE-BASED SPELLING CHECKING  (SYLCHK)

          3.1 Introduction

               SYLCHK    identifies    possible   typographical   errors   and
          misspellings in texts by judging the phonological well-formedness of
          each  word:  a  word  is a possible error if it cannot be decomposed
          into one or more  well-formed  syllables.   SYLCHK  assumes  that  a
          syllable  is made up of an optional onset, a vocalic nucleus, and an
          optional coda; the user  must  supply  a  table  of  these  for  the
          language  to  which he is applying SYLCHK.  (Obviously SYLCHK cannot
          be applied in a language whose writing system does  not  approximate
          phonological form.)

               SYLCHK  never alters the text to which it is applied.  However,
          it may be used to correct text files in the following way:

               1. SYLCHK is applied to one or more  text  files,  accumulating
                  the possible errors in a single output file.

               2. This  error  file  is  sorted  and edited to create a change
                  table for correcting the errors.

               3. The change table is applied to the text files with a program
                  like WRDCHG (in this package) or CC (Consistent Changes).

          3.2 Running the program

               When SYLCHK is run the following will appear on the screen:

                  SYLLABLE BASED SPELLING CHECK  Version 3.0 (15-Dec-86)

          You  are  then informed of how much memory is available by a message
          like the following:

                  SETUP-ALLOC-10904 bytes for records

               You are then asked to indicate characters  which  you  wish  to
          have  treated as alphabetic characters along with wht standard ones.
          Note that all other characters will be regarded as occurring outside
          of  words.   For  example,  if one wished to change "didn't" to "did
          not", the apostrophe (') would have to be treated as  an  alphabetic
          character; otherwise SYLCHK will treat "didn't" as two words, "didn"
          and "t".

               Press <RETURN> to include these as alphabetic characters: ~'
               Otherwise type the characters desired:

          After you respond, SYLCHK will inform you of the characters it  will



          Syllable-based Spelling Checking                                  15

          treat as alphabetic.  For example, if  you  responded  by  typing  a
          tilde (~), you will then see the following:

                  Using the following as alphabetics:
                  ~ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz

          The next thing SYLCHK does is ask for two things relating to the ONC
          or Onset-Nucleus-Coda file.  Details about the form of this file are
          given  below  in  section 3.4.  The ONC file tells the program which
          characters (and character sequences) are  allowed  to  form  correct
          syllables.   First  it  asks  for  the  character  you  have used to
          separate ONC distribution classes in the ONC file, and then asks for
          the name of this file:

                  Character which separates ONC distribution classes: [\]

                  ONC file:

          You  will  then be asked for an orthography change table.  If you do
          not  want  to  use  a  change  table,  simply  press  <RETURN>.   An
          orthography  change  table  allows  you to normalize the spelling of
          words before they are  checked,  which  may  be  very  useful.   For
          example,  in  the practical orthography for Quechua, long vowels are
          represented as two vowels, (e.g.  long /a/ is represented as  "aa").
          However,  in the phonological system, long vowels pattern as a vowel
          followed by a consonant, so long /a/ patterns as an /a/ followed  by
          a  consonant  [length].   (For a justification of this analysis, see
          David Weber and Peter Landerman "The Interpretation of  Long  Vowels
          in Quechua" IJAL, January 1985, pages 94-108.)  In order that SYLCHK
          treat long vowels in this way, the words are normalized by  changing
          "aa"  to  "a:"; "ee" to "e:"; and so on, and ":" is listed as a coda
          in the ONC file.  The format  of  an  orthography  change  table  is
          described  below  in  section 3.5. If an orthography change table is
          specified, the program will respond with a message:

                  Orthography change table file: [None]

                  5 changes loaded.

          Now you are asked for the name of  the  file  that  indicates  which
          specific  standard  format  fields  the program will check.  This is
          done with the following prompt:

               Standard format marker field file: (<RETURN> for all fields

          See section 2.4 for a discussion of this  file.   If  you  want  all
          fields,  simply  press the <RETURN> key.  Next you will be asked for
          an output file:

                  Output file: [con]

          If  you  simply  type  a  carriage  return,  the  list  of  possible
          misspelled words will be displayed on the terminal.  If you type the
          appropriate device name to refer to your printer it will be  printed
          (without creating a file).  If you type a file name, the result will
          be written to that file.  Next, you are asked for  the  file  to  be
          checked with the prompt




          DOCUMENT PREPARATION AIDS                                         16

                  Input file:

               The program will then begin processing  the  text.   Each  time
          SYLCHK  successfully decomposes a word into well-formed syllables, a
          period will appear on the screen, enabling you to watch its rate  of
          progress.  At the end you will see a summary like:

                  INPUT: 386 words.

                  73 possible errors in abcdef.ghi

          SYLCHK  allows multiple input files to be checked (with the same ONC
          specifications, etc.).  You are asked:

                  Next input file (RETURN if no more):

          If you respond with a <RETURN> the program will  terminate  and  you
          will return to the monitor.

          3.3 The form of the output

               The  output file will contain, for each file being checked, its
          name, the potential errors found in that file  (with  each  possibly
          misspelled word on a separate line), and following the last possible
          error, the number of possible errors found in that file.

                  Possible errors in HGMK01.SFM
                  akrarkran
                  hanunn
                  wais
                  
                  3 possible errors

          3.4 How to write the ONC file

               This file  informs  SYLCHK  of  the  characters  and  character
          strings  that  are  acceptable  syllable  onsets,  nuclei and codas.
          These  appear  in  five  sets,  corresponding   to   the   following
          distribution classes:

                  first  = only in syllable onset (e.g., kw, sy, n~)
                  second = only in syllable coda (e.g., length)
                  third  = in either the onset or coda; if ambiguous,
                           will be interpreted as onset (e.g., k, ch)
                  fourth = in either the coda or onset; if ambiguous,
                           will be interpreted as coda 
                  fifth  = in the vocalic nucleus (e.g., a, e, i, o, u)

          Members  of  each set are mutually exclusive of all other sets, that
          is, no phoneme can occur in more than one distribution  class.   The
          third and fourth classes are listed as they are to solve the problem
          of ambiguity:  how does one divide  words  that  are  of  the  CVCVC
          pattern?   In the third set, onset or coda, phonemes are listed that
          can occur as either onsets or codae.  If a member of this set occurs
          as  the  middle C in a CVCVC pattern, the  program will interpret it
          as an onset, that is, CV.CVC.   Likewise,  phonemes  listed  in  the
          fourth  set,  coda  or  onset, will be interpreted as a coda if they
          occur as the middle C in a CVCVC pattern, that is CVC.VC.



          Syllable-based Spelling Checking                                  17

               The beginning and ending of each  class  is  marked  by  a  "\"
          (backslash).    (Thus,   the  file  should  contain  10  \'s.)   Any
          characters outside of these  five  regions  is  treated  as  comment
          (i.e.,  everything  before  the  first  "\",  between the second and
          third, the fourth and fifth, the sixth and seventh, the  eighth  and
          ninth,  or  following  after  the last "\" is comment.)  Within each
          class, characters and character strings should be separated  one  or
          more whitespace characters (tab, blank or carriage return).

               The  ONC  file  also  tells SYLCHK what are acceptable syllable
          patterns  within  words.   Three  patterns  are  given.   The  first
          describes  only  initial  syllables,  the third describes only final
          syllables, and the  second  describes  all  medial  syllables.   The
          parentheses  are  used  to  indicate a syllable, the square brackets
          indicate an optional phoneme.   Be  certain  there  is  no  matching
          parenthesis or square bracket missing.

               Here is a sample ONC file (used for Quechua):

                  NJSYL.ONC modified for SYLCHK v. 3. by Steve McConnel,
                  12-Dec-86
                     ONSET ONLY \ dy br pr by b d dr f fw fy gy h hy kl j
                                  hw ky kw py pw rr sy ty n~ kr bl n~w   \
                      CODA ONLY \ :                                      \
                  ONSET OR CODA \ ch g k l ll m n p q r s sh t tr ts w y \
                  CODA OR ONSET \                                        \
                        NUCLEUS \ a e i o u  a' e' i' o' u'              \
                  
                  SYLLABLE PATTERNS ([O]N[C]) (ON[C]) (ON[C])

               Here  is  a  second  example  of  an  ONC file describing a the
          syllable pattern for To'abaita (Solomon Islands) where the only  two
          syllable shapes are V and CV.

                  ONC.TOB  by Linda Simons  December 1986
                     ONSET ONLY \ b d f g gw k kw ng ' l m n r s t th w \
                      CODA ONLY \ \
                  ONSET OR CODA \ \        considered onset if ambiguous
                  CODA OR ONSET \ \        considered coda if ambiguous
                        NUCLEUS \ a e i o u \
                  
                  SYLLABLE PATTERN ([O]N) ([O]N) ([O]N)

          You should not be unduly concerned about making this table complete.
          Create a first approximation with  those  characters  that  come  to
          mind, and try it out on a text.  It will then quickly become obvious
          which characters and character strings you need to add to the table.

          3.5 How to write an orthography change table    

               An orthography change table is a list of paired  strings,  each
          string  bounded  by  double  quotes (").  The first string of a pair
          specifies some pattern to be matched  in  a  text,  and  the  second
          string  specifies  what  is to be substituted for each occurrence of
          the matched string.  For example, the following is  the  table  used
          for Quechua mentioned above:




          DOCUMENT PREPARATION AIDS                                         18

                  LNGVWL.TAB  D. Weber  May-30-82
                  "aa" "a:"
                  "ee" "e:"
                  "ii" "i:"
                  "oo" "o:"
                  "uu" "u:"

          Observe the following in writing a change table:

               1. The changes may occur in any order,  that  is,  their  order
                  makes no difference in the effect on the text.

               2. All  changes  should  be  given  in  lower  case;  it is not
                  necessary to give a change with various capitalizations.

               3. Any line whose first printing  character  is  not  a  double
                  quote  is  treated as a comment. (Note, a space or tab could
                  an  effective  change,  since   these   are   not   printing
                  characters.)

               4. Any  characters  may  be  placed  between the left and right
                  strings.   This  allows  whatever  notation  you   like   to
                  symbolize  the  change;  the  following  lines have the same
                  effect:

                          "mispelled" becomes "misspelled"
                          "mispelled" --> "misspelled"
                          "mispelled" > "misspelled"
                          "mispelled" "misspelled"

               5. Anything following the right string is ignored, so  comments
                  may  follow  the pair of strings; for example, the following
                  three changes are effective:

                          "kachaka"       "alliya"        `get well'
                          "qo"            "qara"          `give'
                          "fiyupa"        "aliska"        `very much'



                     4. SYLLABLE-BASED SPELLING CORRECTION (SYLCOR)

          4.1 Introduction

               SYLCOR  is  a   program   for   correcting   misspellings   and
          typographical  errors in text.  SYLCOR identifies possible errors by
          judging the  phonological  well-formedness  of  each  word:  a  word
          possibly  has  an  error if it cannot be decomposed into one or more
          well-formed syllables.  SYLCOR assumes that a syllable is made up of
          an optional onset, a vocalic nucleus, and an optional coda; the user
          must supply a table of  these  for  the  language  to  which  he  is
          applying  SYLCOR.   (SYLCOR  cannot  be  applied in a language whose
          writing system does not approximate phonological form, for  example,
          Chinese.)

               Potential  errors  in text may be exceptions to whatever method
          is used to discover them.  For example, if error detection  for  the
          Quechua language is based on phonological well-formedness, then many



          Syllable-based Spelling Correction                                19

          words borrowed from  Spanish  are  exceptions.   SYLCOR  uses  lists
          (which  you  create  as  you corrects text) to skip such exceptional
          words.   You  might  have  a  list  of  loan  words,   a   list   of
          abbreviations, a list of Biblical names, or something else.

               Potential  errors  in  text  may be real errors.  SYLCOR allows
          these to be corrected.  Context is  sometimes  needed  to  determine
          what  the  correct  word  should  be.   For  example, if you were to
          encounter the misspelling "ther" out of context, you would not  know
          whether  it should be corrected to "their", "there", "other", "the",
          etc.  Therefore, each time an error is suspected, SYLCOR displays  a
          region of text surrounding the suspect word.

               For  many errors, you will simply want to correct the error and
          continue through the text.  For common errors, you may want to  have
          all  subsequent instances corrected automatically.  For example, you
          might  want  all  instances  of  "recieve"   to   become   "receive"
          automatically.   SYLCOR  allows  you  to  create  (in the process of
          correcting text) lists of automatic changes.  You may choose to have
          each  automatic  correction  presented  for  your approval before it
          modifies the text.

               When you begin a session  with  SYLCOR,  the  files  containing
          exceptions and auto-corrections are loaded.  At the end of each text
          corrected, for each file to which there have been additions, you are
          asked  if you would like to update the file or backup the additions.
          In this way,  the  files  may  be  enlarged  by  each  session,  and
          consequently you do less and less work in subsequent sessions.

               SYLCOR  may  be applied to many texts in one session.  For each
          input text file, a corresponding output file will be created.

               SYLCOR deals only with the words of the text,  and  deals  with
          them  only  one  at a time.  All the format marking, punctuation and
          capitalization are passed unchanged  from  the  input  text  to  the
          output text.

               As  mentioned  above,  SYLCOR uses phonological well-formedness
          for  detecting  potential  errors.   SYLCOR's  error   detector   is
          precisely  that  of  SYLCHK.  Both use the same data files, i.e. the
          same orthography normalization table and the same file of acceptable
          onsets,  nuclei and codae.  Before running SYLCOR, you might find it
          helpful to run SYLCHK on some text; this will help  you  to  develop
          the data you need in the tables.

               If  you  intend  to  put  words  into a new auto-corrections or
          exceptions file during a SYLCOR session, you must create these files
          before  you  run  SYLCOR.   The  files  may  be  empty,  but you are
          encouraged to place identifying comments in them, according  to  the
          syntax given below (see section 4.9).

          4.2 Initiating a session with SYLCOR

               After  giving  the  command to run SYLCOR you will see first an
          line showing you the amount of  available  memory.   Then  you  must
          respond  to  some  questions so that some files can be loaded and so
          that certain options may be set.  You are first asked  for  a  setup
          file with the prompt:



          DOCUMENT PREPARATION AIDS                                         20

                  Setup file: [none]

          If you do not have a  setup  file,  you  must  answer  a  series  of
          questions interactively at the terminal.  If you provide the name of
          a setup file, many of the subsequent questions will be answered from
          the  file,  and you will be free to seek the beverage of your choice
          while the files load.  The following is a sample setup file:

                  Setup file for using SYLCOR with To'abaita texts
                  '
                  2
                  1
                  autoco.tob
                  y
                  loan.tob
                  biblic.tob
                  
                  \
                  onc.tob
                  
                  fields.tob

          The first line will always be skipped; this allows you to provide an
          identifying  comment.   Subsequent  lines  provide  responses to the
          questions in the order the program asks  them  as  discussed  below.
          There  may  be from zero to four names of exceptions lists and after
          the last exception file is given there must a carriage  return.   If
          some  file  cannot be found, setting up becomes interactive, and you
          must provide the correct responses from  the  terminal  (unless  you
          want to abort SYLCOR, edit the setup file and try again). 

               After  being  asked  for  a  setup file, you will then be asked
          which characters  you  want  treated  as  alphabetic  characters  in
          addition to the standard ones:

               Press <RETURN> to include these as alphabetic characters: ~'
               Otherwise type the characters desired:

          All other characters will be regarded as occurring outside of words.
          For example if you wish to treat "oyo't"  as  a  word,  include  the
          apostrophe  (')  as  an  alphabetic character; otherwise SYLCOR will
          treat "oyo't" as the two words "oyo" and "t".   After  you  respond,
          SYLCOR  will  inform  you  of  the  characters  it  is  treating  as
          alphabetic.  For example, if you responded by typing a tilde (~) and
          an apostrophe ('), you will then see the following:

                  Using the following as alphabetics:
                  ~'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz

               The  auto-corrections  and exceptions you are about to be asked
          for are stored in the computer's memory as a type of tree structure,
          called  a "trie."  Tries are more efficient than simple lists in two
          ways: (i) it is possible to find  entries  much  more  quickly,  and
          (ii)   for  large tables, more changes can be stored.  The degree to
          which this efficiency is attempted is set by the number you give  in
          response to the prompt:

                  Maximum level for trie [no limit]?



          Syllable-based Spelling Correction                                21

          If there were nothing to pay for the efficiency,  one  would  simply
          strive  for  the  maximum, responding always with a carriage return.
          But efficiency isn't free.  If the dictionary is not large enough to
          take  advantage  of  the  density you hope to achieve, more space is
          used than necessary.  (It's something like packaging soap in economy
          sized  boxes:  if  you don't fill them the result takes up more room
          than necessary.)

               As a rule of thumb, respond with 2 or 3 for tables with  up  to
          1000  entries.   You will probably develop a feel for what number is
          appropriate; you might even experiment, loading the same table  with
          different  numbers of levels and seeing which number leaves the most
          space (as reported by messages concerning free  space  given  before
          and  after  the  change table is loaded).  If you set the number too
          low (say 0 or 1 for over 500 entries) the time it takes to find each
          change will increase considerably.

               The next question the program asks is:

                  Minimum length of words to check: [1]

          Here  you should indicate the number of characters in the language's
          shortest well-formed words.

               Next you are asked:

                  Auto-corrections file: [none]

          If you do not want any  auto-corrections,  simply  type  a  carriage
          return.  If you have an auto-corrections table to load, respond with
          the appropriate file name.  If you do not have  an  auto-corrections
          file  and  you expect to put automatic corrections into such a file,
          go no further!  Use ^C to get back to the monitor and create a  file
          (using  a  text  editor).   (The structure of this file is described
          below in section 4.7.)  Run SYLCOR again, and when you get  to  this
          point,  respond  with  its  name.  The auto-corrections will then be
          added to this file.  After an auto-correction file  is  loaded,  you
          are told how many corrections were loaded.  Next comes the question:

                  Query before any auto-corrections? [y]

          If  you  respond  with  "n" or "N", auto-corrections are carried out
          automatically, without asking you to verify them.  The only evidence
          you  will  see  of  a  change is the incrementing of the "auto-corr"
          counter on the screen.  If you answer "y",  "Y"  or  simply  respond
          with  a  carriage  return,  then  each  time  an  auto-correction is
          discovered the surrounding text is displayed in the  upper  part  of
          the screen and your approval is sought.  For example:

                  "ther" > "their" ? [y]

          This   forces  you  to  decide  case-by-case  whether  a  change  is
          appropriate.   You  will   probably   want   to   be   queried   for
          auto-corrections  at  first;  if  you  find  that  you always answer
          positively, then you may feel comfortable about dispensing with  the
          warning.




          DOCUMENT PREPARATION AIDS                                         22

               After the auto-corrections query you  will  again  be  informed
          about  how much free memory is available.  Next you are asked for an
          exceptions file:

                  Exceptions file 1:

          If you respond to this with a carriage return,  SYLCOR  will  assume
          you  do  not  want  to  use  any  exceptions  files.   As  with  the
          auto-correction file, any exceptions files you wish to use  must  be
          created before reaching this point.  They need not have any entries,
          but  you  must  respond  to  this  prompt  with  the   name   of   a
          previously-created  file.   An  exceptions  file is simply a list of
          words, all lower case.  Its order is not significant.  It is good to
          have  it  begin  with  an  identifying  comment line.  (If this line
          begins with a backslash  ("\")  then  the  exceptions  file  can  be
          periodically sorted and the comment line will stay at the top.)

               After  Exceptions  file 1 has loaded, you will be told how many
          exceptions were loaded and informed about the amount of free storage
          by a message such as the following:

                  12963 bytes left, largest space is 6568 bytes.

          If  either  of  the numbers gets below 100, SYLCOR may have problems
          adding to the exceptions lists or auto-corrections table.

               If you have loaded an exceptions file, you will  be  asked  for
          another:

                  Exceptions file 2:

          You  can  use  up to four exceptions files during any run of SYLCOR.
          This allows you to keep, for example, Biblical names  in  one  file,
          linguistic  jargon  in another, unassimilated loan words in another,
          and so on.  Some applications will not need all  of  the  exceptions
          files;  for  instance,  correcting  Scripture  would  not  need  the
          linguistic jargon and correcting a linguistic paper would  not  need
          the  Biblical names. When the question appears again, press <RETURN>
          if you have no more exception files.

               Next you are asked :

                  Character which separates ONC distribution classes: [\]

          Next you are asked for an "ONC" table:

                  ONC file?

          The ONC file defines the possible syllable onsets, nuclei and codae.
          You  must  write  an  ONC  file  for  the  language to which you are
          applying SYLCOR; how to do so is discussed in the  previous  section
          on SYLCHK, section 3.4

               Next you are asked:

                  Orthography change table file: [none]

          If  you do not want to use a change table, simply press <RETURN>.  A



          Syllable-based Spelling Correction                                23

          change table allows you to normalize the spelling  of  words  before
          they  are  checked,  which  may be very useful.  For example, in the
          practical orthography for Quechua, long vowels  are  represented  as
          two  vowels, for example, long /a/ is represented as "aa".  However,
          in the phonological system, long vowels pattern as a vowel  followed
          by  a  consonant,  so  long  /a/  patterns  as  an /a/ followed by a
          consonant [length].  In order that SYLCOR treat long vowels in  this
          way,  the  words  are  normalized  by changing "aa" to "a:"; "ee" to
          "e:"; and so on, and ":" is listed as a coda in the ONC  file.   The
          format  of an orthography change table is described in detail in the
          preceding section on SYLCHK, section 3.5.  The next question is:

               Standard format marker field file: (<RETURN> for all fields)

               This file will list the specific standard  format  fields  that
          you  want  SYLCOR  to  read.   See the preceeding section on WRDCHG,
          section 2.4, for details of how this file should look.  Simply press
          <RETURN>  if  you want SYLCOR to read all fields.  Remember that all
          the answers to these questions  can  be  put  in  a  setup  file  as
          discussed already.

               At this point you will see a message on the screen and finally,
          you are asked for an input file:

                  Input file:

          To this you must respond with the name  of  the  file  you  wish  to
          correct.   If the file is not found, you will be asked to try again.
          When the file is found, you are asked for the  name  of  the  output
          file.   SYLCOR makes a default file name which you can use by simply
          typing a carriage return; the default writes to default  device  and
          adds  and  extension .SPL to the input file name.  Thus, if you were
          editing FUNNEY.SFM, the next prompt would be:

                  Output file: [FUNNEY.SPL]

          Of course, you are free to respond with whatever file name you wish.
          (On  a  two-tape  system, you will definitely want to have the input
          and output files on different tapes,  as  otherwise  there  will  be
          considerable tape spinning in the course of correcting a file.)

          4.3 Screen layout

               Suppose  you  initiate  a  session with SYLCOR in which you are
          correcting a file TEXT.SFM from the default device and  putting  the
          corrected version onto a specified device (such as DD1: or b:) under
          the name TEXT.SPL (where the new extension  indicates  that  it  has
          gone  through a spelling corrector).  The following appears slightly
          above the middle of the screen in reverse video:

          +-----------------------------------------------------------------+
          |SYLCOR TEXT.SFM > DD1:TEXT.SPL 0 words 0 Errors 0 Auto-corr 0 Exc|
          +-----------------------------------------------------------------+

          The region above these lines is for the display of text.  The region
          below  is  the  area  in which all your interactions with SYLCOR are
          displayed, that is, prompts and your  responses,  as  well  as  word
          editing.



          DOCUMENT PREPARATION AIDS                                         24

               As words pass through SYLCOR, the appropriate  counts  will  be
          incremented.   If you finish working on TEXT.SFM and correct another
          file, the new file names will be displayed and the counters will  be
          reset  to  zero.   Every  time  a  word passes from the input to the
          output file, the "words" counter gets incremented.  If the  word  is
          phonologically  anomalous, but is already on an exceptions list, the
          "Exc" counter is incremented.  If it is phonologically anomalous but
          there  is  an  auto-correction  for  it,  the "Auto-corr" counter is
          incremented.  If it is phonologically anomalous and there is neither
          an exception nor an auto-correction for it, then the "Error" counter
          is incremented.

          4.4 Handling possible errors: word edit mode

               When SYLCOR suspects an error, you  are  put  into  "word  edit
          mode."   The  word  is displayed in reverse video in the top part of
          the screen with surrounding text.  The following line  appears  just
          below the middle of the screen:

          WORD EDIT: <-,->, DEL, CTRL/U, CTRL/R, RETRN when done, ? for help

          Below  this  appears  the  word  you  are  editing,  with the cursor
          positioned directly after it.  You may  now  edit  this  word.   Any
          character you type will be entered to the left of the cursor, except
          for the following, which have the effect indicated:

          <- or CTRL/B moves the cursor back (to the left) one character
          -> or CTRL/F moves the cursor forward (to the right) one character
          DELETE or BACKSPACE  deletes one character to the left of the cursor
          CTRL/U or CTRL/W deletes the entire word being edited
          CTRL/R restores the original word, undoing all the editing
          RETURN closes the editing on this word
          ? prints this message

          If you hold down one of the arrow keys, it will move left  or  right
          until  you  release  the key.  If you are at the end of the word and
          move right, the cursor will cycle around to  the  beginning  of  the
          word.   If  you  are at the beginning of the word and move left, the
          cursor will cycle around to the end of the word.

               When you have finished  editing  a  word,  press  the  carriage
          return.   If  you  have  changed the word, the original word and the
          corrected form are displayed as a change, and you are asked  if  you
          want  to  make  this change automatic (by adding this to the list of
          automatic changes).  For example, if you  have  changed  "yeild"  to
          "yield", the following is displayed:

                  "yeild" > "yield" ? [n]

          The  "[n]"  at  the end of this line specifies the default value; if
          you respond simply with a carriage return, the change  will  not  be
          added  to the auto-corrections.  If you want to add this correction,
          respond with "y" or "Y".  After you respond to  this  question,  the
          program again resumes searching for the next possible error.

               Suppose that, instead of correcting the word, you want to leave
          it just as it is.  To do so, simply respond with a carriage  return.
          The  word  will then be unchanged, and you will be asked if you want



          Syllable-based Spelling Correction                                25

          to add it to one of the exceptions lists.  For example, if you  have
          two  exceptions  files,  LOANS.LST  and  BIBNAM.LST  (for  loans and
          Biblical names, respectively), you will see the following:

                  Add "xxxxx" to exceptions file?
                  1 - loan.lst
                  2 - bibnam.lst
                  <RETURN> to not save this exception
                  Type 1, 2, or <RETURN>

          To this you must respond with a "1", in which case the word will  be
          added  to LOANS.LST; a "2", in which case it is added to BIBNAM.LST;
          or a carriage  return,  in  which  case  it  is  not  added  to  any
          exceptions  list,  and  the  program  resumes searching for the next
          possible  error.   (The  program  will  complain  about  any   other
          response.)

          4.5 Making the auto-correction and exception files

               When you finish correcting a text file, and the output file has
          been written, you are then asked if you would like  to  protect  the
          additions  made  to  the auto-correction and exceptions files.  Only
          the files to which there have been  additions  will  be  considered.
          You are asked:

          Update auto-corr & all exceptions files to their current names? [n]

          If you respond with "y" or "Y", all files to which there have been
          additions are updated under the same name and onto the device from
          which  they  were  read.  Since this involves copying the original
          file  and  then  writing  out  the  additions,   this   can   take
          considerable time on a tape based system.

               If  you  respond negatively you are given the option to do so
          file by file.  You will see a prompt like the following:

                  For auto-corr file NJAUTO.TAB
                          1 - save both new and old auto-corrections
                          2 - save only new auto-corrections
                          <RETURN> to forget new auto-corrections
                          Type 1, 2, or <RETRUN>:

          This gives you  the  option  to  (1) rewrite  the  file  with  the
          additions  (which,  again,  takes  a while on a tape-based system)
          (2) write out a temporary  backup  file  consisting  of  only  the
          entries  you  have  added  since  your last update, (3) do nothing
          about backing up additions.  The  second  alternative  takes  less
          time,  but  in  the event of a problem (e.g., a power failure) you
          must later do a separate operation to append the additions to your
          original file.

               If  you  are  making  many  additions to the auto-correct and
          exceptions files, SYLCOR may ask you to  protect  these  additions
          before  it  gets  to  the end of the text file you are correcting.
          This is because SYLCOR has a limited ability to keep track of  all
          the  new  additions.   When  it gets to the limit, it wants you to
          rewrite the file with the additions (i.e.,  option  1,  above)  so
          that it can start afresh remembering new additions.  (Note: option



          DOCUMENT PREPARATION AIDS                                       26

          2 above will not do here, as it does not cause SYLCOR to  "forget"
          the old additions and start a new list.)

          4.6 Ending a session with SYLCOR

               SYLCOR  begins  the process of terminating a session when you
          respond with a carriage return to the following prompt:

                  Next input file (<RETURN> if no more):

          Since you may have done only temporary backup to this  point,  and
          would now like to do a full backup, you are again asked

          Update auto-corr & all exceptions files to their current names? [n]

          When the matter of backup is settled, you are asked to replace the
          systems tape if necessary and then type a carriage  return  before
          control returns to the operating system:

                  Reinsert system disk if necessary, then press <RETURN>:

          You will then be returned to the system prompt.

          4.7 Writing your own auto-correction and exceptions files

               It was said above that you must create the files used to hold
          auto-corrections and exceptions before you run  SYLCOR,  but  that
          when  you  create  them,  you need not put in any entries.  If you
          know beforehand some words you wish to include in these files, you
          might  as  well put them in with your editor.  Here we discuss the
          syntax of the auto-corrections and exceptions files.

               An auto-correction file has the same syntax as an orthography
          change  table  (as  defined  in  section 3.5).   Each  line should
          contain at most one correction.  The match string comes  first  on
          the  line,  followed  by the substitution.  Both are surrounded by
          double quotes.  Anything on a line outside the quotes is  ignored.
          Any  line  beginning  with any printing character besides a double
          quote is a comment line and is ignored.  Do  not  use  upper  case
          characters (except, perhaps, in comments)!  It is good to start it
          with an identifying comment line.  It can be  sorted  periodically
          with a line sort, and it can be used with WRDCHG.

               An auto-correction file does not need to have anything in it.
          Auto-corrections  can  be  added  to  it  by  using  it   as   the
          auto-corrections file of a SYLCOR session.  Thus, you can start an
          auto-corrections file simply by creating (with an editor) an empty
          file  or a file which simply contains an identifying comment line.
          Then you can add all the corrections in SYLCOR sessions.

               An exceptions file contains words,  one  per  line,  with  no
          quote  marks  or blanks.  Any line beginning with a non-alphabetic
          character is ignored and may be used for comments.  Again, do  not
          use upper case characters (except, perhaps, in comments)!




          Spelling Correction with Table Lookup                           27

                 5. A SPELLING CORRECTION WITH TABLE LOOKUP (SPLCOR)

               SPLCOR is a program for correcting potential misspellings and
          typographical errors in text.  SPLCOR may be applied to many texts
          in one session: for each input text file, a  corresponding  output
          file will be created.

               SPLCOR  deals only with the words of the text, and deals with
          them only one at a  time;  all  format  marking,  punctuation  and
          capitalization  are  passed  unchanged  from the input text to the
          output text.  It treats every word as a potential error unless the
          word  has been previously entered into an "exception" list.  It is
          possible to have up to four exceptions  lists;  for  example,  you
          might  have  a list of loan words, a list of abbreviations, a list
          of Biblical names, etc.

               SPLCOR allows real errors to be corrected.  Since context  is
          sometimes  needed  to determine what the correct word should be, a
          region of text surrounding the error is displayed.   For  example,
          if  you  were  to encounter the misspelling "ther" out of context,
          you would not know whether it  should  be  corrected  to  "their",
          "there", "other", "the", etc.

               For  many  errors,  you will simply want to correct the error
          and continue on through the text.  For common errors  though,  you
          may want to have all subsequent instances corrected automatically.
          For example, one might want all instances of "recieve"  to  become
          "receive"  automatically.   SPLCOR  allows  you  to create (in the
          process of correcting text) a list of automatic changes.  You  may
          choose to approve each automatic correction before it modifies the
          text or to have it applied without your approval.

               When a session with SPLCOR is initiated, the files containing
          exceptions  and  auto-corrections  are loaded.  At the end of each
          text corrected, you can refresh the tape or disk copies  of  these
          files.   In this way, they are enlarged by each session, so you do
          less and less work in subsequent sessions.

               A variant of SPLCOR, called SYLCOR, detects potential  errors
          on the basis of phonological well-formedness.  It is expected that
          in the future other spelling correctors will  be  available  which
          use  the  SPLCOR shell but have other error detection methods.  If
          you have entered (in the process of  correcting  text)  a  certain
          word, it will be passed as acceptable.

               For  the  details of running SPLCOR, see the documentation of
          SYLCOR (section 4).  Ignore  all  references  to  the  orthography
          normalization and ONC tables.  All other aspects of the SYLCOR are
          exactly as in SPLCOR.



                               6. HYPHENATION (HYPHEN)

          6.1 Introduction

               Discretionary  hyphens  are  symbols  in  a  text  file  that
          indicate  places  where  word  hyphenation at the end of a line is



          DOCUMENT PREPARATION AIDS                                       28

          allowed.  Just as in English we have rules about where  words  can
          be divided, vernacular languages do also.  Having these symbols in
          a text as we were working with it would  be  a  nuisance,  so  the
          HYPHEN  program  can be used to put them in just prior to printing
          or typesetting.  The discretionary hyphen character is read by the
          formatting  program  Manuscripter  (MS)  and signals that the word
          could be hyphenated there if it occurs at the end of a  line  when
          printing  takes  place.   This  feature  is  especially helpful in
          languages that contain many long words.  If hyphenation  were  not
          allowed, a lot of space would be wasted at the end of each line of
          print.

               The HYPHEN program is basically  language  independent.   The
          user  defines which segments or sequences of segments constitute a
          given syllabification class and then defines the hyphenation rules
          in  terms of these classes.  The user also defines which character
          sequences constitute overstrike units.

               In Spanish, for example, the class of consonants contains the
          segments  b,  l,  and r and the sequences br and bl.  The class of
          vowels contains the segments a, , and i and the sequences ai  and
          i.   One  hyphenation  rule in Spanish is VCV becomes V-CV.  Thus
          the sequence abri would be hyphenated as a-bri.

               The program also allows the user to specify where in the word
          hyphenation  is  to  begin  and  end.  Thus one can tell it to not
          start hyphenating until there are at least  4  characters  at  the
          beginning and to stop hyphenating when there are 3 characters left
          at the end.  This would override any hyphenation rules that  might
          apply near the word boundaries.

               HYPHEN  also  allows  one to specify to which standard format
          fields the hyphenation process is  to  apply.   In  a  dictionary,
          then,  one  can  have  separate  classes  and rules for the source
          language fields (such as \w and \i) and for  the  target  language
          fields (such as \d and \t).

               If  HYPHEN  finds  a  word that has any sequence that has not
          been defined, it will display an  error  message  on  the  screen.
          This message will show what the sequence is, what the word is, and
          will also state that the word will not be hyphenated.

          6.2 Data files

               HYPHEN uses four user-defined data files  which  need  to  be
          created with a text editor before running the program.

          6.2.1 Segment definition file

               This  file  contains  the  information  about  which segments
          and/or sequences belong to which classes.  The information  is  to
          be entered in a specified format.

               1. All  text up to the first occurrence of the word CLASS (or
                  class) at the beginning of a  line  is  considered  to  be
                  comment.




          Hyphenation                                                     29

               2. The word CLASS (or class)  at  the  beginning  of  a  line
                  indicates  that  a  new class is about to be defined.  The
                  one letter abbreviation for the class  should  follow  the
                  key  word  CLASS.   Any  other  text  after  that  will be
                  considered comment.

               3. From the next line to either the end of the file or to the
                  next  occurrence  of  the word CLASS at the beginning of a
                  line, all characters are considered to be either  segments
                  or sequences that belong to that class.

               4. Please note that no one unique sequence can belong to more
                  than one class.  Thus "a" cannot both belong to the  class
                  A and the class V.

               5. Also  note  that  HYPHEN  will  always  take  the  longest
                  possible sequence and assign its associated class  to  it.
                  As  an  example,  let's suppose that the following classes
                  are defined:

                          CLASS V
                            a ai i
                          CLASS C
                            n r t tr
                          CLASS M
                            ain
                  
                  Then the word "train" will be treated as  a  "CM"  pattern
                  and the word "trait" would be treated as a "CVC" pattern.

               The following shows an example from Campa Pajonal (a language
          of the Peruvian jungle). (Note that the front slash (/) and double
          quote  (")  preceding a vowel as well as the tilde (~) before an n
          represent overstrikes that a discussed in section 6.2.2.)

              Campa Pajonal segment definition file    hab  17-May-85
          
              CLASS V  Vowels
          
                  a  e  i  o  u
                  aa ee ii oo uu
                  ae oe
                  /a /e /i /o /u "u
          
              CLASS C  Consonants
          
                  c ch g j  jy m  my n ~n p py qu qy r ry
                  s sh t th ts ty tz v vy y
          
              CLASS N  Word medial nasal consonant clusters
          
                  mp  nqu nth ntz
                  mpy nqy nts
                  nc  nt  nty
                  nch
          




          DOCUMENT PREPARATION AIDS                                       30

          6.2.2 Overstrike unit file

               This file  lists  the  character  sequences  that  constitute
          overstrike  units.   That  is, it lists all sequences that will be
          printed  as  one  character  as  the  text  is  passed  through  a
          Consistent  Changes  print  table.   This  information  is used by
          HYPHEN to count correctly where to  begin  or  end  hyphenating  a
          word.  The sequences are to be entered in a specified format.

               1. The first line is treated as comment.

               2. All  following  text  is  considered  to  be a list of the
                  overstrike units.  Each unit should be separated by "white
                  space"  (i.e.,  a  space, a tab, or a new line).  Capitals
                  and lower case letters do not need to be distinguished.

          The following shows an example from Spanish.

               Overstrike definition file for Spanish  05-Jul-85 hab
          
               'a 'e 'i 'o 'u "u ~n
          

          6.2.3 Hyphenation change table

               This file contains  the  hyphenation  rules.   It  is  to  be
          written  in the form of a "change table," although it is different
          from a Consistent Changes table in several ways.

               A change table is a  list  of  paired  strings,  each  string
          bounded  by  double  quotes  (").   The  first string of a pair is
          called the  "match  string";  it  specifies  some  pattern  to  be
          matched.   The  second  string,  called the "substitution string,"
          specifies what is to be substituted for  each  occurrence  of  the
          matched   string.   Please  note  the  following  when  writing  a
          hyphenation change table:

               1. Any character(s) may be placed between the left and  right
                  strings.   This  allows  whatever  notation  you  like  to
                  symbolize the change.  The following lines have  the  same
                  effect:

                          "VCV" becomes "V-CV"
                          "VCV" --> "V-CV"
                          "VCV" > "V-CV"
                          "VCV" "V-CV"
                  

               2. Anything   following  the  right  string  is  ignored,  so
                  comments may follow the pair of strings.

               3. If a character other than space or tab appears on  a  line
                  before  the  first  double  quote  mark, then that line is
                  regarded as a comment, and any change on  that  line  will
                  not  be  applied.   This  provides  a simple mechanism for
                  disabling a change: simply put some character ahead of the
                  first  string.   For example, the following line would not
                  make any change:



          Hyphenation                                                     31

                          off "VCV" > "V-CV"

               4. The hyphenation rules are ordered and will be  applied  as
                  many  times as possible.  That is, the first change in the
                  table will be made until it cannot be made anymore.   Then
                  the  second  change  will be made and so on.  This feature
                  has great  advantages,  but  can  cause  problems  if  not
                  properly  used.  It is possible to create an infinite loop
                  with this table!  Consider the following changes, where  C
                  is  the class of consonants, V is the class of vowels, and
                  G is the class of the single segment glottal.

                          "CCC" > "Cc-C"
                          "CC"  > "C-C"
                          "VG"  > "Vg-"
                  
                  Note the order of the changes.  If  the  double  consonant
                  change  were  put  first,  it  would  never  see  a triple
                  consonant change (CCC would become C-CC  and  then  become
                  C-C-C).   Note that the first change converts the second C
                  to a lower case c.  This is  so  that  after  CCC  becomes
                  Cc-C, the second rule will not then convert the CC to C-C.
                  Also note that this same "trick" was applied  for  the  VG
                  change.   Without  it,  we would have an infinite loop: VG
                  would become VG- which then becomes VG--, and so on.

               5. The special symbol #  indicates  a  word  boundary.   Thus
                  "#CV"   indicates  word-initial  CV  and  "CV#"  indicates
                  word-final CV.

               Please note the following special restrictions on the above:

               1. There must be  a  one-to-one  correspondence  between  the
                  number  of  non-hyphen  characters in the match string and
                  the substitution string.  Thus the following will  produce
                  unpredictable results:

                          "AI" > "V"  (too few  char's in sub. string)
                          "C"  > "TR" (too many char's in sub. string)
                  

               2. When  word  boundary conditions are indicated in the match
                  string, the substitution string should  also  include  the
                  word boundary symbol (#):

                          "#VCV" > "#V-CV"
                          "VCV#" > "VC-V#"
                  

          6.2.4 Stardard format marker field file

               This  file  allows  the user to specify which standard format
          fields (in a text containing several fields) are to be hyphenated.
          Merely  list  the  format  markers  which indicated the fields the
          hyphenation rules are to apply.  They can be entered in  any  way.
          Any  text  that  is  not preceeded by a backslash character (\) is
          considered to be a comment.  The following could be an example for
          a dictionary:



          DOCUMENT PREPARATION AIDS                                       32

              Pajonl.sfm  Campa Pajonal std format marker field file
          
                  \w words
                  \i illustrative sentences
          
          Please note that this file is optional.  If no file  is  specified
          when the program is run, all fields will be used.

          6.3 Running the program

               When  HYPHEN is first run, it begins by indicating the amount
          of free memory available with a message like:

                  HYPHENATION Version 1.3 (12-Dec-86)
                    
                          SETUP-ALLOC-112832 bytes for records

          You are then asked to specify which non-alphabetic (i.e., anything
          other than a-z) characters are included as specifying words.

               Press <RETURN> to include these as alphabetic characters: ~'
               Otherwise type the characters desired:

          If, for example, you were using ' for accent, ~n for an enyee, and
          "u for a dieresis u, you would want to type:

                  '"~
          
          and then press the <RETURN> key.  HYPHEN will then inform  you  of
          the  characters  it  will  treat  as alphabetic (i.e., are used in
          forming a word).  Any other characters will be  considered  to  be
          punctuation.   For  example,  if  you  used the example above, the
          following will be displayed:

                  Using the following as alphabetics:
                  '"~ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz

          It now asks a series of three questions about  how  to  hyphenate.
          The first is:

                  Discretionary hyphen character: [&]
          
          Type  the  character  you wish to use for the discretionary hyphen
          and press the <RETURN> key.  It will assume that you want  to  use
          an  ampersand  (&)  if you just press the <RETURN> key.  Note that
          you will also have to inform Manuscripter of the character you use
          for your discretionary hyphen symbol (with the .dh command).  Note
          that a two-character sequence may also be used (e.g., [-  as  used
          by SIL's Printing Arts Department in Dallas for typesetting).  The
          second question is:

                  Hyphenation starts after this many characters: [2]
          
          Enter  the  minimum  number  of  characters  in  a  word  that  is
          acceptable  for  hyphenation  to begin and press the <RETURN> key.
          It will assume you want it to begin after at least two  characters
          if you just hit the <RETURN> key.  The third question is:




          Hyphenation                                                     33

               Hyphenation stops at this many characters from the end: [2]
          
          Enter the number desired and press  the  <RETURN>  key.   It  will
          assume  that  you  want  2 characters if you just hit the <RETURN>
          key.  Now it asks for the files that you have created as discussed
          in section 6.2.  The first one is:

                  Segment definition file:
          
          Enter the name of your file and press the <RETURN> key.  Secondly,
          it asks:

                  Overstrike unit file: (<RETURN> for no overstrike units)
          
          If you have a file specifying which character sequences constitute
          one  printing  segment, enter its name and press the <RETURN> key.
          If there are no such sequences, merely  press  the  <RETURN>  key.
          Note that if you have overstike characters in your text but do not
          specify them here, HYPHEN may not correctly  delete  discretionary
          hyphens too near the front or too near the end of a word.  Then it
          will ask:

                  Hyphenation change table:
          
          Enter the file name of your change table and  press  the  <RETURN>
          key.   After  it  has  loaded  the  file, it will display how many
          changes it found.  It will now ask:

               Standard format marker field file: (<RETURN> for all fields)
          
          If you have a file specifying which standard format fields are  to
          be  hyphenated, enter its name and press the <RETURN> key.  If you
          want to hyphenate the entire text, merely press the <RETURN>  key.
          It now asks:

                  Input file:
          
          Enter  the  name  of  the  text file you wish to be hyphenated and
          press the <RETURN> key.  Then it asks:

                  Output file: [xxxxxx.hyp]
          
          where xxxxxx represents the name given for the  input  text  file.
          Enter  the  name  you  want  for the hyphenated file and press the
          <RETURN> key.  If you just press the  <RETURN>  key,  HYPHEN  will
          write  your file on the default device using an extension of .hyp.
          After it has processed the file, it will  display  the  number  of
          words it processed and then ask:

                  Next input file: (<RETURN> if no more)
          
          Enter  the  name of any additional files to be hyphenated or press
          the <RETURN> key.

          6.4 Examples

               Following are three examples from  Peruvian  languages.   The



          DOCUMENT PREPARATION AIDS                                       34

          explanation of the rules, the hyphenation change  table,  and  the
          segment definition file are shown for each.

          6.4.1 Spanish

          6.4.1.1  Hyphenation   rules.  These   are   from  The  New  World
          SPANISH-ENGLISH  and   ENGLISH-SPANISH   Dictionary,   edited   by
          Salavatore Ramondino (Signet Books, 1969), pp. 553-554.

                                      Consonants

               1. ch,   ll,  rr  count  as  single  letters  and  are  never
                  separated:

                          pe-cho  o-lla  pe-rro

               2. Single consonants between vowels go with the second vowel:

                          ca-be-za  pa-re-cer

               3. The groups pr, pl, br, bl, fr, fl, tr, dr, cr, cl, gr,  gl
                  go with the following vowel and are never separated:

                          re-pri-mir  co-pla  te-cla

               4. In  other  groups  of two consonants, whether identical or
                  different,  the  consonants  are   divided   between   the
                  preceeding and the following vowel:

                          res-pi-ro  hon-ra  ac-cin  in-no-ble  at-las

               5. In  groups  of three consonants, the first two go with the
                  preceding vowel and the third with the following vowel:

                          ins-tin-to  obs-t-cu-lo

                  Exception: the groups listed in 3 above are not separated:

                          en-tre  com-pra  tem-plo  ins-tru-men-to

                                        Vowels

               6. In any combination of two of a, e, or o, the  syllable  is
                  divided between the two vowels:

                          ca-o-ba  i-de-a-cin

               7. In  any combination of two vowels in which one is a, e, or
                  o and the other is i or u, and there is no accent mark  on
                  the  i  or  u,  the  vowels  form  a diphthong and are not
                  separated:

                          jo-fai-na  vian-da  em-bau-car  men-guan-te
                          vi-rrei-na  con-tien-da  en-deu-dar-se  con-sue-lo
                          co-loi-dal  na-cio-nal  duo-de-no

                  If there is an accent mark on the a, e, or o of the group,



          Hyphenation                                                     35

                  the  two  vowels  still  form  a  diphthong  and  are  not
                  separated:

                          es-tis  es-co-gis  cun-do

                  If  the  accent mark falls on the i or u of the group, the
                  two vowels do not form a diphthong and are separated:

                          ca--da  pen-sa-r-a-mos  a-ta-d  re--ne

               8. In any combination of i and u,  that  is,  ui  or  iu,  no
                  division  of  syllables  is made between these two vowels.
                  This holds whether there is an accent mark or not:

                          ciu-dad  rui-do ca-sus-ti-co

               9. In any combination of three vowels in which the first  one
                  is  i, u, or  (more than three do not occur), there is no
                  division of syllables between any two vowels of the group.
                  This  holds  whether there is an accent mark on any of the
                  vowels or not:

                          a-pre-ciis

               These rules can be simplified to  the  following  hyphenation
          rules and segment defintions.

          6.4.1.2  Segment  definition  file.   Table  1  shows  the segment
          definition file needed for Spanish.

               Spanish segment definition file    hab/sp  08-Jul-85
               
                       This data is from The New Word SPANISH-ENGLISH and
                       ENGLISH-SPANISH dictionary, ed. by Salvatore
               Ramondino,
                       1969, pp. 553-4 (V. Division of Syllables in
               Spanish).
               
               CLASS C  Consonants
               
                       b  bl br c  ch cl cr d  dr f  fl fr g  gl gr h  j  k
                       l  ll m  n  ~n p  pl pr qu r  rr s  t  tr v  x  z  y
               
               CLASS V  Vowels
               
                        a   e   i   o   u
                       'a  'e  'i  'o  'u
               
                       ai  ia  ei  ie  oi  io  ui  iu  "u'e
                       au  ua  eu  ue  ou  uo  "ue "ui "u'i
               
                       'ai i'a 'ei i'e 'oi i'o 'ui i'u
                       'au u'a 'eu u'e 'ou u'o
               
                       'iu u'i
               
                       i'ai i'ei u'ai u'ei "u'ei
               



          DOCUMENT PREPARATION AIDS                                       36

                       uia ui'a uio ui'o uie ui'e
               

                              Table 1 - Spanish segments


          6.4.1.3  Overstrike unit file.  Table 2 shows the overstrike  unit
          file  needed  for  Spanish.   An  accented  vowel is preceded by a
          single quote ('), a dieresis on a u is indicated by a double quote
          ("), and an enyee is indicated by a tilde (~n).

               Overstrike definition file for Spanish  05-Jul-85 hab
               
               'a 'e 'i 'o 'u "u ~n

                            Table 2 - Spanish overstrikes

          6.4.1.4  Hyphenation  change table.  Table 3 shows the hyphenation
          change table needed for Spanish.

               Spanish hyphenation rules    hab   17-May-85
               
               "VCV"   > "V-CV"
               "CCC"   > "Cc-C"
               "CC"    > "C-C"
               "VV"    > "V-V"

                          Table 3 - Spanish hypenation rules


          6.4.2  Amarakaeri

               This is a Peruvian  jungle  language  which  belongs  to  the
          Harakmbet language family.

          6.4.2.1  Hyphenation    rules.  Amarakaeri   has   the   following
          hyphenation rules (as provided by  Bob Tripp):

               1. When a sequence of vowel-consonant-vowel occurs,  a  break
                  may  be  made  following  the first vowel, except when the
                  consonant is d, g, or y.

                       ya-ti-huad
                  
                  When a vowel is followed by a glottal, the break  is  made
                  after the glottal.

                       o'-hua'-po
                  
                  When  a  sequence of vowel-consonant-glottal-vowel occurs,
                  the break is made between the consonant and the glottal.

                       mo'-en-'uy-ne  on'-haudiay-'uya-te
                  

               2. A break may be made between two consonants.




          Hyphenation                                                     37

                       arat-but  yan-nig-pee'
                  
                  The digraph hu should not be broken.

                       hua-hue'    pak-hue'
                       huey-pa     jo-nan-hua-hua-hue'
                  
                  When a glottal occurs between two  consonants,  the  break
                  should be made after the glottal.

                       On'-ka'-a-po   on'-no-kie'-uy
                       on'-tia-huay-po
                  

               3. A break may be made between two vowels.

                       o'-e-a-po  hua-e'-e-ri
                  
                  However, the vowel clusters oe, oe, ee, ae, ia, ie, io, io
                  should not be broken.

                       no-poe'-dik  on'-no-po'-toe-po
                       tia-huay-hued   be-tio-ka'
                  
                  When a cluster of three vowels occurs, break following the
                  second vowel.

                       a'-nig-pei-a'-po  mo'-ma-noe-an-hua-hui-ka'-a-po-ne
                  
                  In  any  vowel cluster including a glottal, a break may be
                  made after the glottal.

                       ij-no-poe-a'-a-po'i  hua-e'-e-ri  aro'-en

               4. Do not hyphenate leaving a single letter at the  beginning
                  or end of a word.

          6.4.2.2  Segment  definition  file.  Table  4  shows  the  segment
          definition file needed for Amarakaeri.  An  underscored  vowel  is
          indicated by a closing brace (}) preceding the vowel.

               
               Amarakaeri segment definition file   hab 15-May-85
               
               CLASS C  Consonants
                       b c f h hu j k l m n p q r s t v w x z
               
               CLASS X  Exception consonants
                       d g y
               
               CLASS G  Glottal
                       '
               
               CLASS V  Vowels
                        a  e  i  o  u
                       }a }e }i }o }u
                       }o}e }e}e }a}e ia }i}e io }i}o
                       oe



          DOCUMENT PREPARATION AIDS                                       38

                            Table 4 - Amarakaeri segments


          6.4.2.3  Overstrike unit file.  Table 5 shows the overstrike  unit
          file  needed for Amarakaeri.  An underscored vowel is indicated by
          a closing brace (}) preceding the vowel.

               Overstrike definition file for Amarakaeri   06-Jul-85 hab
          
               }a }e }i }o }u
          

                           Table 5 - Amarakaeri overstrikes


          6.4.2.4  Hyphenation change table.  Table 6 shows the  hyphenation
          change table needed for Amarakaeri.

               
               Amarakaeri Hyphenation Change Table  hab  15-May-85
               
               "VCV"   >       "V-CV"
               "VCGV"  >       "VC-GV"
               "VXGV"  >       "VX-GV"
               "CC"    >       "C-C"
               "XC"    >       "X-C"
               "CX"    >       "C-X"
               "XX"    >       "X-X"
               "CGC"   >       "CG-C"
               "CGX"   >       "CG-X"
               "XGC"   >       "XG-C"
               "XGX"   >       "XG-X"
               "VGV"   >       "Vg-V"
               "VG"    >       "Vg-"
               "VVV"   >       "Vv-V"
               "VVGV"  >       "Vvg-V"
               "VV"    >       "V-V"

                        Table 6 - Amarakaeri hyphenation rules


          6.4.3  Campa Pajonal

               Campa  Pajonal is a Peruvian jungle language which belongs to
          the Arawakan language family.

          6.4.3.1  Hyphenation rules.  These rules were provided  by  Allene
          Heitzman.

               1. The  vowels  are:  a,  e,  i,  o, and length, written as a
                  geminate vowel, and the vowel clusters ae, and oe.

               2. The consonants are: c, ch, g, j, jy, m, my, n, ,  p,  py,
                  qu, qy, r, ry, s, sh, t, th, ts, ty, tz, v, vy, y.

               3. The  consonant  clusters  are (word medial only): mp, mpy,
                  nc, nch, nqu, nqy, nt, nth, nts, nty, ntz.




          Hyphenation                                                     39

               4. Break after any vowel preceeding a consonant except before
                  an m or n in a consonant cluster.

               5. Do not break off less than four letters.

          6.4.3.2  Segment  definition  file.  Table  7  shows  the  segment
          definition file needed for Campa Pajonal.

               
               Campa Pajonal segment definition file    hab  17-May-85
               
               CLASS V  Vowels
               
                       a  e  i  o 
                       aa ee ii oo
                       ae oe
                       /a /e /i /o 
               
               CLASS C  Consonants
               
                       c ch g j  jy m  my n ~n p py qu qy r ry
                       s sh t th ts ty tz v vy y
               
               CLASS N  Word medial nasal consonant clusters
               
                       mp  nqu nth ntz
                       mpy nqy nts
                       nc  nt  nty
                       nch

                           Table 7 - Campa Pajonal segments


          6.4.3.3  Overstrike unit file.  Table 8 shows the overstrike  unit
          file needed for Campa Pajonal.  An accented vowel is preceded by a
          single slash (/), and an enyee is indicated by a tilde (~n).

                  
                  Overstrike definition file for Campa Pajonal  06-Jul-85
                  hab
                  
                  /a /e /i /o ~n
                  
                         Table 8 - Campa Pajonal overstrikes


          6.4.3.4  Hyphenation change table.  Table 9 shows the  hyphenation
          change table needed for Campa Pajonal.




          DOCUMENT PREPARATION AIDS                                       40

               
               Campa Pajonal hyphenation rules    hab   17-May-85
               
               "VC" > "V-C"    c break after any vowel preceding
                                 a consonant
                               c do not break if it is an m or n
                                 in a consonant cluster
               

                      Table 9 - Campa Pajonal hyphenation rules

          6.5 Miscellaneous

          6.5.1 Program limitations

               While HYPHEN is quite general, it does have some limitations.

               1. If a text has a mixture  of  vernacular  and  loan  words,
                  HYPHEN  will  try to hyphenate the loan words according to
                  the rules of the vernacular.  If the  loan  word  contains
                  some   undefined  sequence,  then  HYPHEN  will  ring  the
                  terminal bell and display an error message  for  the  word
                  and   will   not   hyphenate  it.   (This  is  actually  a
                  fundamental problem of identifying  loan  words  within  a
                  text).

               2. In version 1.2, HYPHEN correctly handles a text containing
                  Manuscripter bar commands (such as |b   or  |u).   Earlier
                  versions used to treat the b or u as a part of the word to
                  be hyphenated and it would lose any  capitalization  of  a
                  word preceded by a bar command.

               3. HYPHEN  assumes  that  the  orthography  consists  only of
                  lowercase alphabetics.  Thus it is not able  to  tell  the
                  difference  between upper and lower case letters, even if,
                  say  capital  letters  were  used  to  represent  unvoiced
                  vowels.   Both will be treated as if they were lower case.
                  In order for HYPHEN to correctly  handle  this  situation,
                  one  will  need  to  represent  the unvoiced sound by some
                  other unique sequence.

          6.5.2 Testing method

               The following is a method one can use to test  one's  segment
          definition file and hyphenation change table.

               1. Create a file that consists of the example words listed in
                  the hyphenation rules.  Put each word on a separate line.

               2. Then make two copies of each word, each one on a  separate
                  line.

               3. Place   a  backslash  character  in  front  of  the  first
                  occurrence  and  insert  hyphens  where  they  should  go.
                  HYPHEN  will  then  treat this as a standard format marker
                  and not as a word.




          Hyphenation                                                     41

               4. Insert a space in front of the second word.

               5. Run the file through the HYPHEN program  and  examine  the
                  results.   If  hyphenation has occurred correctly, the two
                  occurences of the word will line up exactly.

          Here is an example of part of such a test file for Spanish.

                  \o-lla
                   olla
                  \ca-be-za
                   cabeza
                  \re-pri-mir
                   reprimir
                  \co-pla
                   copla
                  \te-cla
                   tecla
                  \res-pi-ro
                   respiro
                  \obs-t'a-cu-lo
                   obst'aculo
          The output would then look like this:

                  \o-lla
                   o-lla
                  \ca-be-za
                   ca-be-za
                  \re-pri-mir
                   re-pri-mir
                  \co-pla
                   co-pla
                  \te-cla
                   te-cla
                  \res-pi-ro
                   res-pi-ro
                  \obs-t'a-cu-lo
                   obs-t'a-cu-lo

          6.5.3 Some change table techniques

               One can use the fact that the hyphenation rules  are  ordered
          to  one's  advantage.  Consider an example from Ticuna, a Peruvian
          jungle language.  The sequence arj needs to be hyphenated as  -arj
          word  finally  and  a-rj  elsewhere  (j  is a vowel).  The segment
          defintion file includes the following classes:

                  TIPHYP.SEG Character classes for Ticuna (Peru) hyphenation
          
                  CLASS V
                    e i o u
                  CLASS C
                    b c ch d f g l m n ~n ng p q s t w y
                  CLASS A
                    a
                  CLASS J
                    j
                  CLASS R



          DOCUMENT PREPARATION AIDS                                       42

                    r
          
          Notice that a, r, and j are in  separate  classes  by  themselves.
          The hyphenation rules include the following changes:

                  TIPHYP.CHG  changes for hyphenation of Ticuna (Peru)
          
                  "ARJ#"  > "-arj#"
                  "A"     > "V"
                  "J"     > "V"
                  "R"     > "C"
                  "VCV"   > "V-CV"
          
          Note  here that the word final exception is treated first.  If the
          sequence arj is not word final, then  the  second  through  fourth
          changes  will  convert the "ARJ" class sequence into a "VCV" class
          sequence.  This allows  the  final  change  to  make  the  correct
          hyphenation.



                   7. DELIMITER CHECKING AND NESTING CHECK (DELIM)

          7.1 Introduction

               Delimiters  are  symbols  used  in  pairs to enclose specific
          information.  The  most  common  delimiter  pair  is  parentheses.
          others  are  square brackets or curly braces.  DELIM tests whether
          delimiters are paired and properly nested.  The user  may  specify
          the  delimiters  to  be checked; for example, he may wish to check
          the following:

                  ( ) { } " " ` ' [ ] < >

          DELIM reports the errors in such a way that they are easy to find.
          Multiple  files may be checked.  DELIM never changes the file that
          is being checked.

               DELIM is useful for the preparation of any text  which  makes
          use  of  delimiters.   For  example,  many  linguistic papers have
          frequent parentheses, phonetic and  phonemic  bracketing  ([]  and
          //),  and  glosses (`') all of which must be balanced and properly
          nested, for example, [atox] /atuq/  `fox'.   Sometimes  formatting
          programs  (e.g.,  SCRIBE)  and  often programming languages (e.g.,
          PTP, C) require heavy use of delimiters.  (While errors  in  these
          can  sometimes  be  discovered  by  running  the  program, it will
          generally be much quicker to discover the errors  with  DELIM  and
          correct them before running the program.)

          7.2 Running the program

               DELIM begins to run with the following message:

               DELIMITER PAIRING AND NESTING CHECK Version 2.1 (12-Dec-86)
          
                Press <RETURN> to use these delimiters:
                ({["
                )}]"



          Delimiter Checking and Nesting Check                            43

                Otherwise type delimiter file name:

          If you are satisfied with this list of delimiters, simply  type  a
          carriage return.  Otherwise specify the name of the delimiter file
          that includes the delimiters you want to check.  The form of  such
          a file is discussed in section 7.4.  Next you will be asked for an
          output file:

                  Output file: [con]

          If you simply type a carriage return, the output will  be  put  to
          the  terminal.   If  you  wish to have the output printed directly
          (i.e., without first creating a file on some device), respond with
          prn  (or  however  you refer to your printer).  If you type a file
          name, the result will be written to that file.  Next, by means  of
          the prompt

                  Input file:

          you  are  asked  for  the  file  to  be checked.  Respond with the
          appropriate file name.  When DELIM  finishes  checking  the  first
          file,  it asks for another file to be checked:

                  Next input file: (<RETURN> if no more)

          When there are no more files to be checked, simply type a carriage
          return to return to the monitor.


          7.3 The form of the output

               The output file will contain, for each  file  being  checked,
          its  name, the potential errors found in that file, and the number
          of potential errors found in that file.

               There are two sorts of errors.  First, there might be a right
          delimiter  for  which  there  was  no  previous corresponding left
          delimiter.  For example, if a file started with the line

                  This is a file ] which has an error.

          the error would be reported as follows:

                  unmatched right ] on line 1
                  This is a file ] which has an error.
                                 ^

          If a 15 line file ended with

                  This is a file { which has an error.

          the error would be reported as follows:

                  unmatched left { on line 15
                  This is a file {
                                 ^





          DOCUMENT PREPARATION AIDS                                       44

          7.4 How to write a delimiter file

               To  specify  delimiters  other  than  the  defaults,  it   is
          necessary  to  create a delimiter file.  This file contains two or
          three lines.  The optional first line  is  reserved  for  comments
          such  as  "Delim  file  for  XYZ."   The  second  line should list
          (without  intervening  spaces,  commas,   etc.)   all   the   left
          delimiters.   The  third  line should list the corresponding right
          delimiters,  with  each  right  delimiter   directly   below   the
          corresponding  left  delimiter.   For example, The following is an
          acceptable delimiter file (where there is nothing on lines two and
          three other than the delimiters, and all lines end with a carriage
          return):

                  This is a DELIM file for XYZ
                  [{(
                  ]})

               Any character can be  given  as  a  delimiter,  but  note,  a
          delimiter can only be a single character.

               If  the last two lines of the delimiter file are not the same
          length, you will be informed with the  message  when  the  program
          runs:

                  Delimiter lists are not the same length.

               It   is  possible  (and  sometimes  desirable)  to  give  the
          delimiters to be checked directly from the terminal.  This can  be
          done  by giving the terminal device name in response to the prompt
          (tt: for RT-11, con for MS/DOS) for a delimiter file name,  typing
          the  two  lines of left and right delimiters, and then closing the
          file with a ^Z (control Z).  For example, if one wished  to  check
          only  the  delimiter  pairs  ( )  and [ ], he could respond to the
          prompt for a delimiter file with tt: (on RT-11 systems), then type
          the sequence:

                  ( [ <RETURN> ) ] <RETURN> ^Z


          7.5 Program limitations

               One actual error sometimes causes DELIM to report many errors
          (i.e., errors are  said  to  "cascade").   Thus,  sometimes  error
          messages  subsequent to a real error should simply be disregarded.
          If the real error  is  fixed,  the  subsequent  (erroneous)  error
          messages go away.

               Too  many  unmatched left delimiters (more than approximately
          15) will cause DELIM to terminate with a message beginning  "Stack
          overflow..."  If this happens, control is returned to the monitor.
          Try checking the file with fewer delimiter pairs or  correct  what
          errors you can and rerun the program.

               Delimiters   cannot   span   files,  that  is,  corresponding
          delimiters must be in  the  same  file.   DELIM  does  not  ignore
          delimiters in comments or in quoted strings.  DELIM can check only
          99 pairs of delimiters.

