                        THE STRING-EXTENSIONS LIBRARY
                        =============================

Designed by the Gwydion Project

------------------------------------------------------------------------------

Table of Contents
-----------------

1.  Introduction

2.  The Conversions Module

3.  The Character-type Module

4.  String-hacking

5.  Regular-expressions

6.  Substring-search

7.  Known bugs

------------------------------------------------------------------------------

1.  Introduction
----------------
String-extensions is a library of routines for working with characters and
strings.  String-extensions exports these modules:

    Conversions

    This module consists of various useful conversions involving strings.

    Character-type

    This module is a Dylanized version of the C library ctype.h

    String-hacking

    This module exports miscellanous functions and data structures that are
    useful when working with strings and characters.

    Regular-expressions

    This module contains various functions that deal with regular
    expressions (regexps).

    Substring-search

    This module contains methods for searching for fixed substrings rather
    than general regular expressions.

2.  The Conversions Module
--------------------------
The Conversions module consists of various useful conversions involving
strings.  They are:

string-to-integer(string, #key base) => integer   [Function]
integer-to-string(integer, #key base) => string   [Function]
digit-to-integer(character) => integer  [Function]
integer-to-digit(integer) => character  [Function]

        Base defaults to 10, and is the radix for the number system to
        convert from/to.  Bases below 2 are errors, as are bases above 36.
        When converting from a string, the string must exactly describe a
        number, with no excess characters.  Digit-to-integer will signal an
        error if the digit is non-alphanumeric.  Errors will be signalled
        for all invalid input.

as(<string>, character)  [G.F. Method]

        Turns a character into the appropriate string of length one.

3.  The Character-type Module
-----------------------------
Character-type is a Dylanized version of the C library ctype.h  It contains
the following functions:

------------------------------------------------------------------------
Function and Argument     Returns #t for these characters                 
Type                                                                      
------------------------------------------------------------------------
alpha?(character)         a-zA-Z                                          
digit?(character)         0-9                                             
alphanumeric?(character)  a-zA-Z0-9                                       
whitespace?(character)    Space, tab, newline, formfeed, carriage return  
uppercase?(character)     A-Z                                             
lowercase?(character)     a-z                                             
hex-digit?(character)     0-9a-f                                          
punctuation?(character)   ,./<>?;\\:&quot;|'[]{}!@#$%^&*()-=_+`~      
graphic?(character)       alphanumeric or punctuation                     
printable?(character)     graphic or whitespace                           
control?(character)       not printable                                   
------------------------------------------------------------------------

4.  String-hacking
------------------
The String-hacking module exports miscellanous functions and data structures
that are useful when working with strings and characters.

add-last(stretchy-sequence, object) => stretchy-sequence  [Generic Function]
add-last(string, character) => string  [G.F. Method]

        Like add except it's guarenteed to add the character to the end of
        the string.

predecessor(character) => character  [Function]

        Get the character before this character.  Equivalent to

            as(<character>, -1 + as(<integer>, character))

successor(character) => character  [Function]

        Get the character after this character.  Equivalent to

            as(<character>, 1 + as(<integer>, character))

case-insensitive-equal(object1, object2)  [Generic Function]
case-insensitive-equal(string1, string2)  [G.F. Method]
case-insensitive-equal(character1, character2)  [G.F. Method]

        Does a case insensitive equality test.  Methods are provided only
        for strings and characters, not general collections.

<character-set>  [Sealed Abstract Class]
<case-sensitive-character-set>  [Class]
<case-insensitive-character-set>  [Class]

        A <character-set> is a non-mutable subclass of <collection>, and is
        conceptually an unordered set of characters.  Dylan collection
        elements always have keys, so to fit sets into Dylan, the key of an
        element of a character set is the element itself.  There are two
        instantiable subclasses of <character-set>,
        <case-sensitive-character-set> and <case-insensitive-character-set>.
        <character-set> is not instantiable; one must always specify one of
        the instantiable subclasses when creating a character set.
        There are two ways of making a character set.  The first is a method
        for make using the description: keyword.  The value that follows the
        description: keyword is a string that describes the set using a
        notation like a regular expression character set, except without the
        `[` and `]' delimiters.  For example,

            make(<case-sensitive-character-set>, description: &quot;a-z&quot;)

        would be the set of all lowercase alphabetic characters.
        A second way to create character sets is to use an as method. The as
        method basically takes a collection of characters and discards the
        keys of these characters.  Example:

            as(<case-insensitive-character-set>,
               &quot;abcdefghijklmnopqrstuvwxyz&quot;)

        is again the set of all lowercase alphabetic characters.  It is
        important to realize that the as method does not take a description:

            as(<case-sensitive-character-set>, &quot;a-z&quot;)

        returns the set of `a', `-', and `z', not the set of all alphabetic
        characters.
        The most useful operation on character sets is member?, which does
        what one would expect.  Another useful operation is the
        forward-iteration-protocol.  This basically calls member? on every
        possible character until it finds a character that is a member of
        the set.  This means that in a <case-insensitive-character-set>,
        both `a' and `A' will come up.

<byte-character-table>   [Class]

        A byte-character-table is a vector that uses byte characters as
        indices instead of integers.  The following are equivalent:

            regular-vector[as(<integer>, character)]
            byte-character-table[character]

        <byte-character-table> has absolutely no relation to <table>.  It is
        simply a <mutable-explicit-key-collection>.

5.  Regular-expressions
-----------------------
The Regular-expressions module contains various functions that dealwith
regular expressions (regexps).  The module is based on Perl (version 4), and
has the same semantics unless otherwise noted.  The syntax for Perl-style
regular expressions can be found on page 103 of Programming Perl by Larry
Wall and Randal L. Schwartz.  There are some differences in the way
String-extensions handles regular expressions. The biggest difference is
that regular expressions in Dylan are case insensitive by default.  Also,
when given an invalid regexp, String-extensions will produce undefined
behavior while Perl would give an error message.

There is some work involved in analyzing a regular expression, and if the
same regexp is used repeatly with different target strings, this will result
in wasted computation.  For this reason, each basic function in the
Regular-expression module comes with a companion function that makes using a
regular expression more efficient when it is used more than once.  For
example, the regexp-replace function has the make-regexp-replacer companion
function.  There is one exception; the join function has no make-joiner
function.  The &quot;make-fooer&quot; will analyze the regular expression
exactly once, and provide a function that makes use of this pre-analyzed
regular expression.  For example, the following two pieces of code yield the
same result:

            regexp-position(&quot;This is a string&quot;, &quot;is&quot;);
            let is-finder = make-regexp-positioner(&quot;is&quot;);
            is-finder(&quot;This is a string&quot;);

However, the second form is more efficient if is-finder is called multpile
times.

regexp-position  [Generic Function]

            (big-string, regexp, #key start, end, case-sensitive)
            => variable-number-of-marks-or-#f

        This function returns the index of the start of the regular
        expression in the big-string, or #f if the regular expression is not
        found.  As a second value, it returns the index of the end of the
        regular expression in the big-string (assuming it was found;
        otherwise there is no second value).  If there are groups in the
        regular expression, regexp-position will return two more values (a
        start and an end) for each group.  If the subgroup is matched, these
        will be integers.  So

            regexp-position(&quot;This is a string&quot;, &quot;is&quot;);

        returns values(2, 4), and

            regexp-position(&quot;This is a string&quot;, &quot;(is)(.*)ing&quot;);

        returns values(2, 16, 2, 4, 4, 13), while

            regexp-position(&quot;This is a string&quot;, &quot;(not found)(.*)ing&quot;);

        returns #f.  If the subgroup is not matched, however, both the start
        and the end will be #f.  The marks are always given relative to the
        start of big-string, and not relative to the start: keyword.
        Start: and end: specify what part of big-string to look at, and they
        default to the beginning and end of the string, respectively.
        Case-sensitive defaults to false.

make-regexp-positioner  [Generic Function]

            (regexp, #key byte-characters-only, need-marks, maximum-compile,
            case-sensitive)
            => an anonymous positioner
            method (big-string, #key start, end)

        Make-regexp-positioner can return several different types of
        positioners, and it is up to the user to specify what kind of
        positioner the user wants.  By default, it returns a positioner that
        works like regexp-position.  However, if need-marks is #f, it may
        give a positioner that only returns #t or #f, with no marks. (And
        then again, it may still return marks) If byte-characters-only is
        specified, the positioner may only work on big-strings that consist
        only of byte characters (characters whose numerical value is between
        0 and 255, inclusive).  And if maximum-compile is #t, it will take a
        long time to return a positioner, but the positioner will run really
        fast.

regexp-replace  [Generic Function]

            (big-string, search-for-regexp, replace-with-string, #key count,
            case-sensitive, start, end)
            => new-string

        This replaces all occurences of regexp in big-string with
        replace-string.  If count: is specified, it replaces only the first
        count occurences of regexp.  (This is different from Perl, which
        replaces only the first occurence unless /g is specified)
        Replace-string can contain backreferences to the regexp.  For
        instance,

            regexp-replace(&quot;The rain in spain and some other text&quot;,
                           &quot;the (.*) in (\\\\w*\\\\b)&quot;, &quot;\\\\2 has its \\\\1&quot;)

        returns &quot;spain has its rain and some other text&quot;.  If the
        subgroup referred to by the backreference was not matched, the
        reference is interpretted as the null string.  For instance,

            regexp-replace(&quot;Hi there&quot;, &quot;Hi there(, Bert)?&quot;, 
                           &quot;What do you think\\\\1?&quot;)

        returns &quot;What do you think?&quot; because &quot;, Bert&quot;
        wasn't found.

make-regexp-replacer  [Generic Function]

            (regexp, #key replace-with, case-sensitive)
            => an anonymous replacer function that is either
            method (big-string, #key count, start, end)
            or
            method (big-string, replace-string, #key count, start, end)

        The first form is returned if the replace-with: keyword isn't
        supplied, otherwise the second form is returned.  (There is no
        efficiency gained by supplying the replace-with string, but it might
        be convenient)

translate  [Generic Function]

            (big-string, from-string, to-string, #key delete, start, end)
            => new-string

        This is equivalent to Perl's tr/// construct.  From-string is a
        string specification of a character set, and to-string is another
        character set.  Translate converts big-string character by
        character, according to the sets.  For instance,

            translate(&quot;any string&quot;, &quot;a-z&quot;, &quot;A-Z&quot;)

        will convert &quot;any string&quot; to all uppercase: &quot;ANY
        STRING&quot;.
        Like Perl, character ranges are not allowed to be
        &quot;backwards&quot;.  The following is not legal:

            translate(&quot;any string&quot;, &quot;a-z&quot;, &quot;z-a&quot;)

        (This restriction may be removed in future releases)  Unlike Perl's
        tr///, translate doesn't return the number of characters translated.
        If delete: is true, any characters in the from-string that don't
        have matching characters in the to-string are deleted.  The
        following will remove all vowels from a string and convert periods
        to commas:

            translate(&quot;any string&quot;, &quot;.aeiou&quot;, &quot;,&quot;, delete: #t)

        Delete: is false by default.  If delete: is false and there aren't
        enough characters in the to-string, the last character in the
        to-string is reused as many times as necessary.  The following
        converts several punctuation characters into spaces:

            translate(&quot;any string&quot;, &quot;,./:;[]{}()&quot;, &quot; &quot;);

        Start: and end: indicate which part of the string.  They default to
        the entire string.
        Caveats:  Translate is always case sensitive.

translate  [G.F. Method]

            (big-byte-string, from-byte-string, to-byte-string, #key delete,
            start, end)
            => new-string

        The only method of translate operates only on byte strings.

make-translator  [Generic Function]

            (from-string, to-string, #key delete)
            => an anonymous translator
            method (big-string, #key start, end) => new-string

        Does what you'd expect it to.

make-translator  [G.F. Method]

            (from-byte-string, to-byte-string, #key delete)
            => an anonymous translator
            method (big-string, #key start, end) => new-byte-string

        Again, the existing method on make-translator only handles byte
        strings.

split  [Generic Function]

            (regexp, big-string, #key count, remove-empty-items,
            case-sensitive, start, end)
            => a variable number of strings

        This is like Perl's split function.  It searchs big-string from
        occurences of regexp, and returns substrings that were delimited by
        that regexp.  For instance,

            split(&quot;-&quot;, &quot;long-dylan-identifier&quot;)

        returns values(&quot;long&quot;, &quot;dylan&quot;,
        &quot;identifier&quot;).  Note that what matched the regexp is left
        out.  Remove-empty-items, which defaults to true, magically skips
        over empty items, so that

            split(&quot;-&quot;, &quot;long--with--multiple-dashes)

        returns values(&quot;long&quot;, &quot;with&quot;,
        &quot;multiple&quot;, &quot;dashes&quot;).  Count is the maximum
        number of strings to return.  If there are n strings and count is
        specified, the first count - 1 strings are returned as usual, and
        the count'th string is the remainder, unsplit.  So

            split(&quot;-&quot;, &quot;really-long-dylan-identifier&quot;, count: 3)

        returns values(&quot;really&quot;, &quot;long&quot;,
        &quot;dylan-identifier&quot;).  If remove-empty-items is true, empty
        items aren't counted.
        Case sensitive determines if the regexp for the delimiter should be
        considered case sensitive or not; it defaults to case-insensitive.
        Start: and end: indicate what part of the big string should be
        looked at for delimiters.  They default to the entire string.  For
        instance,

            split(&quot;-&quot;, &quot;really-long-dylan-identifier&quot;, start: 8)

        returns values(&quot;really-long&quot;, &quot;dylan&quot;,
        &quot;identifier&quot;).
        Caveat: Unlike Perl, empty regular expressions are never legal
        regular expressions, so there is no way to split a string into a
        #rest sequence-of-characters.  Of course, in Dylan this is not a
        useful thing to do, so this is not really a problem.

make-splitter  [Generic Function]

            (pattern :: <string>, #key case-sensitive)
            => an anonymous splitter
            method (big-string, #key count, remove-empty-items, start, end)
            => buncha-strings

        Does what you would expect.

join  [Generic Function]

            (delimiter :: <string>, #rest strings) => big-string

        Does the opposite of a split.

            join(&quot;:&quot;, word1, word2, word3)

        is equivalent to

            concatenate(word1, &quot;:&quot;, word2, &quot;:&quot;, word3)

        (and no more efficient)  Note that there is no make-joiner.

6.  Substring-search
--------------------
Substring-search contains methods for searching for fixed substrings rather
than general regular expressions.  It is as similar to the regular
expression module as we could make it.  Substring functions work only on
byte strings, and are always case sensitive.  These functions were taken
from the Collection-extensions library shipped in Mindy 1.1, but the
parameters, keywords, and return values have changed significantly since
then.

substring-position  [Generic Function]

            (big-string, search-for-string, #key start, end)
            => position-or-false;

        Returns the position of the search-for-string in the big-string (or
        that portion of the big-string specified by start: and end:).  This
        search is always case sensitive.
        This function uses the Boyer-Moore algorithm for long strings, and a
        simple dumb search for short strings.  It should yield good
        performance under all circumstances.

make-substring-positioner  [Generic Function]

            (search-for-string) => an anonymous positioner
            method (big-string, #key start, end) => position-or-false

        Does the obvious.

substring-replace  [Generic Function]

            (big-string, search-for-string, replace-with-string, #key count,
            start, end)
            => replaced-string

        Replaces the substring, or the first count instances of it if count:
        is specified.  Note this function does not support start: or end:.

make-substring-replacer   [Generic Function]

            (search-for :: <byte-string>, #key replace-with)
            => an anonymous function replacer that is either
            method (big-string, #key count, start, end) => new-string
            or
            method (big-string, replace-with-string, #key count, start, end)

        Does the obvious.

7.  Known bugs
--------------
Regular-expressions will do unpredictable things if given bad arguments.
(ie, a string that isn't a legal regular expression) Sometimes it'll crash,
and sometimes it'll merily chug away and return crazy answers.

The regexp parser will happily accept a &quot;quantified assertion,&quot;
which isn't technically a legal regexp.  However, both regular and compiled
matching will handle it as one intuitively thinks it should be handled.  (An
example of a quantified assertion would be &quot;^*&quot;, which matches
&quot;any number of beginning of line&quot;.  Since &quot;*&quot; means
&quot;0 or more&quot;, &quot;^*&quot; is interpretted to mean &quot;&quot;,
which is how one would intuitively belive it should be interpretted.)

