Projects

John C. Paolillo

Active

Readers of my recent book (Analyzing Lingusitic Variation: Statistical Models and Methods, CSLI Publications, Stanford California) may want to know where to find more information about VARBRUL and/or more information about the analyses in the book. I have tried to place as much of that information as possible in this location.

R-Varb

This project seeks to re-implement VARBRUL or something much like it using R, a programming language and environment for statistical computation. Advantages of this implementation are (i) Many maintenance issues of VARBRUL become moot: R (and hence R-Varb) runs on a much broader range of platforms than are available for any prior version of VARBRUL, it is regularly updated and has a large user-community contributing new code and bug fixes; (ii) R provides many other statistical functions, making additional high-price/power ratio statistical software packages unnecessary; (iii) R-Varb is easier to update and improve than existing versions of VARBRUL.

The R Language is released under the terms of the GNU General Public License, meaning that everyone who uses it has full access to its source code (this has not been true of VARBRUL, sensu stricto, although its developers have generally been willing to share their source code when asked).

R is patterned after S, a similar language/programming environment first developed by AT&T. For information about programming in the S (or R) programming language, several books are available; I got mine from Amazon.com. There is a page here describing these books and what they are useful for.

Indexing using R

Reduced vector space analyses for IR, such as Latent Semantic Analysis have become quite popular and now even find their way into commercial applications, such as email filtering (e.g. Apple's new emailer for OS X). The best ways to construct and use such indicies are not yet well-understood. This project assembles a set of resources for constructing and exploring reduced vector-space models using R, a programming language and environment for statistical computation. These resources can be applied to a number of other corpus linguistic problems that employ similar analysis techniques.

South Asian Text Encodings

South Asian languages use a number of different alphabetic writing systems, most of which are based on Brahmi script. Nonetheless, the diversity of traditions associated with these languages along with the opportuniistic conversion of scripts to font-based encodings for web use has led to a large number of different encodings for South Asian languages. Sometimes as many as four different encodings for the same language exist, in competition with one another. While these encodings can happily exist side-by-side in different display environments, they engender numerous problems for indexing and searching text. This project seeks to develop a protocol for declaring and using different encodings for the same or related languages. This protocol may be implemented in browsers and in search engines, to provide a uniform mechanism for searching and working with South Asian text, regardless of its original encoding.

Lexical Resources

I'm working on developing some lexical resources for natural language processing in South Asian Languages, such as Sinhala, Pali, Hindi, etc., using publicly available texts and resources. More on this as the project progresses.

Inactive

The remaining projects here are more-or-less back-burner types of things, or work that I did that was undertaken on the way to some other goal. As such, they are mostly not finished works, although some of them are things I will probably get around to doing something with eventually.

I'm posting these things here on the off-chance that they will be useful to someone, and also to have a place to put things that I know will be useful to one or two people who need them, and who knows, someone else might benefit from them too.

There are several different types of things here, such as,

Programming projects using Prograph, a fully visual programming language Programming projects for teaching in Prolog, and Icon Two irascibly different programming languages And a few other miscelaneous things. Prograph

SuperConc, a concordancing program inspired by the Macintosh program Conc, but more flexible and editable. Basically, SuperConc is intended to be a linguist's workbench, maybe even a lightweight LinguaLinks. I work on it whenever I can, because I still don't see anything like it, because I think it is needed, and because I know it can be done. Once ordinary linguists have a tool like this to use, things will not be so ordinary so more. DrawMD, A Macintosh program written using Prograph Classic, for drawing in n-dimensional binary-valued spaces, projected onto a two-dimensional plane. I wrote this program specifically to draw some diagrams for an article I wrote, so the program has some bugs. Prolog

bupparse.pro, an example program for teaching the principles of bottom-up parsing using BUP. Uther.pro, another example program for teaching principles of Unification Grammar. This version allows the use of path specifications as well as a prioritized form of unification (for lexical templates, not yet used in the example grammar). Icon

Featbox is a program I wrote to handle the typesetting of attribute-value matricies, such as those used in Head-Driven Phrase Structure Grammar. The program was designed to work as a clipboard filter for MS Word versions 4 and 5 on Macintosh computers, which supported a mathematical formula typesetting feature that used in-line escape sequences to control the display of special characters, braces, and ordinary text. Since I used this feature fairly heavily in my dissertation and in certain papers I wrote, I needed a way to produce the appropriate escape-code syntax starting with something a bit more readable. The program uses a context-free grammar to parse a selection of text into the appropriately structured escape codes. The resulting text, when pasted into MS word, including current versions, will be automatically typeset in the appropriate way.

Miscelaneous

I have an Excel spreadsheet that goes with an article I often use to teach about connectionism, namely John Gloldsmith (1992) "Local modeling in Phonology". In Steven Davis, ed. Connectionism: Theory and Practice, 229-246.Oxford: Oxford University Press.