At the time when the variationist research paradigm was first developed, logistic regression was not commonly used in statistical analyses, even though the technique was developed substantially earlier. Since a somewhat complex iterative series of calculations needs to be performed to esitmate a logistic regression model, software needed to be written specially to accomplish this. The result was the VARBRUL program, originally written in FORTRAN by David Sankoff. Later versions were implemented by Sankoff and Rousseau (1978), Pintzuk (1982), Guy and Lipa (1986), Sankoff and Rand (1988, 1992), and Robinson, Lawrence and Tagliamonte (2001).

Statistical software packages for logistic regression emerged later, but it is often not obvious how to conduct VARBRUL-style analyses using them. For this reason, software is an important consideration in conducting variationist analysis and there is a strong tendency to favor VARBRUL in linguistic work.

Fortunately, most variationist analyses can be run with software that is freely available, although the data input requirements, which are somewhat particular for VARBRUL, need to be observed carefully. VARBRUL itself has a somewhat arcane feel and takes some getting used to. Other programs make use of a more familiar spreadsheet interface.

Versions of VARBRUL

Versions are listed from most recent to least.

  • GoldVarb X. Sankoff D., Tagliamonte S.A., & Smith E. (2005). This is the most recent version of VARBRUL as of this writing. It is mainly a re-implementation of Rand and Sankoff's 1992 version (GlodVarb 2). It adds some minor new features, but retains all earlier functionality. For the first time, parallel versions of GoldVarb now run under both Macintosh and Windows environments. At least to the extent that GoldVarb X employs the same cell file structures as earlier versions of GoldVarb, it will share the data size and accuracy limitations of earlier versions.
  • R-Varb John C. Paolillo (2002). The R-Varb project aims to address some of the maintenance problems of VARBRUL by re-implementing its functions programs in the R programming language and statistical computing environment. Currently, cell and token files can be read, and analyzed, stepwise and one-level regression are supported, and. Conditions file support is in the process of being implemented (this is most difficult). R is implemented on many popular platforms, so cross-platform compatibility is addressed. R-Varb will eventually be released as a package for R.
  • Goldvarb 2001 (Windows). J.S. Robinson, H.R. Lawrence & S.A. Tagliamonte (2001). This is the first version of Varbrul written for Windows (as opposed to DOS), for which reason it is still used to some extent. It is mainly a re-implementation of Rand and Sankoff's 1992 version of GoldVarb.
  • GoldVarb 2.1 (Macintosh). David Rand and David Sankoff (1992). This version provides a GUI interface for the traditional VARBRUL routines, which are treated as text filters (input &arr;filter&arr; output). Since all data must go through a cell file for an analysis to be run, the format of the cell file limits the kind of data that can be used: counts of any variant must be smaller than 9999 (they must fit in 4 char width). Values that are larger than this are truncated. Another accuracy issue concerns the size of integers: if any totals exceed the 16-bit integer size (>32766), an overflow results and a negative value is obtained. This also produces incorrect results, and the output of the analysis is meaningless. Version 1.6 (first release) occurred in 1988. A French version of the information about Goldvarb 2.1 is also available. The name GoldVarb is apparently a pun on the name of computer scientist (Lev) Goldfarb.
  • PC-VARB (MS-DOS). Susan Pintzuk, David Sankoff (1982). This version was the standard version used on the PC and compatible computers from 1982-2000. It is a set of command-line programs that run under MS-DOS; similar versions ran on the VAX system at the University of Pennsylvania. The different programs separately perform functions that were later integrated into GoldVarb as menu commands. For those with older computers or other limitations, it may still be serviceable.
  • VARBRUL 3M. Pascale Rousseau (1978). This program was written to extend the functions of VARBRUL 2S. It is highly sophisticated, and a number of publications by Rousseau and Sankoff in the 1970's detail many of its features. It is not widely available or used at present. I will post a copy of the source once I obtain permission.
  • MacVarb. Greg Guy and William Lipa (1987). This early MacOS version is really a GUI over VARBRUL 2S, which was hacked to compile in a FORTRAN then available for the Mac. As a GUI, it had several advantages over GoldVarb, particularly in a simplified recoding interface, but it aged badly with changes in the OS. The GUI was programmed in Think Pascal, and as of MacOS System 7 (e.g. 7.1.2) MacVarb would no longer run reliably, as the FORTRAN compiler apparently used Toolbox calls that were eventually phased out. Data file formats, particularly cell files, changed formats between VARBRUL 2S and later GoldVarb/PC-VARB versions.
  • VARBRUL 2S. David Sankoff (1972). A commented version of David Sankoff's VARBRUL 2S, a program for conducting logistic regression analysis using iterative proportional fitting, written in the mid 1970's and used by many variationist sociolinguists. This version was provided by Gregory Guy, and was used in his MacVarb program. It is identical in most of its characteristics to the version that appears in the appendix of Shana Poplack's 1978 University of Pennsylvania dissertation. I added my comments in February 1999. They are clearly distinguished notationally from the earlier comments. Modern FORTRAN compilers will not accept this source code, and I have not found what is causing the error(s) (it appears to be syntactically correct). As my comments make clear, the body of the program is coherent, and can be used to understand what the program's function is, and how it is accomplished.

Problems with Varbrul

  • Data Management. All versions of Varbrul lack certain features for data management. My experience is that GoldVarb is best used in conjunction with database programs (e.g. Claris Filemaker or Microsoft Access) or spreadsheet programs (e.g. Microsoft Excel) for data management. For greatest convenience, the database or spreadsheed should be able to export the data as a text-only file without delimiters, formatted one token (observation) per line. This can be accomplished if each observation is on a separate row in MS Excel, or using a database view a view that shows one observation per record. GoldSearch is a command-line program authored by David Boas, Miriam Meyerhoff, and Naomi Nagy (2002) that permits modularization of some of the data in VARBRUL in a manner similar to relational databases.
  • Accuracy problems arise from the formats of the data files and number representations used. While newer versions of GoldVarb solve most of the numerical accuracy problems by using modern compilers, there are still problems with some data formats. Most important of these is that of cell files, which permit only four single-byte digit characters in the value fields for each variant of a variable. This places an upper limit of 9999 tokens of each variant, with larger values being truncated.
  • Iterative Proportional Fitting is used for estimation, placing limitations on the nature of the data types used. Continuous variables are not provided for. In addition, IPF converges badly when interactions are present in the data, making testing of some interactions difficult.
  • Multinomial dependent variables can only be used in VARBRUL 3M and PC-VARB, which are not integrated into current program releases.
  • Factor and Factor group names are limited to a single character. Provisions for documentation of analyses are weak.
  • Database integration is not provided for by current versions. complex data manipulations are required to move data from databases, as needed for large-scale studies, into the forms required for VARBRUL programs. Moreover, none of VARBRUL's data formats resemble an ordinary spreadsheet. Hence, off-the shelf software is difficult to use with VARBRUL, where it would clearly make sense for data entry and editing.
  • File formats: conditions files and cell file formats are difficult to understand. Hand entry into these forms is sometimes necessary, but difficult.
  • Continued development of VARBRUL is uncertain. Changes in operating systems often make it impossible for researchers to use VARBRUL on current platforms. Few in the VARBRUL user community are programmers of sufficient skill to maintain VARBRUL and steadily improve its functionality
  • Improvements in logistic regression analysis, diagnostics, etc. have been made since VARBRUL was designed. Some of these pertain to "core" functions of VARBRUL such as step-up/step-down analysis. It is unlikely that VARBRUL development will ever keep pace with such improvements.
  • Logistic regression is the only anlysis performed by VARBRUL, but many variationist analyses could require other statistical techniques.
  • VARBRUL analysis is not customizable. Sometimes it would be useful to automate other routines for analysis. Currently there is no way to do this.

To address these problems, it is necessary to leave the VARBRUL family of programs, in favor of other software that performs logistic regression.

Other Software

Researchers conducting variationist linguistic research should seriously consider alternatives to VARBRUL. Considerations deciding in one direction or the other include individual expertise, the form in which existing data is available, the available computing platforms, the nature of the other analyses to be conducted, and the intended audience of the research, among others. Most technical considerations are trivial in nature, in spite of what we may be led to think. For reasons of audience

  • GLMStat K. J. Beath. GLMStat is a very easy-to-use program for running Generalized Linear Models, including log-linear and logistic regression models. It uses a standard spreadsheet interface and dialog boxes for specifying the model etc. It is intuitive and easy to learn, especially if you know a spreadsheet program already. It is a Macintosh-only program, but it is frequently maintained. It is also quite modestly priced.
  • R R Core Team. R is a statistical programming language and environment. What this means is that R provides a language (syntax, semantics) for writing programs in which it is easy to state the mathematical and combinatoric operations needed for statistical analysis, alongside an interactive (command-line) environment in which numerous statistical programs already reside. Hence, whatever R provides can be easily extended. Existing packages cover a broader range of statistical analyses than nearly any other statistical package, commercial or free. GLMs including logistic regression and log-linear modeling are well-supported. There is a growing number of books on R, and users may benefit from the large number of books on S as well.
  • SPSS
  • SAS

Categories: Software, Varbrul, Statistics