Automated Learning of Phonology and Morphology
Adam Albright Bruce Hayes
Department of Linguistics, UCLA
Phonologists seek to model human phonological competence, specifically the ability of humans to develop elaborate rule systems on the basis of extended--but haphazard--exposure to input data.
Most such modeling has adopted a rather idealized approach: our theories are used to analyze languages, under the view that the theoretical devices that make possible cogent language-particular analyses of difficult data patterns would, in principle, also serve human learners well in their task.
But perhaps one could also proceed much more directly: incorporate a model of phonological competence into an automated learner, which would discover the principles of the phonology on its own, as the child must. Here, we can test phonological principles in a way that is perhaps more rigorous: when a learner is equipped with a particular theoretical principle in phonology, does its ability improve (or, for that matter, get worse)?
Adam Albright and I spent several years developing an automated learner. Our learner is presented with paradigms, and develops a rule system that permits it to take the "Wug" test (Berko 1958): given one member of an inflectional paradigm, it guesses another member. Thus: "What is the past tense of spling?" yields three answers, in descending order of confidence: splinged, splung, splang. Our learner currently is able to mimic human well-formedness intuitions in a variety of simulations.
The work was funded by NSF grant BCS-991-0686
(you may obtain the Adobe PDF reader, needed to read these files, by clicking here)
|Adam Albright and Bruce Hayes (2004) "Modeling Productivity with the Gradual Learning Algorithm: The Problem of Accidentally Exceptionless Generalizations," in submission to a conference proceedings volume.||This was the second version of our learner; it learns nonlocal environments; The Gradual Learning Algorithm is used to filter out environments learned that hold true by accident of the data that happen to be in the learning set.|
|Adam Albright and Bruce Hayes (2003) "Rules vs. Analogy in English Past Tenses: A Computational/Experimental Study". Preprint version; appeared in Cognition 90: 119-161.||Comparison of our model with two alternatives: the Dual Mechanism approach of Steven Pinker and others, and a purely-analogical approach based on the Generalized Context Model of Robert Nosofsky.||Click here to go to download page|
|Adam Albright and Bruce Hayes (2002) "Modeling English Past Tense Intuitions with Minimal Generalization". Preprint version. Appeared in Mike Maxwell, ed., Proceedings of the 2002 Workshop on Morphological Learning, Association of Computational Linguistics. Philadelphia: Association for Computational Linguistics.||A succinct account of how our learner works.|
|Albright, Adam, Argelia Andrade and Bruce Hayes (2001) Segmental Environments of Spanish Diphthongization. UCLA Working Papers in Linguistics 7 (Papers in Phonology 5), 117-151.||Our learner applied to a problem in Spanish phonology, with Wug test data.||Click here to go to download page.|
|Adam Albright and Bruce Hayes (2000) "Distributional encroachment and its consequences for phonological learning". UCLA Working Papers in Linguistics 4 (Papers in Phonology 4), 179-190.||A problem in phonological learning: one ending shows up occasionally in the environment for another. How we avoid overgeneralization.|
|Adam Albright and Bruce Hayes (ms., 1999), "An Automated Learner for Phonology and Morphology".||How our learner works. Somewhat out of date, but with some interesting problems not discussed elsewhere.|
Learning Databases. If you have your own learning model and would like to try it out on our databases, feel free to download the following. Most use the SILDoulosIPA font for phonetic symbols. Some have feature charts to cover the phonemes of the languages, as they are transcribed in the data files.
|English (Excel spreadsheet)||English features||Present and past tense forms of 2181 English verbs. Based on an original from the Web site of Brian MacWhinney . First file is raw data in Excel format, second file is text file, in format usable by the learning software.|
|English||Another database, derived from the CELEX corpus, used in the preparation of "Rules vs. Analogy in English Past Tenses: A Computational/Experimental Study"|
|Spanish||Spanish features||For studying diphthongization. 4000 verbs in the 1st plural present
(non-diphthongized) and 1 sg. present (many diphthongized). In format
usable by learning software.
More Spanish verbs, used in our joint paper "Segmental Environments of Spanish Dipthongization" (with Argelia Andrade) are available here.
|Latin||(features bundled with data package)||Full paradigms of about 3000 nouns. In format usable by learning software.|
|Italian||(features bundled with data package)||About 3000 Italian verbs, in 1 sg. present and infinitive. Used by Albright 1998 for projecting conjugation class. In format usable by learning software.|
|German||(no feature chart)||Present and past tense forms of about 1650 verbs. Based on The Oxford Book of German Verbs. Excel format. No prefixed forms, except where stem is bound. No frequency statistics.|
The Learning Program. This is the local learner, reported in Albright and Hayes (2002, 2003). It is written in Java and in principle will run on various machines. To run this program, you need to obtain the free "Java run-time environment", which is already present on most machines. It can be downloaded here.
The program itself sits in a "executable jar" file: just download it and then click on it to run. If you have Java on your machine, this will probably work.
Download the program
To learn how to run the program, I recommend doing a little problem I wrote for my morphology class (Linguistics 205). All the files needed are in this zipped bundle.
Back to Bruce Hayes's Home Page
Last modified October 3, 2011