Center for the Study of Language and Society (CSLS)

Center for the Study of Language and Society

Building historical corpora: From morphological tagging to grapho-phonological parsing

Donnerstag, 05.09.2019, 15:00 Uhr

Doktorierende der GSAH können sich die Teilnahme am Workshop mit 1 ECTS anrechnen lassen. Bei MA Studierende der Soziolinguistik zählt die Teilnahme als Besuch eines Gastvortrags.

Veranstaltende: Forum Language and Society
Redner, Rednerin: Benjamin Molineaux (The University of Edinburgh)
Datum: 05.09.2019
Uhrzeit: 15:00 - 16:30 Uhr
Ort: F-106
Lerchenweg 36
3012 Bern
Merkmale: Öffentlich

Orthographic variation, which is endemic to non-standard spelling systems, is seen by many researchers as a fatal stumbling block for building morpho-syntactically parsed historical corpora. Graphemic alternations, however, have long been a treasure-trove for historical phonologists, who attempt to piece together bygone sound-systems by close examination of spelling practices. This tug-of-war between morpho-syntactic and grapho-phonological approaches to corpus-building has resulted in independent traditions of spelling standardisation, on the one hand, and of diplomatic transcription with minimal tagging, on the other. A third route, however, is increasingly feasible, producing lemmatised and part-of-speech tagged texts, while preserving fine-grained spelling variation. In this workshop I will give an overview of this approach, based on the construction of the Corpus of Historical Mapudungun (CHM), a project at Edinburgh’s Angus McIntosh Centre for Historical Linguistics.

The main focus will be on the methods and tools used to go from the printed or manuscript texts to a lemmatised, morphologically tagged and grapho-phonologically parsed corpus. I will survey the process of optical character recognition, and the principles and conventions of XML tagging used for lemma and morpheme parsing. Since the CHM-version is still under development, the final stage of the process – grapho-phonological parsing – will be illustrated with data from the From Inglis To Scots corpus (FITS – also developed at the AMC), which maps spellings to sounds in the early history of Scots (1380–1500). Here, I will showcase our bespoke tool – Medusa – which creates dynamic visualisations of the graphophonological relations in the corpus. I conclude with some examples of the usefulness of reconciling the core objectives of corpus methods with a level of linguistic analysis often dismissed as cumbersome and uninformative.