Bilingualism on the ground: The Canberra-Vietnamese code-switching corpus
Community language corpora are essential for linguistic research and provide valuable insights into a community’s social and cultural history. Many of these corpora, however, are challenging to locate and access due to limited detailed description, thereby limiting its reuse and analysis. This talk thus aims to place linguistic data at the centre stage and highlight the crucial contributions such data can make.
I will begin by introducing the Canberra Vietnamese-English Corpus (CanVEC), outlining its design and composition. CanVEC is a pioneering corpus of natural speech from 45 Vietnamese-English bilingual speakers living in Canberra, Australia. This is a well-established community that has, until now, received very limited documentation. The corpus includes 23 conversations, approximately 10 hours of speech from speakers across two generations (aged 12 to 67 at the time of recording). The corpus features detailed demographic annotations and is semi-automatically tagged for language identification, Part-of-Speech (POS), and English translations (Nguyen & Bryant, 2020). It represents the first annotated corpus of its kind for this community.
I will then highlight some significant findings from the data analysis, covering both linguistic and computational fronts. Linguistically, the corpus facilitates an in-depth exploration of language variation and change in Vietnamese heritage language, Australian English, and code-switching within the community. I specifically show how CanVEC data challenges the long-standing notion of ‘matrix-language’ in code-switching research (Nguyen 2024) and reveals patterns of cross-generational variation of subject and object expression in heritage Vietnamese. Computationally, I demonstrate how CanVEC uniquely contributed to interdisciplinary research on multilingual model development and evaluation, specifically on the effects of natural code-switching on machine performance (Sterner and Teufel 2023; Nguyen et al., 2023a; Nguyen et al., 2023b; Chan et al., 2024).
Li Nguyen is a linguist specialising in language variation and change, contact linguistics, and computational sociolinguistics. Much of her work is corpora-based and focuses on contact phenomena in bilingual settings. Her current interests include language use in ethnically diverse communities, sociolinguistically informed NLP, and innovative technologies that can accommodate multilingual speakers. Li’s recent commitments have been interdisciplinary projects, where she works collaboratively with computer scientists, industry partners, and policy makers to evaluate and develop educational technologies for code-switching. Before joining NTU as an Assistant Professor, she worked as a Research Associate at the ALTA Institute within the Cambridge Computer Lab, and a Research Fellow and Lecturer at the Australian National University. She holds a PhD in Linguistics from the University of Cambridge.