  • Published : April 22, 2013
2012 International Conference on Asian Language Processing

John Lee
Halliday Centre for Intelligent Applications of Language Studies Department of Chinese, Translation and Linguistics City University of Hong Kong jsylee@cityu.edu.hk Abstract—We present a corpus-based analysis of the use of mixed code in Hong Kong speech. From transcriptions of Cantonese television programs, we identify English words embedded within Cantonese utterances, and investigate the motivations for such code-switching. Among the many motivations observed in previous research, we found that four alone account for more than 95% of the use of English words in our speech data across genres, genders, and age groups. We performed analyses over more than 60 hours of transcribed speech, resulting in one of the largest empirical studies to-date on this linguistic phenomenon. Keywords-code-mixing; English; corpus linguistics. code-switching; Cantonese;





While Cantonese is the mother tongue for the vast majority of the people in Hong Kong, English is also spoken by 43% of the population [1], reflecting the city’s heritage as a British colony. A well-known feature of the speech in Hong Kong is code-switching, i.e., “the juxtaposition of passages of speech belonging to two different grammatical systems or sub-systems, within the same exchange” [2]. Specifically, in the case of Hong Kong, the two grammatical systems are Cantonese and English. The former serves as the ‘matrix language’, and the latter as the ‘embedded language’, resulting in Cantonese sentences with English segments such as (example taken from [3]): canteen heoi3 canteen jam2 caa4 ‘let’s go to the canteen for lunch’ Here, the English segment contains only one word (‘canteen’), but in general, it can be a whole clause. We will use the general term ‘code-switching’ rather than the more specific term ‘code-mixing’, which refers to switching below the clause level, even though most English segments in our corpus indeed contain only one or two words (see Table 3). There is already a large body of literature devoted to the study of Cantonese-English code-switching from the theoretical linguistic point of view [3,4,5]. This paper investigates the motivations behind the use of mixed code, on the basis of a large dataset of speech transcribed from television programs. In Section II, we outline previous research on the motivations of code-switching, and discuss how our investigation complements theirs. In Section III, we describe our methodology for corpus construction, in particular the design of the taxonomy of code-switching motivations. In Section IV, we present an analysis of these motivations according to genre, gender and age.

The first major framework for classifying codeswitching motivations in Hong Kong consists of two categories: ‘expedient’ and ‘orientational’ [6]. Central to this framework is the distinction between words in ‘high Cantonese’ and ‘low Cantonese’. In everyday conversations, a speaker sometimes cannot find any word from ‘low Cantonese’ to describe an object, institution or idea (e.g., ‘application form’). Using a word from ‘high Cantonese’ (e.g., biu2 gaak3), however, would sound too formal and therefore stylistically inappropriate. In expedient mixing, the speaker resorts to an English word; the mixing is pragmatically motivated. In contrast, orientational mixing is socially motivated. The speaker chooses to use English (e.g., ‘barbecue’) despite the availability of equivalent words from both ‘low Cantonese’ (e.g., siu2 je5 sik6) and ‘high Cantonese’ (e.g., siu1 haau1), since he perceives the subject matter to be inherently more ‘western’. This dichotomy has been criticized as overly simplistic, because of the ambiguity in defining lexical and stylistic equivalents among ‘low Cantonese’, ‘high Cantonese’, and English. Instead, a four-way taxonomy is proposed:...
