English World-Wide 36:1 (2015), 41–44. doi 10.1075/eww.36.1.02pet issn 0172–8865 / e-issn 1569–9730 © John Benjamins Publishing Company
The Global Web-based English Corpus (GloWbE) introduced to EWW readers in
Davies and Fuchs’s focus article is an outstanding addition to the corpora of the
English-speaking/English-using world – by its sheer size and up-to-dateness, and especially its inclusion of 14 “new Englishes” and six “core Englishes” (i.e. both indigenized and settler varieties, in Schneider’s (2007) terminology).1 Its 1.9 billion words come from 340,000 websites and blogs with the relevant regional suffixes, which were also carefully checked for other indicators of regionality. The validity of using internet data to represent regional varieties of English has been demonstrated by Cook and Hirst (2012), in relation to Canadian and British English, and the new GloWbE corpus takes this a quantum leap further. It will be an invaluable reference resource for mapping new varieties of English whose norms are still evolving (Hundt and Gut 2012).
GloWbE’s concentration on web-based discourse serves a number of purposes. It creates a repository of an English-language medium not represented in any of the existing standard corpora (Corpus of Contemporary American English,
COCA; British National Corpus, BNC). It provides a vast body of English language data from the second decade of the 21st century. Because this web-based material consists largely of unrestricted websites and blogs, it is not entangled by the copyright law (Androutsopoulos 2014) operative in some English-speaking countries, and GloWbE makes it freely available to bona fide researchers. This includes researchers outside linguistics, e.g. psychologists and speech pathologists seeking English norms for literacy and language competence. The Brown Corpus from the 1960s has too long served as the benchmark for assessment and diagnostic purposes (Brysbaert and New 2009), apart from its limitations in terms of the medium/genres it represents (published writing only).
The GloWbE corpus consists of blogs (60 per cent) and websites (40 per cent), housing a mix of internet genres and text-types. In fact, both source types contain a mix of material, since the blogs written by individuals may contain or link to texts from institutional websites elsewhere. Blogs would nevertheless contain more material which has not undergone professional editing, and thus reflect the less generically constrained frontiers of the local variety. Meanwhile websites managed by institutions naturally contain more professionally edited material, 1. Davies and Fuchs’s term “core” Englishes reflects Kachru’s (1985) three-circles model of
World English, though not his terminology (“inner circle” Englishes). 42 Pam Peters reinforcing generic constraints on style and innovation. Ideally the two different source types could be independently searched in GloWbE, though this is not currently possible.
GloWbE’s inclusion of blogging discourse is a corpus-building strategy for side-stepping the previous dependence on written discourse as the benchmark in corpus-based research. How far blogs really represent speech is a different issue.
The writing of blogs is predicated on more direct interaction with readers than through the print medium, and multivariate statistical research on internet registers and text-types (Biber and Kurjian 2007) has found significant factorial dimensions such as personal/involved narrative, persuasive discourse, and addresseefocused discourse, all of which serve to simulate spoken interaction. Blogging and the responses to it can be analysed in terms of social practice (Bolander 2013).
Yet the discourse of blogs is not contextualized like face-to-face conversations or even distanced conversation by phone or radio. As CMC data, it is more likely to be framed as “text” than as “place” by linguistic researchers (Androutsopoulos 2014). Blogging bears some comparison with the register of letter-writing, although it is probably less stylized than the letters collected in historical corpora such as ARCHER (A Representative Corpus of Historical English Registers) and the Australian COOEE (Corpus of Early Oz English) (Fritz 2007). The extent to which writing represents the speaking voice is problematic, as discussed by Hickey (2010) in his introduction to research on varieties of English in writing. Written language reflects the author’s communicative intent rather than spontaneous narrative, and the audience constructed through it is always a fiction (Ong 1975). The receptive audience constructed by the blogger is not autonomous, as in natural conversation. For all these reasons, the GloWbE collection could not replace the samples of carefully transcribed dialogue in the regional International Corpus of
English (ICE) corpora, the Santa Barbara Corpus, and other sociolinguistic collections accessible through the Australian National Corpus (AusNC, see <ausnc.org. au>) — despite their limitations in terms of time, place and participation. Ideally they would be enlarged and updated, though the expense involved in their collection is formidable. Even making them interoperable (as for those included in
AusNC) is hard won.
GloWbE undoubtedly offers great opportunities for large-scale research on the more elusive idioms and constructions of English, on low frequency lexicogrammatical items and their variants, and alternative syntactic structures and the linguistic contexts in which they vary. It supports synchronic studies of grammaticalization and emergent grammar, as in Smith’s (2014) paper on complex subordinators. It complements diachronic studies of grammaticalization, which can be based on historical corpora such as the HELSINKI corpora and ARCHER, or well-documented secondary sources on the English language such as the Oxford
Responses to Davies and Fuchs 43
English Dictionary (3rd edition, online). GloWbE is grammatically tagged for the major word classes. The platform allows searches that combine words and word class tags, so as to discriminate word senses associated with particular grammatical roles, e.g. medium as adjective versus medium as noun, as well as wider research questions of colligation.