Welcome to our corpus linguistics page. While just an introductory look at the world of corpus linguistics, I hope it will provide some tools for you to consider using as your interests and time allow.

 Corpora 

At the heart of this strand of linguistics is, of course, the notion of corpora. These data sets can come in various sizes, and some are quite extensive. Let's have a look at some, folks.

At some point you might find yourself in need of a specialized corpus. Of course, you could simply compile a large text file all by yourself; this is, as you will have noted, what we will be looking next week in our 9th session. One example I will share with you only in class because it is not publicly available and would result in great gnashing of teeth if I were to make it available online.

Here we have a useful document by John Sinclair titled Developing Linguistic Corpora: A Guide to Good Practice.

Should you need to convert a text format from, say, PDF to simple text, one option is a free conversion website called Zamzar. A second option is to simply use the 'save as' routine. Should you need to do a batch conversion, however, the quickest and least painful method is to use AntFileConverter, which is simply a batch file converter. Wow.

As you'll know from class, there are two broad categories of corpora. Generalized corpora are, as the name suggests, for general use. However, specilized corpora are essentially niche corpora intended for very specialized research questions. Examples follow:

 Online 

 Corpus @ BYU

Here at English-Corpora.org you'll find a variety of corpora from different languages, genres, and areas. A recent addition to the pantheon is the Coronavirus Corpus, which makes sense. As you'll see on the webpage, Mark Davies is the eminent gentleman behind this massive undertaking.

As you might have suspected, YouTube again has several very informative tutorials. You might begin here, with the COCA 01: Introduction to Using the Corpus .

 AntConc 

Developed by Lawrence Anthony just down the street from us at Waseda, this is a very useful set of tools. Here is his webpage, on which you'll find lots of information in addition to the various types of software (including AntConc) that he has developed. For our purposes in this course, here is the AntConc webpage.

Lest this all seem beyond comprehension, Dr. Anthony has provided a series of tutorials available on YouTube.

I will readily admit that the Keylist tool was a mystery the first time that I tried it. Thankfully, this helpful video explains it nicely.

I'll leave it up to you to search for more helpful tutorials.

 The Compleat Lexical Tutor 

Another option is Tom Cobb's website called The Complete Lexical Tutor, which you'll recall is my chosen example for the website most likely to cause eye damage. (Was that site designed intentionally?)

 Statistics and Such Cruel Things 

A quick and dirty statistics note here: you will encounter the term log-likelihood when assessing whether the difference in frequency between your target text and a given corpus is statistically significant. As usual, the two significance levels of note in our field are p < .05 and p < .01. When pondering log-likelihood results, values in excess of 6.63 indicate statistically significant results at p < .01 and those in excess of 3.84 hit the p < .05 mark.

Here, Good People, is a chapter on stats for corpus linguistics, which I'm sure must fall under the 'fair use' category. Maybe. Here as well is an online log-likelihood and effect size calculator, which really falls under the "simple webpage" heading. However, it can be useful.

Courtesy of Lancaster University, here is their webpage on Lancaster Stats Tools Online.

URL: www.jimelwood.net/students/temple/techined21/corpus.html

The logo was created on Cool Text.

Date last updated: June 1, 2021 * Copyright 2021 by Midas, Cyrus, and all the other lunatics.