By Dalat TESOL
Helping novice researchers explore language through real-world data
📌 Introduction
Corpus linguistics is an approach to language study that focuses on analyzing real-world language use through large collections of texts, known as corpora. Instead of relying solely on intuition or textbook examples, it investigates actual patterns in naturally occurring language — making it especially powerful for TESOL researchers and educators.
Rooted in usage-based linguistics and championed by scholars like John Sinclair, corpus linguistics offers a more empirical view of how language functions across different contexts, genres, and speakers. For TESOL students and novice researchers, it provides a gateway to evidence-based language teaching and rich linguistic insights — even with limited technical skills.
This article introduces:
- What corpus linguistics is
- Why it matters in TESOL research and teaching
- Key tools and concepts
- Sample research applications
- Corpus design basics
- Pedagogical possibilities
- Recommended readings to continue your journey
📚 1. What Is Corpus Linguistics?
Corpus linguistics involves the systematic study of language through digital collections of spoken or written texts. Researchers use software tools to search for patterns, frequencies, and relationships in authentic language data.
🧠 In short: Corpus linguistics helps us understand how language is really used — not just how we think it’s used.
✅ Key Concepts
Term | Meaning |
---|---|
Corpus (pl. corpora) | A structured set of real language texts (spoken/written) stored electronically |
Concordance | A list of all occurrences of a word/phrase in its surrounding context |
Collocation | Words that tend to co-occur (e.g., “heavy rain”, “strong coffee”) |
Frequency list | A ranking of the most common words or structures in a corpus |
Keyword analysis | Words that appear unusually often in one corpus compared to a reference corpus |
✅ Types of Corpora
Type | Example | Use |
---|---|---|
General corpora | British National Corpus (BNC), COCA | Language norms, grammar trends |
Learner corpora | ICLE, LINDSEI | EFL learner errors and interlanguage |
Specialized corpora | Teacher talk, textbook corpora | Focused topic or context |
DIY corpora | Your own dataset (e.g., student writing) | Classroom-based or thesis projects |
🏗️ 2. How to Design or Choose a Corpus
Before beginning any analysis, it’s crucial to define or select your corpus carefully:
Key Design Principles:
- Representativeness: Does the corpus reflect the kind of language you’re studying (e.g., academic writing, classroom talk)?
- Balance: Is there a mix of topics, authors, or genres?
- Size: Larger is often better, but small corpora can work for focused questions
- Cleanliness: Are the texts well-formatted and free from irrelevant characters?
🎯 Example: If you want to study metadiscourse in Vietnamese EFL essays, aim for at least 30 essays across proficiency levels and tasks, and organize them clearly by source and genre.
🎯 3. Why Use Corpus Linguistics in TESOL?
Corpus linguistics is relevant to both research and teaching:
🧪 For Research:
- Examine learners’ lexical or grammatical patterns
- Investigate genre-specific language use
- Compare native and non-native discourse features
- Analyze textbooks, tests, or classroom language
🧑🏫 For Teaching:
- Create data-driven learning (DDL) activities
- Develop vocabulary lists grounded in real usage
- Teach grammar using real-life contexts
- Use authentic examples for genre awareness
💡 Instead of telling students “we say ‘make a decision’,” you can show them it appears 450 times per million words in the BNC, while “do a decision” appears zero times — that’s data, not opinion.
🛠️ 4. Tools for Corpus Linguistics (Beginner-Friendly)
You don’t need to know coding to start analyzing corpora. These tools are free and widely used:
Tool | Description |
---|---|
AntConc | Concordancer and collocation tool for DIY corpora (Windows/Mac/Linux) |
LancsBox | More advanced visual and statistical features (Windows/Mac) |
COCA | Corpus of Contemporary American English – large online corpus |
SKELL | Ideal for teaching collocations with simple interface |
WordAndPhrase | Word profiles and frequency by genre (COCA-based) |
BNCWeb | Web access to British National Corpus (spoken/written texts) |
🧪 5. Sample Research Applications
✅ Lexical Bundles in Student Writing
RQ: What multi-word expressions do Vietnamese EFL students use in argumentative essays compared to native speaker essays?
- Build or access two corpora
- Use AntConc to extract lexical bundles (e.g., “on the other hand”)
- Analyze structural and functional differences
🎯 Contribution: Highlights formulaic language gaps in learner writing.
✅ Collocations of “Take” in IELTS Materials
RQ: What collocations with “take” appear in IELTS reading texts?
- Compile IELTS passages as corpus
- Use LancsBox or SKELL to find collocates (e.g., “take notes”, “take care”)
- Analyze patterns and teaching implications
🎯 Contribution: Informs vocabulary instruction and exam prep materials.
✅ Teacher Talk in EMI Classrooms
RQ: How do Vietnamese lecturers modify language when teaching science in English?
- Transcribe 3–5 hours of lecture audio
- Search for features like hedging (“kind of”), clarification, code-switching
- Use concordance data to analyze discourse strategies
🎯 Contribution: Provides evidence for EMI policy and teacher training.
🧑🏫 6. Corpus Linguistics in the TESOL Classroom
Data-Driven Learning (DDL) involves giving students access to concordance lines or corpus snippets so they can:
- Discover patterns (e.g., how “due to” is used in academic writing)
- Reflect on differences between spoken and written forms
- Build genre awareness (e.g., introductions in journal articles)
👩🏫 In a classroom activity, students might compare concordance lines of “affect” vs. “effect” to learn usage through observation, not just definitions.
This approach turns students into language detectives and promotes deeper language awareness.
⚠️ 7. Common Pitfalls (and How to Avoid Them)
Pitfall | Solution |
---|---|
Counting words without interpretation | Always analyze concordance lines contextually |
Using too small a dataset | Justify your corpus size and acknowledge limitations |
Assuming frequency equals importance | Use collocates and dispersion to check spread |
Copy-pasting results | Always connect results back to your research question |
✍️ 8. How to Write About Corpus Linguistics in Your Thesis or Article
When writing your methods section:
- State your RQs clearly
- Describe the corpus (size, genre, source, design choices)
- Justify your tool (Why AntConc? Why this corpus?)
- Explain your steps (frequency → concordance → interpretation)
- Include data excerpts to support your findings
📑 Example:
“Using AntConc 3.5.9, I extracted all 4-word bundles occurring at least 10 times in a corpus of 80 Vietnamese EFL essays. I then categorized these bundles based on Biber et al.’s (2004) structural classification.”
📚 Further Reading
- TBC
🧠 Final Thoughts
Corpus linguistics gives TESOL researchers and teachers evidence-based insights into how language works. You don’t need advanced programming skills — only:
- A clear question
- A well-constructed or selected corpus
- And curiosity to explore language patterns
Strong TESOL research isn’t built on intuition alone — it’s grounded in the real language of real users.
Corpus linguistics helps us get there.
🌿 Dalat TESOL – Chia sẻ kiến thức giảng dạy, nghiên cứu khoa học và cơ hội xuất bản