By Dalat TESOL
Helping TESOL researchers confidently analyze classroom-based intervention data
📌 Introduction
You’ve planned an intervention. You’ve collected your pre- and post-test scores. Now what?
This guide introduces key statistical tools for analyzing data from experimental and quasi-experimental TESOL studies. These designs are common in classroom research where teachers compare teaching methods, materials, or technologies.
We’ll walk you through:
- What kind of data you need
- Which statistical tests to use
- What assumptions to check
- How to interpret and report your results
- Real TESOL research examples
🧠 1. What Kind of Data Do You Have?
Most classroom-based TESOL research uses interval or ratio data (e.g., test scores, rubric scores, survey totals).
Start with descriptive statistics:
- Mean (average performance)
- Standard deviation (spread of scores)
- Range (lowest to highest)
📌 Always report descriptive statistics first — they offer a snapshot before inferential testing.
🔍 2. Before Running Tests: Check Assumptions
Most statistical tests below are parametric — they assume your data meets certain conditions.
Assumption | How to Check | What to Do if Violated |
---|---|---|
Normality (bell-shaped distribution) | Shapiro-Wilk test, histograms | Use non-parametric tests (e.g., Wilcoxon, Mann–Whitney) |
Equal variances across groups | Levene’s Test | Use Welch’s t-test or non-parametric |
Interval/ratio scale | Test/survey design | Convert ordinal scales cautiously |
🧪 Use JASP, SPSS, or Jamovi to easily check assumptions.
📊 3. Choosing the Right Statistical Test
Test | Use When | Example in TESOL |
---|---|---|
Paired-sample t-test | Comparing pre/post in the same group | Did Class A improve after using Quizlet? |
Independent-sample t-test | Comparing two groups at one time point | Did Class A score higher than Class B after the intervention? |
ANCOVA | Comparing post-test scores while adjusting for pre-test | Did Class A outperform B even after accounting for pre-test differences? |
Repeated Measures ANOVA | Comparing the same group at multiple time points | How did students’ fluency change across three speeches? |
Effect Size (Cohen’s d) | Measuring the magnitude of change | Was the gain meaningful or just statistically significant? |
🧪 4. Scenario 1: One Group Pre/Post Design
Example:
Does using ChatGPT for planning improve writing fluency?
Group | Pre-test | 4-week treatment | Post-test |
---|---|---|---|
Class A | Writing sample | Brainstorm with ChatGPT | Writing sample |
Test: Paired-sample t-test
Result Example:
t(29) = 4.21, p < .001, d = 0.77
✅ Interpretation: Students wrote significantly more words after the intervention, with a moderate-to-large effect size.
🧪 5. Scenario 2: Two Groups, Post-Test Only
Example:
Which feedback type leads to better writing — peer or teacher?
Group | Feedback | Post-test Score |
---|---|---|
Class A | Peer feedback | 78.5 |
Class B | Teacher feedback | 70.3 |
Test: Independent-sample t-test
Result Example:
t(38) = 2.03, p = .048, d = 0.63
✅ Interpretation: The peer feedback group outperformed the teacher feedback group. The effect was moderate in size.
🧪 6. Scenario 3: Control for Pre-Test Differences
Problem:
Your two groups had different pre-test scores — can you still compare them fairly?
Solution:
Use ANCOVA to statistically control for these differences.
📌 ANCOVA adjusts post-test scores by removing the influence of pre-test variation — like comparing two runners by subtracting their head-start.
Caution: Only use ANCOVA if the pre-test and post-test are linearly related, and group variances are homogeneous.
📏 7. What About Effect Sizes?
Statistical significance (p < .05) tells you whether the effect is likely real. But effect size tells you how meaningful the difference is.
Cohen’s d | Interpretation |
---|---|
0.2 | Small effect |
0.5 | Medium effect |
0.8+ | Large effect |
💡 Example: A d = 0.73 suggests your ChatGPT intervention had a strong impact on writing fluency.
📋 8. Sample Reporting (APA Style)
Example 1: Paired-sample t-test
Students showed significant improvement in writing fluency from pre-test (M = 62.5, SD = 8.4) to post-test (M = 75.3, SD = 7.9), t(29) = 5.21, p < .001, d = 0.73.
Example 2: Independent t-test
The Quizlet group (M = 75.3, SD = 7.9) outperformed the flashcard group (M = 68.4, SD = 9.2) on the vocabulary post-test, t(58) = 2.57, p = .013, d = 0.66.
🔀 9. If Your Data Isn’t Normal…
Use non-parametric alternatives:
Parametric Test | Non-parametric Version | Use When |
---|---|---|
Paired t-test | Wilcoxon Signed-Rank | Non-normal pre-post scores |
Independent t-test | Mann–Whitney U | Small or skewed group scores |
✅ 10. Statistical Checklist for TESOL Researchers
Before analysis:
- Identify IV and DV
- Check data type (interval, ratio)
- Check normality (Shapiro-Wilk)
- Check variances (Levene’s Test)
- Choose test: t-test, ANCOVA, etc.
- Compute and report effect size
- Interpret in light of RQ, context, and limitations
📚 Further Reading
- Plonsky, L. (2015). Quantitative Research Methods in Applied Linguistics.
- Larson-Hall, J. (2016). A Guide to Doing Statistics in Second Language Research Using SPSS and R.
- Dornyei, Z. (2007). Research Methods in Applied Linguistics.
🌱 Final Thoughts
Statistics in TESOL aren’t just about numbers — they’re tools to answer important teaching and learning questions. When chosen and interpreted carefully, even simple tests can lead to powerful insights about what works in your classroom or study.
“The goal is not complexity — it’s clarity and credibility.”