📊 Statistics for Experimental and Quasi-Experimental Research in TESOL

By Dalat TESOL
Helping TESOL researchers confidently analyze classroom-based intervention data


📌 Introduction

You’ve planned an intervention. You’ve collected your pre- and post-test scores. Now what?

This guide introduces key statistical tools for analyzing data from experimental and quasi-experimental TESOL studies. These designs are common in classroom research where teachers compare teaching methods, materials, or technologies.

We’ll walk you through:

  • What kind of data you need
  • Which statistical tests to use
  • What assumptions to check
  • How to interpret and report your results
  • Real TESOL research examples

🧠 1. What Kind of Data Do You Have?

Most classroom-based TESOL research uses interval or ratio data (e.g., test scores, rubric scores, survey totals).

Start with descriptive statistics:

  • Mean (average performance)
  • Standard deviation (spread of scores)
  • Range (lowest to highest)

📌 Always report descriptive statistics first — they offer a snapshot before inferential testing.


🔍 2. Before Running Tests: Check Assumptions

Most statistical tests below are parametric — they assume your data meets certain conditions.

AssumptionHow to CheckWhat to Do if Violated
Normality (bell-shaped distribution)Shapiro-Wilk test, histogramsUse non-parametric tests (e.g., Wilcoxon, Mann–Whitney)
Equal variances across groupsLevene’s TestUse Welch’s t-test or non-parametric
Interval/ratio scaleTest/survey designConvert ordinal scales cautiously

🧪 Use JASP, SPSS, or Jamovi to easily check assumptions.


📊 3. Choosing the Right Statistical Test

TestUse WhenExample in TESOL
Paired-sample t-testComparing pre/post in the same groupDid Class A improve after using Quizlet?
Independent-sample t-testComparing two groups at one time pointDid Class A score higher than Class B after the intervention?
ANCOVAComparing post-test scores while adjusting for pre-testDid Class A outperform B even after accounting for pre-test differences?
Repeated Measures ANOVAComparing the same group at multiple time pointsHow did students’ fluency change across three speeches?
Effect Size (Cohen’s d)Measuring the magnitude of changeWas the gain meaningful or just statistically significant?

🧪 4. Scenario 1: One Group Pre/Post Design

Example:
Does using ChatGPT for planning improve writing fluency?

GroupPre-test4-week treatmentPost-test
Class AWriting sampleBrainstorm with ChatGPTWriting sample

Test: Paired-sample t-test

Result Example:
t(29) = 4.21, p < .001, d = 0.77

✅ Interpretation: Students wrote significantly more words after the intervention, with a moderate-to-large effect size.


🧪 5. Scenario 2: Two Groups, Post-Test Only

Example:
Which feedback type leads to better writing — peer or teacher?

GroupFeedbackPost-test Score
Class APeer feedback78.5
Class BTeacher feedback70.3

Test: Independent-sample t-test

Result Example:
t(38) = 2.03, p = .048, d = 0.63

✅ Interpretation: The peer feedback group outperformed the teacher feedback group. The effect was moderate in size.


🧪 6. Scenario 3: Control for Pre-Test Differences

Problem:
Your two groups had different pre-test scores — can you still compare them fairly?

Solution:
Use ANCOVA to statistically control for these differences.

📌 ANCOVA adjusts post-test scores by removing the influence of pre-test variation — like comparing two runners by subtracting their head-start.

Caution: Only use ANCOVA if the pre-test and post-test are linearly related, and group variances are homogeneous.


📏 7. What About Effect Sizes?

Statistical significance (p < .05) tells you whether the effect is likely real. But effect size tells you how meaningful the difference is.

Cohen’s dInterpretation
0.2Small effect
0.5Medium effect
0.8+Large effect

💡 Example: A d = 0.73 suggests your ChatGPT intervention had a strong impact on writing fluency.


📋 8. Sample Reporting (APA Style)

Example 1: Paired-sample t-test

Students showed significant improvement in writing fluency from pre-test (M = 62.5, SD = 8.4) to post-test (M = 75.3, SD = 7.9), t(29) = 5.21, p < .001, d = 0.73.

Example 2: Independent t-test

The Quizlet group (M = 75.3, SD = 7.9) outperformed the flashcard group (M = 68.4, SD = 9.2) on the vocabulary post-test, t(58) = 2.57, p = .013, d = 0.66.


🔀 9. If Your Data Isn’t Normal…

Use non-parametric alternatives:

Parametric TestNon-parametric VersionUse When
Paired t-testWilcoxon Signed-RankNon-normal pre-post scores
Independent t-testMann–Whitney USmall or skewed group scores

✅ 10. Statistical Checklist for TESOL Researchers

Before analysis:

  • Identify IV and DV
  • Check data type (interval, ratio)
  • Check normality (Shapiro-Wilk)
  • Check variances (Levene’s Test)
  • Choose test: t-test, ANCOVA, etc.
  • Compute and report effect size
  • Interpret in light of RQ, context, and limitations

📚 Further Reading

  • Plonsky, L. (2015). Quantitative Research Methods in Applied Linguistics.
  • Larson-Hall, J. (2016). A Guide to Doing Statistics in Second Language Research Using SPSS and R.
  • Dornyei, Z. (2007). Research Methods in Applied Linguistics.

🌱 Final Thoughts

Statistics in TESOL aren’t just about numbers — they’re tools to answer important teaching and learning questions. When chosen and interpreted carefully, even simple tests can lead to powerful insights about what works in your classroom or study.

“The goal is not complexity — it’s clarity and credibility.”

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top