📊 Statistics for Experimental and Quasi-Experimental Research in TESOL

By Dalat TESOL
Helping TESOL researchers confidently analyze classroom-based intervention data

📌 Introduction

You’ve planned an intervention. You’ve collected your pre- and post-test scores. Now what?

This guide introduces key statistical tools for analyzing data from experimental and quasi-experimental TESOL studies. These designs are common in classroom research where teachers compare teaching methods, materials, or technologies.

We’ll walk you through:

What kind of data you need
Which statistical tests to use
What assumptions to check
How to interpret and report your results
Real TESOL research examples

🧠 1. What Kind of Data Do You Have?

Most classroom-based TESOL research uses interval or ratio data (e.g., test scores, rubric scores, survey totals).

Start with descriptive statistics:

Mean (average performance)
Standard deviation (spread of scores)
Range (lowest to highest)

📌 Always report descriptive statistics first — they offer a snapshot before inferential testing.

🔍 2. Before Running Tests: Check Assumptions

Most statistical tests below are parametric — they assume your data meets certain conditions.

Assumption	How to Check	What to Do if Violated
Normality (bell-shaped distribution)	Shapiro-Wilk test, histograms	Use non-parametric tests (e.g., Wilcoxon, Mann–Whitney)
Equal variances across groups	Levene’s Test	Use Welch’s t-test or non-parametric
Interval/ratio scale	Test/survey design	Convert ordinal scales cautiously

🧪 Use JASP, SPSS, or Jamovi to easily check assumptions.

📊 3. Choosing the Right Statistical Test

Test	Use When	Example in TESOL
Paired-sample t-test	Comparing pre/post in the same group	Did Class A improve after using Quizlet?
Independent-sample t-test	Comparing two groups at one time point	Did Class A score higher than Class B after the intervention?
ANCOVA	Comparing post-test scores while adjusting for pre-test	Did Class A outperform B even after accounting for pre-test differences?
Repeated Measures ANOVA	Comparing the same group at multiple time points	How did students’ fluency change across three speeches?
Effect Size (Cohen’s d)	Measuring the magnitude of change	Was the gain meaningful or just statistically significant?

🧪 4. Scenario 1: One Group Pre/Post Design

Example:
Does using ChatGPT for planning improve writing fluency?

Group	Pre-test	4-week treatment	Post-test
Class A	Writing sample	Brainstorm with ChatGPT	Writing sample

Test: Paired-sample t-test

Result Example:
t(29) = 4.21, p < .001, d = 0.77

✅ Interpretation: Students wrote significantly more words after the intervention, with a moderate-to-large effect size.

🧪 5. Scenario 2: Two Groups, Post-Test Only

Example:
Which feedback type leads to better writing — peer or teacher?

Group	Feedback	Post-test Score
Class A	Peer feedback	78.5
Class B	Teacher feedback	70.3

Test: Independent-sample t-test

Result Example:
t(38) = 2.03, p = .048, d = 0.63

✅ Interpretation: The peer feedback group outperformed the teacher feedback group. The effect was moderate in size.

🧪 6. Scenario 3: Control for Pre-Test Differences

Problem:
Your two groups had different pre-test scores — can you still compare them fairly?

Solution:
Use ANCOVA to statistically control for these differences.

📌 ANCOVA adjusts post-test scores by removing the influence of pre-test variation — like comparing two runners by subtracting their head-start.

Caution: Only use ANCOVA if the pre-test and post-test are linearly related, and group variances are homogeneous.

📏 7. What About Effect Sizes?

Statistical significance (p < .05) tells you whether the effect is likely real. But effect size tells you how meaningful the difference is.

Cohen’s d	Interpretation
0.2	Small effect
0.5	Medium effect
0.8+	Large effect

💡 Example: A d = 0.73 suggests your ChatGPT intervention had a strong impact on writing fluency.

📋 8. Sample Reporting (APA Style)

Example 1: Paired-sample t-test

Students showed significant improvement in writing fluency from pre-test (M = 62.5, SD = 8.4) to post-test (M = 75.3, SD = 7.9), t(29) = 5.21, p < .001, d = 0.73.

Example 2: Independent t-test

The Quizlet group (M = 75.3, SD = 7.9) outperformed the flashcard group (M = 68.4, SD = 9.2) on the vocabulary post-test, t(58) = 2.57, p = .013, d = 0.66.

🔀 9. If Your Data Isn’t Normal…

Use non-parametric alternatives:

Parametric Test	Non-parametric Version	Use When
Paired t-test	Wilcoxon Signed-Rank	Non-normal pre-post scores
Independent t-test	Mann–Whitney U	Small or skewed group scores

✅ 10. Statistical Checklist for TESOL Researchers

Before analysis:

Identify IV and DV
Check data type (interval, ratio)
Check normality (Shapiro-Wilk)
Check variances (Levene’s Test)
Choose test: t-test, ANCOVA, etc.
Compute and report effect size
Interpret in light of RQ, context, and limitations

📚 Further Reading

Plonsky, L. (2015). Quantitative Research Methods in Applied Linguistics.
Larson-Hall, J. (2016). A Guide to Doing Statistics in Second Language Research Using SPSS and R.
Dornyei, Z. (2007). Research Methods in Applied Linguistics.

🌱 Final Thoughts

Statistics in TESOL aren’t just about numbers — they’re tools to answer important teaching and learning questions. When chosen and interpreted carefully, even simple tests can lead to powerful insights about what works in your classroom or study.

“The goal is not complexity — it’s clarity and credibility.”