Test-Retest Reliability

Simulating consistency, practice effects, and what correlation does (and doesn’t) tell you

Published

March 25, 2026

What is Test-Retest Reliability?

Test-retest reliability refers to the consistency of a measure when the same individuals are tested more than once under identical (or near-identical) conditions. It answers a deceptively simple question:

If I measure the same thing twice, do I get the same answer?

A high test-retest correlation tells you the instrument is consistent — not that the construct itself is stable, or that the scores are accurate. This distinction matters more than it might seem, and we’ll return to it at the end.


Interactive Sandbox

Use the sliders below to explore how sample size, true correlation, and practice effects interact. All plots and statistics update live in your browser — no server needed.

Scatter plot

Legend
  • 🔵 Blue filled circles — original retest scores
  • 🔴 Red open circles — retest + practice effect (visible when practice effect ≠ 0)
  • Dashed line — perfect agreement (test = retest)

Statistics summary

Things to try with the sliders
  • Drag n from 10 → 300 — watch the sample r stabilise toward the true ρ. Small samples are noisy!
  • Set ρ = 0, practice effect = 0 — the t-test should be non-significant and r ≈ 0.
  • Set practice effect = 5 — the mean difference jumps, the t-test turns significant, but r doesn’t move at all.
  • Change the seed — same parameters, different sample. How much does r jump around at n = 20 vs n = 200?

Explanation

Why does correlation ignore practice effects?

Pearson’s \(r\) measures the linear relationship between deviations from the mean. When you add a constant \(\Delta\) to every retest score:

\[r_{X,\, Y+\Delta} = \frac{\text{Cov}(X,\, Y+\Delta)}{\text{SD}(X)\cdot\text{SD}(Y+\Delta)} = \frac{\text{Cov}(X,Y)}{\text{SD}(X)\cdot\text{SD}(Y)} = r_{X,Y}\]

The constant cancels out entirely. The paired t-test, by contrast, works on the raw differences \(d_i = X_i - Y_i\), so it picks up the shift immediately.

This is why reliability researchers often complement \(r\) with:

  • Bland-Altman plots — visualise agreement and systematic bias simultaneously
  • Intraclass Correlation Coefficient (ICC) — penalises both poor correlation and mean-level differences

Key Takeaway

Don’t confuse reliability with stability

A high test-retest correlation tells you that individuals who score high on the first occasion tend to score high on the second — it reflects consistency of rank ordering.

It does not tell you:

  • That the absolute scores are accurate
  • That the construct hasn’t changed over time
  • That systematic biases (like a practice effect) are absent

For a fuller picture, pair the correlation with a check on mean-level differences — which is exactly what Bland-Altman plots and ICC are designed for.


Session Info

Show session info
sessionInfo()
R version 4.5.2 (2025-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS Ventura 13.0

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Rome
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.4 compiler_4.5.2    fastmap_1.2.0     cli_3.6.5        
 [5] tools_4.5.2       htmltools_0.5.9   otel_0.2.0        rstudioapi_0.18.0
 [9] yaml_2.3.12       rmarkdown_2.30    knitr_1.51        jsonlite_2.0.0   
[13] xfun_0.56         digest_0.6.39     rlang_1.1.7       evaluate_1.0.5