|
News and events
“Test Reliability:
Should You Care?”
by Steven B. Just
If you were a lab scientist doing an experiment
you would run the experiment and record your measurements.
And then, if your measurements were important, what’s
the first thing you would do? You would repeat the experiment,
perhaps several times, to ensure that your results were accurate.
If you then published your results in an academic journal
other scientists would again try to replicate your results
before accepting them. If your results could be replicated
there would then be a general consensus that your measurements
were reliable.
A test, of course, is a form of measurement—sometimes
used for important corporate personnel decisions. Yet, most
corporate trainers accept their measurements based on a single
administration of the test, without questioning the reliability
of their results. This is not true for standardized tests,
which are rigorously analyzed, but it is certainly true for
most of the types of testing done by our clients: a single
summative result that measures the outcome of a one-time learning
experience (instructor-led, print, eLearning, etc.).
The reasons for this are clear: We work under
budget and time constraints and we don’t have the luxury
of administering a test multiple times to comparable groups
over a period of time to assure ourselves of the reliability
of our results.
Fortunately there are statistical methods that
allow us to measure test reliability based on a single administration
of a test. These measures are called internal consistency
reliability measures and the two most common statistics are
called Kuder-Richardson Formula 20 (K-R 20) and Chronbach
Alpha. For typical knowledge-based assessments where items
are scored dichotomously (i.e. right or wrong) these two measures
are equivalent.
How do these reliability measures work? Imagine
that we arbitrarily divide a single test in half (say odd
questions and even questions), score each half independently
and correlate one half with the other. In theory, if the test
is internally consistent, the scores should correlate. Then
we divide the test in half in a different way (say first half
of the test and second half of the test) and correlate these
two halves. And we keep doing this. In effect, these two reliability
measures take all possible split-half correlations of the
test and average them to give one reliability estimate. The
reliability estimate is a correlation that will vary between
0 and 1, the closer to one the better.
An Important Caveat
These reliability estimates were developed for
norm-referenced tests (the type that give a nice wide distribution
of scores). The type of testing most corporations do is criterion-referenced
(passing is set at a high cut score, typically 90, and most
students pass, so the grades tend to bunch up at the high
end of the curve). For statistical reasons beyond the scope
of this article, reliability scores for criterion-referenced
tests tend to be low. Many psychometricians feel that these
reliability measures are therefore not meaningful for criterion-referenced
tests.
The Bottom Line
If you are doing criterion-referenced testing,
run reliability statistics, but view the results critically.

|