Reliability
the consistency of test scores over different test administrations, multiple raters, different test questions. How likely is it that a student would obtain the same score?
![]()

Because more selection than supply items can be administered in the same amount of time, we can usually get a better sample of the domain -- there are more pieces of the puzzle to fill in the picture. Research suggests that 9 or more performance items within a particular domain should be administered if individual-level decisions are going to be made.

Source: Gao, Shavelson, Baxter. (1994).
Generalizability of large-scale performance assessments in science: Promises and problems.
Back to the beginning of What is a Test?