Measures of Reliability
Reliability: the fact that a scale should consistently reflect the construct it is measuring. One way to think of reliability is that other things being equal, a person should get the same score on a questionnaire if they complete it at two different points in time (test-retest reliability. Another way to look at reliability is to say that two people who are the same in terms of the construct being measured, should get the same score. In statistical terms, the usual way to look at reliability is based on the idea that individual items (or sets of items) should produce results consistent with the overall questionnaire. The simplest way to do this is in practice is to use split half reliability. This method randomly splits the data set into two. A score for each participant is then calculated based on each half of the scale. If a scale is very reliable a person’s score on one half of the scale should be the same (or similar) to their score on the other half: therefore, across several participants scores from the two halves of the questionnaire should correlate perfectly (well, very highly). The correlation between the two halves is the statistic computed in the split half method, with large correlations being a sign of reliability. The problem with this method is that there are several ways in which a set of data can be split into two and so the results could be a product of the way in which the data were split. To overcome this problem, Cronbach (1951) came up with a measure that is loosely equivalent to splitting data in two in every possible way and computing the correlation coefficient for each split. The average of these values is equivalent to Cronbach’s alpha, α, which is the most common measure of scale reliability (This is a convenient way to think of Cronbach’s alpha but see Field, 2005, for a more technically correct explanation).
There are two versions of alpha: the normal and the standardized versions. The normal alpha is appropriate when items on a scale are summed to produce a single score for that scale (the standardized α is not appropriate in these cases). The standardized alpha is useful though when items on a scale are standardized before being summed.
Interpreting Cronbach’s α (some cautionary tales …)
You’ll often see in books, journal articles, or be told by people that a value of 0.7-0.8 is an acceptable value for Cronbach’s alpha; values substantially lower indicate an unreliable scale. Kline (1999) notes that although the generally accepted value of 0.8 is appropriate for cognitive tests such as intelligence tests, for ability tests a cut-off point of 0.7 if more suitable. He goes onto say that when dealing with psychological constructs values below even 0.7 can, realistically, be expected because of the diversity of the constructs being measured. However, Cortina (1993) notes that such general guidelines need to be used with caution because the value of alpha depends on the number of items on the scale (see Field, 2005 for details).
Alpha is also affected by reverse scored items. For example, in our SAQ from last week we had one item (question 3) that was phrased the opposite way around to all other items. The item was ‘standard deviations excite me’. Compare this to any other item and you’ll see it requires the opposite response. For example, item 1 is ‘statistics make me cry’. Now, if you don’t like statistics then you’ll strongly agree with this statement and so will get a score of 5 on our scale. For item 3, if you hate statistics then standard deviations are unlikely to excite you so you’ll strongly disagree and get a score of 1 on the scale. These reverse phrased items are important for reducing response bias) participants will actually have to read the items in case they are phrased the other way around. In reliability analysis these reverse scored items make a difference: in the extreme...