PSY 498 8 PSYCHOLOGICAL ASSESSMENT
ASSIGNMENT 1 CATHERINE SCOTT 3315 459 7
DEVELOPMENT OF A PSYCHOLOGICAL MEASURE
INTRODUCTION The process of developing a reliable and valid psychological measure is a complex task. In this paper, the author hopes to further explain how to construct a test which can be considered both valid and reliable. Steps to developing a psychological measure The steps to follow when developing a psychological measure can seem daunting and complex. There are nine basic steps which need to be followed: 1.THE PLANNING PHASE This is where the aim of the measure needs to be decided on and stated. The characteristic or construct to be measured, what the measure will be used for, and the target group (population) for the measure will also need to be defined. Once this has been clarified, one can decide how the test scores will affect decisions (or what decisions can be made based on test scores). An important stage in planning is whether the performance is compared to a criterion or a group norm. In order to define the content of a measure, one needs to have a defined purpose of a measure. The construct needs to be operationally defined, by undertaking a literature review (research process) of the main theoretical viewpoints of the construct. The purpose of the measure is clearly vital, as it serves the basis for constructing the measure. In this phase, 'keying' is used ? where information is gathered about the 'aspects of the construct on which these groups usually differ'. (An Introduction to Psychological Assessment. Foxcroft and Roodt. P72). e.g. Items are needed to discriminate between individuals, so as to allow the assessor to view the various 'risk' groups. The format and number of each type of item is the next step in the planning phase. The format of the test will vary according to the construct being measured. There are open-ended items (no limits placed on the test-taker), forced-choice items (like multiple-choice, where careful decisions are involved), sentence completion items, essay items (which test for logical thinking and organisational abilities) and performance items (generally apparatus is manipulated by the test-taker). As mentioned previously, the type of construct being measured will directly influence the item type, as will practicalities such as time constraints. i.e. If there is a limited amount of time, essay type questions may not be appropriate. While developing the format, the number of items in the construct also play a role, as they also influence the time. Generally, in the process of refinement, many items are discarded, so in the planning phase, most developers create more items than needed. Writing of the items is usually a task completed by a team of experts. (Foxcroft and Roodt) The purpose of the measure will help to keep focus on content validity. Existing measures, theories, textbooks etc. provide sources for ideas for test items. When writing a measure, wording must be clear and concise, so as to allow the test-taker easy comprehension of the requirements. Negative expressions and ambiguity are never to be used, as they create confusion, and may invalidate results. The positioning and amount of multiple-choice correct answers, and ratio of true or false items should be equivalent and varied. It is also obvious that the nature of the content should be relevant ? which is discussed further in content validity. With regards to measuring children related issues, the stimulus material should be attractive and varied ? as they tire easily of mundane, repetitive tasks. Once the items have been developed, it is reviewed and evaluated by a panel of experts. (An Introduction to Psychological Assessment. Foxcroft and Roodt. P75). They will judge the items on relevance, appropriateness for target group, wording of the items, and the nature of the stimuli. Based on their findings, certain items may need to be re-written or disregarded. (refer to previous mention of refinement in the formatting process). ASSEMBLY AND PRETESTING OF A MEASURE Items need to be rearranged logically ? this would include grouping types of items together. Once the measure is in a logical format, the length of the measure needs to be reassessed. Time taken by the test-takers to read and comprehend instructions plays a major role in this phase. After consideration, either the time allocated for completion of the measure needs to be increased or decreased, or certain items may need to be discarded. Answer protocol should be predetermined, as it allows for ease of administration. A booklet or answer sheet should be developed in such a way that scoring of the measure, and reproduction of the booklet or answer sheet is an uncomplicated process. Administration instructions need to be developed, and should be clear and unambiguous. Advice is given to pretest the instructions on a sample of people from the target population. (An Introduction to Psychological Assessment. Foxcroft and Roodt. P76). Training on administration measures may need to be provided, as the misunderstanding of instructions may lead to poor performance on an item. The measure is now ready to administer to a large sample of the target population. Benefits of the pre-test phase include feedback from the test-takers on level of difficulty, ease of comprehension, sequencing of items as well as the length of the measure. ITEM ANALYSIS PHASE The main aim of this stage if to validate the purpose of each item. (i.e. Does each item measure the construct? ) The difficulty of an item is directly proportional to the percentage of the individuals who answer it correctly. (represented as 'p'). Clearly, the higher the percentage of correct responses, the more difficult the item. This 'p' tells us about the frequency of correct responses, but nothing about the characteristic of an item. Different samples may yield different values for 'p', but it does provide a generalised view of difficulty. Good test items measure the 'same thing as the total test is measuring'. (An Introduction to Psychological Assessment. Foxcroft and Roodt. P78). A good item score would predict a good score on the total test and vice versa. The discriminating power of an item is determined by a discrimination index (D) ? where a good discriminator has the top percentage of people answering correctly. A positive 'D' value will indicate an item discriminating between upper and lower groups, and has a good discriminating power. Item-total correlation is performed between the score on the item and the total performance on a measure. (An Introduction to Psychological Assessment. Foxcroft and Roodt. P79). Correlations of 0.20 are the minimum acceptable discrimination value allowed when considering item selection. ADMINISTRATION TO STANDARDISATION SAMPLE Once feedback has allowed for modifications, the measure is ready to be administered to a large representative sample, so as to establish reliability, validity and norms. (Norms used only if criteria are not used). 'Reliability of a measure refers to the consistency with which it measures whatever it measures.' (An introduction to Psychological Assessment. Foxcroft and Roodt. P41). There are five types of reliability ? each used to measure different types of constructs: a. Test-retest reliability: where the same test is administered in two different settings to the same group of people. This technique itself is not entirely reliable, as personal and environmental factors may change for the test-takers, and they may recall answers ? which invalidates the second test results. b. Alternate-form reliability: where two similar tests on the same content are administered. The correlation between the two sets of scores would show a reliability coefficient. The limitations of this form are expense, time and creation of an equivalent, yet different test. According to Foxcroft and Roodt, and the author of this paper, this type is not easy to construct, and therefore is not recommended. Split-half reliability: one test is administered in one setting, and divided equally to obtain two scores. The most common method of splitting a measure is to calculate the correlation between odd and even numbered items. This gives a result on half a measure, and is assumed to be an 'underestimation of the reliability coefficient'. (An Introduction to Psychological Assessment. Foxcroft and Roodt. P43). The Spearman-Brown formula (rtt) is said to be the corrected reliability coefficient. Kuder-Richardson and the Coefficient Alpha: this is based on the response for the total measure. A score of '1' is given for correct answers, while '0' given for incorrect answers. The KR20 formula is used. This type of reliability can only measure performance indicators, as personality scales cannot be correct or incorrect. Cronbach developed a Coefficient Alpha reliability measure for such scales. e. Inter-scorer reliability: two assessment practitioners score the tests separately, and the correlation of the two sets of scores is the inter-scorer reliability coefficient. This method may have a high risk of contamination if the practitioners know either the test- taker or the previous scores, and a 'halo-error' may occur. Certain measures are affected by time (absence or presence of a time limit); age; gender and ability levels. Correlation between scores needs to be computed separately for homogeneous groups to obtain a more reliable coefficient. Validity determines how well a measure measures the construct. There are three aspects or types of validity: a. Content validity: which determines whether the content of the measure covers a representative sample of the behaviour being measured. It is non-statistical, and a panel of subject experts (Foxcroft and Roodt) evaluate items during the assembly phase. This measure is best applied to evaluation of achievement and occupational measures. In content validity, face validity is an important aspect for the test-taker. Face validity has nothing to do with the construct, but rather shows the construct appearing to be valid. b. Criterion-prediction validation: a quantitative procedure that involves the calculation of the correlation coefficient between a predictor and a set criterion. Huysamen and Smit defined two types of criterion-related validity: Concurrent validity: the accuracy with which a measure can identify current behaviour regarding skills of an individual. Predictive validity: the accuracy with which a measure can predict future behaviour of an individual. The measure will dictate which type of validity is used. Validity generalisation, where specific skills are assessed, and meta-analysis (research of literature on a specific topic) are popular tests used. (Anastasi and Urbina, 1997) After each administration of a measure, cross-validation of scores should be implemented, so as to obtain a realistic sense of validity coefficients. Construct-identification validity: is the extent to which a measure 'measures the theoretical construct or trait it is supposed to measure'. (An Introduction to Psychological Assessment. Foxcroft and Roodt. P53). Correlation between tests is expected to be high, but not too high, as they will merely be a duplication of the previous measures. Factor analysis is used to analyse the correlation of variables. The factorial validity of a measure refers to the factors measured by the measure. It is best to measure small numbers of variables ? so as to create specialised measures. A measure should correlate highly with other similar variables, and should correlate minimally with irrelevant variables. A validity coefficient ( r) should be considered significant at 0.05 and 0.01 levels according to Huysamen (An Introduction to Psychological Assessment. Foxcroft and Roodt.) There are many variables which affect validity coefficients ? the nature of the group, the relationship between the predictor and the criterion, the proportion of validity to a measure's reliability, criterion contamination, as well as moderator variables (e.g. Personality); and therefore allow for error in validity coefficients. There is a formula for correcting and estimating standard error, and this allows for acceptance of minor outliers. Norms are measurements of a group, against which an individual's score can be evaluated. The choice of norm score for a new measurement will depend on the developer's preference. The most commonly used norm scales are: a. Mental age scales: where the highest and lowest age at which a measure is passed is calculated, and called the basal age. A child's mental age combines the basal age, plus any additional months of credit at higher age levels. Chronological age is irrelevant. e.g. Reading age is calculated according to mental age. b. Grade equivalents:scores are interpreted in terms of the average scores in a specific area of e.g. Learning, for a grade. A child may have a reading level equivalent of grade 3, while they are currently in grade 7, and their mathematical abilities could differ from their current grade as well. Percentiles: scores are divided into quartiles. e.g. Q1 is the first quartile, which represents the lower 25th percentile of scores. This is not the lowest 25 marks, but the bottom percentile of scores. (i.e. They may have 75% for a test, and still be in the first quartile) Standard scores: these are further divided into: z-scores, which represent an individual's distance from the mean in standard deviation units. A raw score equal to the mean, has a z-score of '0'. Positive z-scores indicate and above average performance. Linearly transformed standard scores, are a transformation of the z-scores to eliminate the limited range of a z-score distribution. Normalised standard scores are standard scores that have been transformed to fit a normal distribution. (An Introduction to Psychological Assessment. Foxcroft and Roodt.p60). This is done if it is assumed that a particular attribute is normally distributed. McCall's T scores eliminated negative values, and allows for a standard scale. Stanine and Sten scales are also used in order to reflect a person's position in relation to a normative sample. Deviation IQ Scale is used by most intelligence measures. It is a normalised standard score, with a mean of 100 and a standard deviation of 15. This scale is easily comprehensible, but can only be used for certain measures. TECHNICAL EVALUATION The properties of the measure need to be established in this phase. Validity, reliability and norms are stated ? so as to meaningfully interpret an individual's score. COMPILING A TEST MANUAL According to Foxcroft and Roodt, the test manual must include the following: a. the purpose of the measure b. the target group for which it is intended the time allocation (even if there is none, it must be stated); reading or grade level requirements If the administrator requires training ? and where this is available e. Administration, scoring and interpretation instructions f. Developmental process of the tests g. Types of validity and reliability used h. If test bias exists How and when norms were established, along with the normative sample's characteristics (in raw data tables in an appendix) j. Any cut-off scores which may apply Any revised editions of the manual and tests should be published. SUBMITTING THE MEASURE FOR CLASSIFICATION The Psychometrics committee will determine if the measure can be classified as a psychological measure. PUBLISHING THE MEASURE Marketing material should be concise and accurate. It should state any additional requirements for the administrator, and should be worded appropriately for the target group. ONGOING REFINEMENT If item content dates, revisions of the measure are needed. All new findings should be published and go through the same process as the original measure ? so as to ensure validity and reliability still apply. CONCLUSION In the paper, there are many varying methods of validating a measure ? each can be applied to specific constructs. The factors included in validity testing comprise of tests for content, criterion and construct identification. The end result will be the completion of a measure and test manual, as well as the classification of said measure. BIBLIOGRAPHY
AN INTRODUCTION TO PSYCHOLOGICAL ASSESSMENT. ( 2001) Foxcroft and Roodt. Oxford Press. UNISA PRESS 2005: PSY498-8, TUTORIAL LETTER 102/2005.