Psychological assessment is a prose-orientated activity aimed at gathering a wide array of information by using psychological assessment measures. Developing a Psychological measure, consists of a few phases and steps that needs to be followed. It needs to be planned carefully, items needs to be written down, and the initial version of the measure needs to be administered so that the effectiveness of the items can be determined. The final items are chosen and the measure is then standardized to a certain group of representatives. This enables the validity, reliability and norms of the measure can be established.
After the above, the test manual is compiled Phases and steps. Planning: This phase consists of specifying the aim, defining the content of the measure and developing the test plan. By specifying the aim of the measure we need to state a few points. These include the purpose of the measure, what attributes, characteristics or construct it will meet and what types of decisions could be made on the bias of the test scores. It should state whether, it can be used for screening, competency-based selection and training purposes, or rather in-depth diagnostic assessment.
Which population, including cultural or multicultural, he measure is intended for, If the measure is intended to be individually administered and/ or in group context. Whether it is paper- or computer-based and if it is a normative- ,positive or criterion-referenced measure. Defining the content of purpose is important because the content is directly related to the purpose. The construct needs to be operationally defined. In clinical settings, the rational method can be used. In educational settings, learning outcomes in specific learning areas form the basis.
In organizational settings, the operational definition is based on a job analysis that identifies the competencies hat is needed to perform a job successfully. The construct needs to be comprehensively defined. A construct map can be used to ensure that it is well- defined. The purpose for which the measure is developed , must be considered, therefore a criterion keying is needed. The more rational and analytical approach is often combined with the empirical criterion keying approach to ensure that the result is theoretically grounded.
When the items are tried out in the subsequent phases, factor analysis can be used to refining the underlying dimensions. When a measure is developed for multicultural or multilingual groups, it is critical that he construct is explored in terms of cultural an language group’s understanding and value of it. Developing the test plan is where the format needs to be considered. The format consists of two aspects, a stimulus to which the test taker responds and a mechanism for response. The common items in a format is Forced-choice items, sentence completion items and performance-based items.
When developing the format bear practical consideration in mind. Method of responding also has various formats such as objective format and subjective format. Objective format only has one answer and it is either correct or is perceived to provide evidence of a specific construct. Subjective format, on the other hand, test takers can respond to a question verbally or in writing and the interpretation of the response as providing evidence of the construct depends on the judgment of the assessment practitioner.
In correspondence with the item stimulus, it should be considered whether or not a time limit should be imposed. This should be carefully considered in multicultural group, as context is enhanced when all time limits are removed. Bias can unintentionally be introduces and therefore it is critical to pay attention o potential method and response bias, especially in use of multicultural groups. Test takers can also be bias and therefore respond in response set. The possibility of response sets should be minimized, because this might result in false or misleading information.
The length should be taken into consideration if there is a time limit and the purpose of the measure. Once all of the above has been taken into consideration can there be a clear conceptualization of the specifications for the measure. This is formalized in a test plan and specifies the specific content domains to be included and the number of items that will be included in each one. Item writing: Writing the items is usually done by a group of professionals. The purpose and specifications of the measure provide the guidance for the item developers.
There is a variety of sources that can be consulted for ideas on items. Depending on the type of measure being developed, there is a few important points for item writing. These pointers are that the wording must be clear and concise. Appropriate vocabulary should be used for the target audience. Avoid negative expressions and double negatives in particular. Cover only one the e in an item. Avoid ambiguous items. Vary the positioning of the correct answer in multiple- choice measures and al the distracts in these questions should be plausible.
True or false statements should be approximately the same length and the number of false statements should be approximately equal to the number of true statements. And finally the nature of the content covered should be relevant to the purpose of the measure. When writing items for children, the stimulus material should be as colorful and attractive as possible. Include varied types of tasks and manipulative tasks should be done with ease. Some items will be discarded during later phases, therefore provision needs to be made for including more items in the experimental version than will be required in the final version.
In general, test developers usually develop double the number of items that they require in their final item pool. When item analysis is formed, normally a third of the items are discarded. The next step is reviewing the items. After all the items have been developed, its submitted to a panel of experts to review and evaluate. They judge whether the items sufficiently tap the content domain or dimensions of the construct Ewing assessed. They also comment on the cultural, linguistic and gender appropriateness of the items. Wording of the items, as well as the nature of the stimulus materials will be close evaluated.
The items could also be administrated to a small group of people from the intended target population to obtain qualitative information regarding items that the find difficult to understand. Based on this panel’s recommendations and data gathering from trying out the items, some of the items may need revising or have to be rewritten. Assembling and pre-testing the experimental version of the measure: When tenting the measure ready for its first experimental administration, there are a few practical considerations that require attention. The first step is arranging the items.
Items needs to be arranged in a logical way in terms of the construct being measured. In paper-based test, items needs to be grouped or arranged on the appropriate pages in the test booklet. In the second step we need to finalize the length. Now that the experimental version of the measure is available, the length needs to be revisited. Although sufficient items have to be included to sample the construct, the time the test Akers will read the questions, also needs to be taken into account. The more they have to read, the longer it will take them to complete the measure.
If the measure has time constraints, and there’s a bit to read, some items will have to be discarded or the amount that has to be read needs to be adjusted. The third step is to answer protocols. Decisions need to made for paper-based tests, as to whether items will be completed in the test booklet, or whether a separate answer sheet needs to be developed. Care should be taken in designing the answer protocol in such a way that it aids the scoring of the measure and hat it is ease to reproduce. The next step is to develop administration instructions.
These instructions needs to be developed in such way that they are clear, unambiguous administration instructions for the experimental try-out of the items. It is advisable to pre- test the instructions on a sample of people from the target population. The assessment practitioners need to be comprehensively trained. If care is not taken in this first step of the test development process, it could have negative consequences for performance on the items later during the experimental pre- testing stage. The last step is so pre-test the experimental version of the measure.
The measure should be administered to a large sample from the target population. Qualitative as well as quantitative information should be gathered. Information about how the test takers found the questions could be invaluable during the item refinement and final item selection phase. These information about how test takers responded to stimulus materials, the ordering of the items and the length of the measure could be helpful for the test developer. Item analysis phase: This phase adds value to item development and development of the measure n general.
The purpose of item analysis is to examine each item to see whether it serves the purpose of the measure. Item analysis helps us to determine how difficult an item is, if it discriminates between good and poor performers, if it is biased, and what the shortcomings are. Statistics are computed to evaluate the characteristics of the items. Classical test theory and item response theory are two statistical approaches that can be followed to analyses the characteristics of items. The result are then used to guide the final item selection as well as the organization of the items in the measure.
The classical theory item analysis is used to determine item difficulty. The difficulty of the item is the proportion or percentage of individuals who answer the item correctly. This theory analysis determines discriminating power. One of the purposes of item analysis is to determine which items best measure the construct or content of the domain that the measure aims to assess. The discriminating power of an item can be determined by means of the discrimination index and item-total correlations. To compute the discrimination index the method of extreme groups is used.
Performance on an item is impaired between the upper 25% of the sample and the bottom 25% of the sample. If the item is a good discriminator, more people in the upper group will answer the item correctly. To compute the discrimination index, the upper and lower 25% of the test takers needs to identified. Then, the percentage of test takers in the upper and lower groups who passed the item is computed and the information is substituted in to the following formula: U = number of people in upper group who answered item correctly nu = number of people in the upper group L = Number of people in the lower group who answered the item correctly. = number of people in the lower group. An item-total correlation can be performed between the score on an item and performance on the total measure. There is a direct relationship between item difficulty and item discrimination. The difficulty level of an item restricts the discriminatory power of an item. The item response theory (ART) item analysis can determine an items difficulty level and discriminatory power more accurately. In ART the item parameters are sample variant, therefore the parameters aren’t dependent on the ability level of the test takers responding to the item.
In ART an item-response curve is constructed by plotting the proportions of test takers who gave the correct response against estimates of their true standing on a latent trait (Keen, 2002). The estimates are derived through logistic equations, with the specific equations used varying according to the estimation procedures used. In the two-parameter model, an item response curve is constructed by estimating both the difficulty and discrimination parameters. This model is more popular since it lets you compare the difficult level as well as the discrimination power of two items.
In multicultural country it is useful to investigate item bias in the early stages of developing a measure. With ART the differential item functioning (DIF) can be determined in a more specific way to identify items that may be biased or unfair for certain test-takers. Criteria used for item selection is the last step in this phase. The classical test theory, ART, and DIF analyses can be used to determine item parameters based on which items should be included in the final version of the measure or which items should be discarded.
Item analysis statistics shouldn’t necessarily replace he skill and expertise of the test developer, as these performances on items are impacted by many variables. The item developer may feel, on closer inspection that a poor item should be included, and an item with good statistics should be excluded because of logical or theoretical reason. Revising and standardizing the final version of the measure: After collecting quantitative and qualitative information on items and the experimental version, the fifth phase focuses on revising the items and test and the administering the final version to a large sample for standardization purposes.
Step one is revising thee items and test. Here items that seemed problematic during item analysis phase, needs to be considered and decisions need to be made for each one of theseћ regarding whether it should be discarded or revised. When items are revised they go through the same qualitative review process as well as the experimental try-outs should be done. Selecting items for the final version is step two in this phase. Here we now have a pool of items that has been reviewed and on which empirical information regarding item difficulty, discrimination and bias has been obtained.
In correspondence with this information we now select items to go into the final measure. After selecting the items, the existing database can be used to check on the reliability and validity coefficients of the final measure, to see whether they appear acceptable. The third step consists of refining administration instructions and scoring procedures. According to the feedback and experience gained during the pre- testing, the administration and scoring instructions might need to be modified.
Administering the final version is the final step, where the final version is now administered to a large, representative group of individuals for the purposes of establishing the psychometric properties, validity and reliability as well as the norms. Technical evaluation and norm establishing: Firstly we need to evaluate validity and reliability. The psychometric properties of the measure need to be established. Various types of validity and reliability coefficients can be computed, depending on the nature and purpose of the measure.
Reliability is a measurement with the process of assigning numbers to objects according to clearly specified rules. This measurement process entails that we would have a clear conception of at least three aspects. These aspects include what the entity is, what the entity is we want to measure what exactly the nature of the measure is that we want to use and the application of the rules on how to measure the object. If we want to obtain reliable measures of these objects we therefore have to apply or consider these three aspects in a consistent manner.
Reliability has different types of reliability namely, Test-retest reliability, Alternate-form reliability, Split-half reliability, Inter-item consistency, Inter-scores (rater) reliability and Intra-scorer (rater) reliability. The validity of a measure concerns what the test measure and how well it does so. Validity is not a specific property of a measure. Validity is always the interaction of both the purpose of the instrument and the sample. Types of validity includes Content-description procedures, Construct-identification procedures and Criterion-prediction procedures.
Next we should establish norms, setting performance standards or cut-scores. If norm-referenced measure is developed, appropriate norms need to be established. This is the final step in standardizing the measure. Individuals’ test scores has little to no meaning on its win, therefore it need to be compared to that of a similar group of people. This way the individual’s score can be more meaningfully interpreted. If a criterion- referenced measure is used, cut-scores or performance standards need to be set to interpret test performance and guide decision-making.
When a measure is norm-referenced each test taker’s performance is interpreted with reference to a relevant standardization sample or norm group. Norm group consist of two sub-groups, namely the applicant pool and the incumbent population. The choice of such a norm group has to be representative f both the applicant pool and the incumbent population as well as appropriate for the position the assessment is conducted for. The similarities between the norm group on the one hand and applicant and the incumbent groups on the other hand are established by means of comparative aspects such as gender and age.
Standardized or normal scores are calculated for each test. New applicants’ test scores can now be compared to these newly established standardized, normal scores. Publishing and ongoing refinement: Before publishing a measure, a test manual must be compiled and should be submitted for classification. When a measure is published and ongoing research into its efficacy and psychometric properties is conducted, a revised edition of the manual should be published to provide updated information TA assessment practitioners.
When compiling the test manual it should specify the purpose of the measure, indicate to whom the measure can be administered and provide practical information. It should specify the administration and scoring instructions and the test development process which was followed should be outlined in detail. Detailed information on the types of validity and reliability information established should be provided, as well as how this was done and what the findings were. Information about the cultural appropriateness of the measure should be provided and the extent to which test and item bias has been investigated.
Should provide detailed information about when and how norms were established and norm groups were selected. Give appropriate information about how local norms and cut-off scores could be established or augmented by assessment practitioners to enhance the criterion-related or predictive validity of the measure and indicate how performance on the measure should be interpreted. The measure should be admitted to the Psychometrics Committee of the Professional Board for Psychology so that it can be determined whether the measure should be classified as a psychological measure or not.
The measure can now be published and marketed. A clear distinction should be made between marketing and promotional material and the test manual. The test manual should contain factual information and not act as a selling device that tries to put the measure in a favorable light. The measure should be marketed to an appropriate target market. Standards, such as training for recantations and follow-up activities, should be set for those who purchase and use the measures. Lastly the measure should undergo ongoing revision and refinement. If content dates quickly, more frequent revisions may be necessary.
Popularity of the measure can also mean that it should be revised more often. Usually developers wait until a substantial amount of information has been gathering regarding how the measure needs to be revised before they undertake a full-scale revision and removing process. The adaptation of measures must be viewed within the overall context of the SE of assessment in a cross-cultural setting, and should not focus only on the adaptation process. Besides the technical methods, other aspects of the adaptation process include test administration, item formats used and time limits of the measure.
Administration. The validity of a measure can be seriously compromised if there are any communication problems between the assessment practitioner and the test- takers. Practitioners should be familiar with the culture, language and dialect of test-takers. They should have adequate administration skills and possess some measurement expertise. Item format. Refers to the type of items used in the measure. In multi-cultural context, it cannot be assumed that all test-takers will be equally familiar with the specific item format in the measure. Include a balance of different item formats in the measure or include practice items that would allow test-takers some opportunity to familiarize themselves with the unfamiliar format. Time limits. Concept of speed as a function of intellectual ability is a common foundation of many measures. In other cultures, answering question quickly and blurting out a response is often regarded as rude or impolite. Here intellectual ability is associated with thoughtfulness and careful consideration of your response. Therefore in cultures like these, speed is not seen to be significant factor in cognitive ability.
Measures that have time restrictions can put test-takers from different cultures at a disadvantage. Minimize test speed as a factor when assessing test-takers by ensuring that there is adequate time for completion of the measure. Designs for Adapting Measures Equivalence in cross-cultural comparison. This is the central issue in cross-cultural comparative research. For measures o be equivalent, individuals with the same or similar standing on a construct, should obtain the same or similar scores on the different language versions of the items or measure.
Measures have to be equivalent if comparisons are made between individuals belonging to different subgroups. To ensure that measures are equivalent, they are adapted using judgmental and/ or statistical designs. It must be noted that even if scores can be compared, it cannot therefore just be assumed that the measure is free of bias. Judgmental designs for adapting measures. These are based on a decision by an individual, or a group of individualsћ on the agree to which the two measures are similar. Common designs to achieve this is forward-translation and back-translation.
Forward-translation: Here the source version of a measure, which is referred to as the original language source, is translated into the target language. This translated measure goes through an experimental phase where test-takers answers the items and the judges decide if the responses reflect a reasonable representation of the test items. In a more common variation of this design, a group of different bilingual experts compare the source and target versions of the measure to determine whether the two versions are equivalent.
Back-translation: The original measure is first translated into the target language by translators, and then translated back into the original language by different set of translators. Equivalence is assessed by having source language judges check for errors between the original and back-translated versions of the measure. Main advantage in this design is that researcher who are not familiar with the target language to gain some insight into the quality of the translation. This is dependent on the characteristics of participants as well as the version of he translate instrument.
Bilingual test-taker: Both the source and target versions of the measures are administrated to test- takers who speak both the source and target languages, before comparing the two sets of scores. Since the same test-takers take both versions of the measure, differences in the abilities of test-takers that can confound the evaluation of translation equivalence will be controlled. Source and target language monolinguals: Here source language monolinguals take the source version and target language monolingual take the target versions of a measure.