Human Resources

Development Of Rater Training Programs

Hire Mubeena M. , a brainy Consultant from Dubai, United Arab Emirates | Expertbase

By Mubeena M.

expertbase logo Articles by Mubeena M. Human Resources Development Of Rater Training Programs

Development Of Rater Training Programs

329 Claps
6207 Words
32 min

How organizations may train their raters to decrease biases, increase accuracy of evaluations, increase behavioral accuracy to improve observational skills, and finally, increase rater confidence.

Performance appraisals are utilized by organizations for a variety of reasons. For example, some organizations may use the performance evaluation process for developmental purposes, to assist employees enhance their careers by obtaining appropriate training. A second reason for utilizing performance appraisals may be for administrative decisions, such as evaluating employees for promotions, demotions, merit pay increases, etc. and finally performance appraisal may be used for providing simple feedback to employees.

Studies have found that employees are constantly seeking feedback from their employers on how they perform on the job. A component of the performance appraisal process is training raters to increase the accuracy of providing accurate feedback to employees. Therefore, the following areas will be discussed in more detail on how organizations may train their raters to decrease biases, increase accuracy of evaluations, increase behavioral accuracy to improve observational skills, and finally, increase rater confidence. Specifically, the five rater training programs that will be described based on past research studies are: Rater Error Training (RET), Frame-of-Reference Training (FOR), Rater Variability Training (RVT), Behavioral Observation Training (BOT) and Self-Leadership Training (SLT).

Rater Error Training

RET is based on the supposition that raters possess certain biases that decrease the accuracy of the rating. Effective rater training should improve the accuracy of the ratings by decreasing common “rater biases” or “rater errors.” There are five main types of errors made when evaluating others. The first type of rater error is known as distributional errors. Frequently raters fail to discriminate between ratees.

Distributional errors include the following errors: severity error, leniency error and the central tendency error. The severity error- occurs when a rater tends to give all ratees poor ratings. The leniency error- takes place when the rater has a propensity to rate all the ratees well; and the central tendency error- has most likely occurred if the rater gives all the ratees close to or the same score. The leniency error occurs most often in performance appraisals.

There are many reasons behind this, however most notably, performance appraisals are used for raises and managers are often apprehensive to give negative feedback. The second main type of rater bias is known as the halo effect- the halo effect occurs when a rater fails to distinguish between the different performance dimensions. Usually during a performance appraisal an employee is being rated across many performance dimensions.

For example, if an employee does well on one performance dimension such as customer service he or she will likely be rated high on another performance dimensions such as leadership skills. The third type of error is known as the similar-to-me effect- which is the tendency of a rater to judge more favorably those whom the rater perceives as similar to themselves. The contrast effect- is the fourth main type of error is the tendency for a rater to evaluate individuals relative to others rather then in relation to the job requirements. The fifth error is what is known as the first-impression error- which is an initial judgment about an employee that ignores (or perceptually distorts) subsequent information so as to support the initial impression.

Over the years a number of strategies have been developed to control for these errors/biases. Employee comparison methods include rank order, paired comparison, and forced distribution. The advantages of these methods are that they prevent distributional errors. However, the halo effect may still occur. The disadvantage in utilizing these comparison methods are that they are cumbersome and time consuming and utilize merely ordinal measurements.

Other strategies for controlling errors/biases are known as behavioral scales. All behavioral scales are based on critical incidents. Examples of behavioral scales are Behaviorally Anchored Rating Scales (BARS) and Mixed-Standard Scale. The advantages in employing theses types of scales are that they overcome most rating errors associated with graphic rating scales. Additionally, they provide internal measurement. The disadvantages in utilizing these types of scales are that they are costly and time consuming and are not at all useful if the criteria changes frequently.

Another technique to prevent error in performance evaluations is to train the raters. Most RET programs involve instruction on the different types of errors that are possible. Sensitivity training to the different types of errors/biases that may occur can in itself help reduce the possibility that raters will commit these errors. Training helps evaluators recognize and avoid the five most common errors in performance evaluations. “Training to reduce rater error is most effective when participants engage actively in the process, are given the opportunity to apply the principals to actual (or simulated) situations, and receive feedback on their performance. Education on the importance of an accurate appraisal process for executives also is needed, given that accurate appraisals increase performance and prevents lawsuits (Latham & Wexley).”

Law and Sherman (in press) found that, “In general, the use of trained raters is associated with improved rating quality. While training raters leads to reductions in common rating errors, the question still remains as to whether trained raters show adequate agreement in their ratings of the same stimulus.” A study by Garman (1991) evaluated inter-judge agreement and relationships between performance categories and final ratings. Garman (1991) found that “some type of adjudicator training should be developed to ensure that adjudicators have a common understanding of the terms, the categories, and their use in arriving at the final ratings” (p. 23).

There have been many strategies offered in order to control for and reduce rater error. RET is the most widely accepted form of reducing errors and biases. As shown here research suggests that training raters is the best way to ensure that your performance appraisal results are accurate assessments of the true performance of an employee.

Frame-of-Reference Training

FOR training is one of the latest attempts at increasing the accuracy of performance ratings. It is largely an outcome of the impression that traditional forms of rater training, such as RET, have not been successful at increasing rater accuracy even though it minimizes scaling rater errors (Stamoulis & Hauenstein, 1993; Hedge & Kavanagh, 1988; Woehr & Huffcut, 1994). Thus, merely controlling for errors of leniency, halo, and central tendency is not enough to ensure increased accuracy. Defining accuracy remains the key question and has determined subsequent research on FOR training.

The premise behind FOR training is that raters be trained to maintain specific standards of performance across job dimensions with their corresponding levels. They must have a reference point so that there would be a match between the rater’s scores and the ratees’ true scores. Instead of relying on the process of rating, which is more the focus of RET, FOR training has a more content-oriented approach. Traditional rater training methods have not primarily considered cognitive elements of how rater’s generate ratee impressions. FOR training directs raters to alter and think about their own implicit theories of performance.

Research on FOR training has significantly supported its use for the purposes of increasing rater accuracy (Stamoulis & Hauenstein, 1993; Hedge & Kavanagh, 1988; Woehr & Huffcut, 1994). This method usually involves the following steps: (1) participants were told that performance should be viewed as consisting of multiple dimensions, (2) participants were told to evaluate ratee performance on separate and specific job performance dimensions, (3) BARS were used to rate the managers and trainers would review each dimension in terms of specific behaviors, also going into levels of performance for each dimension, (4) participants viewed a set of practice videotape vignettes of manager-subordinate interactions, in which managers always served as ratees, and (5) trainers would then provide feedback to indicate which rating the participant should have given for a particular of levels within a dimension (Stamoulis & Hauenstein, 1993; Hedge & Kavanagh, 1988; Woehr & Huffcut, 1994; Sulsky & Day, 1992; Sulsky & Day, 1994; Woehr, 1994).

The goal of this training was to achieve high inter-rater reliability that reflected a common performance theory as was taught by the trainers. Virtually every study provided training on four main dimensions: motivating employees, developing employees, establishing and maintaining rapport, and resolving conflicts. The ‘resolving conflicts’ dimension items served as distracters and were not addressed during training (Stamoulis & Hauenstein, 1993; Woehr & Huffcut, 1994; Sulsky & Day, 1992; Sulsky & Day, 1994; Woehr, 1994). True scores were determined using ‘experts’ who had extensive training in the relevant performance theory espoused by the trainers.

As mentioned earlier, a key driver of the FOR method is the issue of what constitutes an accurate rating. The research focuses heavily toward the Cronbach’s (1955) component indexes and Borman’s (1977) differential accuracy (BDA) measures (Stamoulis & Hauenstein, 1993; Sulsky & Day, 1992; Sulsky & Day, 1994; Woehr, 1994; Sulsky & Day, 1995). Cronbach’s (1955) indexes are four types of accuracy that are decompositions of overall distance accuracy, i.e. the correlation between ratings and true scores: (1) Elevation (E): Accuracy of the mean rating over all ratees and dimensions (2) Differential Elevation (DE): Accuracy of the mean rating of each ratee across all dimensions (3) Differential Accuracy (DA): Accuracy with which ratees are ranked on a given dimension (4) Stereotype Accuracy (SA): Accuracy of the mean rating of each dimension across all ratees (Woehr, 1994). All studies agree that DA is the most important accuracy measure in terms of FOR training as it assesses whether raters effectively distinguished among individual ratees across individual job dimensions (Stamoulis & Hauenstein, 1993; Sulsky & Day, 1992; Sulsky & Day, 1994; Woehr, 1994; Sulsky & Day, 1995).

The trend in FOR training literature has witnessed a shift from confirming that the method works toward why the method works, specifically outlining the cognitive mechanisms utilized by raters that underlie its success. The major general finding in the research is that FOR-trained participants have demonstrated a significant increase in rating accuracy compared with control-trained participants. However, FOR-trained participants have shown lesser memory recall and a general decrease in accuracy when identifying whether specific behavioral incidents occurred (Sulsky & Day, 1992; Sulsky & Day, 1994; Sulsky & Day, 1995). These contradictory results have attempted to be explained by various cognitive impression-formation and categorization models.

Hedge and Kavanagh (1988) conducted a study to assess rater training on accuracy by enhancing observational or decision-making skills. Results showed that RET resulted in a drop in leniency, but an increase in a halo effect. It was suggested that raters be trained to use better integration strategies to overcome incorrect cognitive schemas (Hedge & Kavanagh, 1988).

Sulsky and Day (1992) posit the Cognitive Categorization theory, where the primary goal of training is to lessen the impact of time on memory decay of observed behaviors. By providing raters with prototypes of performance they are likely to correctly categorize ratees for each job dimension. The lack of behavioral accuracy may have resulted from misrecognizing behaviors that appeared consistent with raters’ performance theory after FOR training (Sulsky & Day, 1992). According to Sulsky and Day (1992) this showed the tendency of FOR-trained raters to possess increased bias toward prototypical characteristics of a category to the extent of admitting to behaviors that in reality never occurred.

Stamoulis and Hauenstein (1993) set out to investigate whether the effectiveness of FOR training was due to its structural similarities with RET. In addition to a normal control group, they had a structure-of-training control group. They found that although FOR-trained raters showed increased dimensional accuracy (SA+DA), they were only marginally better than the structure-of-control group. The RET group showed greater elevation accuracy but all groups showed increased differential elevation. What may have contributed to the RET group’s success is increased variability in the observed ratings leading to greater ratee differentiation (Stamoulis & Hauenstein, 1993).

David Woehr (1994) hypothesized that FOR-trained raters would be better at recalling specific behaviors, contrary to other research hypotheses. Results revealed that FOR-trained raters had more accurate ratings based on the differential elevation and differential accuracy indexes. Interestingly, Woehr (1994) found that FOR-trained participants recalled significantly more behaviors than participants in the control groups. An explanation stems from an elaboration theory rather than an organization process. It is postulated that behaviors of the same trait may well remain independent rather than linked together as clusters (Woehr, 1994).

Sulsky and Day (1994) revisited the Cognitive Categorization theory to assess whether memory loss for the performance theory would adversely affect rating accuracy. They found that even under conditions of a time delay between observing behaviors and making ratings, FOR-trained raters made more accurate ratings than controls but also showed lower behavioral accuracy. They attribute rating accuracy to a cognitive process of recalling impressions and effectively using the performance theory instead of relying on memorizing specific behaviors (Sulsky & Day, 1994). Overall impressions based on a theory are more chronically accessible than individual behaviors.

A general framework for evaluating the effects of rater training was presented by Woehr and Huffcutt (1994). They assessed individual and combined effects of RET, Performance Dimension Training, FOR training, and BOT. Four dependent measures were halo, leniency, rating accuracy, and observational accuracy. Results showed that each training method led to an increase in all the dependent measures with the exception of RET. RET showed a decrease in observational accuracy and increased leniency. One important finding was that the effects of RET on rating accuracy seemed to be moderated by the nature of the training approach. If the training focused on identifying ‘correct’ and ‘incorrect’ rating distributions, accuracy was decreased. If the training focused on avoiding errors themselves without focusing on distributions, accuracy was increased. FOR training led to the highest increase in accuracy. Combining RET and FOR led to moderate positive effect sizes for decreasing halo (d=.43), leniency (d=.67) and increase in accuracy (d=.52) (Woehr & Huffcutt, 1994).

Sulsky and Day (1995) proposed the idea that as a result of FOR training, raters are better able to organize ratee behavior in memory. This effect was thought to be either enhanced or diminished depending on if performance was presented around individual ratees (‘blocked’), or around unstandardized, unpredictable individual job dimensions (‘mixed’). Organization of information in memory was assessed using the adjusted ratio of clustering (ARC) index, which measures the degree to which participants recall incidents in the same category in succession. Results indicated no Training (FOR or control) X Information Configuration (blocked or mixed) interaction. Results confirmed Borman’s (1977) Differential Accuracy and Cronbach’s (1955) Differential Accuracy as distinguishing factors between FOR and control training. Participants in the blocked condition showed more accuracy than the mixed condition. The implication is that blocked configuration led to effective organization in terms of person-clustering. However, FOR training in the mixed condition was not found to interfere with its beneficial effects. Thus, FOR-trained raters were generally more accurate regardless of how training material was presented (Sulsky & Day, 1995). FOR training encourages stronger trait-behavior clusters resulting in correct impressions for dimensions while observing behavior (Sulsky & Day, 1995).

One of the most recent studies conducted by Schleicher and Day (1998) utilize a multivariate cognitive evaluation model to assess two content-related issues: 1) the extent to which raters agreed with the content of the training, and 2) the content of rater impressions. The basic assumption is that a rater’s theory of performance may not be in line with the theory that the organization espouses. Results indicated that FOR training was associated with increased agreement with an espoused theory, which in turn is associated with accuracy. However, FOR training may be required even for those who have high agreement levels with the espoused performance theory. The reason is that raters tend to base their judgments on individual experiences with ratees (self-referent) instead of relying on the more objective observations of behaviors (target-referent).

The literature in general asks that future researchers focus more on cognitive processes and content-related issues with FOR training. The effect of combined strategies must be further investigated (Woehr & Huffcutt, 1994). Woehr (1994) suggests that a comparison of organization and elaboration models to interpret the effects of FOR training would be helpful. Stamoulis & Hauenstein (1993) want ‘Rater Error Training’ to be changed to ‘Rater Variability Training’ to emphasize the focus of how this method increases variability in ratings to distinguish among ratees.

They state that FOR training is time consuming, and there is no definite understanding of its cognitive mechanisms, it probably leads to less behavioral accuracy, and there are doubts concerning FOR training’s long-term effectiveness (Stamoulis & Hauenstein, 1994). Sulsky and Day (1994) raise a major weak point of FOR training in that it adversely affects the feedback process in a performance appraisal system. Since there is low behavioral recall, managers may not be able to cite specific incidences when giving a subordinate feedback. Schleicher and Day (1998) strongly suggest that the following process issues be examined in the future: the specific purpose of the appraisal, rater goals and motivation, content of the rating format, and specific behavioral anchors used.

FOR training shows promise with its effects on rating accuracy. However, lower behavioral recall presents a drawback when feedback is provided. This fact warrants the method to be used mostly for developmental purposes. For FOR training to be most effective, the performance theory should depend on each organization’s mission and culture.

Rater Variability Training

RVT is a type of performance appraisal training that hopes to increase the accuracy of ratings on measures of performance appraisals. It is a hybrid of FOR training and RET. It has been found that RET training while increasing raters knowledge of rating errors (i.e., halo, leniency) it does not increase the overall accuracy of the ratings (Murphy & Balzer, 1989). On the other hand, FOR training, while it helps to increase the accuracy of ratings it also hinders the rater’s memory of the specific ratee behaviors being appraised (Sulsky & Day, 1992). FOR training tends to be more appropriate for training when performance appraisals are being used for developmental purposes, however, RVT may be more appropriate when appraisals are used for administrative purposes (Hauenstein, 1998).

First proposed by Stamoulis and Hauenstein (1993) RVT attempts to help the rater differentiate between ratees. While attempting to investigate FOR and RET, discussed above, Stamoulis and Hauenstein discovered that the utilization of target scores in performance examples and accuracy feedback on practice ratings in FOR training allowed raters to learn through direct experience on how to use the different rating standards. They also found that in spite of FOR’s perceived superiority, a RET type of training coupled with an opportunity to practice ratings were the best strategies to improve the quality of appraisal data that serve as input for most personnel decisions. They proposed that the assumed general superiority of FOR’s accuracy is not warranted and that a new format for rater-training should be researched. Further, they proposed that training should first attempt to increase the variability of observed ratings to correspond to the variability in actual ratee performance and should provide practice in differentiating among ratees, and suggested the new name Rater Variability Training (Stamoulis & Hauenstein, 1993).

The main difference between the different types of rater-training reviewed above and RVT is the feedback session. In FOR training the trainer in the situation would compare the trainees’ ratings with a target scores, emphasizing the differences between each rater’s ratings and the targets, helping the rater to quantify the targets in relation to their own standards. This would be done for all performance dimensions. In RET training, the feedback session would include the trainer discussing the occurrence of any rating errors by the ratees, after which the raters would be encouraged to increase the variability of their ratings. On the other hand, in RVT training, the feedback session would include the trainer comparing the trainees’ scores with a standard, emphasizing the difference between the each of the ratees compared to the target scores.

Prior to Stamoulis and Hauenstein (1993) stumbling upon the idea that variability of ratings might increase the accuracy of said ratings, Rothstein (1990) noticed that the reliability and variance of ratings in a longitudinal study of first line managers were related. This study looked at inter-rater reliability over time and found that poor observation produced unreliable ratings that lead to low variances. However, Rothstein felt that this may be due to systematic rater errors and suggested additional research looking into the causal linkages between changes in performance over time and systematic rater errors.

Limited research is available on the effectiveness of RVT as compared to FOR and RET training however, two papers were found that specifically discuss and test RVT. The first was completed in 1999 by Hauenstein et al. but it went unpublished. The second paper was written by Hauenstein’s student, Andrea Sinclair. In her discussion of the 1999 Hauenstein et al. study, Sinclair (2000) noted that Hauenstein et al. used performance levels that were readily distinguishable and asked raters to make distinctions between performances that were also easily distinguishable.

However, when there are overt differences in the rating stimuli raters have very little difficulty in differentiating between performances (Sinclair, 2000). Therefore, Sinclair added two dimensions, below average and above average, to the poor, average and good dimensions generally used in research on rater training. Sinclair used a similar format as that of Stamoulis and Hauenstein (1993) in setting up her experiment, but she used three examples of behavior in the pretest measure and five in the posttest. This was done in an attempt to make the post-training more difficult, and to distinguish if it really had an impact on the trainees. The three in the pretest were easily distinguishable, as those used in previous studies, however she added the above average and below average vignettes to the posttest. In addition, to test the idea that RVT is more appropriate for administrative decisions, the training conditions were further broken down into purpose conditions (developmental and administrative).

For all groups ratee three in the posttest represented average performance. What was interesting about Sinclair’s findings was that all groups across purposes of the ratings inaccurately rated ratee three, for the exception of the RVT administrative condition. This was found across almost all types of accuracy identified. These results have implications for the utilization of RVT in the organizational setting, due to the fact that in many organizations appraisals are done for administrative purposes. While it is fairly easy to distinguish between the best and worst performers in a group, those in the middle seem to be the most difficult to rank. The fact that RVT increased the accuracy of ratings for administrative purposes for average employees indicates that this type of training may be the best option for organizations.

The idea of increasing the rater’s ability to differentiate between ratees should be attractive to organizations. In many organizations a manager must differentiate between subordinates when conducting annual reviews. For a manager who can make all ratings for all employees at one time, the task of differentiating between subordinates is a bit easier than in organizations where annual reviews are done on the employee’s anniversary date with the organization. In the former case one is forced to rank the subordinates all at once, but in the latter case the manager must refer back to older reviews for the other employees, hoping that the performance has not changed too drastically to change the overall standing of each. However, in either case it is difficult for any manager to rank their subordinates that lie in the middle (i.e., fairly average employees). Based on the research above, RVT tends to help out in this context, however the research presented above is quite limited and additional research is desperately needed in the use of RVT, especially in field settings.

Behavioral Observation Training

According to Hauenstein (1998), “BOT is designed to improve the detection, perception and recall of performance behaviors.” When conducting performance appraisals it is necessary to ensure that the raters are accurately observing work behaviors and maintaining specific memories of employees’ work behaviors. This plays a significant part in the performance appraisal process. Observation involves noting specific facts and behaviors that affect work performance as well as the end result. Such data is the basis for a performance appraisal (retrieved via worldwide web:

Thus, training on behavioral observation techniques should be imperative in the training of raters. Unfortunately, limited research has been conducted on the topic of BOT in the performance appraisal process, and ways to improve the process. This may or may not suggest that BOT is being conducted by organizations. However, Thorton and Zorich (1980) were able to identify the significance between observation and judgment during the rating process. In their work they focus on a training program that will bare significance on how a rater gathers information on particular behaviors. The two dependent variables measured in the study were recall and recognition of specific behavioral events. In general, they were able to identify a relationship between increased accuracy (in behavioral observation) and performance ratings indicating that the accuracy of the observation process can greatly affect the outcome of the performance appraisal itself.

In a survey conducted on 149 managers, in the manufacturing and service industry it was found that a key competency identified by managers as an effective measure in performance appraisal system included, observational and work sampling skills. This is a strong indicator that any process to improve rater accuracy should include training in objective observation techniques. In addition, raters should be taught how to document behaviors appropriately. This can be done as simply as a “how to” class on record keeping. The training should also include identifying critical performance dimensions, behaviors as well as, the criteria for these dimensions. Inevitably, BOT training should improve recall for the rater and increase the perception of fairness by the ratee (Fink & Longenecker, 1998).

As mentioned previously, there is limited research available on the topic of BOT. The research that does exist suggests that BOT may effectively increase both rating accuracy and observational accuracy indicating that observational accuracy may be just as important as evaluative rating accuracy, particularly during the feedback process (Woehr & Huffcutt, 1994).

Self-Leadership Training

The fifth and final goal in rater training is SLT. The objective of SLT is to improve the self-confidence of the raters in their capability to conduct evaluations. Self-leadership can be defined as “the process of influencing oneself to establish the self-direction and self-motivation needed to perform (Neck, 1996, p. 202).” Neck continued to say that SLT or thought self-leadership can be seen as a process of influencing or leading oneself through the purposeful control of one’s thoughts. Self-leadership is a derivative from Social Cognitive Theory. This theory states that behavior is a function of a triadic reciprocity between the person, the behavior, and the environment. The majority of research on this aspect of management has been derived from the social cognitive literature and related work in self-control. This related research has primarily focused on the related process of self-management.

Self-leadership and self-management are related processes, but they are distinctly different. Self-management was introduced as a specific substitute for self-leadership by drawing upon psychological research to identify specific methods for personal self-control (Manz & Sims, 1980). One of self-management’s primarily concerns is regulating one’s behavior to reduce discrepancies from externally set standards (Neck, 1996). Self-leadership however, goes beyond reducing discrepancies in the standards of one’s behavior. Self-leadership addresses the utility of and the rationale for the standards themselves. The self-leader is often looked upon as the ultimate source of standards that govern his or her behavior (Manz, 1986). Therefore, individuals are seen as capable of monitoring their own actions and determining which actions and consequent outcomes are most desirable (Neck, 1996).

Neck, Stewart, and Manz (1995) suggested a model to explain how thought self-leadership could influence the performance appraisal process (see Appendix). This model uses numerous cognitive constructs, including self-dialogue, mental imagery, affective states, beliefs and assumptions, thought patterns, and perceived self-efficacy. The first construct, rater performance can be conceptualized in numerous ways. One aspect of rater performance is accuracy. Accuracy is measured by comparing an individual’s rating against an established standard. Therefore, accuracy can be seen as a measure of how faithfully the ratings correspond to actual performance. Another aspect of rater performance is perceived fairness. Fairness is important because if employees reject the feedback, resulting in dissatisfaction, if the performance ratings are seen as unfair. Neck, Stewart, and Manz (1995) as a result, suggested that rater performance could be conceptualized as the degree to which a rater is capable to provide accurate ratings and the degree to which the ratings are perceived as fair.

According to the model by Neck, Stewart, and Manz (1995), rater performance leads to rater self-dialogue and rater mental imagery. Self-dialogue, or self-talk is what we covertly tell ourselves. It has been suggested that self-talk could be used as a self-influencing tool to improve the personal effectiveness of employees and managers (Neck, 1996). It has also been suggested that self-generated feedback, a form of self-talk, might improve employee productivity. Raters who bring internal self-statements to a level of awareness, and who rethink and reverbalize these inner dialogues, have the possibility to enhance their performance. Self-talk is predicted to affect both accuracy and perceptions of fairness. Beliefs and assumptions come into play when individuals have dysfunctional thinking. Individuals should recognize and confront their dysfunctional beliefs. Then substitute them with more rational beliefs.

Rater mental imagery is the imaging of successful performance of a task before it is actually completed. The mental visualization of something should enhance the employee’s perception of change because the employee has already experienced the change in his/her mind. Mental imagery has been shown to be an effective tool for improving performance on complex higher order skills such as decision making and strategy formulation, both of which has significant similarities to the performance appraisal process. Also, appraisal accuracy should be directly correlated with the use of mental imagery. However, it was found that mental imagery would improve performance only if raters specifically focused on the performance appraisal process itself (Neck, et al, 1995).

According to the model, the impact of rater self-dialogue on performance is mediated by affective responses. It was suggested that beliefs are related to the type of internal dialogue that is executed, corresponding to a resulting emotional state. Therefore, raters who have functual self-dialogue, can better control their affective state and should as a result, have greater accuracy. In addition, having control of one’s emotions may also reduce confrontation, which in turn, leads to improved perceptions of ratings fairness (Neck et al, 1995).

The next step in the model is the thought patterns of the rater. It has a direct relationship with mental imagery, but self-dialogue is mediated by a corresponding mental state. When these processes are combined they contribute to the creation of constructive thought patterns. Thought patterns have been described as certain ways of thinking about experiences and as habitual ways of thinking (Manz, 1992). Individuals tend to engage in both positive and negative chains of thought that affect emotional and behavioral reactions. These thoughts run in fairly consistently frequent patterns when activated by specific circumstances (Neck, 1996).

The final step in the comprehensive model is the perceived self-efficacy of the rater. This suggests that the type of thought pattern a person enacts, directly affects perceived self-efficacy. This means that the thought patterns of raters determine the strength of conviction they have. This leads raters to they can successfully conduct ratings and appraisal interviews. Self-efficacy expectations should be enhanced if raters enact constructive thought patterns. The self-efficacy step also suggests that the rater’s self-efficacy perceptions directly influence rater performance.

The foundation for thought self-leadership is consistent with the assumptions of organizational change theory. Therefore, the basis for thought self-leadership is not purely psychological or sociological. It is a combination of both beliefs (Neck, 1996). Thought self-leadership has been shown to be a mechanism to enhance team performance, work spirituality, employee effectiveness, and performance appraisal outcomes (Neck, Stewart, & Manz, 1995).

Smither (1998) noted that Bernardin and Buckley (1981), who are widely popular for the introduction of FOR training concepts, discussed improving rater confidence as a means to increase efficiency of performance appraisals. Bandura (1977, as cited by Bernardin & Buckley, 1981) stated four main sources of information that encourage personal efficacy. These methods could be used to establish and strengthen efficacy expectations in performance appraisal. The first and most influential source is performance accomplishment. An example of this is the failure of a supervisor to make fair but negative performance appraisals early in his or her career. This may explain to a large extent, low efficacy expectations. Another source is vicarious experience. This is when one person observes someone else coping with problems, and eventually succeeding. The third source of information for efficacy expectations is verbal persuasion. This is when people are coached into believing that they can cope. The final source is emotional arousal, which can change efficacy expectations in intimidating situations. When there are possible repercussions to a negative rating, a rater’s ability or motivation to give accurate ratings could be reduced.


In conclusion, Hauenstein (1998) suggests to increase rating accuracy the raters training program should include, performance dimensions training, practice ratings and feedback on practice ratings. It is important to note at this point that training raters to provide feedback to ratees is another area that is limited in research however, is a major focus of most organizations. Finally, Hauenstein (1998) recommends that it is desirable to promote greater utilization of rater accuracy training and behavioral observations programs and to ensure that rater training programs are consistent with the rating format and purpose of the appraisal. It is extremely important that organizations understand the reasons why they are utilizing performance appraisals and base the structure of the raters training program on those reasons. With this in mind hopefully it will help to improve accuracy in the ratings.



Bernardin, H. J., & Buckley, M. R. (1981). Strategies in rater training. Academy of Management Review, 6, 205-212.
Fink, L. & Longenecker, C. (1998). Training as a performance appraisal improvement strategy. Career Development International, 3(6), 243-251.
Garman, B.R., et. al. (1991). Orchestra festival evaluations: Interjudge agreement and relationships between performance categories and final ratings. Research Perspectives in Music Education, (2), 19-24.
Guide to Performance Management. Retrieved from the World Wide Web, April 10, 2003:
Hauenstein, N.M.A. (1998). Training raters to increase the accuracy and usefulness of appraisals. In J. Smither (Ed.), Performance Appraisal: State of the Art in Practice, pp. 404-444. San Francisco: Josey Bass.
Hedge, J.W. & Kavanagh, M.J. (1988). Improving the Accuracy of Performance Evaluations: Comparison of Three Methods of Performance Appraiser Training. Journal of Applied Psychology, 73(1), 68-73.
Latham, G.P., & Wexley, K.N. (1994). Increasing productivity through performance appraisals (2nd ed.). Reading, MA: Addison-Wesley Publishing Company.
Law, J.R., & Sherman, P.J. (in press). Do rates agree? Assessing inter-rater agreement in the evaluation of air crew resource management skills. In Proceedings of the Eighth International Symposium on Aviation Psychology. Columbus, OH: Ohio State University.
Manz, C. C. (1986). Self-leadership: Toward an expanded theory of self-influence processes in organizations. Academy of Management Review, 11, 58-60.
Manz, C. C. (1992). Mastering self-leadership: Empowering yourself for personal excellence. Englewood Cliffs, NJ: Prentice-Hall.
Manz, C. C., & Sims, H. P., Jr. (1980). Self-management as a substitute for leadership: A social learning theory perspective. Academy of Management Review, 5, 361-367.
Murphy, K.R. & Balzer, W.K. (1989). Rater errors and rating accuracy. Journal of Applied Psychology, 74, 619-624.
Neck, C. P. (1996). Thought self-leadership: A regulatory approach towards overcoming resistance to organizational change. International Journal of Organizational Analysis, 4, 202-216.
Neck, C. P., Stewart, G. L., & Manz, C. C. (1995). Thought self-leadership as a framework for enhancing the performance of performance appraisers. The Journal of Applied Behavioral Science, 31, 278-294.
Rothstein, H.R. (1990). Interrater reliability of job performance ratings: Growth to asymptote level with increasing opportunity to observe. Journal of Applied Psychology, 75, 322-327.
Schleicher, D.J. & Day, D.V. (1998). A Cognitive Evaluation of Frame-of-Reference Rater Training: Content and Process Issues. Organizational Behavior and Human Decision Processes, 73(1), 76-101.
Sinclair, A.L. (2000). Differentiating rater accuracy training programs. Unpublished Master’s Thesis, Virginia Polytechnic Institute and State University, Blacksburg.
Smither, J. W. (1998). Performance Appraisal: State of the Art in Practice. San Francisco, CA: Jossey-Bass.
Stamoulis, D.T. & Hauenstein, N.M.A. (1993). Rater training and rating accuracy: Training for dimensional accuracy versus training for ratee differentiation. Journal of Applied Psychology, 78(6), 994-1003.
Sulsky , L.M. & Day, D.V. (1992). Frame-of-reference training and cognitive categorization: An empirical investigation of rater memory issues. Journal of Applied Psychology, 77(4), 501-510.
Sulsky, L.M. & Day, D.V. (1994) Effects of frame-of-reference training on rater accuracy under alternative time delays. Journal of Applied Psychology, 79(4), 535-543.
Sulsky, L.M. & Day, D.V. (1995). Effects of frame-of-reference training and information configuration on memory organization and rating accuracy. Journal of Applied Psychology, 80(1), 158-167.
Thorton, G. & Zorich, S. (1980). Training to improve observer accuracy. Journal of Applied Psychology, 65(3), 351-354.
Woehr, D.J. (1994). Understanding frame-of-reference training: The impact of training on the recall of performance information. Journal of Applied Psychology, 79(4), 525-534.
Woehr, D.J. & Huffcutt, A.I. (1994). Rater training for performance appraisal: A quantitative review. Journal of Occupational and Organizational Psychology, 67, 189-205.

This Article is authored / contributed by ▸ Mubeena M. who travels from Dubai, United Arab Emirates. Mubeena is available for Professional Consulting Work both Virtually and In-Person. ▸ Enquire Now.

What's your opinion?

Get Fees