Critique of Trait Psychology

The fundamental problem in the psychology of personality is to take account of individual differences, and to determine how they can be conceptualized and described. Allport argued that personality could be best construed in terms of a set of dispositions to behave in a particular way across a wide variety of situations and over extended periods of time. Within the psychometric tradition, the factor analysts have employed a body of sophisticated statistical techniques for the purpose of delineating a relatively small set of dimensions that would encompass these dispositions.

Others have focused their research on specific traits, or on the development of refined instruments for measuring traits. No matter what their emphasis, trait psychologists agree that an individual's social behavior can be accounted for largely by the presence of generalized behavioral dispositions. The basic assumptions of trait psychology are that behavior is fairly consistent across different situations, that personality traits can be abstracted from observing an individual's behavior; and that these traits somehow cause behavioral consistency.

Despite the popularity and longevity of the trait approach to personality, there has always existed some evidence challenging these claims and assumptions. This evidence was widely scattered in the professional literature, and largely ignored (both because it was scattered and because it so contradicted the prevailing Weltanschauung). However, in 1968 a pair of books (Mischel 1968; Peterson, 1968) independently marshaled the evidence to the contrary, and precipitated a kind of crisis within trait psychology. Mischel has avidly pursued the critique since then in an important series of papers (Mischel, 1969, 1972, 1973a, 1973b, 1977a, 1977b, 1979, 1981). By challenging common beliefs, he has become one of the most controversial figures in psychology; but his arguments have lead directly to a progressive reorientation of the whole field of personality. This chapter expands on Mischel's critique.

Problems with the Search for a Universal Scheme for Personality Structure

As noted, a major theme in the work of trait psychologists has been the attempt to discover the universal structure of personality traits. That is, they wish to discover that multidimensional space which best represents individual differences in personality. The search has proceeded apace, employing a battery of highly sophisticated techniques for multivariate analysis. Almost 100 years after Wundt, and almost 50 years after Cattell, Guilford, and Eysenck began their work, we might reasonably expect some degree of consensus to have emerged about such fundamental questions as the number of traits, their names, and their interrelationships. Nevertheless, the structure of personality traits remains as elusive as it ever it was.

All of those who have sought the universal structure of personality traits have employed factor analysis, or some variant of the technique. However, as became apparent in Chapter 2, different method of factor analysis, in the hands of different investigators, have yielded different results. Who is right? The problem is that each of these solutions is acceptable on purely mathematical grounds -- no one made a gross error in collecting the data, or in applying the formulae of factor analysis. However, each solution is quite different in terms of the psychological meaning of the structure obtained.

Consider, for example, the various attempts to reduce Allport and Odbert's (19XX) list of traits to manageable size (for a review see Goldberg, 19XX). Norman was able to reduce it to five traits, Hogan (REF) to six; Wiggins (1979), however, found eight traits within the interpersonal domain alone. Similar problems crop up when the factor analysts shift from the trait adjectives of ordinary language to specific social behaviors, as assessed (for example) by self-report questionnaires. As noted, Guilford (19XX) extracted 42 hormetic and temperamental traits; Cattell (19XX), 30 traits in the temperamental and dynamic domains; and Eysenck (19XX), three superordinate traits consisting of an unspecified number of subordinate dimensions.

In many ways, Cattell's approach seems the best for this kind of material. His emphasis on primary traits preserves the richness of personality, and oblique factors impose the least structure on nature. But as it happens, his structure of primary traits has been difficult for others to replicate. For example, Soueif et al (1969) administered the 16PF, Cattell's own questionnaire tapping his best-documented traits, to a representative sample of subjects. They obtained 10 primary factors, with poor correspondence between these and the ones obtained by Cattell. The only apparent difference between Soueif's study and Cattell's is the population sampled: the former was English, the latter American. If minor changes in population results in big changes in structure, then that structure can hardly be called universal.

Moreover, different data bases lead to different conclusions about the structure of personality. Continuing our example, consider Cattell's analyses of temperamental traits. As noted, factor analyses of L-data, ratings based on observations of individuals in real-world life situations, gave 15 factors. However, parallel analyses of Q-data (self-reports of behavior collected on questionnaires) give these 15 factors (more or less) plus 8 more that were unique to L-data. Then again, factor-analyses of T- data yielded 18 factors, none of which were common to those derived from L- or Q-data. The structure of personality apparently depends on the kind of data which is submitted to factor analysis: this, too, is inconsistent with the notion that factor analysis can discover a universal structure of personality.

Even when the same population is being tested, the same type of data collected, and the same type of analysis applied, still the structure of personality varies from study to study. An important demonstration of this was provided by Fiske (1973), who analyzed the relationships among various human social motives (Murray, 1938) as assessed by objective self- report questionnaires. Three such questionnaires have been published: the Edwards Personal Preference Schedule (EPPS; Edwards, 19XX); the Adjective Check List (ACL; Gough, 19xx); and the Personality Research Form (PRF; Jackson, 19XX). It is also possible to measure these needs simply by asking subjects to rate themselves on Murray's dimensions, or for their acquaintances to rate them. Fiske found four studies in which at least two of these five methods (EPPS, ACL, PRF, self-ratings, and peer-ratings) had been administered to the same subjects, and he looked at the obtained relationships among the 12 needs measured in common. Two of the samples received the EPPS, and yielded the same patterns of correlations among the various scales; this was also the case for the two samples which received the PRF. However -- and this is the important point -- the different instruments yielded largely different patterns of correlations among the needs being measured. For example, endurance, as measured by the PRF, correlated 0.68 and .66 with achievement in two samples which received the PRF, but .07 and .21 with the same dimension measured by the EPPS. The PRF and ACL, then, yielded different interrelationships among their subscales, despite the fact that they were ostensibly measuring the same dimensions of personality.

In a later analysis of this same data set Huba and Hamilton (1977; see also Fiske, 1977) qualified Fiske's assertion by showing that the ACL, EPPS, and PRF had highly similar factor structures. There were three to four factors in each instrument, and three of these -- achievement motivation, dominance or extraversion, and caring for others -- were found in each separate inventory. As was the case with the studies of the multidimensional structure of personality discussed in Chapter 2, the interrelationships among traits is clearest and most stable at the level of secondary and higher-order dimensions. This finding of relative internal stability at a more abstract level, however, does not obviate the problem of establishing a stable constellation of empirical relationships with external criteria, forming a proper nomological net around the various individual constructs.

Fiske's analysis represents an extension of the logic of the multitrait- multimethod matrix (Campbell & Fiske, 1959). If a construct is valid, different methods of measurement should yield comparable results. It seems likely, given Fiske's results, that the empirical correlates of a trait will be substantially determined by the method by which that trait is measured. In the absence of empirical reasons to prefer one method of measurement over another (Ashton & Goldberg, 1972; Hase & Goldberg, 1967; Jackson, 1975), it seems unlikely that the questionnaire methodology will ever converge on such correlates. In this case, different methods of measurement yield different results.

Empirical Problems with Research on Specific Traits

In addition to problems associated with the attempt to document a universally applicable structure of personality, there are also problems that beset research pertaining to other assumptions of trait theory. In this discussion, we ignore problems with research on specific traits, such as extraversion or social responsibility, and focus only on problems with the trait paradigm as a whole.

The Accuracy of Self-Reports

Trait-based assessment appears to depend on the assumption that questionnaires elicit accurate self-reports of past behavior. However, there are reasons for thinking that this assumption is unwarranted. One such reason has already been discussed under the rubric of the realism- idealism issue. As we have already seen the judgments of "objective" observers concerning a target's behavior may be contaminated by preconceptions concerning the relationships among various aspects of behavior. There is no reason to think that people are immune from these biases when making reports concerning their own behavior. Put simply, an individual's self-reports, insofar as they represent reconstructions from memory, may reflect what he or she thinks she does rather than what he or she actually did. In addition, a number of factors, including the person's intelligence, educational level, and cultural background may affect his or her understanding of, and response to, questionnaire items. Independent of these problems, a number of factors can affect the degree to which individuals endorse (i.e., say "Yes" to) questionnaire items. Two of these were introduced in Chapter 4: social desirability and acquiescence tendency. In the case of social desirability, subjects tend to endorse items that reflect well upon them, even if the statements are not true. Acquiescence reflects the tendency of subjects to say "yes" to any item, regardless of its substantive content. To the extent that questionnaires are contaminated with social desirability -- or respondents show high need for approval or acquiescence tendency -- these tests will be inaccurate measures of generalized behavioral tendencies.

In most research settings, where the anonymity of the subject is preserved and there are no lasting consequences of his or her behavior or the investigator's evaluation of his or her response, it is arguable that subjects reflect honestly on their past behavior and experience. Of course, candor does not provide any defense against the problem of idealism in personality ratings. Nor does the assumption of honesty relieve the investigator of anxiety concerning the criterion which the person applies in making self-reports. Consider a typical questionnaire, in which a subject has to respond "Yes" of "No", "True" or "False", to statements of the following kind:

I have been disappointed in love.

The instructions typically ask the subject to respond "Yes", or "True", if the statement is generally true, and "No", or "False", if the statement is generally false. But except for certain outrageous statements included on questionnaires, mostly to determine whether the respondents were paying attention to what they were doing (e.g., "Sometimes my ideas turn into insects"), such statements as these could be endorsed by almost everyone. All of us have been disappointed in love at one time or another. Some of us have been disappointed more than others, and indeed that is what the question is all about. The question does not specify what criterion the subject should use in making his/her self-report (such criteria are often specified when observers make judgments of other people, however). How many disappointments must accumulate before the subject should answer "yes" rather than "no"?

The problem is not solved by switching from dichotomous (two choice) "True-False" response format to one involving a continuous rating scale. True, the subjects are permitted to make finer discriminations between yes and no, but the criterion problem remains unsolved. If subjects have different criteria for endorsing an item, whether in relative or absolute terms, then a great deal of unreliability or noise has been injected into the data. This unreliability will limit the correlation between the predictor variable, assessed by the questionnaire, and the criterion behavior.

Suppose that by careful selection of subjects all were of normal or better intelligence, achieved the same degree of educational status, and came from the same cultural backgrounds; and all were motivated to be candid in their self-reports and instructed to employ the same criteria in making their self-ratings. Would that solve our problems and make trait- based assessment interpretable? Not necessarily. The problem would remain that we would not know what the behavior being reported means to the subject.

Consider a garden-variety questionnaire item, such as:

I have very few quarrels with members of my family;

and suppose that a respondent's answer is "no". What does this mean? The assumption underlying trait-based assessment is that this response reflects a tendency on the part of the subject to behave in a particular way. But, of course, this is not necessarily the case. A variety of historical antecedents could have led to this response. Perhaps the respondent is a sullen, hostile person; or perhaps his/her family is wracked by marital discord, the generation gap, and other sources of internal tension. Without knowing which was the case, we have no way of assigning meaning to the response.

Or, take another common question:

I lead a busier life than most people.

Does this respondent behave in this way because he wants to, or are there factors in his social environment that constrain his behavior? Again, we have no way of knowing from the response; and without such interpretation, we have no way of knowing what to make of it. Finally, consider a third questionnaire item:

I find it very relaxing to travel by myself.

Is the person a loner, or does she simply like to get away by herself once in a while? Unless we know her goals and intentions at the time of the act, we have no way of interpreting it.

The problem may be put simply: an instance of behavior, experience, thought, or action, cannot be interpreted in isolation; it acquires meaning by virtue of the context in which it takes place. But questionnaires do not leave room for respondents to explicate the context. Accordingly, the significance of the behavior is unclear. As will be discussed later, vague and fragmentary information is especially vulnerable to transformation according to the observer's hypotheses and expectations. Bits of behavior -- even large bits of behavior -- viewed in isolation may rather easily be fit into the investigator's conceptual system -- it can come to mean whatever he or she wants it to mean. As Mischel (1968) put it, "There is a Procrustean tendency to claim that a thing means whatever the investigator wants it to mean" (p. XXX).

Prediction of Criterion Behaviors from Questionnaire Responses

In trait psychology, self-report questionnaires are used as convenient instruments for the assessment of past behavior. The idea is that the respondent's answers to the questions reveal his or her behavioral tendencies, or dispositions. Since these generalized behavioral dispositions are assumed to control future behavior, it follows that questionnaire scores should predict the individual's behavior in some future specific situation. Unfortunately, this does not seem to be the case. Across a wide variety of studies, the typical correlation between questionnaire scores and some trait-relevant criterion observed either in laboratory settings or in the real world is only about .30. This state of affairs is fairly represented by the concurrent validity coefficients of the various types of personality questionnaires, as described in Chapter 3 (Ashton & Goldberg, 1972; Hase & Goldberg, 1967; Jackson, 1975).

Mischel (1968) dubbed this figure the "personality coefficient", implying that by and large it represents the upper limit on prediction of criterion behavior from questionnaires. A decade and a half later, there is little evidence that would contradict this general impression. There are two ways to view this outcome of research. On the one hand, the correlations are usually statistically significant, so that there is some relation between the past behavior assessed with questionnaires and the future behavior as observed directly. However, it should be remembered that the square of the correlation between two variables gives the proportion of variance in one variable that may be accounted for by variance in the other. If we take Mischel's personality coefficient as representative, and there is no reason not to do so, then we have an estimate that personality traits, as measured by standard questionnaires, account for no more than 10% of the variance in actual behavior in specific trait-relevant situations.

As a specific example, consider a study of delay of gratification in school children reported by Funder, Block, and Block (JPSP 1983; for a more detailed analysis, see Kihlstrom, Am. Psychol 1986).  The children's personalities were assessed by their teachers using the California Q-Sort.  These assessments were completed when the children were 3, 4, 7, and 11 years of age.  In addition, at age 4, they were administered formal tests of delay of gratification and resistance to temptation.  The Q-Sort was scored to yield scores on Block's "Big Two" dimensions of personality, ego-control (roughly, introversion-extraversion) and ego-resiliency (roughly, neuroticism), as discussed in Chapter 3.  For purposes of the present discussion, the most relevant analyses concern the correlations between personality as assessed at ages 3 and 4 and actual behavior as tested at age 4.

All of these correlations, of course, are lower than Mischel's "personality coefficient" of r = .30.  Of course, those subjects were children, very young children at that, and it might be too much to expect them to behave as predicted by underlying personality traits.  So let's take another example, this one of college students.

So let's consider a study by Chaplin and Goldberg (JPSP 1984) specifically intended to address the issue of the predictability of behavior from personality traits.  A large group of college-student subjects completed self-ratings on eight personality traits, such as friendliness and conscientiousness; these subjects were also rated by their peers, recruited from their dormitories, fraternities, and sororities.  Finally, their behavior in a laboratory "getting acquainted" situation was rated by independent observers who were blind to the subjects' self- and peer-ratings (unfortunately, it wasn't possible to get objective ratings of honesty and activity level in their laboratory situation).  As indicated in the table below (derived from their Table 5), the average correlation between self-rated personality traits and objective behavior was r = .08; the average correlation between peer-rated personality and behavior was roughly the same, r = .09.  Both values lie far below Mischel's personality coefficient, and indicate that personality traits accounted for about 1% of the variance in observed behavior.

Personality Trait
Activity Level
Emotional Stability
Cultural Sophistication
Average Across Traits

Thus, while there is some lawful relationship between generalized traits and behavior in specific situations, there is a vast amount of behavioral variance left unaccounted for.

Typically, the highest validity coefficients obtained in trait research are between two questionnaires ostensibly measuring the same trait. Given the argument of Campbell and Fiske (1959), this should alert us to the preponderance of method variance over trait variance in these sorts of studies. In fact, many questionnaires seem interchangeable despite the different labels which they carry. They correlate highly with each other, and equally well with specific behaviors. The personality coefficient indicates that questionnaire assessments have poor predictive validity; the interchangeability of various questionnaires means that they have poor discriminant validity. Either condition alone would spell trouble for the assumptions that underlie trait psychology.

Improvements in Technology

As noted earlier, trait psychology fostered important developments in psychometric theory, and the corresponding evolution of a sophisticated technology of personality assessment. As discussed in Chapter 3, there have been essentially four stages in the evolution of self-report questionnaires. The first generation, rational-intuitive instruments like the Woodworth Personal Data Sheet, were successively replaced by factor- analytically derived inventories such as the Guilford-Zimmerman Temperament Survey and Cattell's Sixteen Personality Factor Questionnaire and empirically derived inventories such as the Minnesota Multiphasic Personality Inventory and the California Psychological Inventory. The rational-intuitive questionnaires appeared to have some degree of content validity, but they lacked statistical refinement and demonstrated empirical validity. The factor-analytic strategy insured statistical refinement but not external validity, while the empirical strategy promised external validity but ignored internal statistical properties; both tended to ignore theoretical considerations in scale construction.

The latest development has been the introduction of a set of inventories developed according to the principles of construct validity, principally by Jackson (refs) -- the Personality Research Form, Jackson Personality Inventory, and similar instruments. One would naturally expect that such developments would lead to corresponding improvements in prediction. However, this has not proved to be the case. relate to the way that other people perceive their behavior? As noted in Chapter 3, the four substantive types of scales tested by Hase and Goldberg (1967) (rational, theoretical, factor-analytic, and empirical) showed higher validity coefficients (derivation or cross-validation) than the content-free (stylistic and random) scales, but there were no differences among them. Moreover, the sequential method, which is supposed to combine the virtues of the other four types, does not yield substantially higher validity coefficients than the more primitive methods.

The kinds of global, dispositional assessments provided by personality questionnaires do not have much power to predict specific trait-relevant behaviors. However, it may be unreasonable to ask them to do so. Single-act criteria are, in effect, single-item tests. And single-item tests are inherently unreliable. This unreliability, as suggested in Chapter 3, sets upper limits on the validity that can be achieved. Personality ratings, however, are typically made by judges who know their targets well. Thus their evaluations, while expressed in terms of a single number, more closely approximate the average of a large number of observations. On sheer statistical grounds, then, predictive validity should be very much improved. It seems obvious that the judges providing the peer-ratings have had the opportunity to observe the subject's behavior over many different occasions and in many different situations. If so, then their ratings are aggregate scores, based as it were on many specific items, and thus statistically reliable. However, the enhanced reliability of the criterion does not improve the predictive situation much: the validity coefficients in the studies described here are not very much higher than those characterized by Mischel (1968) in terms of the "personality coefficient". The highest average validity coefficient derived from a self-report behavioral questionnaire, the .51 obtained by Jackson (1975) with three scales of the JPI, accounts for only 26% of the criterion variance; the more frequently obtained Rs, in the high .20's, are not distinguishable from those criticized by Mischel.

This evidence should not be taken to mean that there are no generalized behavioral tendencies. Obviously there are. Past behavior, as recorded in self-reports on personality questionnaires, shows a statistically significant, if low, relationship to future behavior in some laboratory or real-world situation. Moreover, the things that subjects say about themselves are significantly related to what others say about them. This point was made effectively by Funder (ref), who asked subjects to rate themselves on a variety of trait dimensions and then compared the results with peer-ratings.

However, the relationships between self-reported past behavior and future behavior are by no means strong, and people's self-appraisals do not agree totally with what their acquaintances say about them. At the level of measurement, it appears that there is no unambiguous evidence favoring the utility of any high-technology method of personality scale construction, be it based on factor-analytic, empirical, or sequential strategies. Intelligent but naive item-writers are just as good as sophisticated psychometric technique in generating valid personality tests. But no method of eliciting behavioral self-reports gives predictions that are superior to subjects' self-ratings on the trait dimensions themselves. The message of these studies seems clear: if you want to know that an individual can tell you about his or her behavioral dispositions, just ask. Instead of trying to contrive behavioral reports that tap various trait dimensions, we can simply ask subjects to provide self-ratings on the trait dimensions themselves, expressed in ordinary language.

The fact that self-ratings on trait dimensions provide better concurrent validity than self-reports on sophisticated assessments of past behavior again raises the issue of utility as a desirable property of a psychological test. Consider, for example, a study by Holland and Nichols, which attempted to predict pupils' accomplishments in school. Two sets of predictors were available: a broad-gauged personality questionnaire of the usual type, and a checklist of preferred school activities. When the scales of the personality questionnaire were entered into a multiple- regression equation, the resulting validity coefficient (R) was .31; when the scales of the activity preference questionnaire were similarly treated, the validity coefficient was R = .37. In another study by Mischel (1965, 1969) of Peace Corps volunteers, three types of personality assessments were correlated with on-the-job performance as rated by field supervisors: a set of predictors derived from a personal interview; the pooled judgment of the assessment staff; and the individual's own self-prediction of success, gathered under conditions that insured confidentiality. The first two predictors gave validity coefficients of .13 and .20, respectively; the self-predictions correlated .39 with the criterion. While none of these validity coefficients is large by absolute standards (a correlation of .13 accounts for less than 2% of the variance in the criterion, while the correlation of .39 raises this figure to about 15%), self-ratings are clearly the best. In both cases, and in many other studies, subjects' self- ratings, arguably the cheapest possible personality assessments, proved to be more valid than either self-reports of past behavior or the assessments of objective observers. Personality questionnaires may possess some level of validity, but they typically lack utility, in that they fail to provide better (more accurate) assessments than a cheaper method.

Cross-Situational Consistency

At the heart of the trait concept is the notion of a coherence to behavior which cuts across contexts, so that behavior in one situation predicts behavior in other, similar situations. The wider the variety of situations eliciting equivalent behavior (aggression, friendliness, conscientiousness, etc.), the better the evidence for the kind of behavioral disposition implied by the "trait" concept. While this kind of evidence is of crucial importance to documenting the existence and power of personality traits, unfortunately the relevant evidence has been hard to come by. It is not possible to gain is by means of self-report questionnaires, as for reasons noted people may perceive more consistency in their behavior than actually exists. Nor, for the same reasons, is it enough to ask informants who have been acquainted with a subject for a long period of time to provide these assessments. What is required is direct assessment of the individual's behavior in a wide variety of situations which are theoretically relevant to the trait in question. Such studies, involving trained observers and many hours of contact with the subjects, are difficult and expensive to execute, and for that reason there are not too many of them. In this section, we review all of those studies available to date which have examined the degree to which individuals behave similarly across different but related situations.

Character: The Hartshorne & May Study (1928-1930)

The first, and most widely cited, study of behavioral consistency was the "Character Education Inquiry" (CEI) of Hartshorne and May (1928; Hartshorne, May, & Waller, 1929; Hartshorne, May, & Shuttleworth, 1930). In this series of experiments, a group of schoolchildren were administered a large number of tests to measure their honesty, altruism, and self- control, and to determine the relationships among these features of personality. In this remarkable study, more than 11,000 children were administered a battery of formal tests of these traits. Not all subjects received all of the tests. Some tests were open, but most were conducted surreptitiously; many were conducted in widely divergent situations as well. Thus, the study provided information on behavioral consistency both within a particular type of situation, and (often) between different situations as well. Begun in 1924 with funding from the Religious Education Association, and continuing for five years, the CEI still ranks as the most extensive observational study of behavioral consistency ever undertaken.

Hartshorne and May were primarily interested in a variety of questions concerning moral character broadly defined, including its relation to age, sex, measured intelligence, socioeconomic status and cultural background, race, national origin, and religious affiliation. They also investigated the effects of moral and religious education on these characteristics, in an anticipation of the kinds of programs developed later by Kohlberg (ref). However, they also observed appreciable inconsistency from on test to another, and among the different situations with respect to the behavior of the children. In what follows, we focus solely on these findings.


In the major portion of their study (Hartshorne & May, 1928), the subjects were given several opportunities to lie, cheat, and steal in a variety of settings: in the classroom, on the athletic field, and in the home.

Altogether, more than two dozen different tests were administered in eight different categories of deceit.

  • Scholastic tests: The subjects were administered tests of scholastic achievement in their classroom, (e.g., arithmetic, vocabulary, grammar, and general information), and then allowed to grade their own papers. Obviously, this situation gave the children the opportunity to cheat by changing their answers; however, unbeknownst to them a carbon-paper duplicate was employed to record their original answers, so that any cheating could be easily detected when the two forms were compared.

  • Speed: In another case, the subjects were administered equivalent forms of a speeded performance test, the first time under strict supervision and the second time under conditions that permitted them to cheat by continuing to work after the allotted time was up. No opportunity was given in the inter-test interval to acquire the skills or knowledge assessed by the test. Therefore, subjects showing large differences in test scores favoring the second test were classified as probable cheaters.

  • Coordination: Some paper-and-pencil tests of hand-eye coordination were intentionally designed to be so difficult that the pupils would not be able to complete them correctly within the allotted time. A child who reported correctly solving the problems, then, would be very rare indeed: therefore those who claimed to have correctly competed the tasks were classified as probable cheaters. In these and similar tests to be described later, careful pretesting established normative standards of performance for the age group which served as subjects in the formal experiments.

  • Puzzles: A similar procedure was followed with some puzzles. In all these cases cheating was noted only if a subject departed so far from the standards (in most cases 3 S.D. above the mean) that the results were unlikely to have been produced by chance.

  • Home: The subjects were given a standard vocabulary test to complete at home, and were admonished not to seek help from any other source. Later, on one of the scholastic tests described earlier, an alternate form was administered. As in other cases of this sort, a large difference in scores between the supervised classroom test and the unsupervised test done as homework was taken as indicative of cheating.

  • Contests: In the context of a staged athletic contest, complete with awards for top performance, the children performed broad jumps and chin- ups, and completed tests of hand grip and lung capacity. The subjects were tested privately. In the first phase, the examiner instructed them in the use of the test equipment, surreptitiously recording their scores as they practiced. Then the subjects were left alone in the testing room, instructed to test themselves, and report their scores to the examiner. Any deviation between the score recorded during practice and that reported after the self-testing, favoring the latter, was taken as an index of cheating.

  • Party: In another test, children attending a party played a version of "pin the tail on the donkey", in which a blindfolded child attempts to find a target. The subject was blindfolded, as usual, but the blindfold was such as to permit the child to see the target if s/he should open his or her eyes. While the child could get relatively close to the target without peeking, it was extremely unlikely that s/he would hit it precisely. Children who were more accurate than could reasonably be expected were considered to be probable cheaters. Another occasion for cheating was provided by a relay race. The game was to transfer beans from three boxes to a goal box which was originally empty, one at a time. The game obviously affords the opportunity for the child to transfer more than one bean at a time, in which case the number of beans accumulated in the goal box at the end of the game would exceed the number of times the child ran between the start boxes and the goal box. If this occurred, the child was classified as a cheater.

  • Stealing: Various party games, arithmetic problems, and puzzles were constructed to make use of small coins. After completing the assigned task, the subjects were instructed to return the materials to a storage bin, providing an opportunity for petty theft. involving small amounts of cash under circumstances which gave them the opportunity to pocket the coins. Those who did so were classified as stealers.

  • Lying: The children completed a self-report questionnaire about their honesty. In one phase they were asked to indicate whether they had ever lied, or cheated on any tests. Of course, by this time the evidence of actual cheating from the tests described previously had already been gathered, so the measure of lying was derived from a comparison of the children's self-reports with their actual behavior during the experiment. In the other test the children were asked whether they had engaged in certain forms of behavior which, while highly desirable from a social point of view, are unfortunately all too rare. Subjects who responded affirmatively to a large number of these questions were considered to be probable liars.

The consistency of honest or deceptive behavior may be revealed in a number of analyses.

  • All the school, speed, and coordination tests were administered on two occasions, showing test-retest correlations averaging .68, .57, and .57, respectively.

  • More to the point are the intercorrelations among different tests.  The average intercorrelations among different tests within a category on the diagonal, and the average intercorrelations among tests in different categories in the triangle. The former were quite substantial, averaging .57. The latter were quite low, averaging .20.

  • Four of the tests (school, speed, coordination, and puzzles) were administered in the classroom situation: their average intercorrelation was .26.

  • The remaining tests were administered during athletic contests or at parties: the average correlation of these tests with the classroom exercises was .17.

  • Lying, which was measured in the classroom, correlated .23 with the other classroom tests on average, but only .06 with the non-classroom tests.

Hartshorne and May (1928, p. 385) conclude, "Thus as we progressively change the situation we progressively lower the correlations between the tests".


Another phase of the research gave the children several opportunities to display qualities of charity, altruism, self-sacrifice, and cooperation. All of these tests were administered in a classroom situation.

  • Efficiency cooperation: The children were given sheets containing a large number of simple arithmetic problems. In one phase prizes were offered to the individual pupils who completed the most problems within an allotted time (self motive). In the other, prizes were offered to the classroom which completed the most problems (group motive). Cooperation was assessed as the ratio of problems completed under the self-motive to those completed under the group motive, with positive scores indicating cooperation.

  • Free choice: After the Efficiency Cooperation test the children were allowed to work on more problems, which would be added to their own or their group's contest score, as they wished. The children were permitted to divide their efforts in any way they desired. This test contains an element of self-sacrifice that was missing in the previous phase.

  • Money vote: Each class was informed that it was likely to win a school-wide learning contest, and asked to determine how it would disburse the monetary award. There were five choices: give it to the pupil in the class who earned the highest score; purchase something for the school; purchase something for the classroom; divide it equally among the class members; or buy something for a needy child or family. Each child ranked his/her choices in the order of preference, and these ranks determined an altruism score.

  • School kits: A pencil case containing pencils, and a pencil-sharpener, erasers, rulers, etc., was given to each child as a present. Then the children were asked to contribute all or part of their gifts to help make up school kits for poor children in other schools. The articles were assigned values on the basis of cost, and the total of each child's contribution was a score of charity.

  • Envelopes: The children were given envelopes and asked to contribute jokes, puzzles, stories, and pictures that might amuse other youngsters who were sick and confined to hospitals. their offerings were rated according to the trouble apparently taken to find and prepare the materials.

The intercorrelations of these five tests were all quite low, averaging only r = .20.


In the final phase of the project, the children were confronted with situations that either required them to persist at a difficult or frustrating task, or to inhibit various responses. Most of these tests were conducted in a classroom context, though a few inhibition tests were administered in the party situation described earlier. Some of the tests were run with the subjects in groups, some in individual sessions. Unfortunately, Hartshorne et al. (1929) did not provide figures for these major cross-situational comparisons. We must be satisfied with self-control as displayed in each of the test situations within the classroom context.

Here were the persistence tasks:

  • Story resistance: The children were read exciting mystery and adventure stories, each of which was cut off just before the climax. They were then given printed copies of the story endings, distorted in such a way as to make them very difficult to read. The children were instructed to make marks on the printed page as they decoded each word, so that it was easy to tell whether the child struggled through to the end, or gave up before completing the story.

  • Cross puzzle: The children were given a puzzle consisting of six irregular pieces which could be put together to form a cross. The puzzle was so difficult, however, that only 2 out of 850 5th- to 8th-graders completed it. A score was computed by the amount of time spent before giving up.

  • Magic square: Each child was given a 3x3 matrix printed on a piece of paper, and instructed to fill each of the nine cells with a digit so that each of the rows and columns added to 15. This is harder than it looks: no child solved the problem in 75 minutes, and scores were derived from the amount of time spent before giving up.

  • Continuous performance: The Efficiency Cooperation test described earlier lasted for 24 minutes. Not surprisingly, there was a tendency for work rate to drop off towards the end of this boring and monotonous task. Two persistence scores, one for the self-motive and one for the group-motive, were derived by subtracting the number of problems completed during the first two minutes of each phase from the number completed in the last two minutes. A score of 0 indicated work at a continuous rate; a negative score indicated a dropoff.

The intercorrelations of the group-administered tests  are again quite low, averaging only r = .24. For the individually administered tests, the average correlation was .04.

In addition to these tests of persistence, another series assessed the child's ability to inhibit common response tendencies. Actually, the scores were calculated in such a way as to represent a failure of inhibition.

Stories: A story was read aloud, omitting the climax. The subjects were instructed to write their own endings. Then they were given the choice of completing a "test" or finding out how the story actually ended. The social demands of the situation clearly favored the former choice, and instances of the latter choice, summed across trials, were counted as failures of inhibition.

Safes: A toy safe, with a penny locked inside, was placed on each child's desk in preparation for a later activity. The children were firmly instructed not to touch the safes until the appropriate time. The number of times the child manipulated the combination lock during some paper-and- pencil tests determined the inhibition score.

Puzzles: The same procedure was applied to a puzzle, with movements of the pegs and other objects entering into the score.

Pictures: A series of arithmetic problems was printed alongside an interesting news story, riddles, puzzles, and the like. The children were instructed to complete the problems and ignore the distractions. Failure to do so was assessed by comparing the number of problems solved in the distracting condition to those solved in a nondistracting baseline test.

The intercorrelations averaged r = .16. Six further tests, similar to these in format, were administered individually, also yielding an average intercorrelation of .16.

Some common children's games were also adapted for test purposes. Each game was repeated for several trials to yield a continuous score.

  • Hopping race: The children were lined up and the number of false starts (i.e., before the "go" signal was counted as the measure of inhibition failure.

  • Crows and cranes: In this modified game of tag (otherwise known as "fairies and brownies") the children were divided into groups, and the groups were randomly assigned to be "it". The designation was called out at the start of each trial, and the number of times each child ran in the wrong direction (i.e., toward the group who was "it" or away from the goal when being chased) was counted.

  • Funny story: The children sat in a circle and were regaled with jokes and amusing tales under instructions to keep a straight face. Smiles and laughter were counted as failures of inhibition.

  • Simon says: In this familiar game, inhibition failure was counted as the number of times the child obeyed a command that had not been preceded by "Simon says ... ".

  • Noise response: The children were subjected to a variety of strange and startling (though not harmful) whistles, yells, breaking glass, etc., under instructions to respond only to the whistles. False responses were counted as inhibition failures.

The intercorrelations among these five "party" tests of inhibition average r = .05.

Character and Consistency

Honesty, service, and self-control (persistence and inhibition) were all construed as components of a single broad trait, moral character. When total scores derived from the individual tests of these qualities were intercorrelated, the resulting average r was.23.

Burton (1940) conducted a factor-analysis of what he deemed to be the "most reliable" measures of character. This analysis yielded a first factor which accounted for 40% of the variance among the tests. This suggests that there may be a general factor of "character" after all -- but of course it obscures the important question of cross-situational consistency.

Negativism: the Reynolds Study (1928)

Reynolds (1928) assessed "negativism" in 229 children aged 2 to 5 years.

First, he examined examining teachers' and mothers' ratings of the children. Then, he administered 13 formal tests of negativism, conducted over an 18-minute period.

  • Approach to the experimenter

  • Rapport with experimenter (tested 4 times)

  • Surrender of personal liberty (don't ask)

  • Imitation (tested 2 times)

  • Play period

  • Don't (what?)

  • Blocks in-box and out-of-box

  • Request: Knock down tower

  • Prohibit: Knock down tower

  • Neglect

  • Taking test

The average correlation between teachers' and mothers' ratings of negativism ranged from r = .17-.35. The average correlation between teachers' ratings and test behavior ranged from r = .26-.35. The average correlation between mothers' ratings and test behavior ranged from r = .06-.19. The test-retest reliability was .46.

It's not clear what to make of this study, mostly because it's not clear how firmly established any trait might be expected to me in preschool children.

Extraversion: The Newcomb Study (1929)

Newcomb (1929) studied the trait of extraversion-introversion employing behavioral observations of delinquent boys attending a summer camp. His study compliments that of Hartshorne and May in that it employed natural situations, rather than experimentally contrived ones. The camp could accommodate 30 boys at a time, and provided its residents with a wide variety of recreational, learning, and work experiences. The boys were divided into 6 tents, and each group had its own resident counselor. With such an arrangement, the counselors were able to make detailed continuous observations of their charges' behavior in a wide variety of situations. For his study, Newcomb selected 30 which seemed related to nine characteristics associated with the dimension of extraversion-introversion: volubility vs. taciturnity; seeking limelight vs. seeking background; large energy output vs. sluggishness; ascendancy vs. submission; interest in environment vs. indifference; impetuousness vs. caution; social forwardness vs. diffidence; ease of distraction vs. perseverance; and preference for group vs. solitude.

  1. Did he show confidence in his own abilities?

  2. Did he take the initiative in organizing games?

  3. Did he submit to criticism or discipline from counselors?

  4. Did he speak before the group at camp- fire?

  5. How well did he help in getting ready for inspection?

  6. How was he about getting up in the morning?

  7. Did he willingly fall in with what others in his group wanted to do?

  8. Did he speak or laugh aloud when the group was supposed to be quiet?

  9. Did he propose doing something new after spending a short time on a given activity?

  10. Was he eager to be first when the group took turns, as in batting, handling the oars, getting in line for food, etc.?

  11. Did he engage in group misdemeanor?

  12. What percent of the time did he play or work alone?

  13. Did he take the initiative in approaching or speaking to a stranger?

  14. How skillful was he at making outdoor fires?

  15. Did he tell of his own past, or of exploits he had accomplished?

  16. Did he get into scraps with other boys?

  17. Did he attempt hand projects demanding detailed work or skill?

  18. What was his attitude toward serving at the table?

  19. How carefully did he make his bed?

  20. How much did he read?

  21. Was he fond of swimming?

  22. How well did he cooperate in after-meal work?

  23. Did he give loud and spontaneous expressions of delight or disapproval?

  24. Did he get into trouble of a mischievous or adventurous nature?

  25. Did his conversations with counselors confine itself to asking and answering necessary questions?

  26. Did he use exaggerated gestures, antics, or show- off activities?

  27. How did he spend quiet hour? How long did he continue each activity in the morning?

  28. How much of the time did he talk at the table?

  29. How much of the day did he spend doing things that required little or no action?

Not all of the items recorded by Newcomb may correspond with the reader's intuitive notion of extraversion, but each of them is related to some aspect of the trait discussed in the theories of that trait reviewed by Newcomb:

  • Volubility

  • Seeking Limelight

  • Energy Output

  • Ascendancy

  • Interest in Environment

  • Impetuousness

  • Social Forwardness

  • Ease of Distraction

  • Preference for Group Activities

At the end of each day, the counselor recorded the number of "distinctly remembered incidents" (p. 20) in which each boy in his tent displayed specific trait-relevant behaviors in each situation. Wherever possible, he also provided detailed accounts of the incidents themselves. In addition to these direct behavioral observations, at the end of the month the counselors also made overall impressionistic judgments of each boy's typical response to each of the 30 situations. These ratings were made by each of the six counselors, inasmuch as over the course of the month they had had ample opportunity to get to know boys from other tents as well as their own charges. Data was collected over two months, on a total of 51 boys (some campers dropped out, and some were in attendance for both months). at the camp.

Newcomb addressed the consistency issue in several related ways. His principal analysis concerned the problem of "trait consistency", by which he meant the degree to which behaviors selected as representative of extraversion-introversion actually cohered together. The best data for this purpose, of course, is provided by the daily records kept by each counselor of his charges. While not recorded precisely "on the spot", they are not as susceptible to the kinds of biases that beset the global ratings made at the end of the month-long session.

  • Four situations did not yield enough responses from the subjects to permit analysis.

  • For the remaining 26 situations, the average intercorrelation was only .15 for the daily observations (p. 109).

By and large, extraversion in one situation was not accompanied by extraversion in the other relevant situations, and similarly for expressions of introversion.

Newcomb's analysis of the available literature indicated that extraversion-introversion is a very broad dimension of personality, consisting of a number of narrower, but still related, dimensions (this is congruent with Eysenck's analysis, as discussed in Chapter 2). Accordingly, it may be that a higher degree of behavioral consistency would be shown when the target behaviors are considered to be representative of these more circumscribed dimensions, rather than taken individually.

  • For example, the characteristic of volubility was represented by Items #15, 23, 25, 27, and 29 -- all of which obviously have to do with talkativeness.

  • Similarly, Items #1, 8, 18, and 26 all have to do with seeking the limelight.

For ostensibly related behaviors (such as #15 and 23, or #1 and 8), the average correlation was only r = .297 (N = 27) and .211 (N = 30). For comparison purposes, Newcomb also drew a random sample of correlations among ostensibly unrelated behaviors (such as #15 and 1, or #23 and 8), obtaining an average correlation of (.226, N = 27 and .247, N = 30).

While there is some improvement in consistency when these dimensions are considered, there is not very much improvement. The intercorrelations among ostensibly related behaviors were not significantly higher than those found among theoretically unrelated behaviors, leading Newcomb to conclude either that "The basis on which the related behaviors are chosen is faulty" or "These behaviors have no peculiar relation to each other, so that the traits really do not exist as units". As we noted earlier, one may quarrel with some of the behaviors chosen by Newcomb as representative of extraversion-introversion or its subsidiary components; because so many of his choices are intuitively appealing, however, we favor the latter explanation.

Interestingly, the retrospective global judgments gave a higher index of trait consistency than did the daily records of specific behavior: an average r of .45 (p. 109). Apparently, after a month the counselors perceived the campers as having been more consistent in their behavior than they actually were. This is an early example of a problem alluded to earlier: the contamination of memory-based ratings by certain cognitive biases, in which the observer confuses what actually happened with what s/he expects to have happened. The daily records of specific behavior, because of the short retention intervals involved, are less susceptible to this problem, but they are not free of it entirely. Small as it is, is seems likely that even the paltry level of cross-situational consistency shown by the specific behavior records (r = .15 for all situations, and .XX for situations clustered together according to the nine sub-traits of extraversion-introversion) is itself inflated by these same sorts of biases, so that they actual correlation is probably closer to zero.

The reason for Newcomb's failure to find any particular consistencies in behavior among trait-relevant situations becomes more apparent when we examine the consistency of behavior within each of these situations. Of the 26 situations yielding sufficient data for analysis, only 21 were listed on the on-the-spot behavior ratings in a manner which permitted the recording of both positive and negative responses. For example, for the first item listed above ("Did he show confidence in his own abilities?"), two responses ("Boasted loudly of greater abilities than he had" and "Spoke confidently of ability he really had") were coded as expressions of extraversion, while two other responses ("Expressed lack of confidence in his own abilities" and "Hesitated even to try his ability") were coded as expressions of introversion. Newcomb concluded that "A few individual behaviors are highly consistent, with most of them hovering closer to complete inconsistency rather than perfect consistency". If there is little or no consistency to the specific behaviors, there can hardly be any consistency to the traits which they ostensibly represent.

Because of the wealth of data collected about specific incidents, Newcomb (1929) was able to uncover the reason behind the low levels of consistency displayed by his subjects. As he summarized his findings:

There are always slight differences in both internal and external stimuli which are important in determining behavior, yet are not recordable . . . . Situations are necessarily so different that large measurable consistency is not to be expected (p. 77). To cite an obvious example, "Whether or not Johnny engages in a fight may depend on whether or not he thinks he can 'lick' his opponent" (p. 39).

Although Newcomb hoped to cancel out these purely situational effects by observing his subjects on a number of different occasions over a long period of time, apparently this did not take place. As Newcomb noted, "The factor of specific behavior consistency, if there is such, resides less in the boys than in the situations" (p. 39).

Punctuality: The Dudycha Study (1936-1938)

In a study which is remarkable for the effort involved, Dudycha (1936, 1937a, 1937b, 1938) studied punctuality in 307 students at Ripon College. The study was conducted at a time when the requirements placed on college students were much more stringent than they are today. Attendance at classes and at chapel was strictly required; and students lived in on- campus dormitories and took their meals in the college dining hall. Dudycha took advantage of this situation and recorded the arrival times of students at various obligatory and optional appointments. Students with 8:00 classes were of special interest. Because the vast majority lived on campus, and no student had any problems with transportation, Dudycha reasoned that arrival time at one's 8:00 class would be maximally influenced by one's desire to be punctual, and minimally affected by extraneous variables. Similarly, breakfast was chosen as the representative meal, as the college dining hall was open only from 7:00-7:30 A.M. Late afternoon appointments were represented by rehearsals of extracurricular musical organizations, and by compulsory chapel services which were held monthly. Individual student appointments with faculty (e.g., conferences with the instructor of the freshman composition course, or with the Dean of Students) were taken as representative of the rest of the day. Evening appointments were represented by various sports and cultural events. In some instances the students arrival times, to the nearest minute, were recorded by the instructors involved. In most instances, however, the recordings were made by Dudycha himself, with the aid of his spouse. For the better part of a year, they checked off names at the door to the college dining hall, lurked in the waiting rooms outside faculty offices, and collected tickets at college events. All of this was done quite surreptitiously, so that none of the subjects ever knew that they were being studied.

The situations employed were as follows:

  1. 8:00 Classes: 7232 observations of 211 subjects over 3 college terms.

  2. Breakfast at the College Commons: 3427 observations of 110 subjects over 45 consecutive days.

  3. Faculty Conference Appointments: 984 observations of 132 subjects during the academic year.

  4. Extracurricular Activities: College Band and College Singers; 1230 observations of 38 subjects over two quarters.

  5. Vesper Services: monthly; 1025 observations of 228 subjects over five months.

  6. Athletic Contests and Entertainments: basketball games, college plays, and concerts; 1462 observations of 283 subjects in 12 events over one academic year.

Later, Dudycha collected other information relevant to punctuality. For example, all of his subjects were rated on punctuality by three of their dormitory-mates. He even went to far as to poll the student body as a whole to determine who was considered to be the most and least punctual students. Finally, each subject completed a 47-item questionnaire regarding his or her attitudes toward punctuality, and related behavior. A subgroup of his subjects also participated in a number of formal experiments concerned with the accuracy with which they could estimate the time required to perform various tasks such as copying geometric figures and running various errands.

Dudycha's results were not discrepant from those of Hartshorne et al. and of Newcomb. Using the number of minutes early or late as a measure of punctuality, the average correlation across different situations was .19.

Against the objection that what matters is whether a subject is early or late, not how much s/he is early or late, Dudycha also used the minute-by-minute arrival time data to classify his subjects simply as generally early, on time, or late in each situation. The average contingency coefficient is .XX, about the same as that observed in the earlier analysis.

Summarizing his results, Dudycha wrote:

We can not conclude that all subjects either possess or do not possess a general trait of punctuality; punctuality is not an all-or-none characteristic . . . . [M]any students are consistently punctual in certain situations; but there are none who are equally consistent in their tardiness.

Dudycha was able to find small groups of students who were consistently early or consistently late, but and the generally positive correlations among the various situations indicated some small degree of cross- situational consistency. However, the level of consistency observed was by no means large.

Some of the reason for the low level of cross-situational consistency in arrival time is indicated by the comments made by the subjects about punctuality and their attitude toward it (Dudycha, 1938). A majority of the subjects claimed to be concerned about punctuality, out of consideration for others, because they felt that punctuality was essential for social and professional success or correlated with other desirable personality characteristics, and because they find latecomers annoying and the experience of being late embarrassing. Dudycha noted, however, that an individual's arrival time for an appointment also varied considerably depending on a number of factors. Obviously, mistakes in budgeting time, or in keeping track of its passage, will affect punctuality; so will the behavior of others -- as when, for example, several individuals go to a meeting as a group. Unanticipated delays also prevented subjects from being on time even when they wanted to do so. Given the inconsistency in arrival time, it appears that these kinds of events are more the norm than the exception to the rule.

Just as important was the importance of the appointment, and of the consequences of being late. As one of the subjects observed,

For instance, it is one thing to be late to church when you have to climb over four people in the second row in order to get a seat, and a more trifling thing to be late when you can just walk in and sit in the back row . . . . I don't believe that getting to classes on time, ordinarily, is worth missing half your breakfast and running four blocks and up three flights of stairs for. I believe in being punctual if it isn't too inconvenient (p. 216).

Dependency: The Sears Study

Sears (1963) studied dependency behavior in a group of 40 four-year-olds enrolled in a campus nursery school. By means of a time-sampling procedure the children were observed by one by four different judges on 20 different occasions randomly distributed with respect to the time of day and day of the week. A total of 7-10 hours of observation was devoted to each individual. In addition to this relatively unstructured situation, the children were also observed for one hour in the presence of their mothers.

The following behaviors were rated.

  1. Negative Attention Seeking: getting attention by disruption, aggressive activity with minimal provocation, defiance, or oppositional behavior.

  2. Positive Attention Seeking: seeking praise, seeking to join an in- group by inviting cooperative activity, actual interruption of an ongoing group activity.

  3. Touching or Holding: nonaggressive touching, holding, clasping onto others.

  4. Being Near: follows or stands near a particular child or group of children or a teacher.

  5. Seeking Reassurance, Comfort, or Consolation: apologizing, asking unnecessary permission or for protection or for help or guidance.

  6. Questionnaire:bids for attention when the mother was busy filling in a questionnaire.

  7. Puzzles: Bids for attention when the mother was trying to get the child to perform skillfully on a puzzle.

While all of these categories probably strike the reader as instances of dependency behavior, in fact Sears found that dependency was not highly correlated across the various indices. The average correlation between indices was only about .05, with many (12 of the 50 correlations produced by examining the seven indices for boys and girls separately) actually negative.

Most revealing in the present context are the correlations between total dependency (summing across the five categories) manifested in the free-play situation and total dependency (summing across the two remaining categories) manifested in the structured situations with the mother present. The correlations were -.02 for boys and .06 for girls, clearly not very substantial. Sears concluded (1963, p. 36) that

These findings suggest that the use of such a term as Allport's common trait is not warranted in describing the structure or organization of dependent behavior. Rather, each of the five categories [of free-play behavior] needs to be considered separately with respect to its origin.

Sears might have added, on the basis of the correlations between behavior in the free-play and structured situations, that the findings did not support a notion of cross-situational consistency in dependent behavior either.

The Story So Far...

None of these studies seem to have attracted a great deal of attention at the time they were published, but interest in them was revived after 1968, when Mischel (1968) and others cited them -- especially the work of Hartshorne and May (1928-1930) as evidence that the Doctrine of Traits was seriously lacking in empirical validity. Thereafter, a number of investigators -- many of them more favorable to the Doctrine than Mischel was -- revived the study of cross-situational consistency with a new spate of studies inspired by Hartshorne and May, but offering ostensibly better methodologies.

Consistency as a Moderator Variable

One notable trend was a rethinking of the concept of consistency.  Bem and Allen (Psych. Rev. 1974), paraphrasing Abraham Lincoln, suggested that, while we can't predict all of the people all of the time, we might still be able to predict some of the people some of the time.  They offered a new take on the consistency issue, by arguing that some people were just more consistent in their behavior than others are, and that consistency could serve as a moderator variable in the trait-behavior relationship. In technical terms, a moderator variable is one that  affects -- moderates -- the relationship between a predictor (or independent) variable and some criterion (or dependent) variable.  (Moderator variables are frequently confused with mediator variables, which are third variables which explain the relationship between two other variables; for a thorough discussion, see Baron &  Kenny, JPSP, 1986.)   B&A proposed that some people were just more consistent in their behavior than others are, and that consistency could moderate the trait-behavior relationship.  Only people who are consistent on a dimension such as extraversion can be said to possess the trait, and predictions of behavior in specific situations only hold for those who show high levels of behavioral consistency.  While that might strike a reader as somewhat circular -- only consistent people have a trait, and only people who have a trait behave consistently -- the more important idea is that consistency is itself something like a trait.  Except, in their view, that consistency is trait-specific.  Only consistently friendly people possess the trait of friendliness, and a consistently friendly person may not necessarily be consistently conscientious (or whatever).

The B&A study focused on two traits, friendliness and conscientiousness.  The subjects first completed global self-ratings of each trait, as well as a global measure of how consistently they behaved with respect to each trait.  They then completed a "Cross-Situational Behavior Survey" (CSBS) which described 24 different common situations, and rated how friendly or conscientious they were likely to be in each of them.  Then, of course with the permission of the subjects, these same instruments -- global ratings of friendliness and conscientiousness and CSBS for both traits -- were completed by the subjects' mothers, fathers, and roommates (or other close peer); the global and CSBS ratings for each trait were then combined to yield total scores for friendliness and conscientiousness.  They also collected objective, behavioral measures of friendliness in two situations -- participation in a group discussion and initiating conversation with a stranger; and objective, behavioral measures of conscientiousness in three situations -- returning course evaluations, completing course readings, and neatness of their dormitory room. 

If you think of the self-ratings, mother-ratings, father-ratings, and peer-ratings as representing four different settings in which the subjects could display friendliness and conscientiousness, plus the two objective behavioral assessments of friendliness and three of conscientiousness, that yields a total of 13 different sessions in which these traits were assessed.  As predicted, subjects who rated themselves as more consistent in their behavior showed higher average cross-situational correlations (mean rs = .57 for friendliness and .45 for conscientiousness) than those who rated themselves as more variable (mean rs = .27 and .09, respectively). 

It would appear, on the basis of these results, that B&A had broken the "ceiling" on predictability and cross-situational consistency implied by Mischel's "personality coefficient".  In fact, they claimed that "the magic +.30 barrier appears to have been penetrated" (p. 514).  However, their results are actually not so clear (see a detailed critique by Chaplin & Goldberg, JPSP 1984).

As a side note: The B&A study also bears on the question of predictability.  It follows from their argument about cross-situational consistency, as implied by the title of their paper, that consistent subjects should show higher trait-behavior correlations than inconsistent subjects.  Although B&A did not highlight these particular results, it turns out that the same difference between consistent and variable subjects appeared with respect to predictability as well as cross-situational consistency.  For purposes of this analyses, we can consider subjects' self-ratings as the measures of friendliness and conscientiousness, and see how these trait ratings correlate with behavior in the trait-relevant situations. 
Despite these anomalies, the idea that "you can predict some of the people some of the time", and that self-rated consistency could moderate both cross situational consistency and trait-behavior predictability caught on, inspiring many attempts to replicate B&A's results.  Most of these attempts succeeded, although the later studies generally found much less difference between consistent and variable subjects than B&A did.  This is actually quite a common result in psychological research, and it even has a name: the "decline effect" (see Jonah Lehrer, "The Truth Wears Off", New Yorker, 2010). 

Among these was a paper by Mischel and Peake (1981), which studied a wide variety of behaviors relevant to friendliness and conscientiousness group of college students (that Mischel led this study is particularly appropriate, as B&A's study was initially directed at his notion of the "personality coefficient"; did I mention that, at the time, Mischel and Bem were colleagues at Stanford?).  

The most extensive of these replication attempts was by Chaplin and Goldberg (1984), described earlier in the discussion of predictability, and it failed utterly.  As noted earlier, the C&G study had more to do with predictability than with cross-situational consistency, but -- as noted in the discussion of B&A -- the same logic holds.  When the subjects were divided on the basis of their self-reported consistency.  The average trait-behavior correlation, aggregated across three different measures of consistency,  was r = .08 for the consistent subjects and r = .08 for the variable subjects. 

So, no matter how you look at it, the prospect of "consistency" serving as a moderator of either predictability or cross-situational consistency are pretty dim.

The Story Even Further...

Except for Bem, these investigators are unanimous: behavior, at least in these domains, doesn't seem to be determined by some abstract, highly generalized, underlying trait. To be sure,some people behave consistently from one situation to another. But there is not a great deal of consistency across subjects. At first blush, it appears that Newcomb was right: "The factor of consistency, if there is such, resides less in the boys than in the situation". But it's the perceived situation that matters, not the objective situation. As Newcomb noted, "Whether or not Johnny engages in a fight may depend on whether or not he thinks he can lick his opponent".

Realism versus Idealism

A central assumption in trait-based personality assessment is that self- reports and peer-ratings are isomorphic with actual behavior. This is necessary, if the relations among individual items in these ratings are to reflect the actual co-occurrence of behaviors in the real world, and thus reveal underlying traits. However, there are substantial reasons for thinking that this assumption is overstated, if not simply invalid.

One of the factor-analytic solutions to the structure of personality traits discussed in Chapter 2 was the five-factor solution offered by Norman (1963). In a long series of investigations, the same structure consistently emerged across many different samples, prompting Norman (1963) to claim that he had found a universally applicable structure, one representing "a highly stable structure of personal characteristics" (1963, p. 581). This conclusion was reinforced by Goldberg (1982), who summarized a large number of multivariate studies of ratings on trait adjectives. Norman's factor structure has been consistently replicated by all who have worked with Cattell's 171-item reduction of the Allport-Odbert list (Borgatta 1964; Digman & Takemoto-Chock, 1981; Fiske, 1949; Goldberg, 1981; Tupes & Christal, 1961). Moreover, Goldberg (1980) has shown that the same five factors emerge regardless of the particular type of factor analysis employed (e.g., orthogonal or oblique), or whether the analysis is performed on self- or peer-ratings. As Goldberg (1982, p. 160-161) notes,

Clearly, there is something to this structure . . . . These are data speaking for themselves . . . . Whether the data come from self reports or from descriptions of other people, whether based on one kind of rating scale or another, no matter what the method for factor extraction or rotation, the results are much the same. Clearly, these five individual differences are . . . compelling places to start our search for a universal order of emergence of personality terms.

It appears, then, that Norman's set of five dimensions provides the replicability that we sought in Cattell's dimensions, plus a compromise between structures such as Eysenck's that do not seem to be differentiated enough to capture the richness of human individuality, and others such as Guilford's which seem too highly differentiated to be economical.

However, a variant of Norman's experiment, conducted by Passini and Norman (1966), yielded a startling result. In this study, subjects provided ratings of complete strangers -- individuals whom they had never met before the rating session, and with whom they were not allowed to interact. The judges could not possibly know where the target of their ratings stood on the 20 scale items that comprise Norman's five factors. Thus, they were required to rate their targets as they "imagined" them to be. Nevertheless, the study yielded the same five factors that had appeared in the earlier investigations. When the judges made their ratings, they apparently capitalized on their intuitive knowledge of the relationships among personality traits. Thus, having decided (for whatever reason) that their target was talkative, they also judged him or her to be relatively open, adventurous, and sociable. The result: the factor of extraversion.

While admittedly the situation for Passini and Norman's judges was a little out of the ordinary, a little reflection suggests that the same intuitive processes are also operative in the studies where judges rated people they know well. Suppose you were to be asked whether your friend were adventurous. You don't necessarily know, having perhaps never observed him or her in a situation where he or she could display this quality. However, you know something about adventurous people: they tend to be talkative, and sociable, and they aren't very secretive. Your friend has these qualities, so you make the inference that s/he also is adventurous. It is your best guess, given the information available. If everybody in a sample does this (and there is no reason to think that they do not), and people share similar ideas about what qualities go together in personality, then the actual relations among behaviors -- what is supposed to be rated -- will get inextricably confused with the assumed relations among them. In the present context, an ostensible "universal personality structure" such as Norman's may be a universal conception of personality structure, but no more.

Implicit Personality Theory

The proposal, then, is that each of us possesses what Bruner and Tagiuri (1954, p. 649) referred to as a "naive, implicit 'theory' of personality [that] people work with when they form an impression of others" (for other early statements of this concept, see Cronbach, 1955; Jones, 1954; Kelly, 1955; Steiner, 1955; for recent reviews, see Schneider, 1973; Schneider, Hastorf, & Ellsworth, 1979). Such a theory is described as naive in that it is the product of informal and uncontrolled observation; it is implicit in that it is rarely articulated explicitly by those who possess it, and may even be unconscious. In every other respect, however, implicit personality theory is a theory, just like those formulated by professional personologists.

As such, implicit personality theory consists of some assumptions concerning human nature (e.g., "all men are beasts"), the causes of human behavior (e.g., "people are innately aggressive"), and the origins of individual differences (e.g., "the child is father to the man"). As we will show later (Chapter 10), a common assumption in implicit personality theory is that individual differences in behavior may be attributed to people's internal behavioral predispositions -- in short, to their traits. Accordingly, investigations of implicit personality theory also attempt to assess generalized beliefs concerning the number of basic traits and their names, the average standing of the population on each of these dimensions, the amount of variance within the population, and the relationships among basic traits. It is this latter aspect, people's beliefs about the relationships among personality traits, that appears to account for the findings of Passini and Norman (1966).

The Asch experiment. A classic study of the effects of these beliefs on personality ratings was provided by Asch (1946). He presented his subjects with an ensemble of traits ostensibly characterizing an individual, and then asked them to write a paragraph describing the target and to complete an adjective checklist. One group of subjects received the following list of traits:

intelligent, skillful, industrious warm, determined, practical, and cautious.

Another group received the identical list except that cold was substituted for warm. The resulting impressions were strikingly different. Targets characterized as "warm" were described more positively than those characterized as "cold"; but this was not the case for all of the traits rated by the subjects. These effects were replicated by Mensh and Wishner (1947) and Kelley (1950). Apparently, the effect of the warm-cold variable was not merely to increase or decrease the target's perceived social desirability in general. The subjects seemed to base their judgments on fairly differentiated intuitions about which traits go together, and which do not.

Not all such variations have these effects. For example, when Asch (1946) substituted polite and blunt for the warm-cold dimension in the ensemble given above, the difference between the two groups was substantially reduced. For this reason, Asch argued that while the warm- cold dimension was central to the final impression formed by the subjects, polite-blunt was more peripheral. In a later study, Wishner (1960) attempted to determine what made some traits central and other traits peripheral. He found that central traits -- those that, when altered within the context of a particular stimulus ensemble, changed responses on the checklist to the greatest degree -- were those which were most highly correlated with the stimulus and response terms. If subjects believe that a trait is correlated with others on the stimulus and response lists, they will be influenced by that trait; and the perceived strength of the correlation will determine the degree to which they are influenced, one way or another. Wishner's analysis proposed an elegant solution to a longstanding puzzle within social cognition and personality assessment. In the present context, it also offers powerful and dramatic testimony to the richness and differentiation of the matrix of trait relationships which people possess as part of their intuitive social knowledge.

The matrix of trait relationships. Since the work of Asch and Wishner, a number of investigators have attempted to determine precisely what this matrix is. These studies have employed the same factor analysis and related multivariate statistical procedures employed in the search for a universally applicable structure for representing individual differences (Chapter 2). There is an important difference, however: while the earlier studies attempted to discover the actual relationships among various behaviors and traits, as they occur in nature, these newer studies focused on perceived relationships, as they were represented in the minds of observers. Typically, they ask people to rate the degree to which two behaviors go together, or the degree to which two adjectives are similar in meaning. No reference is made to any particular stimulus person, real or imagined.

Some early hints about the cognitive representation of personality traits were provided by the work of Osgood and his associates on the basic dimensions of meaning in language (Osgood, Suci, & Tannenbaum, 1957). They asked subjects to describe an arbitrary assortment of objects -- animate and inanimate, human and nonhuman -- in terms of a representative set of adjectives. Analysis of the interrelations among these ratings consistently revealed three major independent dimensions:

evaluation (e.g., good-bad, optimistic-pessimistic, complete-incomplete);

potency (e.g., strong-weak, hard-soft, severe-lenient); and

activity (e.g., active-passive, hot-cold, excitable-calm).

In fact, these dimensions have appeared in a number of studies of implicit personality theory (for a review see Rosenberg & Sedlak, 1972). However, closer inspection revealed that the evaluation dimension could be split into two separate dimensions, named social evaluation (e.g., warm-cold) and intellectual evaluation (e.g., intelligent-stupid).  Note, first, that adjectives and their antonyms generally appear opposite each other in this space, as they should. Note also that the "pure" evaluation dimension lines up more closely with social evaluation than with intellectual evaluation, which makes some sense. Finally, the dimensions are not at right angles to each other (hard-soft vs. good-bad comes closest), meaning that the structure can be reduced still further -- ultimately, it appears, to a single dimension of evaluation.

As with the literature on the structure of personality, there is no need to choose between a conceptual structure which involves one or two dimensions and one which involves three or five (or more). The cognitive representation of personality traits is probably hierarchically organized. Probably all the subordinate dimensions resolve to a single superordinate one of (social) evaluation (Rosenberg, 1977). This prepotent dimension is readily apparent in the "halo" effect (Thorndike, 1920), in which raters assume that high standing on one socially desirable trait implies high standing on all other socially desirable traits. The negative pole, perhaps, is reflected in stereotypes of social outgroups, which typically involve clusters of socially undesirable features. But at the same time, people can make finer distinctions if they find it necessary or useful to do so.

The Systematic Distortion Hypothesis

Everyone agrees on the existence of implicit theories of personality, but there is considerable disagreement on their nature and the effects they may have on the usual sorts of personality ratings. One possible conclusion is that the conceptual relationships among behaviors and traits is derived from observations of the ways in which these variables actually co-occur and covary in the real world. Thus, the mental representation of trait relationships is an accurate reflection of reality (Block, Weiss, & Thorne, 1979; Jackson, Chan, & Stricker, 1980; Passini & Norman, 1966). Another is that the mental representation of behavior and trait relationships is based on preconceived notions of what go together. These notions, which are generally inaccurate, thus introduce systematic distortion into the impressions which we form of other people (Shweder, 1981; Shweder & D'Andrade, 1979, 1980).

These kinds of distortions are readily observed in classic research by Chapman and Chapman (1971) on the illusory correlation. The Chapmans were concerned with a paradox which has long plagued the field of psychopathology and psychotherapy: practicing clinicians often rely on psychological tests whose validity has not been demonstrated by formal research. The problem is not simply that the tests are too new to have been validated, or that clinicians are unfamiliar with the relevant scientific literature. Rather, the claim is that the tests are valid, despite scientific evidence to the contrary. Clinicians insist that the lore which surrounds the tests in question provides valid assessments and predictions. When asked for evidence, they often point to the consensus that exists among their colleagues about these tests, and ask how such agreement could have evolved in the absence of repeated successful experience in applying the tests. It is this question that the Chapmans sought to answer.

In the first of their experiments, Chapman (1967) presented subjects with words in the form of paired-associates. Each of four stimulus terms was paired with each of three response terms, making for 12 combinations in all. On 1/3 of the combinations the stimulus word was paired with a response which was a close associate (e.g., tiger-lion); on the remaining 2/3 of the combinations the two terms were unrelated (e.g., boat-eggs). Each of the 12 combinations was presented an equal number of times, either 4, 10, or 20 presentations per pair. When the subjects were subsequently asked to estimate the frequency with which each stimulus term was paired with each of the four response terms, they consistently overestimated the co-occurrence of the closely associated terms. In this case, then, the subjects perceived a correlation between two events (words) that was wholly illusory -- an effect mediated by the linguistic relationships between stimulus and response.

The same kind of effect was observed in two series of studies involving the perceived correlates of various responses on psychological tests. Chapman and Chapman (1967) presented line drawings of people ostensibly made by psychiatric patients, each accompanied by a statement of that "patient's" alleged symptom. Each drawing possessed a distinctive feature such as broad shoulders, atypical eyes, or an overemphasized mouth. The symptoms were statements about concern for masculinity, intelligence, impotence, etc. Each picture type was paired with each statement an equal number of times, and the pairs were presented in a long series consisting of many repetitions. When asked how often each quality of drawing had been paired with each statement, the subjects consistently overestimated the frequency of certain combinations. The perceived associations are intuitively appealing, for the most part, but they simply aren't true. Similar findings were obtained when alleged symptoms were paired to ostensible responses to an inkblot test. In all these experiments, perceptions of correlation or co-occurrence were strongly influenced by pre-existing semantic relationships between the items. Where the items were strongly related in conceptual terms, judgments of correlation were inflated.

We shall have more to say about the illusory correlation later (Chapter 10), when we turn our attention to the processes involved in social cognition. For the present, we wish to entertain the hypothesis that similar illusions and distortions affect perceptions of behaviors and traits which form the basis of analyses of the structure of personality. The following treatment is derived principally from Shweder (1981).

At the heart of Shweder's argument is a distinction between behavior recorded on line -- that is, at the moment at which it happens -- and behavior recorded from memory --that is, where a judge or rater is asked to form an impression of an individual's behavior sometime after it has occurred. It should be clear that the kinds of data typically subjected to factor analysis in the search for a universal structure of personality traits has been memory-based: self-reports of past behavior, summaries provided by informants who have had some contact with the object of their ratings, and the like. The basic argument is that memory-based ratings are contaminated by the rater's pre-existing ideas of what behaviors should go together, and thus present a distorted picture of what behaviors actually go together. This contamination is not the same as the kind of random inaccuracy that we would naturally expect from all but a perfectly reliable recording device; rather, the distortion is systematic, following lines dictated by the conceptual associations among behaviors. Shweder offers three types of evidence favoring his systematic distortion hypothesis.

Conceptual Association versus Rated Behavior

A large number of studies have examined the relationship between the conceptual similarity of various behaviors, and the correlations among these same behaviors that emerge from memory-based ratings. Consider, for example, a set of findings which emerged from Bales ' (1970) study of small group behavior. In the study, participants in a number of small groups were asked to rate themselves on a variety of questionnaires, including the MMPI and 16PF; moreover, they were also rated on a number of dimensions by other group participants, and by a group of "objective" observers. The intercorrelations among these items yielded three dimensions of power, likeability, and task-orientation (note the similarity to Osgood's dimensions of potency, evaluation, and activity). Based on these findings, Bales then constructed a 26-item rating scale to tap these three factors. The 26 items, when subjected to factor analysis, did in fact yield these three factors. Later, Shweder (1975) asked a group of judges simply to sort these 26 items into categories based on "similarity of meaning". The similarity judgments of a number of raters were subjected to a form of multivariate analysis known as multidimensional scaling. The result was that the three-factor structure derived from conceptual similarity ratings was almost identical to the three-factor structure derived from memory-based behavior ratings.

This is not an isolated result. Shweder (1981) has summarized a number of studies, in each of which the structure revealed by multivariate analysis of observers' memory-based judgments has been substantially replicated by corresponding analyses of conceptual similarity. Recall that the identical five-dimensional structure obtained by Norman (1963) from ratings made by close acquaintances was also obtained by Passini and Norman (1966) from ratings made by complete strangers. These are not isolated findings. For example, the symptom-clusters comprising the major psychiatric syndromes such as schizophrenia and depression have been found both when psychiatric patients are rated (e.g., Overall, Hollister, & Pichot, 1967) and when judges are simply asked to rate the degree to which the symptoms go together conceptually (Shweder & D'Andrade, 1980). Similarly, the three principal factors obtained in questionnaire assessments of Murray's human motives (Huba & Hamilton, 1976), discussed earlier, have been replicated in conceptual similarity ratings (Ebbesen & Allen, 1977). And Jackson and Helmes (1979) replicated Wiggins's (1979) circumplex structure of interpersonal traits in a computer simulation in which hypothetical subjects were programmed to respond solely on the basis of social desirability and acquiescence tendency. Similar findings have been obtained in several other domains as well (Shweder, 1981; Shweder & D'Andrade, 1979).

Where do these conceptualizations of association come from? From one point of view, the mental representations of behavior manifested in the conceptual association matrices are accurate reflections of the associations among these behaviors in the real world (Block, Weiss, & Thorne, 1979; Jackson, Chan, & Sticker, 1979; Passini & Norman, 1966). According to the "accurate reflection" hypothesis of conceptual association, people abstract these relationships from their observations of themselves and others in everyday social interactions -- in other words, these behaviors really are correlated with each other. The evidence in this respect comes from a small number of studies which have examined the relationships among various behaviors as they were recorded "on line", as they occurred.

Conceptual Association versus Actual Behavior

If implicit personality theory reflects the way behaviors and traits are organized in the real world, then we should observe a high degree of correspondence between the structure of behavior as it emerges from "on- line" observation and that which is revealed by armchair speculation. Unfortunately, very few data sets exist that permit these comparisons. In part this state of affairs reflects the very few "on line" observational studies which have been conducted -- almost all of them were reviewed in the previous section on trans-situational consistency. And very few of these studies have been subjected to conceptual association analyses.

However, some hints about the outcome of such research is provided by these "on line" studies themselves. All of them employed multiple behavioral indices of traits, on the quite reasonable assumption that they would show some degree of coherence. Thus, Hartshorne, May, and their colleagues (Hartshorne & May, 1928; Hartshorne et al., 1929, 1930) thought that cheating in class, false reporting in the gym, and stealing at parties would covary as aspects of a single trait of honesty-dishonesty; and that deceit, service, persistence, and altruism would covary as elements of moral character. Similarly, Newcomb, Dudycha, and Sears thought that their variables would covary as components of extraversion, punctuality, and dependency, respectively. Bem and Allen (1974) thought that their variables would come together to represent friendliness and honesty. The point is that their predictions must have struck the readers of this chapter -- as they did the original investigators, and the authors of this book -- as intuitively appealing. Yet, all of the above-named investigators were surprised, and chagrined, to find that their dependent variables did not covary as expected. the conceptual associations existing in the minds of these investigators were not replicated in the correlations among the actual behaviors. Again, this is not an isolated finding. To date, a total of XX studies have compared the associations among behavior recorded "on line" (or virtually so) with the conceptual associations among these same behaviors, in the abstract. In no case is there a substantial relationship between the actual behavior relations, on the one hand, and the conceptual associations, on the other. Given this kind of evidence, it is difficult to conclude that the conceptual associations matrices are an accurate reflection of reality. Rather, it seems more likely that the conceptual associations present a distorted view of the relations among various behaviors. Before this assertion can be made with any confidence, however, a third analysis is needed -- this time, of the relationship between behavior recorded "on line" and behavior rated from memory.

Actual Behavior versus Rated Behavior

In order to pin the issue down completely, it is necessary to consider the only remaining comparison -- between actual behavior and rated behavior. We have already introduced one such comparison in Newcomb's study of extraversion. There, observers rated 30 trait-relevant behaviors at the end of each day ("on line", for all intents and purposes), and then made retrospective judgments on the same scales at the end of one month's contact with their charges. As noted earlier, considerably more consistency was found in the memory-based judgments than in the more immediate ratings.

Putting It All Together

If few studies have directly compared conceptual association with actual behavior, or rated behavior with actual behavior, fewer still have made all three comparisons at the same time. One such study has been reported by Shweder and D'Andrade (1980). Subjects viewed a half-hour-long videotape depicting a family interaction. Some viewers provided "on-line" ratings of their behavior in terms of 11 categories; others rated the interaction retrospectively on the same dimensions; still others made judgments of the degree of similarity between the rating scales. The patterns found in conceptual associations and rated behavior were very similar (r = .59); however, the organization of actual behavior was only weakly related to that of rated behavior (r = .22), while if anything it was contrary to the structure of the conceptual associations (r = -.29).

None of this is to say that social behavior is random, unorganized, or haphazard; or that our intuitive knowledge of the organization of social behavior is totally illusory. The matrices representing actual behavior in the domains that have been examined contain correlations that are overwhelmingly positive, even if they are somewhat low. This indicates that there is some low-level consistency of the type hypothesized by the Doctrine of Traits. However, it is clear that our preconceptions of "what is like what" guide and distort our perceptions of "what goes with what" (Shweder, 1981). If social behavior is organized the way trait theory asserts it is organized, this structure is not likely to be revealed by memory-based ratings of people's own or others' behavior. It must be sought in observations of actual behavior, recorded on line.

Where, then, do these traits reside -- in the behavior of the subjects, or in the minds of the observers? The clear implication of these studies is that the structure of personality traits revealed in behavioral self-reports may be an ideal structure, existing in the minds of those making the judgments, rather than a real structure existing in the actual relationships among the behaviors being rated. At least, the real structure is inextricably bound up with the ideal one. Our preconceptions about "what goes with what" inevitably distort our observations of "what goes with what". We may never be able to remove this systematic distortion, so as to reveal the "actual" structure of personality traits.

Reaction to the Critique

The empirical difficulties with trait psychology have been widely acknowledged, even by those who continue to work in the domain (e.g., Block, 1977; Epstein, 1977; Hogan, DeSoto, & Solano, ref). By and large, however, these investigators attribute the problems to inadequate methodology. The general argument is that increasing technical refinements in personality assessment will yield better predictions of behavior in specific situations and more compelling evidence of cross- situational consistency. Some support for this point of view is provided by the work of Block (1971, 1977), who through careful assessment was able to amass evidence of substantial consistencies in some traits over long periods of time.

Under the influence of the concept of construct validity, there developed a sophisticated critique "from within" of certain approaches to personality assessment and a correspondingly superior technology of personality assessment.