Trait Research

In the Doctrine of Traits, Allport argued that personality could be best construed in terms of a set of dispositions to behave in a particular way across a wide variety of situations and over extended periods of time. A large part of trait psychology has been devoted to uncovering the "grand design" of the structure of personality. Using factor analysis and related multivariate methods, a number of investigators have attempted to determine the number of trait dimensions needed to account for individual differences in personality. These efforts constitute the classic trait theories of Guilford, Cattell, and Eysenck -- and their followers. Other personality psychologists, however, have not been so heavily involved in the enterprise of establishing a universally applicable scheme for describing the structure of personality. While heavily influenced by Allport and the other classic trait theorists, and certainly adhering to the doctrine of traits, they have pursued somewhat narrower topics within the field. In this chapter we sample some of these approaches.

The Assessment of Individual Differences

From the biophysical point of view, traits are possessed by people as part of their neurological constitution. Accordingly, it makes sense to measure people in terms of their standing on various trait dimensions -- just as we would measure their height or weight. One of the major outgrowths of trait theory has been the development of a sophisticated technology for personality assessment by means of self-report questionnaires. For an introduction to this area, see Lanyon and Goodstein (1971) and Wiggins (1973). Anastasi (ref) gives concise summaries of most of the major questionnaires employed in personality assessment today. These "psychological tests" represent the application of formal psychometric principles to the kinds of informal personality judgments which we make every day about people.

The Nature of Personality Tests

In a manner directly analogous to a ruler or a scale, personality tests are supposed to assign numbers to people in accordance with their standing on one or more trait dimensions. As with any measuring device, these instruments must possess certain properties if they are to be useful: reliability, validity, and utility (American Psychological Association, ref).


An assessment device is useful only to the degree that it yields a precise measurement of the attribute in question: a rubber ruler is not as reliable as one made of steel because it can be bent, stretched, and compressed. The easiest way to determine reliability is to see whether multiple measurements of the same object by the same instrument yield the same scores.

  • When two or more observers use a test simultaneously to make judgments about a person, the agreement among them is called interrater (or interjudge)reliability.

  • When a single observer makes trait ratings about a person on two or more different occasions, this is test-retest reliability.

In both cases, reliability is indicated by high correlations (typically above .75) between the different observations. Note that the requirement of test-retest reliability is based on the assumption that personality traits are stable over time. For certain more transitory features of personality such as emotional or motivational states, of course, retest reliability will not necessarily be high (Atkinson, 1980; McClelland, 1980).

A more complicated form of reliability is known as internal consistency. With certain attributes of people, such as height and weight, a single measurement -- in inches or centimeters, pounds or kilograms -- is all that is required. Investigators of personality traits, however, have not been able to settle on any single, unambiguous tests for the dimensions that interest them. Rather, following the principle of convergent validity discussed earlier (in "Types and Traits"), they must rely on multiple measurements, each an imperfect manifestation of the trait.

Consider, for example, some of the items in a scale test of responsibility, from the Jackson Personality Inventory. While most of us would agree that each of these items has something to do with responsibility, most of us would be reluctant to choose just one item for purposes of assessment. (If you need convincing on this point, poll five different people about which single item they would pick. We wager that you get at least four different answers.)

I contribute to charity regularly
If I accidentally scratched a parked car, I would try to find the owner to pay for the repairs.
Under no circumstances would I give incorrect testimony in evidence in court.
If the conductor on a train forgot to take my ticket, I would tell him.
I am very careful not to litter public places.

For this reason, we would expect that each of these items would correlate with the other four, as well as with the total score on the test calculated by summing over the items (technically, the total score is corrected by dropping the item in question, so that the correlations are not inflated by the correlation between the item and itself). In factor-analytic terms, these results would mean that a single common factor -- in this case, responsibility -- runs throughout the test items. Internal consistency is often indexed by Kuder-Richardson Formula 20 (or 21), which yields a statistic similar to the correlation coefficient. Internal consistency is an important part of Loevinger's (1957) structural component of construct validity.


Does the test measure what it is supposed to measure? Reliable tests are not necessarily valid -- a steel ruler gives a consistent measurement, but it is not a very good as a measure of weight.

In general, there are three forms of validity considered in psychometric theory.

  • First of all, a test should possess content validity -- its items should be representative samples of the domain being measured. Reading ability is not validly assessed by a test which contains only mathematics problems. Aside from considerations of reliability, there is another reason why most tests of personality traits contain many items: traits manifest themselves in many different ways, and each of these should be included in the test. The notion of content validity is part of Loevinger's (1957) substantive component of construct validity.

  • The items of a test often make intuitive sense -- that is, they appear to have something to do with the trait in question. If so, the test is said to possess face validity. This characteristic is not essential, however, and later we will see different types of tests that measure personality characteristics in a very subtle and indirect manner.

  • External validity is the most important characteristic of a personality test (or any other measurement, for that matter). Test scores are of no interest in themselves, and must relate to something else the person does. Scores on the test must be shown to relate to some criterion, such as group membership or some other attribute of the person (concurrent validity), or to his or her performance in some future situation (predictive validity). Sometimes, investigators try to establish validity by correlating scores on one trait test with scores on another test of the same trait (or a closely related one). However, the spirit of Loevinger's argument concerning the external component of construct validity, and Campbell and Fiske's (1959) concept of method variance, both contravene this. The test and the criterion must employ different methods of measurement.

There is a complex relationship between reliability and validity. Reliability does not insure validity, as we have seen in the case of a ruler employed as a measure of weight. However, it does set upper limits on validity, because for logical and mathematical reasons a test cannot correlate with something else (the essence of validity) higher than it correlates with itself (the essence of reliability). A test with poor reliability cannot be very valid. Moreover, a valid test must be reliable, even if -- as is the case with measurements of emotional and motivational variables (McClelland, 1980) -- that reliability is hard to demonstrate.

Samples and Signs

All personality tests are based on one of two assumptions. The first is that the test items represent samples of behavior --that is, that there is some kind of isomorphism or one-to-one correspondence between the person's performance on the test and what he or she thinks and does in the real world outside the testing situation. Here, face validity is very high. For example, a popular test of conformity (extracted from the Jackson Personality Inventory) contains the items listed in the following table, and assumes that the person's self-reports on this test reflect the way the person actually does (or would) behave when confronted with these situations. Most tests of intellectual abilities (e.g., the Wechsler Adult Intelligence Scale and the Scholastic Aptitude Test), and many tests of personality variables (e.g., the GZTS and the 16PF) make this assumption

I am very sensitive to what people think of me.
In most situations, I usually agree with the opinion of the group.
before making a decision, I often worry whether others will approve of it.
It makes me feel uncomfortable to be dressed differently from those around me.
I often wonder why some people get pleasure out of doing unconventional things.

For other tests, however, there is no assumed isomorphism between the items of the test and the person's actual behavior. From this point of view, test items are intrinsically interesting units of behavior which are signs of some underlying disposition. Consider the items listed in the following table, which are drawn from the Depression scale of another popular personality questionnaire, the Minnesota Multiphasic Personality Inventory (MMPI -- the original edition). On the surface, none of these items seems to relate to the trait in question. Yet research indicates that people who are depressed tend to answer them in ways that non-depressed people do not (the keying of the items is not necessarily obvious, so don't try to test yourself). It doesn't' matter, from the point of view of the test, whether a person actually likes to flirt in real life. All that matters is that he or she says so on the test. Some personality questionnaires (the MMPI and many of the scales on its offspring, the California Psychological Inventory) as well as many projective techniques such as the Rorschach inkblot technique, make the sign assumption, and thus contain many subtle items (Goldberg & Slovic, 1976; Seeman, 1952).

It takes a lot of argument to convince most people of the truth.
I am neither gaining nor losing weight.
I do not have spells of hay fever or asthma.
I like to flirt.
I sweat very easily even on cool days.

Content versus Style

When questionnaire items are considered to represent samples of the subject's actual behavior and experience, it becomes a matter of some importance whether the subject's responses to the questions are influenced by factors irrelevant to the content of the scales (Cronbach, 1946, 1950). Obviously, a person's intelligence, education, and cultural background will affect how he or she understands the questionnaire items. One popular personality inventory, for example, contains the item, "I used to play 'drop the handkerchief'", to be answered True or False -- referring to a children's game which is all but lost to living memory. Another item is "I believe in the Second Coming of Christ". Cultural factors are especially important, as we shall soon see, when a test intended for use with one group of people is administered to persons with a quite different ethnic heritage.

Even when these problems have been considered, there are still a number of "nuisance variables" that affect the extent to which subjects will endorse various test items. For example, there is a well-documented acquiescence tendency -- for people to say "yes" to any item, regardless of content (Jackson & Messick, 1958). (The opposite tendency of negativism, or the tendency to say "no" to everything, is also a problem with occasional individuals). If a personality scale consists of all positively worded items (where saying "True" would count as evidence that a person possesses the trait), a strong acquiescence tendency would lead him or her to high scores no matter what actual standing he or she has on the trait dimension in question. Both acquiescence and negativism can be taken into account easily by insuring that half the items are worded positively and half negatively, so that "yea-saying" and "nay-saying" will not artificially inflate subjects' scores.

Personality tests by their very nature often ask questions about threatening or otherwise unpleasant topics, and accordingly there is a tendency towards social desirability in responding. That is, people often tend to answer these questions in a manner that is consistent with sociocultural standards -- again, regardless of their actual behavior or experience. This is not so easily controlled as acquiescence tendency, but there are several strategies available. One control procedure advocated by Edwards (1957) involves forcing the subject to choose between two alternative items known to be equivalent in terms of social desirability. This forced-choice method was employed in the Edwards Personal Preference Schedule (ref). Another tack is simply to correlate each item with a measure of social desirability, and discard any items where the correlation is too high.

Acquiescence and social desirability are called response styles because they seem to represent ways of approaching personality questionnaire items irrespective of their content.

A related concept is response set, which refer to the subject's attempt to convey a particular impression of him- or herself as likeable, aggressive, and the like.

A number of other response styles have been identified, and these are summarized in the following table, based on crossing acquiescence (measured by the normative frequency with which an item is endorsed, and the social desirability of the item (Wiggins & Lowell, 1965).



Social Desirability





Deviance Non-conformity Deviant Favorability


Unfavorability Acquiescence Favorability


Unfavorable Communality Communality Hyper-Communality

Setting aside considerations of the actual content of questionnaire items:

  • A person who endorses low-frequency, low-desirability items is presenting himself as deviant with respect to conventional social norms -- that is, he's really bad.

  • Someone who endorses moderate-frequency, low-desirability items is presenting himself in a realistically unfavorable light.

  • If someone endorses high-frequency, low-desirability items, he is saying that he does the same sorts of negative things that everyone else does.

  • A person who endorses low-frequency, medium-desirability items is presenting himself as non-conforming.

  • Someone who endorses moderate-frequency, moderately desirable items is showing an acquiescent response style.

  • If someone endorses high-frequency, moderately desirable items is saying that he's just like the average person.

  • A person who endorses low-frequency, highly desirable items is also presenting himself as deviant with respect to conventional social norms, but in a positive way -- he's better than the rest of us.

  • Someone who endorses medium-frequency, highly desirable items is trying to present himself in a realistically favorable light.

  • If someone endorses high-frequency, highly desirable items, he is saying that he's really above average --pace Garrison Keillor, you might think of this as displaying the "Lake Wobegon Effect" -- after the Minnesota town "where all the children are above average".

Identification of acquiescence tendency and social desirability led to a thorough examination of the personality questionnaires in used at the time. The outcome was not encouraging. Many questionnaires were found to be so heavily contaminated by response styles that it seemed doubtful that they measured any content at all (Jackson & Messick, 1958). From one point of view, of course, acquiescence, social desirability, and the rest may be considered to be important trait dispositions in their own right (Couch & Kenniston, 1960; Crowne & Marlowe, 1960). But this is a different matter that intending to assess sociability, for example, and finding out that you've measured acquiescence tendency instead.

The issue was addressed directly by Jack Block (1965) in a monograph appropriately entitled The Challenge of Response Sets (see also Rorer, 1965), and a continuing exchange with Jackson and Messick (Bentler, Jackson, & Messick, 1971, 1972; Block, 1967, 1971; Jackson, 1967a, 1967b). Block was particularly concerned with the claim that the two major dimensions which appear in factor-analyses of the Minnesota Multiphasic Personality Inventory scales, commonly identified with Eysenck's traits of extraversion-introversion and neuroticism, in fact boil down to acquiescence and social desirability, respectively (e.g., Wiggins & Lovell, 1965). To make his point, Block constructed MMPI scales that were free of acquiescence tendency (by insuring that equal numbers of items were keyed True and False) and social desirability (by eliminating items with extreme social desirability values). A new factor analysis continued to reveal the same two dimensions. Block argued that because response styles had been eliminated, the resulting factor structure reflected the substantive content of the scales, which he labeled ego control (extraversion-introversion) and ego resiliency (neuroticism). However, while Block has shown that personality questionnaires can be decontaminated in this manner, it is the original, confounded form of the MMPI (and similar scales) that remains in use. While this does not necessarily render these instruments invalid, it does render them less than optimal for purposes of personality assessment. Most recently, a new generation of personality inventories has been developed, which are relatively free of these problems.


From a strictly scientific point of view, all personality tests must be both reliable and valid. From the point of view of practical economics, however, it is highly desirable that they be efficient as well (Mischel, 1968; Sechrest, 1963). The only justification for introducing a new method of measuring a trait (or any other aspect of personality) is that is represents an improvement in measurement over some other method that is already available; or that it is less expensive than that method. Consider the Scholastic Aptitude Test (SAT), which attempts to predict a student's academic standing after the first year of college. If the test were valid, it would certainly be cheaper to base college selection on SAT scores than to let everyone enroll and see who flunks out as freshmen. The possible economic advantage is clear, too, in the case of personality tests whose items are considered to be samples of real-world behavior and experience. It is certainly cheaper to ask people to fill out a 20-item self-report scale than to deploy a gang of observers to see how they behave in 20 different situations.

A test possesses utility to the extent that it optimizes the ratio of costs to benefits of its use. It may improve prediction over a less expensive instrument, or it may equal the predictive power of a more expensive instrument. The optimal ratio, of course, will depend on the cost to society of a mistaken judgment. A college may be willing to take the risk that some students will eventually drop out, but psychiatrists may be unwilling to take the risk that some patients will commit suicide. There are many varieties of personality tests, but the self-report questionnaire is generally touted as the most reliable, valid, and efficient means of collecting information on individual differences in personality traits.


If we are going to use tests to compare one individual with another, or to compare one individual's responses at two different times, then it is important that the test be administered and scored in the same way to both people, or on both occasions. For that reason, the developers of psychological tests provide detailed instructions for how they are to be administered and scored.


What does a score on a test mean? One way of finding out is to compare an individual's score on a test to the scores of other people who have taken the same test. In this way, we can determine whether the person scores above or below average, and by how much. For this reason, the developers of psychological tests usually provide norms of test performance -- such as the mean score and standard deviation -- based on the administration of the test to a representative sample of the population. Sometimes, as with the WAIS, the representative sample is an actual probability or stratified sample of the population as a whole. On other occasions, the "representative sample" consists of students enrolled in the university where the test developer is on the faculty, or some other sample of convenience. The original version of the MMPI was normed on a sample of largely white, rural, residents of Minnesota, while the CPI was normed on a sample of college students enrolled at the University of California, Berkeley. Most test developers caution that users should take into account the normative sample -- in an important Federal court case (Griggs vs. Duke Power Company), a firm got in trouble for comparing black and Hispanic workers to the Minnesota-based norms for the MMPI.

Methods of Questionnaire Construction

On the surface, the procedure for constructing a personality questionnaire seems quite straightforward. After developing a pool of items relevant to the trait in question, the investigator devises a standard procedure for administering and scoring the test (standardization). The, the test is administered to a representative sample of the population, in order to collect norms that will translate an individual's test score into relative standings compared to the group as a whole. Finally, he or she collects external validity evidence, in order to confirm that the test is able to predict some criterion. There are many varieties of personality questionnaires, differing chiefly in the way that the item pool is developed and refined (for fuller treatments, see Edwards, ref; Jackson, refs; and Lanyon & Goodstein, ref).

The Intuitive Method

In this procedure, the test is derived from either common sense (often called the rational method) or from some formal theory (the theoretical method). This boils down essentially to a written interview, in which face-valid questions are posed and responses are collected in the form of True-False statements or numerical self-ratings. In any case, the method involved is simple: the investigator defines the construct which interests him or her explicitly and then writes items (or chooses them from some pool already available) that seem to bear on the construct.

In principle, these items should be refined according to some statistical criteria, based on the results of normative testing. For example, for a test k items in length, good rules of thumb are:

M = 1/2k;

M - 2SD > 0; and

M + 2SD < k,


M = the mean score on the test, derived from a representative sample of the population;

SD = the standard deviation.

With plenty of room above and below the mean, the test will be sensitive to individual differences along the entire trait dimension. Good test-construction practice also calls for some assessment of internal consistency, eliminating any items that seem redundant (i.e., show high interitem correlations) or irrelevant to the construct (i.e., low item-to-total correlations), and checking the internal consistency of the refined scale on a new normative sample.

The intuitive method was employed in the construction of the very first personality questionnaire, the Personal Data Sheet devised by Woodworth (ref) for the psychiatric screening of army recruits during World War I; also the Study of Values (Allport, Vernon, & Lindzey, ref) and the Edwards Personal Preference Schedule (Edwards, 1959). Many of the "projective" personality tests such as the Rorschach and the Thematic Apperception Test also have been devised by intuitive methods. Because the method is so convenient, it has also been employed by many other investigators who needed a personality scale for their research. In fact, we venture to guess that the intuitive method accounts for the vast majority of all personality questionnaires ever published. Unfortunately, most of these investigators never seem to have advanced from the stage of generating lists of plausible test items to the stage of statistical refinement.

For details on the use of the Personal Data Sheet, see "The First Personality Test Was Developed During World War I" by Lila Thulin, published on the website of Smithsonian Magazine, 09/23/2019.

The chief problem with even refined intuitive questionnaires, of course, is the absence of any empirical evidence for the validity of their inventors' intuitions. There is no reason to believe that the traits selected as important by one investigator will match those deemed by another to be crucial to understanding personality; or that there will be agreement about the degree to which the items included in the test represent the trait in question. Moreover, rarely is there any attempt to validate the questionnaire against some external criterion of the traits in question. The questionnaire may be thoughtfully constructed, and its rationale may appeal to common sense or consensus among those investigators who share a particular point of view, but in the final analysis intuition is no substitute for empirical evidence.

The Factor-Analytic Method

Instead of relying on intuition in constructing personality questionnaires, many investigators have preferred to let nature itself dictate the dimensions to be measured, and the items which will appear on the instruments. This is accomplished by means of factor analysis and other related multivariate techniques discussed in the chapter on "Types and Traits". These procedures, of course, represent extensions of the sort of "statistical refinement" applied to questionnaires constructed by means of the intuitive method. By letting nature speak for itself (Cattell's words again), test-constructors of the factor-analytic school hoped to solve the problem of the apparently arbitrary selection of traits and scale items favored by the intuitive school.

The factor-analytic test-developer begins by amassing a wide-ranging variety of questionnaire items. Some of these items will be original with the investigator. However, in order to insure comprehensive coverage of the many domains of human behavior and experience, he or she may "borrow" items from already-available questionnaires devised by others. These items are assembled into a single long questionnaire, administered to a large group of subjects (technically, there must be several times as many subjects as items), and submit the item-intercorrelations to factor analysis.

Typically, such a procedure yields a number of different factors, indicating how many individual-difference dimensions are represented in the questionnaire. Each factor is given a name after considering the items that show significant factor loadings (e.g., item-to-factor correlations of at least .40), and these items are then used to constitute a refined scale measuring the particular trait dimension revealed by the factor analysis. This method of construction insures that each scale of the questionnaire will have high internal consistency.

The factor-analytic approach has been used to produce many popular personality questionnaires, including Cattell's 16 Personality Factor Questionnaire, the Guilford-Zimmerman Temperament Survey, and the Eysenck Personality Inventory. All of these questionnaires may be called personality inventories because they attempt to provide a comprehensive survey of individual differences in personality, and to permit each individual to be located as a point in a multidimensional space representing these trait dimensions. In each case, the underlying assumption is that factor analysis can be relied upon to reveal the structure of personality. As we saw earlier, however, this assumption cannot be taken for granted. Each investigator's factor analysis reveals a somewhat different structure, and there are no grounds (other than taste) for choosing between them. The chief problem with the intuitive method, moreover, remains unsolved. There is little evidence that the traits measured by the questionnaires relate to any external criterion. Meehl later retreated from this position after he and Cronbach introduced the notion of construct validity.

The Empirical Method

In an attempt to put personality assessment on a solid empirical basis, some investigators have insisted that the sole standard for including an item on a test should be its relation to some external criterion. The theoretical foundation for this method was laid by Meehl (1945), who argued that test items themselves were meaningless, and gained meaning only by virtue of their association with other variables. From this point of view, there is no assumption that the person's responses to test items are accurate reflections of his or her actual behavior. The responses themselves are "interesting and significant bits of behavior" (Meehl, 1945, p. XXX). Clearly in this case test items are considered to be signs rather than samples.

There is, of course, some evidence that the questionnaire factors predict something, as in Cattell's use of the 16PF scales in his specification equation. But finding external correlates after the scale has been developed and published is quite different from introducing empirical considerations into the construction process itself. Eysenck, it should be noted, is a partial exception to this generalization. In developing questionnaire measures of introversion-extraversion, neuroticism, and psychoticism he has often employed a method called criterion analysis. In this variant of the conventional factor-analytic technique, external (non- questionnaire) criterion variables are also entered into the correlation matrix, and the factor analysis is adjusted so that the emerging factors line up with the criteria. This practice ensures that the factors extracted from the questionnaire will relate to something besides a paper-and-pencil test. Nor is there any concern with the face validity of the items on the resulting scales. The relation between the test item and the underlying disposition may be obvious or it may be quite subtle. This point was made most forcefully by Berg (1955, 1967) in his "deviation hypothesis", which held that even preferences for random drawings could be used as personality scale items, so long as people with different dispositions expressed different preferences.

Empirically derived questionnaires are constructed by the method of contrasting groups. The investigator begins, as in the factor-analytic method, by assembling a large and heterogeneous group of items. These items are then administered to two groups of people: one of whom possesses the target attribute while the comparison group does not. For example, in the construction of a depression scale, the criterion group might be composed of hospitalized psychiatric patients carrying a diagnosis of depression; the comparison group might be nonhospitalized individuals drawn from the same community. The rate at which individuals of both groups respond "yes" is calculated, and those items with the largest significant differences in endorsement rate are retained as scale items. Checks for internal consistency are also made, and highly redundant items are eliminated in order to achieve economy. Of course, some group differences are likely to occur merely by chance. Accordingly, proper practice requires that the entire scale-derivation procedure be repeated in a new sample -- a technique called cross-validation. Only items which show significant differences in both samples go into the final version of the scale. Norms for the scales are then calculated based on the responses of a representative sample of the population. In an extension known as double cross-validation, the subjects are split in half, and items selected independently for each group: only those items that significantly differentiate the two groups.

Examples of personality inventories constructed by the empirical method include the Strong Vocational Interest Blank commonly employed in career counseling (ref); the Minnesota Multiphasic Personality Inventory employed in clinical settings (Dahlstrom, Welsh, & Dahlstrom, 1972, 1975; Hathaway & McKinley, 1940); and the California Psychological Inventory, designed for the assessment of normal personality (Gough, 1964). These three questionnaires are among the most frequently used instruments in both applied and research settings (Sundberg, 1961). The latest edition of the Mental Measurements Year Book (Buros, ref), an authoritative guide to standardized psychological tests, lists a total of XXXXX papers published in the professional literature reporting research employing these three inventories.

Note: Since this chapter was originally drafted, the MMPI has been extensively revised, employing a different, more theoretically-based method of scale construction.

The introduction of the empirical method of test construction represented an important advance in personality measurement. It restated the purpose of assessment: to use test scores to predict something about the person's behavior and experience in the outside world. However, because there is no concern with face validity and little interest in the internal structure of empirical tests, the method has problems of its own. Despite the argument by Meehl (1945) and Berg (ref) that people who score highly on empirically derived scales may be assumed to be similar to the criterion group employed in constructing the scale (or, put another way, that "birds of a feather flock together"), many psychologists are uncomfortable with the frequently vague relationships between the content of test items and the disposition ostensibly measured by them. In addition, it turns out that individual items often contribute to scores on more than one scale in the inventory. Therefore, any correlation obtained between these scales is ambiguous as it could reflect either a substantive relationship or an artifact of item overlap. Moreover, even if each of the items is related to the criterion, they may not be related to each other -- thus internal consistency may be lacking.

The biggest problem with the empirical method has to do with the representativeness of the criterion and comparison groups used to select scale items. The MMPI, for example, was standardized in 1935 on an adult sample drawn from the largely rural-agricultural, northern European, Protestant population of the American Midwest. And the "Masculinity- Femininity" scale of the MMPI was derived by comparing the responses of homosexual and heterosexual men, "masculine" and "feminine" men, and -- finally, just men and women airline cabin attendants. (It was that hard, apparently, to find items on which men and women responded differently.) It is not at all clear that the responses of these individuals should be used as the standard for evaluating those of other kinds of people. The cultural differences between 1935 and the present, farmers and factory workers, Midwesterners and southerners, Lutherans and Jews may well lead to erroneous conclusions. This is clearly the case for blacks (refs) and adolescents (refs), who often appear disturbed on the MMPI even though they show no evidence of maladaptive behavior in real life. In this case, these individuals are simply penalized for not sharing some of the attitudes and experiences of the group on whom the test was originally standardized. Empirically derived tests are only valid for the kinds of people on which their norms are based.

The Methods Compared

In the final analysis, of course, comparison of the different methods of test construction boils down to an empirical question: which kind of test permits the best prediction of some criterion behavior? This question was addressed by Hase and Goldberg (1967). Beginning with the items of the CPI, they proceeded to construct six sets of 11 scales for use as predictor variables.

  • Empirical: the 11 standard CPI scales constructed by the method of contrasting groups.

  • Factor-analytic: A total of 11 orthogonal factors were extracted from the CPI item-intercorrelation matrix, and corresponding scales were constructed from the pattern of factor loadings.

  • Intuitive-rational: The four standard CPI scales constructed by the rational method were supplemented by seven new scales, intended to parallel some of the 11 CPI empirical scales.

  • Intuitive-theoretical: Items were selected from the CPI to correspond to 11 of the "human motives" discussed in Murray's theory of personality.

  • Stylistic: The nine content-free CPI scales developed by Lovell (1964) to measure such tendencies as acquiescence and negativism, supplemented by two new scales measuring communality and social desirability.

  • Random: The CPI items were randomly assigned to one of 11 scales, rendering them completely meaningless.

Hase and Goldberg also employed a number of different criterion variables in their prediction task.

  • Social Conformity: whether the subject joined a sorority (all the subjects were college women), and peer ratings of her tendency to yield to group pressure.

  • Peer Ratings of dominance, sociability, social responsibility, femininity, and psychological-mindedness.

  • Academic Achievement, as measured by grade-point average.

  • Academic Interest: whether the student chose to major in the liberal arts or some applied field, and whether the student dropped out of school at the end of the first year.

The method used by Hase and Goldberg to assess the validity of the various kinds of scales in predicting the several criteria is called double cross-validation. They divided their sample in half, and correlated each of the predictor scales with each of the criteria. Then, for each of the six types of scales they used a correlational technique called multiple regression to construct an equation representing that combination of the 11 scales which best predicted each criterion. The multiple regression equations were derived separately in each half of the sample, and each was cross-validated on the other half.

The table shows the average derivation and cross-validation multiple correlations (averaging over the 10 criteria and two half-samples). Somewhat surprisingly, the stylistic and random scales are able to predict the criteria, as indicated by statistically significant multiple correlations in the derivation column. However, these multiple-regression equations don't hold up on cross-validation (this should be a warning to those test-developers, or anyone else working with correlations, who point proudly to significant validity coefficients but neglect to confirm them through cross-validation). More important, there are no differences in the predictive validity of the four types of substantive scales. (There is some shrinkage in validity from derivation to cross-validation for these scales as well as for the stylistic and random scales, but the validity coefficients remain statistically significant.) The conclusion is that none of the methods discussed so far is truly superior to the others, from the point of view of external validity. From the point of view of utility, of course, the rational scales were the easiest to construct in the first place.

Scale Type

Derivation R

Cross-Validation R


.48 .26


.45 .26


.45 .26


.48 .27


.39 .15


.36 .10

The Deviation Hypothesis

As noted earlier, a radically empirical approach to personality test construction takes absolutely no account of the content of test items. All that is needed is for the items to discriminate between two comparison groups. For this reason, empirically derived personality scales typically contain a mix of face-valid items which are obviously related to the substantive domain under consideration, and other, "subtle" items, which do not appear to relate to the domain. The assumption is that obvious and subtle items are equally good predictors of criterion behavior. However, Duff (1965) found that obvious MMPI items performed better than subtle items in discriminating between psychiatric patients in different diagnostic categories. This may account for Hase and Goldberg's failure to find empirically derived personality scales to be superior to other types. The low validity of the subset of subtle items pulls down the validity for the scale as a whole.

A direct comparison of subtle and obvious items was performed by Goldberg and Slovic (1967). Groups of college women were classified into groups based on grade-point average, academic achievement, sorority joining, social conformity, rated sociability, and rated dominance. On each criterion, the subjects were separated into high- and low-scoring groups. From a large pool of available items four sets of six scales were constructed to serve as predictors.

  • Verbal, high face valid: Questionnaire items judged highly relevant to the criterion.

  • Verbal, medium face valid: Questionnaire items judged moderately relevant to the criterion.

  • Verbal, low face valid: Questionnaire items judged irrelevant to the criterion.

  • Nonverbal: Berg's (1961) Perceptual Reaction Test, consisting of sets of simple geometrical figures, for which the subject is required to state a preference.

Following the principles of empirical item validation, questions that differentiated between high- and low-scoring groups on a particular criterion were included in the corresponding prediction scales. Thus all of the scales were empirically valid, but differed in face validity, obviousness, or subtlety. Employing the strategy of double cross- validation, prediction scales derived from each half of the sample were tested in the other half. The table gives the average derivation and cross- derivation validity coefficients for each of the four types of items. As in the study by Hase and Goldberg (1967), all the derivation validities were moderately high and equal. While this would seem to support the deviation hypothesis, the extent of shrinkage was not the same for all types of scales. Only the scales with high face validity showed significant validity coefficients at cross-validation; validity for the subtle scales shrank to zero.

Average Validity Coefficient



Verbal - High

.62 .24

Verbal - Medium

.61 .01

Verbal - Low

.59 .07


.56 .02

Scales consisting entirely of subtle items have been advocated in some corners on the ground that subjects who approach personality questionnaires with a defensive attitude will not see through them, and thus be tricked into self-disclosure. However, it appears that personality tests are disguised in this way at the expense of validity. On the basis of the evidence so far, the most valid questionnaire items appear to be those which are chosen on the basis of both rational and empirical considerations.

The Sequential Method: State of the Art? (as of 1994)

The most recent development in personality testing has been the fusion of the best features of the intuitive, factor-analytic, and empirical strategies in test construction. The theoretical rationale for this approach has been set out by Jackson (1970, 1971), and is based on the principles outlined earlier for establishing construct validity.

Essentially, Jackson calls for a substantial injection of theory into the process of test development. Like those who advocate the intuitive method, he insists that we must possess an explicit definition of a construct before we attempt to measure it. It is not enough to blindly conduct factor analyses or compare the item-endorsement rates of various criterion groups. Such prior (a priori) definitions are rarely found in the factor-analytic and empirical methods. Also similar to the rational method, Jackson argued that theoretical considerations should enter into the selection of test items and in the development of scoring schemes. Items should be chosen in order to adequately sample the entire range of the construct as theoretically defined. Face validity is important: if the relationship between test item and theoretical construct is subtle rather than obvious, the content of the item must still be theoretically relevant. The theory must also determine how test items are to be combined into a measure of the construct, whether that combination is linear, class, dynamic, or something else. Test development must also consider such issues as the contamination of content by response styles, convergent and discriminant validity, and method variance.

Steps in test construction. Jackson calls his method "sequential" because it involves a number of steps, each designed to lead to a progressively more refined questionnaire. The process begins by selecting dimensions on the basis of either theoretical or practical considerations. Obviously, the purpose to which the questionnaire will be put --vocational counseling, psychiatric screening, or research on normal personality, for example -- will dictate which dimensions will be selected. After each pole (positive and negative, or high and low) of the trait dimension has been carefully defined, a team of consultants is engaged to generate items which will form a provisional pool for the scale (Jackson proposes that item-writing continue until the investigator has 5-10 times the questions that will be actually needed). These items are then edited for relevance, redundancy, and clarity of expression. If the test-developer feels that certain aspects of the construct are underrepresented in this provisional pool, new items are written to fill the gaps and enhance content validity. Half the items are worded positively and half negatively, in order to eliminate the effects of acquiescence and negativism.

The provisional item pool is then administered to a group of subjects, along with a scale measuring social desirability. Items with extreme endorsement and rejection rates are eliminated on the grounds that they are not very sensitive to individual differences. A Differential Reliability Index is calculated for each item, based on its correlations with both the corrected total scale score and the social desirability scale. Items where the correlation with social desirability is higher than the item-to-total correlation are eliminated from the pool. An Item Efficiency Index is also calculated for the items in the reduced pool, based on the correlations between each item and each of the scales represented on the questionnaire. Items showing higher correlations with scales other than the one they are intended for are transferred to those scales or eliminated from the pool. Finally, pairs of items are matched for DRI and IEI, and then divided between alternate forms of the inventory. Finally, a factor-analysis of the items is carried out. Given the laborious process of generating and selecting items, the factor structure of the inventory should parallel the dimensions selected at the outset, and the relationships obtained between the factors should be those expected by the underlying theory as well. All of these analyses, which are intended to refine the item pool, parallel those employed in the intuitive and factor-analytic methods.

The final step in the procedure involves getting empirical evidence on the external validity of the questionnaire scales, through an analysis of convergent and discriminant validity using the multitrait-multimethod matrix. The refined questionnaire is administered to a new sample of subjects. For external criteria, Jackson typically relies on self- and peer-ratings of the subjects on the same trait dimensions ostensibly measured by the scales on the questionnaire. Following the arguments of Campbell and Fiske (1959), there should be no significant correlation between measurements that share neither trait nor method variance; correlations between various measures of the same trait should exceed correlations among different traits measured by a single method; and the patterns of correlation among the traits should be comparable, regardless of the method employed. If these conditions do not obtain, it is back to the drawing board. If they do, the empirical analysis of the questionnaire is completed, and the scale is ready for use.

The sequential method clearly involves a great deal of work, but in the past decade or so Jackson and his colleagues have employed it to produce a series of personality inventories: the Personality Research Form (PRF; measuring 15 of Murray's "human motives" the Jackson Personality Inventory (JPI; assessing a miscellany of traits that have received substantial attention in the research literature); the Jackson Vocational Interest Survey (JVIS; for career counseling); and the Psychological Differentiation Inventory (PDI; for use with psychiatric patients). This generation of questionnaires, containing items that are faithful to some underlying theory, scales that are highly saturated with content and relatively free of response style bias, and proven external validity seem to represent the "state of the art" in the construction of personality tests.

The Methods Compared: Reprise

The next question, of course, concerns the practical utility of such instruments. Is the extra expense and effort involved in the sequential method justified by a corresponding increase in the predictive power of the tests? Ashton and Goldberg (1972) were the first to provide an answer. Following the procedure employed by Hase and Goldberg (1967), these investigators employed six sets of three scales (measuring sociability, achievement, and dominance) to predict five criterion variables measured by peer ratings: sociability, achievement, dominance, how well the subject was known, and how well she was liked (again, all the subjects were college women). The six sets of scales represented different methods of test construction:

  • Empirical: the sociability, achievement, and dominance scales of the CPI, which were originally derived by the method of contrasting groups;

  • Rational-CPI: scales of the same name, as constructed from the CPI item pool by Hase and Goldberg (1967);

  • Theoretical: Similar scales, also constructed from the CPI by Hase and Goldberg (1967);

  • Sequential: Similar scales, taken from the PRF;

  • Rational-Psychologists: Scales consisting of items written by psychology graduate students who had been given the definition of the trait constructs in question, with no statistical refinement of the items;

  • Rational-Nonpsychologists: Another set of unrefined intuitive scales, consisting of items contributed by writers with no formal training in psychology;

  • Self-Rankings: The subjects were asked to rate themselves on numerical scales of sociability, achievement, and dominance.

For each type of questionnaire, the three scales were entered into multiple-regression equations predicting each of the five criteria. The table presents the validity coefficients, averaged across the five criteria (there were no cross-validation coefficients, because none of the scales were derived empirically). As can be seen, the scales contributed by the nonpsychologists did the most poorly, and the sequential method showed a small increase in predictive power over the empirical, theoretical, rational-CPI, and rational- psychologists questionnaires.

Scale Type

Validation R









Intuitive - Psychologists


Intuitive - Nonpsychologists




In a later study, Jackson (1975) obtained comparable results comparing three measures of social participation, tolerance of others, and self-esteem. These scales were entered into multiple-regression equations predicting self- and peer-ratings on the same dimensions, with the results depicted in the following table.

Scale Type

Validation R




.44 .45

CPI (Empirical)

.31 .09

JPI (Sequential)

.51 .29

In both studies, the various types of scales were all valid, in the sense that they significantly related to the criteria. Thus they may be appropriately used in personality research. The investigator need only select the type s/he wishes to employ. Because it lacks the item-refinement and convergent- discriminant validation steps of the sequential method, the intuitive method is cheaper in terms of time, money, and effort expended. In both studies, the sequential method showed only a small advantage over less costly techniques. Note, however, that the largest validity coefficients in the Ashton and Goldberg (1972) study were shown by the subjects' self- rankings on the traits in question. Clearly this is the cheapest assessment of personality available. Therefore, we conclude that while the sequential method does indeed improve the validity of assessment to some degree, it lacks utility when compared with cheaper methods. Intelligent item-writers do very well at constructing personality questionnaires on their own, without need of internal or external checks on their work. No method of questionnaire construction, however, is superior to the subjects' self- rankings (for another demonstration of this, see Mischel, 1969). If we want to know what subjects can tell us about themselves, it is not necessary to employ a highly developed technology of personality assessment. We need only ask them.

The Genetic Basis of Individual Differences in Personality

Often lurking behind the concept of a personality trait is the notion that these dimensions of personality are genetically determined, part of our hereditary makeup just like eye color and skin pigmentation. As we have noted earlier, this is not a necessary element in the doctrine of traits. Allport (1937) asserted that traits could reflect either nature, or nurture, or both. However, some theorists, like Eysenck (1976), have gone so far as to assert that personality traits must be inherited.

Informal observations of similarities between parents and their children, among siblings, and the appearance of substantial individual differences in temperament (e.g., activity level, crying, etc.) among infants shortly after birth (ref) lend some support to the proposition that certain features of personality may be genetically determined. This problem may be pursued more rigorously by a number of different methods, all of which assess the degree of similarity (concordance) among members of a family in terms of some attribute (Rosenthal, 1970).

There are two kinds of twins: monozygotic (MZ), who are formed from a single fertilized ovum; and dizygotic (DZ), who are formed from two separate ova. Monozygotic twins have exactly the same genetic endowment, so that they are of the same sex and usually present striking physical resemblances (they are "identical" twins). Dizygotic ("fraternal") twins are no more alike than any two other siblings from a genetic point of view: on average they only have 50% of their genes in common: they may or may not be of the same sex, and they show no more physical resemblance than any of other pair of siblings. The core of the twin study method is to compare the concordance (usually measured by the correlation between siblings) of MZ and DZ twins on some attribute. Obviously, MZ twins share both genetic endowment and social environment, whereas DZ twins share environment only. If the degree of MZ concordance is statistically significant, and also significantly greater than the degree of DZ concordance, this is evidence favoring a genetic contribution to the trait in question.

This evidence may be supplemented by other comparisons. Occasionally, usually because they are born into a family which lacks the economic resources to feed two new mouths, twins are separated at birth and raised in different households. To the extent that their home environments differ, identical twins reared apart (that is, sharing genes but not environment) provide an especially strong test of the genetic hypothesis. In addition, it is possible that by virtue of their physical resemblance, similarity in age, and other factors, twins are treated more alike by their parents and other significant people, compared to nontwins (and that this is especially the case for MZ twins). Any increased concordance in twins, then, might be a product of this increase in environmental similarity. For this reason, pairs of nontwin siblings (having 50% of their genes in common, on average, like DZ twins) and nonsiblings (having no genes in common) are often used to assess the impact of shared environments. In general, the hypothesis of a genetic contribution to personality would be supported by evidence of progressively increasing concordance, as follows:

Nonsiblings (same sex) < Nontwin Siblings (same sex) < DZ Twins (same sex) < MZ Twins, reared apart < MZ Twins, reared together

Even such a pattern of results would not be unambiguous with respect to a genetic contribution to personality, however. For example, comparisons between twins (MZ or DZ) and nontwin siblings are confounded with age. Twins are obviously the same age, while no other brothers and sisters are obviously different in age. In the same way, MZ twins are necessarily of the same gender, while DZ twins need not be. Therefore, any variable that is correlated with either age will show higher twin than nontwin concordance, and any variable that is correlated with sex will show higher MZ than DZ concordance. For this reason, genetic studies should compare MZ twins only with DZ twins of the same sex.

A highly touted strategy in these studies is to compare MZ twins who have been reared together with those that have been reared apart. Theoretically, such twins share the same genetic endowment, but have different environments; Therefore, it is argued, if MZ twins reared apart are more alike than DZ twins, then there is a definite genetic contribution to the individual difference variable being examined. In fact, however, separate rearing is more apparent than real. Most twins who have been separated actually have been reared in quite similar environments. Most were separated for economic reasons, and were brought up by aunts and uncles, grandparents, or neighbors (Kamin, 1974). No matter where they slept and ate, then, they shared very much the same environment.

It has also been noted that MZ twins may be treated more alike than DZ twins, because of their similar physical appearance, and that therefore their environments are more alike as well. Identical twins are certainly often dressed alike, and they are often confused with each other by friends, teachers, and even their parents. The last word on differential treatment of MZ and DZ twins has not been written yet. But to the extent that MZ twins are actually more alike environmentally than DZs, any increased concordance between them on some personality characteristic could be due to either nature or nurture.

In studies of genetic factors in individual differences, it is common to compute a heritability ratio representing the proportion of total variance on a trait that is attributable to genetic causes. However, there are two important limitations to such a statistic. First, it assumes that the factors causing individual differences are either genetic or environmental in nature. Thus, it is commonly stated that the heritability of IQ is .80 (Jensen, 1969), meaning that 80% of individual differences in intelligence are attributable to genetic factors and the remaining 20% to environmental factors. This estimate, by the way, is probably grossly overstated. A better figure is .30 or 30% (Jencks, ref) and maybe even .00 (Kamin, 1974). Be that as it may, if the interaction of nature and nurture is taken into account the estimate of true heritability might drop to 10% genetic factors, rise to 90%, or be any other figure at all.

Heritability ratios only have meaning within a particular environment. If the environment changes, heritability must change as well. Heritability is not a property of an attribute, but it reflects the attribute in its environmental context. To the extent that environment is the same for everyone, heritability will be relatively high; to the extent that environments differ, heritability will be relatively low. For this reason, it seems most sensible not to calculate heritability ratios, as numbers to be engraved in stone. It would seem better simply to compare MZ and DZ concordance. If the first is significantly larger than the second, that finding is prima facie (though not absolutely compelling) evidence that there is some genetic contribution to the trait. Just how much cannot be determined with any precision.

With these considerations in mind, let us examine the available evidence.


One of the earliest twin studies of personality was conducted by Shields (1962), who was able to test 44 MZ twins who were reared apart. Most had been separated early in life, and were not reunited until relatively late in life. His concordance rates for various physical characteristics, IQ, and personality traits (the last measured by one of Eysenck's questionnaires) are presented in the following table, along with corresponding rates for MZ twins reared together and DZ twins, as provided by other studies. In all cases, MZ twins are more alike than DZ twins, suggesting a genetic contribution to individual differences on these dimensions. The fact that, except for weight, the MZ concordance rates are the same for twins reared together and apart lends support to the genetic hypothesis.

MZ Together

MZ Apart



.94 .82 .44


.81 .37 .56


.76 .77 .51


.42 .61 -.17


.38 .53 .11

Somewhat later, Gottesman (1963, 1965, 1966) conducted a similar study intended to assess the inheritance of somewhat narrower trait dimensions than those studied by Shields (1962).

In the first of these studies, Gottesman (1963) reported twin correlations for a group of students attending high school in the Minneapolis-St. Paul metropolitan area: on a number of scales, the correlations for MZ twins were significantly higher than those for DZ twins, providing prima facie evidence for a genetic contribution to individual differences in personality. A later study (Gottesman, 1966) obtained similar results for a larger group of all high-school students living in the Boston area.

The finding that some personality traits are heritable was supported by Dworkin and his colleagues (1976), who followed up the Gottesman sample ten years later -- but with a twist. They were able to test a smaller but representative sample of twin pairs on the MMPI and CPI -- the same inventories they had completed as high-schoolers a decade earlier. Again, some of the scales gave evidence of significant heritability at adulthood. But interestingly, only a small number of scales gave evidence of significant heritability at both adolescence and adulthood.

Put another way,

  • Some traits appeared to be heritable in adolescence but not in adulthood.

  • Other traits appeared to be heritable in adulthood but not adolescence.

For comparison, here are the twin correlations for the subsample of Gottesman's Boston sample that Dworkin et al. were able to retest.




MMPI Scales


.07 .47


<.43 <.15


.23 .20

Psychopathic Deviate

<.49 <.30


<.64 <.17


<.19 <-.19


.41 .14


<.47 <.24


.46 .31

Social Introversion

<.45 <.26

K (Defensiveness)

.45 .29


<.58 <-.04


.37 -.04

Ego Strength

.35 .14


<.53 <-.21

CPI Scales


<.62 <.45

Capacity for Status

..64 .64


<.61 <.24

Social Presence

.39 .52


<.66 <.44

Sense of Well-Being

.56 .35


.77 .61


.51 .31


<.57 <.26


<.75 <.29

Good Impression

.40 .33


-.05 -.07

Achievement via Conformance

.49 .57

Achievement via Independence

<.70 <.16

Intellectual Efficiency

<.78 <.33

Psychological Mindedness

.37 .04


.40 .46


.42 .55

And here are the correlations for the same subjects tested 10 years later, in adulthood.




MMPI Scales


.42 .30


.49 .56


.16 .40

Psychopathic Deviate

.44 .34


.79* .81


.21 -.27


.28 .19


-.15 .5


<.72* <.08

Social Introversion

.50 .25

K (Defensiveness)

<.47* <-.27


<.71* <.00


.40 .12

Ego Strength

<.53* <-.06


<.76* <-.05

CPI Scales


<.70* <.09

Capacity for Status

.40 .45


.29 .21

Social Presence

.47 .43


.48 .33

Sense of Well-Being

<.65* <-.01


.43 -.03


.69 .46


<.62* <-.16


.42 -.06

Good Impression

<.46* <-.14


.22 .29

Achievement via Conformance

.43 .05

Achievement via Independence

.41 .16

Intellectual Efficiency

.54 .15

Psychological Mindedness

<.45* <-.10


.56 .31


.32 .18

One way to account for the discrepancies between the Gottesman and Dworkin et al. findings is to point to the radical differences between the social environments of the subjects as adolescents and as adults. The social lives of high-school students are pretty much alike, consisting of a familiar mix of home life, school, after-school jobs, dances, and sports events. After graduation, however, people take markedly different life paths. Some go into military service while others go to college; some raise families while others establish themselves in careers outside the home; some work in factories and some in offices; etc. From a strictly biological point of view, such an increase in environmental variance will necessarily diminish the possible genetic contribution to individual differences on some attribute. When the environment changes, certain attributes will change even if these attributes have their source in the individual's genetic endowment. Genetic influences are not fixed and immutable. It should be pointed out, however, that the discrepancy suggests at least one other hypothesis -- one which is not as friendly to the genetic argument. That is, the results of both studies, taken together, may reflect random effects at both ages, and these personality traits may not be heritable after all.

The most extensive twin study of personality conducted to date involved a fairly representative sample of 850 twins who had taken the National Merit Scholarship Qualifying Test toward the end of their senior year in high school (Loehlin & Nichols, 1976). These subjects completed an extensive battery of personality questionnaires including the CPI. Even though all testing was done by mail, internal analyses of the subjects' responses showed very little evidence of random or careless responding. Zygosity was diagnosed by means of a questionnaire concerning physical similarity and confusions in identifying the twins, which had proved to correspond well with the more rigorous blood-group diagnoses. In order to control for the influence of similarity in gender on similarity in personality, only same- sex pairs of fraternal twins were employed.

Table 3.11 shows the MZ and DZ correlations for males and females combined. Note, first, that all the correlations are positive, indicating some degree of similarity between twins on the dimensions tapped by the CPI. More important, the MZ twins are more similar (median r = .50) than the DZ twins (median r = .32). There were no sex differences in the degree of MZ-DZ difference. This difference was obtained on all the scales, and there were no significant differences among the scales in the magnitude of the value. In other words, all the traits measured by the CPI appear to be under an equivalent degree of genetic control.

Loehlin & Nichols (1976)



CPI Scales


.53 .24

Capacity for Status

.57 .44


.53 .29

Social Presence

.59 .23


.48 .25

Sense of Well-Being

.49 .29


.50 .34


.54 .32


.56 .31


.53 .33

Good Impression

.47 .28


.38 .17

Achievement via Conformance

.46 .15

Achievement via Independence

.55 .40

Intellectual Efficiency

.52 .33

Psychological Mindedness

.42 .22


.46 .20


.35 .21

Note: Loehlin and Nichols reported correlations separately for males and females in two halves of the total group. For purposes of constructing this table, the values above represent simple arithmetic means of the correlations given in their Table 4-1.

Of course, the greater similarity of MZ twins compared to DZ twins may reflect similarities in the way MZ twins are treated by their parents and others in their environment, rather than their genetic endowments. Loehlin and Nichols surveyed the subjects' parents about their child-rearing practices, and found, in fact, that the MZ twins were treated more similarly than their DZ counterparts. They were more likely to have dressed alike as children, to have spent time together during late childhood and adolescence, to have had the same teachers in school, and to have shared a bedroom. Parents of MZ twins were more likely to report trying to treat the children similarly, compared to the parents of DZ twins. However, the differences between MZ and DZ twins was not particularly great in these terms, and the degree of similarity of environments was not strongly related to the degree of similarity in personality. Doubtless the similar treatment accorded to MZ twins contributed to their similarities in personality, but apparently the effect is not entirely mediated by the environment.

Somewhat paradoxically, perhaps, Loehlin and Nichols were perplexed by the outcome of their study. It seemed unreasonable to them, as it does to us, that all personality traits are heritable, much less that they are equally heritable. Yet this is what they found. The difference between MZ and DZ twin similarity was about the same for all CPI scales. Moreover, MZ- DZ differences of about the same magnitude were found for all other dimensions studied: general and special intellectual abilities as measured by the NMSQT itself; self-esteem; ideals, goals, and vocational interests; and activity preferences. It is almost as if the evidence on genetic determination is too good to be true.

Even so, the twin data provides plenty of evidence favoring environmental influences. Identical twins were not by any means perfectly identical with respect to personality. However -- and this is yet another paradox in the data -- environmental similarity (at least as measured in this study) does not seem to be strongly related to personality. The authors summarize their findings as follows:

Our data have thus not yielded any final conclusive answers to the heredity-environment question for personality, abilities, and interests. The data are generally consistent with a substantial influence of the genes in accounting for individual differences in these domains, but they imply a substantial influence of the environment as well -- indeed, they do not altogether exclude a completely environmentalist position. The data do, it seems to us, have something important to say concerning this environment. And the upshot of what they say is than it operates in remarkably mysterious ways, given traditional views on personality and motivational development (Loehlin & Nichols, 1976, p. 94).

Given these "puzzles and paradoxes about how the genes and environment act to shape the development of personality" (Loehlin & Nichols, 1976, p. 95), it seems more appropriate to set the issue of personality development aside. When we have a clearer understanding of what personality structure is, and how personality processes operate, we may be in a position to generate some hypotheses about how individual differences develop.

Behavioral Observations

Most studies of the inheritance of personality have employed self-report questionnaires, as the most convenient means of personality assessment. Some other recent studies, however, have employed direct observations of subjects' behavior in various test settings. In the most recent study, Plomin and Foch (1980) compared 54 MZ and DZ twin pairs aged 6-9 on a variety of tasks relevant to personality and temperament.

  • Activity Level: a pedometer similar to that used by some joggers to measure the miles they have run was worn by each child for a period of one week;

  • Fidgeting: during a 9-minute period when the child was instructed to rest quietly in a comfortable chair;

  • Vigilance: number of errors on a continuous performance task in which the child had to indicate whether each of a series of playing cards was the same as the one which preceded it (performed under both quiet and noisy conditions);

  • Selective Attention: errors on a listening task performed against a noisy background; and

  • Aggression: number of times a Bobo clown doll was hit, the rated intensity of the hits, and (just for good measure) the number of times the doll was knocked from one end of the room to the other.

Table 3.12 shows the findings of the study, along with MZ and DZ correlations for height and weight. Only the physical measures show any significant genetic contribution -- note that the correlations almost perfectly reflect what we would expect for purely genetically determined traits, . For the psychological measures, the MZ and DZ correlations are statistically equivalent. Which is to say that there is no evidence for a genetic contribution to any of them.




.95 .50


.89 .46

Activity Level

.99 .94


.43 .20

Vigilance 1

.03 -.14

Vigilance 2

.24 -.10

Selective Attention

.42 .50

Aggressiveness 1

.44 .42

Aggressiveness 2

.38 .48

Aggressiveness 3

.22 .44

If the study by Loehlin and Nichols gave "too much" evidence for genetic determination of personality, the Plomin-Foch results appear to give "too little". These findings, considered along with the ambiguities of the other studies mentioned, lead us to conclude that a genetic factor in individual differences in personality has not yet been convincingly demonstrated. When read in the context of critiques of studies ostensibly showing a major genetic contribution to IDs in intelligence (Jencks, ref; Kamin, 1974), we doubt that such factors can be demonstrated -- or, if they can be demonstrated, that they will turn out to be very important.

Temporal Stability of Personality

While the Doctrine of Traits does not require a genetic basis for personality, because traits may be acquired through experience, it does demand that traits, once established, remain relatively stable over time. As it turns out, there is not a great deal of information available pertaining to this issue. At first consideration, this is somewhat surprising. It seems a relatively simple matter to administer a personality inventory to a sample of subjects, and then follow them up after a substantial interval of time has elapsed -- much as Dworkin et al. (refs) did with Gottesman's (1962, 1963) twin sample.

When this is done, considerable stability is apparent. A recent survey of published test-retest reliabilities for the major factor-analytic and empirical personality inventories (e.g., SVIB, MMPI, CPI, GZTS, 16PF) found an average reliability coefficient of .81 for an interval of one to two weeks (Schuerger, Tait, & Tavernelli, 1982). Longer intervals show reduced but significant reliabilities of .67 for one year, .65 for 7-10 years, and .60 for 20 or more years.

Test-Retest Interval

M Reliability

1-2 Weeks


1-2 Months


3-11 Months


1-2 Years


3-4 Years


5-6 Years


7-10 Years


11-14 Years


20+ Years


One problem with this approach is that the ratings are not free of certain artifactual influences. It may be, for example, that subjects remember what their responses were the first time around, and base their second set of self-reports on these memories rather than their actual current behavior and experience. Moreover (especially when very long intervals of time are involved), many personality questionnaire items are written in the past tense (e.g., "I was a slow learner in school", from the MMPI). Since the incidents referred to occurred in the past, it is difficult to show change on such items. Most important, however, is the fact that questionnaire self-reports represent subjects' perceptions of themselves, rather than their actual behavior. For reasons that we discuss later ("Critique of Trait Research"), there may be more temporal stability to these self-perceptions than there is to the actual behavior of the subjects. For this reason, adequate studies of temporal stability demand behavioral observations collected at several points in time. Accumulating such data is extraordinarily expensive in terms of time, money, and resources -- in fact, the study may well outlive the original investigator.

The Fels Study

One attempt to gather such data employed a group of 89 subjects who had been closely followed from birth to age 14 at the Fels Research Institute in Ohio (Kagan & Moss, 1962). As children, the subjects had been observed and tested extensively at home, school, and laboratory twice a year from birth to age six, and once a year from age six to fourteen. Some 30 years later 71 of these individuals returned to Fels for an intensive interview and another battery of tests. This data formed the basis for ratings on a number of variables relating to passivity and dependency, aggression, achievement and social recognition, sexuality, and social interaction. Ratings were made for five epochs: 0-3 years of age, 3-6, 6-10, 10-14, and 19-29. Different raters evaluated the data from childhood and adulthood, and these ratings were strictly independent of each other.

As might be expected, stability within the childhood epochs was greatest when measured over short intervals, and lowest over long intervals. However, as Table 3.13 shows, the best predictors of adult personality were individual differences assessed during age 6-10 (even though the shortest test-retest interval began with the 10-14 epoch). Kagan and Moss suggested that during the interval from age six to ten several important events occur which "crystallize behavioral tendencies that are maintained through young adulthood (1962, p. 272): identification with the parents; the beginnings of schooling; and sustained encounters with peers. The influence of the environment on stability and change in personality may be clearly seen in some of the sex differences in stability. Girls and women showed more stability for dependency, while boys and men showed more for aggression. Nevertheless, overall the correlations between childhood and adult personality are quite low. Apparently there is some stability in personality traits, particularly if the environment supports early tendencies. But clearly there is appreciable change, if not discontinuity, as well.

Epoch of Childhood





Passivity-Dependency .07 .00 .03 .03
Aggression -.03 .03 .07 .08
Achievement & Recognition -.07 .09 .14 .08
Sexuality --- .15 .19 .03
Social Interaction .00 .18 -.17 .22
Compulsivity-Impulsivity .04 .09 .07 .08

Note: For simplicity of exposition, the cell values are simple arithmetic averages of a wider array of values reported by Kagan & Moss (Chapters 3-7, 9).

The Fels study was an important milestone in developmental personality research, but it had certain shortcomings that made it less than definitive. For example, all the ratings from childhood were done by a single judge. Although care was taken to perform the ratings at different times, and after a considerable amount of interpolated activity, it is possible that memory of behavior at one age contaminated ratings of behavior at another. Moreover, the ratings of childhood and adult behavior were not made on precisely the same scales, so they are not completely comparable. Finally, only a single judge rated the adult period. While the influence of memory would tend to inflate the stability coefficients, the remaining problems would tend to diminish them. These are difficult problems to correct, but an ingenious solution was offered by Block (1971, 1977) and his colleagues at the University of California, Berkeley.

The Berkeley Study

Block and his colleagues have followed two large samples of adults since they were young children. One sample, the Oakland Growth Study, consisted of 212 fifth-graders enrolled in the Oakland (California) public schools in 1932. The other sample, known as the Berkeley Guidance Study, consisted of one-third of all children born in that California city between 1929 and 1930. With the permission of their parents, a dossier on each child was assembled as they entered junior high school (JHS) and again after graduation from senior high school (SHS). These files contained a wealth of information based on naturalistic observation, including newspaper clippings, impressions of school authorities and others in the community, school essays, and the like. When these individuals were approximately 35 years old (ADULT), they were recontacted and asked to participate in a total of 12 hours of interviews with a panel of psychologists and social workers. These interviews followed a standard format, and covered all details of the person's life since graduation. It is important to note that at each point, JHS, SHS, and ADULT, the personality data was gathered in a strictly independent fashion: there were different sources of information at each time, and no one contributing material at any one of the later times had any access to information supplied earlier.

The result of this strategy was a large, heterogeneous mass of data, and a method had to be found for preparing a summary of each subject's personality at each time in a standard format that would permit the necessary statistical comparisons. For this purpose, Block and his colleagues devised a variant on Stephenson's Q-sort technique (Stephenson, ref). The California Q-Set (Block, 1961) consists of 114 general statements about personality, such as those displayed in the table below. The meaning of each item was explicitly specified by the investigators. Judges familiar with the rating system read through one of the three dossiers and then sorted the Q-set items into nine categories ranging from least to most characteristic of the person. In order to force the judges to make careful discriminations the Q-sort was constrained to follow a normal (bell-shaped) distribution, s that only a few items could be placed in the most extreme categories. At least three judges completed Q-sorts for the subjects at each time period, and their ratings were averaged to form a "consensus" Q- sort. No judge made Q-sorts for more than one period in any subject's life, so the three sets of ratings, like the data on which they are based, are strictly independent.

Items such as the following define a broad dimension of Ego-Undercontrol,

Unable to delay gratification
Unpredictable and changeable in behavior and attitudes
Has fluctuating moods
Emotionally bland: Has flattened affect (-)
Behaves in an ethically consistent manner (-)
Has a clear-cut, internally consistent personality (-)

Items such as the following define a broad dimension of Ego Resiliency,

Has a wide range of interests
Interesting, arresting person
Feels a lack of personal meaning in life
Gives up and withdraws where possible in the face of frustration and adversity

The table presents the major findings of Block's study pertaining to the issue of temporal consistency. The two studies have been combined, though results for men and women are presented separately. The basic data in this part of the study is provided by correlations between JHS-SHS, SHS-ADULT, and JHS-ADULT consensus Q-sort ratings of each Q-set item. This procedure is analogous to the test-retest reliability of each item in the Q-set. As can be seen, the obtained correlations ranged widely. The average stability coefficient drops from a substantial .75-.77 over the shortest interval (JHS-SHS) to a more moderate .54-.56 for the middle interval (SHS-Adult) and .47-.55 over the longest interval (JHS-ADULT). Similarly, the number of Q-Set items showing statistically significant correlations (p < .001) across the various time intervals drops from a high of 57-59% (JHS-SHS) to a low of 28-30% (SHS-ADULT). Block (1971, 1977) unfortunately did not report comparable figures for the JHS-Adult interval, but the trend indicates that the percentage of consistent traits across that period of time would be still lower.










Psychological Adjustment .62 .50 .33 .23 .23 .20
% Items p < .05 96 89 59 60 --- ---
% items p < .001 59 57 28 30 --- ---
M Item r .41 .39 .25 .26 --- ---
M Composite r .77 .75 .56 .54 .47 .55
Range Composite r -.05-.85 -.05-.85 -.30-.80 -.25-.75 --- ---

Block's study provides evidence for significant temporal consistency of some personality characteristics, but it also reveals appreciable personality change over time. Many of his correlations are near zero, and some are negative, indicating that subjects actually reverse their relative standings on the dimensions in question, going from high at one time to low at another, and vice-versa. Employing a variant of Cattell's L-technique (see "Types and Traits") Block and his colleagues have attempted to trace the course of personality development and change, employing data from the 90 items common to the JHS and adult Q sorts. According to their findings, there is no single pattern of development visible in all subjects. However, some common patterns of personality change were revealed by their analyses, as shown in the table.

In Men

In Women

A - Ego Resilient

B - Belated Adjusters

C - Vulnerable Overcontrollers

D - Anomic Extraverts

E - Unsettled Undercontrollers

U - Female Prototypes

V - Cognitive Copers

W - Hyper=Feminine Repressive

X - Dominating Narcissists

Y - Vulnerable Undercontrollers

Z - Lonely Independents

Some of these patterns, especially the Ego Resilient males and the Female Prototypes (the latter also referred to by Block as the Good Housekeeping- McCall's Woman, after popular magazines concerned with women's lives as wives and mothers), conform to prevailing social stereotypes concerning the masculine and feminine sex roles (Spence & Helmreich, REF). Similarly, the Lonely Independent conforms to the common stereotype of the woman who adopts the stereotypical masculine sex role but at some sacrifice of her social and personal life. By examining other data on these individuals, Block has been able to develop a rather rich portrait of their characters at junior and senior high school and in adulthood, how their character has changed over the interval, typical demographic, occupational, and socioeconomic status, and the like. He has also been able to relate data collected on the subjects to Q-sort ratings of their parents. Thus the study speaks to issues of personality development as well as stability. "What comes through, for both sexes and without exception, is an unequivocal relationship between the family atmosphere in which the child grew up and his later character structure" (Block, 1971, p. 258).

As an exercise in personality description we should add Block's eleven patterns of personality development to the list of modern personality typologies provided in the chapter on "Types and Traits". Block's is different, however, in that it rests on systematically collected and analyzed empirical data rather than on clinical or sociological speculation. Moreover, his typology is uniquely based on an analysis of lives that have been observed over an appreciable period of time, rather than of people who have been assessed at only a single moment.

Research on Individual Trait Constructs

To some degree, the work on personality assessment, inheritance, and stability described so far in this chapter has been concerned with search for a "grand design" for the structure of personality, as discussed in the chapter on "Types and Traits". The major personality tests are commonly called "surveys" or "inventories" precisely because they purport to provide more or less comprehensive assessments of an individual's personality (which is why, for example, they were selected for Gottesman's genetic research). The California Q-Set employed by Block was designed with the same purpose in mind. Since 1955, however, there has been a strong historical trend away from a theoretical "grand structure" toward narrower "minitheories" concerning particular individual-difference constructs. In these programs of research, the principles of construct validity have been used to guide the exploration of single dimensions of personality. But b and large the investigators involved have shown little concern for the structure of personality as a whole.

Perhaps this turn of events reflects an increasing awareness among trait theorists of the futility of "grand design" approaches to personality. There really seems to be no basis for choosing among the various descriptive frameworks offered by Guilford, Cattell, Eysenck, and others. One trait theorist, reflecting on this state of affairs, has gone so far as to propose that the field start anew with an international congress to determine just what the major dimensions of personality are (London, 1978). In any event, a great deal of contemporary personality research is focused on single trait dimensions, with each investigator pursuing his or her own favorite construct. These lines of investigation are periodically reviewed in such journals as Psychological Bulletin or the volumes of Progress in Experimental Personality Research, as well as occasional edited books (refs). The contributors to one recent volume, for example, explore the 13 traits listed in Table 3.18 (London & Exner, 1978).

Achievement Strivings




Field Dependence


Locus of Control


The Need for Approval

The Power Motive




Following the principles outlined earlier, research on individual trait constructs normally begins with an explicit theoretical conception of the trait, which then is translated into a questionnaire intended to measure individual differences on that dimension. The external component of construct validity is typically established by attempting to predict, on the basis of questionnaire scores, how subjects will respond to some situation deemed relevant to the trait dimension. Validating the questionnaire, then, simultaneously validates its underlying theory. This is the essence of construct validity. In addition, there may be investigations of the heritability of the trait, or its origins in early childhood experience. Finally, there may be studies of the impact of various experiences on the trait in question, or of its correlates elsewhere in the domain of personality. In what follows, we briefly survey four of these traits, as samples of this kind of research.


In the aftermath of World War II, there was great interest in understanding how the Nazi regime, and all its horrors, could have arisen in Germany; and how the German people, once they became aware of such atrocities as the program to exterminate Jews, gypsies, and homosexuals, could have acquiesced in these activities as they appear to have done. There was, of course, an appreciation of the consequences for Germany of the worldwide Great Depression and the concessions that had been extracted as part of the Treaty of Versailles that ended World War I; and of the enormous military and police power of the Third Reich, which might have made open resistance useless and dangerous. At the same time, however, there were countless acts of individual and collective courage, from the defiance of the Danes to the rescue efforts of Raoul Wallenberg. Accordingly, investigators began to search for explanations in terms of the psychological makeup of the German people, as well as in the economic and political context.

Some important groundwork had been laid by Eric Fromm's (1941) Escape from Freedom, in which he defined the "authoritarian character" as "the personality structure which is the basis of Fascism" (p. XX). In the years immediately following the war a research group consisting of a number of refugees from Hitler's Europe, supported by the American Jewish Committee, performed their classic study of The Authoritarian Personality (Adorno, Frenkel-Brunswik, Levinson, & Sanford, 1950). Interestingly, their investigation did not focus on Germans, but rather on Americans.

The California F Scale

The research evolved over a number of stages, beginning with the construction of a series of intuitive scales designed to measure attitudes of anti-Semitism (the A-S Scale), ethnocentrism in general (the E Scale), and conservative political and economic attitudes (the PEC Scale). The scales were administered to various groups of subjects, and high- and low-scorers were given extensive clinical interviews and further psychological testing. These procedures formed the basis for a personality assessment, and led to the conclusion that on average those who were prejudiced against Jews, blacks, women, and other readily identifiable social groups, who displayed a chauvinistic or jingoistic brand of patriotism, and who leaned toward the right (as opposed to the center or the left) of the political spectrum, also manifested a constellation of general attitudes and beliefs:

  • Conventionalism: emotionally based, rigid adherence to middle-class values;

  • Authoritarian submission: uncritical reliance on external authority rather than on internal standards for moral control;

  • Authoritarian aggression: hostility toward "outgroups" of various kinds, especially ethnic minorities;

  • Anti-intraception: inclination towards practical matters, and avoidance of self-reflection;

  • Superstition and stereotypy: belief in luck and fate, reliance on undocumented assertions and half-truths;

  • Power and thoroughness: admiration of these qualities in others, combined with assertion of them in oneself;

  • Destructiveness and cynicism: generalized hostility towards others;

  • Projectivity: belief that the world is dangerous and that others are hostile;

  • Sex: an avid concern with sexual topics, especially sexual deviance.

A personality questionnaire, the California F Scale (the F was for fascism), was developed to measure these dispositions. Combining intuitive and empirical strategies, items were written to represent each of the characteristics. Those which distinguished between high- and low-scorers on the A-S, E, and PEC scales were included in the final version. Table 3.19 gives a sample of these items.

Obedience and respect for authority are the most important virtues children should learn.
A person who has bad manners, habits, and breeding can hardly expect to get along with decent people.
If people would talk less and work more, everybody would be better off.
The businessman and the manufacturer are much more important to society than the artist and the professor.
Science has its place, but there are many important things that can never possibly be understood by the human mind.
Young people sometimes get rebellious ideas, but as they grow up they ought to get over them and settle down.
When a person has a problem or worry, it is best for him not to think about it, but to keep busy with more cheerful things.
An insult to our honor should always be punished.
What the youth needs is strict discipline, rugged determination, and the will to work and fight for family and country.
Human nature being what it is, there will always be war and conflict.
Nowadays more and more people are prying into matters that should remain personal and private.
The wild sex life of the old Greeks and Romans was tame compared to some of the goings-on in this country, even in places where people might least expect it.

Determinants and Correlates of Authoritarianism

With a measuring device in hand, Adorno et al. went on to conduct extensive clinical interviews with high- and low-scoring subjects, in order to determine how the authoritarian personality developed. Coming from a background in Freud's psychoanalytic theory of personality, they traced authoritarianism to difficulties which these individuals had as children in dealing with parental authority. It appeared as if childhood dependency, fear of parental reprisal for misdeeds, and hostility toward parental control was somehow transformed into a concern with power, aggressiveness, and antagonism towards people besides parents. Adorno et al. found that authoritarians -- notice how easy it is to slip from traits to types in describing people -- were raised with a heavy dose of threats, coercion, and punishment. Interestingly, there is some evidence that they raise their own children in the same way (Levinson & Huffman, 1955), creating a vicious cycle.

Other investigators have examined the demographic factors and personality variables associated with authoritarianism. It is not surprising, given the way that the F Scale was constructed, that high scorers are suspicious and hostile towards members of other ethnic groups and people who deviate from established social norms. Nor is it surprising that they vote for conservative candidates for public office, although authoritarianism is not correlated with political party affiliation. These individuals tend to posses low levels of education and socioeconomic status. The trait also shows a small negative correlation with intelligence. Authoritarians prefer well-defined as opposed to ambiguous situations, display egocentrism in believing that others share their attitudes and beliefs, and display elevated levels of anxiety.

Authoritarianism and Social Behavior

The Adorno studies were based mostly on clinical interviews conducted by investigators who knew the subjects' levels of authoritarianism in advance, and thus could have subtly and inadvertently shaped their responses to questions concerning their background and current behavior -- or the ways in which these responses were coded in later analyses. For this reason, a large number of laboratory experiments have been performed to examine the effects of authoritarianism on social interactions of various sorts under more tightly controlled conditions.

For example, Katz (1960) and his colleagues placed high- and low-scoring subjects in an attitude change paradigm, in which they were confronted with information intended to change their opinions on various topics. In general, of course, high-scoring subjects were hard to convince, while low- scorers tended to shift their attitudes in the desired direction. Information which contradicted the opinion of highly authoritarian individuals actually had a "boomerang" effect, strengthening their original attitudes (Wagman, 1965). Interestingly, highly authoritarian subjects showed more attitude change if the counterattitudinal information was presented as coming from a highly authoritative source.

Another series of experiments employed the "prisoner's dilemma" game to study competitive and cooperative behavior. The procedure takes its name from the situation in which two suspects are confronted with the following situation: if both refuse to confess, they will get relatively light sentences; if one confesses and the other one doesn't, the one who confesses will be released while the other one will receive heavy punishment; if they both confess, they both will get moderately heavy sentences. In the laboratory, the game is typically played for points which translate into small amounts of cash. Table 3.20 shows both versions of the prisoner's dilemma game (from Freedman, Sears, and Carlsmith, 1981.

Player's Choices

Partner's Choices




Both get $5 Partner gets $10

Player loses $20


Player gets $10

Partner loses $20

Both lose $15

Obviously, subjects can choose either a cooperative strategy (A) or a competitive one (B). However, the cooperative strategy is based on the individual's trust that his/her partner will also cooperate. If the partner competes rather than cooperates, the cooperative player will lose. By the same token, choice of a competitive response assumes that the partner is a dupe: if both compete, both lose. As one might expect, high authoritarians tend to make competitive choices, attempting to maximize their outcomes with little concern for, and even at the expense of, those of others. This is especially the case when a high-authoritarian player is paired with a partner who is low in authoritarianism. Low authoritarians often retaliate for their losses, so the combination typically results in net losses for both players. When two low-authoritarians play, the result is cooperation and net gains for both. Two high-authoritarians, when paired, also eventually work out cooperative strategies.

Finally, authoritarianism has been investigated with respect to actual obedience to authority, as measured in Milgram's (1963) classic paradigm. In this procedure, subjects are recruited to serve as teachers in an experiment on the role of punishment in learning. Another subject (who is actually a confederate of the experimenter) is required to learn a list of words. If s/he makes a mistake, the teacher must administer increasingly severe electric shocks. Milgram found a substantial amount of obedience to the experimenter's instructions under these circumstances, even to the point of administering ostensibly life-threatening punishments. Obedience is especially high among authoritarians (Elms & Milgram, 1966), even in the absence of explicit pressure from an authority figure (Epstein, 1966). Like Adorno et al., Milgram was interested in the psychological underpinnings of Nazi Germany and the syndrome of the "good German". Although he placed primary emphasis on the presence of authority, he agreed that authoritarianism predisposed some individuals to extreme levels of obedience.

Authoritarianism has been one of the most heavily researched concepts in all of personality psychology (for reviews see refs), and it has been the origin of other constructs as well. Recognition that the A-S, E, PEC, and F scales were all keyed in a single direction, so that positive responses produced high scores on the scale, led to the notion of acquiescence as a response style. Later investigators turned acquiescence from a content-free style into a substantive personality domain in its own right, and sought to underscore the theoretical connection between authoritarianism and acquiescence.

Further consideration led investigators to begin to think of authoritarianism of the left as well as of the right, perhaps accounting for certain responses to Communism as well as to Fascism (Shils, 1954). Finally, Rokeach (1954, 1960) developed the construct of dogmatism, or the degree to which an individual is open to new ideas and permits old ones to change (for a recent review see Ehrlich, 1978). Psychological research on authoritarianism has abated considerably, although the rise of the Moral Majority in the United States, the persistence of authoritarian political regimes in much of the Third World, and the apparent attractiveness of cults such as the Unification Church, keep alive the question of the nature of authority and the kind of people who blindly obey it.


The concern with personality and politics did not focus just on the followers. There soon developed an interest in the personality characteristics of leaders, as well. It has been noted at least since the time of Machiavelli's The Prince (ref) that many leaders are opportunistic and manipulative, and believe that the ends which they seek justify any means that is employed to achieve them. It goes without saying that many ordinary people, people outside leadership positions, have these qualities too. Taking The Prince as his inspiration, Christie (Christie & Geis, 1970; Geis, 1978) began to explore individual differences in Machiavellianism.

The Mach Scale

Machiavelli's ancient essays provided Christie with a rich lode of questionnaire items, and the individual-differences measure was largely constructed by paraphrasing various statements in a way that permitted the respondent to express agreement or disagreement. This intuitive-rational generation of the initial item pool was followed by internal consistency analyses, as well as by procedures designed to eliminate the influence of acquiescence and social desirability response styles. Selected items from a late version of the Mach Scale (Mach IV) are given in Table 3.21. The final questionnaire contained items pertaining to three aspects of Machiavellianism:

  • the belief that people can be manipulated;

  • a willingness to manipulate others; and

  • an ability to manipulate others.

Never tell anyone the real reason you did something unless it is useful to do so.
The best way to handle people is to tell them what they want to hear.
It is safest to assume that all people have a vicious streak and it will come out when they are given a chance.
Generally speaking, people won't work hard unless they're forced to do so.
Anyone who completely trusts anyone else is asking for trouble.
the biggest difference between most criminals and other people is that the criminals are stupid enough to get caught.
It is wise to flatter important people.
It is hard to get ahead without cutting corners here and there.

Machiavellianism, as measured by these scales, appears to be uncorrelated with intelligence, socioeconomic status, radicalism-conservatism, and authoritarianism. However, high Machiavellians do tend to deny that people in general -- including themselves -- are trustworthy or altruistic. High Machiavellians seem to be as negativistic and cynical about themselves as they are about other people. They are able to disguise their emotions, and are relatively invulnerable to emotional appeals. Low-scorers on Machiavellianism are relatively easily persuaded and influenced, and easily slip into deep, emotional involvements with others. Both high and low Machiavellians can be induced to cheat (Exline, Thibaut, Hickey, & Gumpert, 1970) -- although they may do so for rather different reasons. When caught, high Machiavellians look their accuser straight in the eye while denying their actions -- probably in an attempt to look convincing.

Machiavellian as Con Artist

The Machiavellian person is one who can get other people to do what s/he wants, but also is one who can get other people to do what they want (provided that they and the Machiavellian want the same things). A number of laboratory studies have put high- and low- scoring subjects in such situations. For example Braginsky (1970) administered an adaptation of the Mach Scale to a group of fifth graders. Then, posing s a marketing researcher, she recruited high- and low- Machiavellians to help her test a new kind of cracker. The test crackers had been laced with quinine to make them taste bitter, and because the helpers tried the goods they were aware of how unpleasant they tasted. However, when offered five cents for every cracker that they could get their schoolmates to try, high Machiavellians earned roughly twice as much as low Machiavellians.

Machiavellianism is negatively correlated with social desirability, so it is not easy to get people to engage in interpersonal manipulation out in the open where it ca be pinned down and studied. One solution to this problem was to invent a game in which such manipulation was required in order to win. In an experiment by Geis (1970), subjects were run in groups of three consisting of one high, one medium, and one low Machiavellian. Advances around the board were determined by the throw of a die and the cards that each player held in his/her hand. The players were given either low-, medium-, or high-scoring hands of Obviously, the only way for a person with a poor hand to win is to form a coalition with another player; but joining such a coalition would dilute the winnings of a player with a relatively good hand. Thus, in order to maximize his/her earnings, a person must be able to form and abandon coalitions as the need arises. Both the formation and dissolution of coalitions was permitted in the experiment, and the partners were allowed to negotiate their own terms for dividing the winnings. At the end of the six games, high Machiavellians had accrued the most points, and low Machiavellians the least. These results were confirmed in a similar experiment in which subjects played for $10 instead of points (Christie & Geis, 1970). Apparently the high Machiavellians used their manipulative skills at the expense of low Machiavellians in their groups.

But high Machiavellians can also share their successes if they have to. Geis (1968) randomly divided a psychology class into teams of four to work on group research projects. Each team had to select a leader, and individual members would share the grade given to the group. High Machiavellians were more likely to be elected leaders, again in testimony to their manipulative skills. Moreover, teams led by high Machiavellians got higher grades for their projects than those led by low Machiavellians, compared to member's performance on individually administered exams. With their personal outcomes at stake, high Machiavellians took charge, mobilized their group to action, and got results.

Of course, it may have been that the high Machiavellians simply did the class project by themselves, just as their fifth-grade counterparts may have held their noses and eaten all of the crackers. However, the bulk of available evidence indicates that these individuals really are better able to manipulate others, and that they consciously devise and employ a wide variety of strategies in order to do so.

One such strategy appears to be to keep their eyes firmly fixed on outcomes, and not allowing themselves to be distracted by personal relationships and other emotional matters. In an experiment (Geis, Weinheimer, & Berger, 1970), subjects played the role of legislators trying to get bills passed by a majority of their colleagues. If they succeeded, they were promised 50,000 "votes" in the next "election" for each law enacted. Some of the bills involved emotional issues (the draft, the drinking age), while others involved relatively trivial matters. As in any legislature, bills are passed if the sponsor can line up votes. Six subjects, each scoring at a different level of Machiavellianism, were assigned to debate the issues, bargain, and then vote. High and low Machiavellians were equally successful on the trivial bills, but highs were more successful than lows on the emotional issues. Indeed, the lows did particularly badly in garnering support for bills that they personally favored. Lows and highs apparently understand the principles of bargaining equally well: highs can use these, consistently, while lows can get sidetracked by irrelevant issues -- and lose.

It is not surprising that experiments on Machiavellianism typically employ analyses of business and political situations. High Machiavellians are strongly oriented toward person and achievement; and in Western culture, which shares this perspective, these individuals are likely to excel. It is by no means clear that people who occupy positions of power and influence in the real world score high on the Mach Scale. Perhaps they do, but many leaders exercise their power and seek their goal openly rather than behind the scenes, and by the exercise of raw power rather than by deliberate, self-conscious bargaining. cards.

Social Desirability and the Need for Approval

While some individuals are disposed to seek power over other people, others are equally active in seeking their approval. The notion of individual differences in the approval motive, or the need for approval, had its origin in the discovery, described earlier, that many personality items and scales are heavily contaminated by social desirability. While originally social desirability was construed as a content-free response style, the concept was later expanded to refer to an individual's desire to cultivate a favorable impression of him/herself in the minds of others and to behave in such a way as to meet the expectations and standards of his/her social reference group. Getting along with others is at last as important as manipulating them, so this characteristic has been subject of an extensive series of investigations (for reviews see Crowne, 1979; Crowne & Marlowe, 1964; Millham & Johnson, 1978).

The Social Desirability Scale

The job of constructing a proper scale to measure social desirability as an attribute of persons is made problematic by the fact that social desirability is also an attribute of scale items. How then to construct a social desirability scale whose substantive content is free of contamination by social desirability response style? And how to infer that a person has a high need for approval if s/he says that s/he goes good things and doesn't do bad things, when in fact s/he may simply be telling the truth? It is not easy, and Crowne and Marlowe were only partially successful. The final scale consisted of two types of items: the attribution to oneself of characteristics that are socially desirable but rare; and the denial in oneself of characteristics that are socially undesirable but common. Some representative items are given in Table 3.22.

Before voting I thoroughly investigate the qualifications of all the candidates.
I never hesitate to go out of my way to help someone in trouble.
I have never intensely disliked anyone.
I am always careful about my manner of dress.
My table manners at home are as good as when I eat out in a restaurant.
No matter who I'm talking to, I'm always a good listener.
I'm always willing to admit it when I make a mistake.

Conformity and Compliance

A large number of studies have examined the response of subjects classified as high and low in social desirability to various implicit and explicit pressures to conform to group standards or comply with situational demands. For example, when participants in an excruciatingly boring psychological experiment were asked to evaluate the study in the presence of the experimenter, subjects high in social desirability gave significantly more favorable replies than their low- scoring counterparts (Marlowe & Crowne, 1961). In another experiment, subjects were placed in a situation involving apparently simple perceptual judgments. In a classic experiment, Asch (1951) had shown that when experimental confederates unanimously announced the wrong judgment, roughly one-third of naive subjects followed suit, giving an answer that was clearly wrong. This tendency is enhanced in subjects who are high in need for approval. High social desirability subjects are more likely to express agreement with a statement that runs counter to their own attitudes, regardless of the credibility of the source or the quality of the argument. Subjects who are low in social desirability are more discriminating, responding chiefly to the credible or high-quality arguments and rejecting the others (Sholnick & Heslin, 1971). When required to prepare and deliver a speech on some topic that is contrary to their private beliefs, high desirability subjects tend to change their beliefs to correspond more closely to their behavior, while low social desirability subjects are less inclined to do so (Crowne & Marlowe, 1964).

An interesting series of studies has examined the effect of social desirability on performance in a verbal conditioning procedure. In this type of experiment, subjects are asked to perform a simple verbal task, such as making up sentences or generating instances of categories. The experimenter selects some property of this response -- plural nouns, sentences beginning with "I", or whatever -- as the target of reinforcement. Whenever the target is produced by the subject, the experimenter responds with "Mmmm-hmmm" (positive reinforcement) or "uh-uh" (negative reinforcement), along with corresponding nonverbal gestures. The finding is that over many trials, subjects increase the frequency of responses that are targeted for positive reinforcement, and decrease the responses that are negatively reinforced (Greenspoon, 1955). Subjects who are high in need for approval are more sensitive to these reinforcement contingencies, whether they are experienced directly (Crowne & Strickland, 1961) or vicariously (Marlowe, Beecher, Cook, & Doob, 1964).

Defense and aggression. In the same way that they shape their behavior so as to conform with situational demands, and strive to present themselves in the best possible light, subjects high in social desirability also avoid entering into situations where they might look bad, and resist situational pressures to act in a socially undesirable way. For example, if they go into psychotherapy they tend to drop out relatively early -- presumably rather than confront the undesirable characteristics that led them to seek treatment in the first place (Crowne & Marlowe, 1964). In verbal conditioning experiments, they show no response to reinforcement when it is contingent on saying very unfavorable things about themselves. Given a choice, they prefer not to participate in social encounters, such as therapy groups, where they will be evaluated. When frustrated, punished, or criticized, they are less likely to engage in retaliatory aggression.

In general, the whole pattern of findings suggests both a positive and a negative aspect to the need for approval. First, there is the desire to do good things, in order to gain approval and acceptance by others. However, in some experiments these behaviors were engaged in even when the subject had no knowledge that he or she was being observed. Moreover, as Crowne (1979) notes, the individual who goes to extremes of social desirability actually invites the contempt of other people. For these reasons, there may be defensive motives behind this disposition, involved in protecting one's own self-esteem. Thus, high social desirability subjects project themselves into their audience, and try to impress themselves as well as others with their goodness.

Traits, Traits, and More Traits

Authoritarianism, Machiavellianism, and social desirability are just a few of literally hundreds of different traits that have been studied by personologists working within the psychometric tradition. They are also three of the most popular traits, being both associated with long and respected research traditions, and represented in large numbers of textbooks. While one can only admire the ingenuity with which many of the experiments have been conducted, in the final analysis the research findings themselves are not important for present purposes. Rather, these three lines of research are cited as prime examples of how the validity of a trait construct is established, and of the work of trait psychologists in general. Beginning with some intuitions as to how people might differ, whether derived from common sense or formal theory, the investigator devises a test to measure these differences. Then, test scores representing generalized dispositions are related to behavior in particular situations, contrived in the laboratory. The results of such studies then serve as a basis for revising the theory and making it more precise. The laboratory and real-world behavior provides evidence for the validity of the scale, but at the same time they help weave a nomological net around the theoretical construct. It is this scientific process, rather than the findings with respect to any particular trait, that we wish the reader to understand.

The task of erecting minitheories about individual trait constructs has proved much more attractive and rewarding for most investigators working in this tradition, compared to the alternative of defining a universally applicable matrix of personality traits. Having examined both approaches within the psychometric tradition, however, it is now time to take a critical look at the trait concept itself.