July 20, 2009
Psychological Science: The [Non-]Theory of Psychological Testing – Part 2
“Psychological Science: The [Non-]Theory of Psychological Testing – Part 1” can be found HERE.
Q & A
Q. If Psychological Test Theory (PTT) is not a theory but a tautology, then what should be substituted in it's place?
A. How about replacing it with a scientific, or observational theory. *
The story so far
No modern science begins with the assumption, explicitly or implicitly, of the reality of Plato's World of Ideal Forms. The one exception is testing and measurement in the social sciences, particularly psychological or mental testing. What is not appreciated by many, if not most, social scientists is that PTT assumptions like True Score, or Latent Trait, are not like literary dramatic license that gives weight and impact to the narrative. From the point of view of the philosophy of science, they are indistinguishable from Plato's Ideal Forms, and have no place in modern science.
Mathematical argument is found in all modern science. The social sciences are no exception. Scientists use mathematical argument in three ways:
- It is used as a way to analyze, understand, and communicate data from observation;
- Mathematical argument helps one hypothesize about data not yet observed; and
- In the service of supplementing 1 and 2, properties of mathematical inventions and constructions are used as convenient substitutes for the undetermined properties of observed or hypothesized data.
PTT, however, tends to use mathematical inventions and constructions, not as a supplement to mathematical argument based on observation, but as a near total substitute for it. This is the tradition handed down to Western civilization from Pythagoras, that is both praised and lamented by Carl Sagan in his book and Public Broadcasting Service (PBS) video series, “Cosmos”.
“Pythagoras...developed a method of mathematical deduction.... The modern tradition of mathematical argument, essential to all of science, owes much to Pythagoras.” Pp. 149-150.
“In the recognition by Pythagoras and Plato that the Cosmos is knowable, that there is a mathematical underpinning to nature, they greatly advanced the cause of science. But in the suppression of disquieting facts, the sense that science should be kept for a small elite, the distaste for experiment [emphasis mine], the embrace of mysticism..., they set back the human enterprise.” P. 155.
The great contributions of Peter Abelard (1079 CE - 1142 CE) to philosophy, and Galileo Galilei (1564 CE - 1642 CE) to science and the process of science, freed Western man from the Plato-bound Scholastics of the Church who believed in the reality of Ideal Forms, and that knowledge was yours just for the thinking without the need to observe. Ideas had no separate existence from man's ability to create them, conceptualize them, and communicate them to others. The arbiter of all truth about nature – the world we experience – would be science.
What makes PTT a tautology?
I thought you'd never ask. A tautology is a self-consistent logical system, wherein all statements are true. It's also called circular reasoning. Here's a very good example from a text of religious instruction for a Christian faith community. The instruction is in the form of questions and answers that were memorized by believers:
Q. Who made me?
A. God made me.
Q. Who is God?
What make this a tautology is that the logic is circular and self-consistent. God is defined in terms of creation, and creation is defined in terms of God. All statements within a tautology are always true: God made me; and, God is the infinitely supreme Being who made all things. Another example of a tautology is Freud's concepts of Eros and Thanatos, the creative life force, and the destructive death force, respectively, in the psyche of the individual. Each concept, Eros and Thanatos, is defined by the absence of the other. If you are a creative artist in the throes of productive output, you have a lot of Eros and your Thanatos is low. If your latest showing at the Museum of Modern Art, in New York City, got a bad review in the New York Times, and you committed suicide, your Thanatos flared up and your Eros dropped off to nothing.
What makes modern PTT a tautology is captured in the title of one of the most influential texts on the subject, Frederick Lord's and Melvin Novick's (1968) book on Classical Test Theory (CTT), with four chapters on Item Response Theory (IRT) written by Allan Birnbaum, "Statistical Theories of Mental Tests." It is a statistical theory, not a scientific theory, nor an observational theory. By definition, a statistical theory of mental testing is a tautology. PTT is captured in a closed, self-consistent, logical system of assumptions and derivations. Within a statistical theory of mental testing, all statements are true.
Among the elements of PTT are various distributions, of which the most easily recognized is the Euler-Gauss Normal Distribution. Too, there is a good supply of parameters and assumptions about the properties of parameters. Let's take a look at the Euler-Gauss bell curve. The shape of the familiar curve comes from plotting the values of the left side of a very interesting equation:
and if we assume a mean of 0 and a standard deviation of 1, the equation reduces to:
and the only variable is x. There are two constants, π, and e. e is Euler's constant (there are actually two Euler's constants) and is known as the natural log. [How about a bit of trivia. What is the relationship between e and Google's IPO, Initial Public Offering?] The problem with most social scientists is that they assume the Euler-Gauss formula is derived from nature, that it is based upon data from systematic observation, and that it reflects nature. It is a purely mathematical invention and construction, that has practical utility for some areas of science, BUT IT IS NOT A DESCRIPTION OF NATURE. It appears to social scientists that it represents reality because there are some observed distributions that look like a normal curve, like height, shoe size, and SAT scores.
The reason many sciences use the normal distribution, is that its properties are already known. Why waste time determining, empirically, the properties of the distribution of scores on a test of depression. If it looks like a normal distribution, then, what the hell – let's use the properties of the mathematical invention and save ourselves a lot of time. This doesn't bother the mathematician nor the mathematical statistician. Nor should it. This doesn't bother the social science researcher, either, for completely different reasons. For the most part, social scientists do not understand that PTT is a tautology, let alone appreciate the limitations it puts on PTT as a science. A reader commented, in Part 1 of this article, on my observation that the normal curve is not a representation of nature, by saying I should look up the Central Limit Theorem (CLT). The CLT states that the distribution of means from a very large number of random samples will tend to be distributed normally. Here's the problem with that comment. The CLT is part of the larger statistical theory, a tautology, that forms the basis for PTT. Since every statement of a tautology is always true, the reader was trying to prove his assertion of the ubiquity in nature of the normal distribution, by asserting another element of the same tautology. That which is tautologically true, cannot be assumed to be a result of empirical discovery, no matter the common sense appeal, long time use, or the prestige of the author(s), and the sales of their books. I am quite confident that they would agree with much of my discourse. Their intellectual caution and reservations, however, do not seem, in my personal opinion, to infiltrate the graduate and undergraduate curricula of educational and psychological testing.
Here's a more comedic example from an old Abbott and Costello routine. Lou Costello, accompanied by Bud Abbott, goes into a bank to apply for a loan. The bank manager asks for identification to prove that he is Lou Costello. He doesn't have any identification with him. The bank manager says he can't begin to consider him for a loan until he can prove he is Lou Costello. Costello turns to Bud Abbott and says, “Tell the bank manager who I am.” Abbott speaks directly to the bank manager and says, with great moral force and confidence, “This man is Lou Costello.”
The limitations of PTT
On the level of common sense observation and utility, the normal curve APPEARS to be in line with many distributions that are observed in nature. However, the normal curve is one of a number of distributions used by PTT. There are distributions for both continuous variables and discrete variables. Each distribution curve APPEARS to approximate at least some distributions found in nature. All of these distributions, though they may APPEAR to mimic some aspect of nature, are purely mathematical constructions, and are not derived from systematic observation. These distributions, and the tautologies they rode in on, may have utility but they do not describe reality in the natural world.
Here's an example from my own experience. It's a little bit of a shaggy dog story, but it is relevant, so please indulge me. Early in my IBM career I worked on the computer simulation of pedestrian and automobile traffic for the sprawling IBM complex in Poughkeepsie, NY. It had manufacturing facilities, research and development laboratories, administration buildings, programming centers, and monstrous computer operations. We had about 10,000 people in the main complex on three shifts. Everyone commuted by car, as there was no mass transit serving the facility. We had scores of buildings, and hundreds of entrance and exit doors for employees. There were huge parking lots, many access and exit points, several miles of intra-complex roads, and guaranteed traffic jams at the end of the workday for first shift employees. The General Manager wanted to change the staggered start-stop times for all employees on all shifts, reduce the 45 minute lunch period to 30 minutes (allowable by New York State law at the time), and have all the first shift operations ended by 3:30 pm. What I am about to report is the literal, absolute truth about why the General Manager was taking these actions, though it was known to only a few of us. He was an avid motorcyclist in good weather, and snow mobile enthusiast in the winter. He wanted to leave his office by 3:30 on Fridays, and head to his personal retreat and playground in the mountains. The GM wanted to be sure that he was not creating traffic delays that were worse than the current situation.
One of the first things we had to do was determine the patterns of exit behaviors on the part of departing employees for all 200 plus exit doors on all three shifts for the whole friggin Poughkeepsie complex. When I say patterns of exit behaviors, I mean the frequency distributions associated with exiting employees following their stop-time. If your shift ended at 4:12 pm, along with all 127 people in your corner of the building, you would 'punch out' no earlier than 4:12 pm and then make a dash for the exit door. We trained observers to stand at the exit doors with stopwatch and clipboard. We had no prior data on this phenomenon, and no idea what the frequency distributions would be like. So we decided not to sample. Instead we collected data for every friggin door, for every friggin stop-time, for every friggin building, for every friggin shift, for every friggin day of the week. Considering the consequence to the GM, and to us (under the principle that shit flows down hill), should he make the wrong decision and produce a worse traffic problem, we had to do as complete a job of data gathering as possible.
Siméon Denis Poisson (1781-1840)
Now let's get back to the point at hand. We need to know the shapes of the frequency distributions of exit behavior of the employees. The software, General Purpose Simulation System (GPSS), could accept empirically derived data (from observation), or could use one of many idealized frequency distributions (like the normal curve) and save us a lot of programming time. When we started plotting all the friggin data, by friggin this, and by friggin that, we noticed something consistent in almost all of the distributions. The greatest frequency of exits was in the first few minutes following the formal stop-time, with only a trickle of people exiting after 6 or 7 minutes. The project manager, an industrial engineer with good training in statistics, realized they all LOOKED LIKE Poisson distributions, not normal distributions, not Chi-square distributions, not flat distributions, not bimodal distributions. They LOOKED LIKE Poisson distributions. So he said the hell with inputting empirical data, we will use the Poisson function in GPSS to generate the distributions of exit behavior for all employees under all conditions.
This example makes three important points:
- The Poisson distribution was not reflecting nature, nor was it derived from nature. We observed nature, and on a rational basis, decided to use the convenience of a known frequency distribution with known properties. This is the exact same process we use in deciding to use a normal distribution; not because the normal distribution is derived from nature, but because our observations from nature appear to us to look like a normal distribution.
- No statistical theory could have predicted the shape of the frequency distribution associated with employee exit behavior – a never before observed phenomenon.
- The use of an idealized distribution, the selection of which is rationally determined, rather than using an empirically derived distribution, can have high utility.
Our computer simulations, using far more data than exit door distributions, showed that lines of cars at intersections would be longer, but that more cars would be processed through the traffic patterns in a given time period. The conclusion from the simulation was that the overall length of time for any car to exit the complex was unchanged. Staking our next raises on our findings, we recommended the GM proceed as planned. We got our next raise and the GM left early on Fridays.
The seeming ubiquity of the Normal Distribution may be more apparent than real; it may be more artifact than substance. Students of PTT, developers of mental tests, and practitioners learn how to create testing procedures and materials that are more likely to produce normal distributions in the data they collect. It is not a function of nature, nor an accident, that SAT scores – and like measures of aptitude, abilities, and achievement – look like the ideal Euler-Gauss bell curve. Test developers work hard to produce measurement tools that yield normal curves. Is this a dishonest manipulation? No. They are simply justifying the use of the properties of the Euler-Gauss distribution for their data. As a ready-made substitute for a time consuming and costly empirical investigation of the properties of raw score distributions, the normal curve can't be beat. It has the added benefit of keeping things standardized, predictable, more easily communicated to constituents, and is continuous with history.
However, in my forty years of social science research, the typical distributions of actual research data, collected in the field, are anything but predictably normal. Our get-out-of-jail-free card was, in some cases, the Central Limit Theorem. Sometimes we copped a reprieve by using an F-distribution because it was more robust in the face of non-normal distributions. Where did we get the notion of F-distributions being robust? You guessed it. It came from the same tautological statistical theory as the Central Limit Theorem.
The most serious limitation of a statistical theory (tautology) of mental testing, is that it cannot predict the properties of the distribution of scores of an hypothesized, but not yet unobserved, aspect of behavior or mental functioning in animals and humans. This alone should inform the social scientist that PTT is NOT a scientific or observational theory. Let's examine what we mean by a scientific theory. First, we need to draw a distinction between the common usage of the word theory and the term scientific theory. In common usage, theory relates to an educated guess, an hypothesis, a conclusion not well supported, a hunch, and the like. In common usage theory has little to do with facts, or, at best, with incomplete, even contradictory facts. A scientific theory is a coherent narrative of some aspect of nature that is based upon facts from systematic observation; it can accommodate new data; it can predict events that were previously unknown or never observed; from new data it can be modified, extended, limited, made more general, or even completely thrown out.
Evolutionary biology is a very good example of a scientific theory. It is a coherent story, base on facts, applies to new data, it can predict transitional forms, and be supported, modified, or disproved by new facts. None of this applies to a statistical theory of mental testing. None of this applies to any tautology. [Evolutionary biology is often criticized, dismissively, with the comment that it is only a theory. Such a remark betrays a lack of scientific knowledge, and no understanding of the distinction between the common usage of theory and the correct understanding of a scientific theory.]
Next time
In my third and final installment of PTT, I'm going to discuss how today's PTT is creating its own reality in the behavior and thinking of test developers, practitioners, and those who are tested; how society is being drawn in to the tautology and perpetuating the illusion that we are observing nature; and how we might move toward a scientific or observational theory of mental testing. Please come back on August 17 and continue to comment to your heart's content.
Posted by Norman Costa at 12:45 AM | Permalink






















Comments
Ahh... Carl Sagan was a God! What have we got left now? Michio "Media Whore" Kaku...? Yay, Discovery Channel.
Posted by: hidflect | Jul 20, 2009 1:39:27 AM
Thank you, Norman -- this is fascinating. Will your detractors be back? I'll be watching...
Posted by: Elatia Harris | Jul 20, 2009 1:58:33 AM
Norman,
An interesting and provocative article. Some (hopefully constructive) questions and criticisms:
-> Is your rejection of the Central Limit Theorem objection more or less tightly restricted to psychological testing, or does it apply to all of science and social science? If the former, how so? Certainly the dismissal that CLT is "tautologous" is pretty sweeping. If the latter, my sense of the significance of your remarks is much diluted - there are always people who get too fastidious about statistical niceties, particularly when riding hobby-horses :)
-> The text makes the use of particular distributions (gaussians, F-disributions, poissons etc) seem entirely a matter of caprice and whimsy - some fellow sitting at a chair arbitrarily picks one or the other, and there it is, a model is born! You emphasize that the distributions are all purely mathematical in spirit, but those particular mathematical distributions are also valued because they each describe different kinds of data from a variety of sources. [This is, perhaps, your left-over baggage from part 1 - it's all very nice to dismiss Plato and Pythagoras as naive, but at the end of it you still must account for the observation that the math works, unaccountably (?) well from your perspective.]
The gaussian perfectly models the unbiased coin, which statement isn't tautologous; you're not "sneaking in" the gaussian by hand. We know to associate physical and geometrical features of the coin to the fact that the gaussian describes its tosses. In fact I could pretty much tell you how to make a coin to have its tosses fit a variety of different distributions! Effectively, we can say what it is about the coin that makes the gaussian work. The Poisson perfectly models the radioactive decay. We understand what it is about decays that makes the Poisson a good description - that nuclei don't "remember" not having decayed already. In fact,
-> "When I say patterns of exit behaviors, I mean the frequency distributions associated with exiting employees following their stop-time. If your shift ended at 4:12 pm, along with all 127 people in your corner of the building, you would 'punch out' no earlier than 4:12 pm and then make a dash for the exit door."
Given this description, wasn't Poisson the obvious starting point? The process sounds pretty memory free...
-> You mention that psychologists see that the data seem to fit a distribution, and from that point on use the mathematical distribution itself. Surely this is something goodness-of-fit measures are meant to characterize. I've never seen results stated that didn't state something like an R^2 - even Microsoft Excel does one! Also, this state of affairs isn't restricted to psychology (or testing) by any means. Everyone uses distributions that they have reason to think should work, and which look like they do, instead of using only tables of observed data. Nor is this mere time-saver - would you have us just ignore the fact of fluctuations and "build in" happenstances of noise into future theorizing and model-building?
Look forward to your responses!
Posted by: D | Jul 20, 2009 5:31:50 AM
Interesting. I'm not a scientist, so I may just be missing the point, but in the first sentence of the third from last paragraph, shouldn't that say "but not yet observed"?
Posted by: William | Jul 20, 2009 9:10:41 AM
You have a good point, but don't overstretch your case.
"The reason many sciences use the normal distribution, is that its properties are already known."
Many distributions in Nature _do_ turn out to be normal, and the reason comes from a unique and unexpected property of the normal distribution, captured in the Central Limit Theorem.
The Central Limit Theorm says: take any statistical distribution at all - no matter how regular or irregular its shape. Now add another distribution identical to the first, and calculate the distribution of the sum. Repeat over and over...as you add up more and more copies of your arbitrary distribution, the sum starts to look more and more like a normal distribution.
Posted by: Josh Mitteldorf | Jul 20, 2009 10:22:52 AM
^^
You forgot to mention independent wrt CLT. Sorry, but electrical engineers have the mnemonic iid(independent identical distribution) permanently imprinted in their memories.
Posted by: Kumar | Jul 20, 2009 11:17:42 AM
This is ridiculous.
Psychologists are (mostly) well-aware that there's no intrinsic reason why most things that they measure should be normally distributed. Some of them are not even close (reaction times). Some of them are close enough that it's worth making the assumption.
All psychologists who build mathematical models are aware of two things: (1) all models are wrong, (2) some models are useful.
If assuming the gaussian distribution of noise on your measure allows you to build a model that predicts future results and build intuition pumps for the underlying theory, then that's useful.
Two more points. First, some psychological models are based on neural computations. We know a lot about the statistical properties of neurons and assemblies of neurons (even if we don't really know what they're representing, for the most part). In that case, there may be perfectly good reasons for choosing a particular distribution of the data. Second, there is an increased use of nonparametric statistics to compare arbitrary distributions using methods like the permutation test. These tests don't say anything about the underlying models, unlike generalized linear models, they just tell you if two sets of data are probably different or not. Less useful, but no assumptions to violate.
I'll be looking forward to the third part of this series, but I still think that its reasoning is totally bogus.
Posted by: Harlan | Jul 20, 2009 1:51:37 PM
Harlan, I'm wondering if it wouldn't serve your points just as well to frame them in terms of disagreement, even strong disagreement. You are voicing a majority opinion here, regarding the matter you bring up -- that all models are both wrong and useful -- not a thunderclap of fresh insight. Nothin' wrong with that. But modifiers such as "ridiculous" and "bogus" go down better in the realm of private opinion than they do in actual discourse. Thanks!
Posted by: Elatia Harris | Jul 20, 2009 2:41:10 PM
To amplify on what Josh said, with a few more strengthening assumptions, you can do away with the requirement that the distributions added up are the same. Intuitively, you need to ensure that no one underlying probability distribution in the sum "swamps out" the rest. When that works out, you can add a large number of **distinct** distributions and still get a normal distribution out of the sum. Gaussians are like good horror movie villains...very hard to kill them dead.
Posted by: D | Jul 20, 2009 3:01:30 PM
I think I know where you're going with this and I believe there is something to what you say, but you misconstrue how most social scientists apply the normal distribution and the Central Limit Theorem. We generally use these in connection with sampling theory, where they do indeed describe the real world and can be, and have been, subjected to innumerable real empirical tests. Of course we are talking about probability, which rests uncomfortably with positivism, so actual data are never exactly normal, but it is nevertheless legitimate to predict that they will be approximately normal under many circumstances. Too long for a comment but:
The normal distribution is the limit of the binomial expansion. It describes -- and you can test this empirically -- the distribution of heads and tails, for example, in any arbitrarily large number of coin flips. The event doesn't need to have 50/50 probability to produce a normal distribution either, it just needs to be binary. Many natural phenomena are generated by some combination of binary possibilities, hence normally or approximately normally distributed. It's not just an intellectual construct, it's real.
As for the CLT, it tells us the confidence we should have in observations based on probability samples. Unlike you, we can't usually stand at every door 24/7, we have to pick a sample. So the CLT gives us the confidence intervals for our observations. Again, empirically testable, and it works. Yes, it really does describe true facts about the real world, and you can prove it experimentally.
Posted by: cervantes | Jul 20, 2009 4:11:57 PM
Elatia, it's perfectly fine to label things "ridiculous" or "bogus" when you describe why they're ridiculous or bogus. I don't understand your obsession with not hurting anyone's feelings (if feelings are even being hurt). I think we're all grown-ups here.
Posted by: billy | Jul 20, 2009 4:54:58 PM
Billy, it's not about hurting feelings -- although if it were, that would be no bad obsession. It's more about disagreeing without excessive discourtesy. When you think about it, attitude is not more powerful than facts, and if you have a fact-based point to make, it will speak for itself. A certain amount of courtesy enables it to speak more powerfully. To start your comment with "This is ridiculous..." is to tip off readers that you are coming from opinion -- not so interesting when the subject is statistics.
Posted by: Elatia Harris | Jul 20, 2009 6:52:44 PM
Sagan’s criticisms of Pythagoras and Plato could apply to the history of English composition studies (formally Aristotle’s rhetoric) as well. Traditionally based on assumptions, ideal forms, circular reasoning, “distaste for experiment,” and “embrace of mysticism,” only in recent decades has composition research (some of it) shifted toward observation and actual controlled studies. Still, much of the discussion about writing and the teaching of writing relies on old stand-bys and new-age-y advice. (For an Abbott and Costello routine about circular reasoning in English class, check here.) Also, I see parallels between your posts on PTT and two books I just read. In Don’t Sleep, There Are Snakes, Daniel L. Everett takes on Chomsky’s “universal grammar” in a Sagan-like fashion, and Bart D. Ehrman, in Jesus Interrupted, challenges tautologies related to the Bible. In addition, Eliezer Yudkowsky writes about things like phlogiston and elan vital and how for a long time both were accepted without question as valid explanations.
Thank you for these posts and for your informative and entertaining writing style. I look forward to the next one.
Posted by: Jack Frost | Jul 20, 2009 11:08:12 PM
I have a problem with simple arithmetic: although we are always told that 1+1=2, when I eat a chicken the equation becomes:
me + chicken = me
Ok, I know there all kinds of biochemical and energetic equations to describe my digestion, but try telling that to the chicken, who has effectively been annihilated. In maths, numbers cannot be annihilated. In the real world, though matter an energy are conserved, the things we count often disapear.
So, while its all very well to count things, we should be aware that we constantly have to "fiddle the books" whenever events in the real world deviate from the mathmatical description.
Posted by: aguy109 | Jul 21, 2009 2:04:08 AM
Colleagues, friends, family, and secret admirers:
The purpose of this post is to beg your indulgence in not responding to your comments, suggestions, and questions. I am engaged in my summer volunteering for the Summit Music Festival (summitmusicfestival.org). I need another day to finish setting up the computer network for the staff and faculty. Donated old computers, a Linux server, Win XP, routers that don't route, and ancient printers are a challenge. Almost there.
All the student concerts, competitions, and master classes are open to the public. The faculty performances require tickets at prices you can't refuse. Hearing these young people from all over the globe, and witnessing their dedication to their instruments are more than reward enough. I hope you consider dropping by over the next 3+ weeks.
Posted by: Norman Costa | Jul 21, 2009 10:27:50 PM
Post a comment