Thread
People are testing large language models (LLMs) on their "cognitive" abilities - theory of mind, causality, syllogistic reasoning, etc. Many (most?) of these evaluations are deeply flawed. To evaluate LLMs effectively, we need some principles from experimental psychology.🧵
Just to be clear, in this thread I'm not saying that LLMs do or don't have *any* cognitive capacity. I'm trying to discuss a few basic ground rules for *claims* about whether they do.
Why use ideas from experimental psychology? Well, ChatGPT and other chat LLMs are non-reproducible. Without versioning and random seeds, we have to treat them as "non-human subjects."

To make the analogy clear: a single generation from an LLM is being compared to one measurement of a single person. The prompt defines the task and the specific question being asked is the measurement item (sometimes these are one and the same).
Flip it around: it'd be super weird to approach a person on the street, ask them a question from the computer science literature (say, about graph coloring), and then based on the result, tweet that "people do/don't have the ability to compute NP-hard problems"!

So, principles:
1. An evaluation study must have multiple observations for each evaluation item! It must also have multiple items for each construct. If you have one or a few items, you can't tell if it's the idiosyncrasy of items that led to the observed responses. See: www.cambridge.org/core/journals/behavioral-and-brain-sciences/article/generalizability-crisis/AD38611...
Far too many claims being made have either a small number of generations (one screenshot!) or a small number of prompts/questions - or both. To draw generalizable conclusions you need at a bare minimum dozens of each.
2. Psychologists know that the way you frame a task matters to what answer you get. Children especially often fail or succeed based on task framing! And sometimes you get responses to a task you didn't intend (i.e., via "task demands").
srcd.onlinelibrary.wiley.com/doi/full/10.1111/cdev.12825
We all know LLMs are deeply prompt sensitive too. ("Let's think step by step..."). So why do we use a single prompt to make inferences about their cognitive abilities?
3. When you claim an LLM has X (ToM, causality, etc.), you are making a claim about a construct. But you are testing this construct through an operationalization. Good expt'al psych makes this link explicit, arguing for the validity of the link between measure and construct.
In contrast, many arguments about LLM capacities never explicitly justify that this particular measure (prompt, task, etc.) actually is the [right / best / most reasonable] operationalization of the construct. If the measure isn't valid, it can't tell you about the construct!
4. Experimental evaluations require a control condition! The purpose of the control condition is to hold all other aspects of the task constant but to vary only the specific construct under investigation. Finding a good control is hard, but it's a key part of a good study.
For example, when we query theory of mind using false belief understanding, researchers often compare to control conditions like false photograph understanding that are argued to be similar in all respects except the presence of mental state information: saxelab.mit.edu/use-our-efficient-false-belief-localizer
In sum: minimal criteria for making claims about an LLM should be 1) multiple generations & test items, 2) a range of prompts (tasks), 3) evidence for task validity as a measure of the target construct, and 4) a control condition equating other aspects of the task.
All of these ideas (in the human context) are discussed more in our free textbook, Experimentology:

Addenda from good points in the replies: 5) evaluations should be novel, meaning not in the training data, and 6) varying parameters of the LLMs is important, at least the one(s) you have access to (temperature).
Mentions
See All