Exalting data, missing meaning
Science developed out of philosophy. Before we had Copernicus’s planetary tables or Newton’s equations of motion, we had Aristotle’s rhetoric. That medieval natural philosophy was wrong, but it still made useful predictions.
I’ve spent my career making things in the worlds of consumer products and education, and in both domains, I’ve observed a common failure mode in decision-making: an overriding obsession with data—with appearing scientific—and an associated repudiation of philosophy.
That’s a totally appropriate obsession in some fields, like manufacturing, transportation, and aviation. But in consumer products, education, and so many other domains involving the messiness of humanity, the data obsession falls prey to hidden errors and distorts our true goals. Worse, it deprives us of truly meaningful insights that are available via philosophy, intuition, and stories, but not yet fully explicable through quantitative systems.
This danger lurks in domains that have not yet been systematized. When it comes to people, we lack Newton’s equations of motion. Actually: we don’t even know what the equivalent equation should be measuring. Even if we did, we probably don’t have instruments that can measure those quantities. Can there ever be equations which can usefully describe these phenomena? I think so, but I don’t know we can even be sure of that.
Until we build more powerful explanatory theories of these domains, we must respect the role of philosophy and beware the dangers of playing scientist.
Measuring everything, measuring nothing
Let’s talk about test scores.
Have you ever gotten a fine grade in a class—say, calculus—but later felt like you didn’t really understand what was going on? That you can follow the steps you’d learned to solve problems like ones your class tackled, but you can’t explain why they worked or apply them in new contexts? This experience seems totally ubiquitous!
So: how much do you trust test scores for making decisions about educational systems?
In fields like education and design, we measure only indirect proxies: page clicks, time on site, changes in test scores, survey responses, and so on. Then we try to make decisions with those measurements.
This is like wiggling a rod, attached to a complex set of gears, attached to the thing we want to measure, which in turn is attached (in mysterious ways) to the thing we’re actually measuring.
When we don’t truly understand the mapping between what we really want to know and those proxies, we easily miss important consequences.
Hours of practice worksheets at Kumon might make a child great at arithmetic, but what does it do to their curiosity? To their desire to learn independently as an adult?
When Spotify opts you in by default to noisy push notifications (“the Beatles are now on Spotify!”), they might increase some engagement score, but they also annoy their users. That annoyance may not show up in any dashboard: maybe users keep using the service exactly as much, but when some PR fiasco blows up the following year, they’re less inclined to take Spotify’s side.
Because we don’t understand this mapping, we have to make many more guesses. And with every guess, there’s some chance that we see some result purely by chance. Statistical hypothesis testing only has meaning when you account for all the hypotheses you’ve tried.
Alternately, sometimes people don’t even make guesses, and they just go hunting in the data. If you go looking for patterns in a sufficiently large data set, you’ll certainly find some!
(”Significant” from XKCD by Randall Munroe)
We know that correlation doesn’t imply causation. Sometimes, you’ll find strong correlations by random chance—just like in the comic above: the data suggest that this shade of blue causes people to engage the most!
But it was just random noise, and if you recolor more elements in that shade of blue, you won’t actually make anyone happier, other than perhaps your own community’s navel-gazers.
Another danger is that your correlations may be masking more important underlying phenomena.
Say you want people to share their big news on your social network first. You can’t measure that directly, but you have a proxy metric: photos included with those posts have EXIF data indicating when they were taken. You decide you want to minimize the time between the photo being taken and the photo being shared.
To figure out how to proceed, you hunt for correlations in your logs of users’ behavior. Say you discover a strong correlation between how quickly a photo uploads and how likely it is that users share photos of big news immediately. You tell your engineers to focus on optimizing upload time!
You ship your optimized photo uploader… but you don’t see any benefits in the metric you were measuring. Turns out, you didn’t see this correlation by chance: you saw it because people with faster upload times can afford better cellular connections, which means they’re more likely to upload photos while they’re out and about, as opposed to waiting until they get to unmetered WiFi.
Photo upload time was itself just a proxy measure of the true root cause.
Even if we’re pretty sure we don’t have any hidden causes or consequences lurking, and we carefully account for all our hypotheses, we must remember that these are proxies we’re optimizing. As the situation varies, these proxies’ connection to your true goals may taper—or reverse!
Vitamin C can prevent disease, when taken in small quantities, if your diet already didn’t contain much of it. But that doesn’t mean you should take a hundred times as much seeking a hundred times the benefit (as double-Nobel-laureate Linus Pauling did): you’ll see no marginal benefit, and you’ll just excrete it all.
In the worst cases, fixation on these proxies can create perverse incentives. Say that you want to prepare students for a life of solving challenging problems. It’s true that minimizing missed days of school may help make that happen—but past a certain point, other factors will dominate.
If you optimize too aggressively for zero missed days of school, you may easily reverse the correlation, disrupting students’ family lives or creating an atmosphere which makes students resent their autocratic school.
If you make a product, total usage time might seem like a good proxy for customer joy. But if you take that metric too seriously, you’d be punished for making a change which would help a customer accomplish a given task in less time than before.
There’s a subtler issue with exalting data in these domains—one my research partner May-Li Khoe has patiently explained for me over and over again. If you try to design something with human meaning by steering toward maximum impact on business outcomes, you’re very likely to end up with little human meaning… which in turn will likely harm whatever business outcomes you’re measuring in the long term.
Similarly, “teaching to the test” sucks the fascination and participation out of classrooms in exactly the way you would expect.
This talk from Frank Lantz covers the issue wonderfully in game design (this quote at 33:30; my thanks to Bret Victor for the pointer):
The dilemma of quantitative, data-driven game design…. So here’s an analogy: Imagine you have a friend who has trouble forming relationships… “I don’t know what I’m doing wrong. I go on a date, and I bring a thermometer so I can measure their skin temperature. I bring calipers so I can measure their pupil, to see when it’s expanding and contracting…” The point is, it doesn’t even matter if these are the correct things to measure to predict someone’s sexual arousal. If you bring a thermometer and calipers with you on a date, you’re not going to be having sex…
So.
Imagine that two teachers have exactly the same measured impact on their class’s test scores. How likely is it that they have the same impact on creating empowered thinkers?
You decide to adjust some variable because in the past, it correlated highly with increased product usage. How likely is it that this change better solves a meaningful problem for the user?
Meaning without measures
We’ve seen that there are plenty of dangers in making decisions based primarily on indirect measures with hazy connections to our true goals. Yet clearly, great teachers and great designers do operate effectively in these unsystematized fields!
They have insight; they have intuition. These come from an internalized philosophy about the field, drawn from experience, observation, and stories. Yes, their philosophies are imperfect; and no, they can’t necessarily give you a set of calipers you can use to make your own decisions.
But if you ask about one student interaction in particular, or one product detail in particular, they can often explain in hindsight why their philosophy pushed them in one direction or another. Listen enough, and you might build some intuition of your own.
This is not just luck or some kind of confirmation bias—there’s an underlying consistency to these experts’ taste. It’s clearly visible even if neither you nor they can quantitatively describe how they’re doing what they’re doing. Great teachers sure do manage to consistently be great teachers, in a way others consistently recognize, even without a dashboard and A/B tests. Of course, we might have to watch for a while to see that an expert delivers insights consistently, as opposed to by chance—that’s why knowledge worker interviews are so hard!—but it’s clear that some experts’ ideas are more consistently successful than others.
That consistency is what meaning is.
How do you know your house exists? After all, you don’t experience it directly: your contact with it is mediated by all kinds of layers of fuzzy visual processing and your own faulty memory. It exists because it’s reliably where it was last time. It exists because when you’re inside it, you consistently see the same imagery, with shadow angles mediated as you expect by the seasons. It exists because others can talk to you about your house and say things, interpreted through a winding auditory system, that somehow match your own fuzzy perceptions. It exists because your fingers can feel the shape of the house number on your door, which matches the shape on that lease you remember signing long ago.
The same logic tells us that when an expert consistently makes decisions broadly regarded as successful, and can explain their philosophy with rhetoric that makes intuitive sense, there is probably a there there.
Your house is more systematized—we can precisely measure its height, draw blueprints, predict its mass—but society could still effectively talk about houses before we had any of those tools. Until we discover those tools (and the questions we want to ask with them!), all we’ve got is tradition, expertise, rhetoric, philosophy. If we listen with balanced skepticism and curiosity, those can be powerful tools themselves.
Pretending to measure meaning
I don’t need to preach so strongly. In practice, we usually can’t ignore domain philosophy and experts’ intuition anyway.
Meaningful philosophy is meaningful—so even if we say we’re throwing it out, our intuition often remains entangled with our decisions.
I see this all the time in product decisions. For instance, someone might believe that sign-up walls make for a bad product for a variety of philosophical reasons, but they justify this decision outwardly by pointing to some data from one product’s blog about their A/B test on the subject.
That data is not the reason they decided to ditch sign-up walls. It’s just the reason they’re giving to others (and often, themselves) about why they made the decision. This behavior represents a sort of homage to science… while simultaneously violating its core principles.
In the education space, people are very excited about growth mindset interventions. The rough idea: if you can persuade a child that intelligence can grow with practice and hard work (just like their muscles), then they’ll actually perform better in school.
The recent enthusiasm for interventions in this space follows a series of randomized controlled trials documented in studies by Carol Dweck and her team at Stanford. These interventions probably are effective! But: the field’s quantitative results are actually quite modest in effect size.
The studies alone can’t justify the magnitude of the excitement about this topic; that follows the magnitude of people’s pre-existing intuitive beliefs in these interventions. The problem is that when the education community talks about this topic, it primarily justifies growth mindset interventions with these studies.
This type of motivated reasoning corrupts the dialog around decision-making. We should use provisional data like this to support—not supplant—our philosophies.
When two people disagree philosophically about an issue in an unsystematized domain, but allow only quantitative arguments, they end up fighting a proxy war through data weaker than their own beliefs. Worse: if we ever do invent powerful predictive systems for these fields, we’ll need our scientific wits about us, unsullied by post-hoc spin.
I hope it’s clear that I’m not arguing for us to generally abandon data and systematic thought. This scientistic obsession is a reasonable defense mechanism! After all, before precise measurement, physicists used to debate with rhetoric, and we ended up with phlogiston (i.e. things burn because they contain an element called phlogiston; phlogiston is lost to the air when a thing burns; things can’t burn inside a jar because that air can’t absorb any more phlogiston).
It’s in fields without reliable systems that we can’t measure our way to understanding.
Building systems in those fields is a critical project, and progress can be made. Meta-analysis and multitrait-multimethod tests have certainly helped us lay some foundations. Yet while fields’ systems are under construction, we must be careful not to put too much weight on them. They’re not yet structurally sound.
Intuition, philosophy, and expertise deliver all kinds of useful tentative explanations. If we monitor their predictions over time, we’ll discover limitations, and our theories will evolve. All the while, we’ll spot patterns, incorporate provisional systematic concepts, and fluidly evolve our beliefs, taking the best evidence however it comes.
Joy, belonging, and empowerment may live in this figure’s “qualitative black box,” but we can still produce explanations for how they arise. Those explanations may well involve measurable inputs and outputs. But if we insist on explaining joy through, say, engagement time and Net Promoter Scores, we’ll get exactly as much joy as we deserve.