The other day I ordered a ridiculously spicy bowl of ramen1. It was far too hot for me to finish. I managed to eat half of it, but that was painful experience, and I felt awful for the next few hours.
This was an unusual experience for me. I love spicy food, and it's far more common for me to have an experience where (1) I ask for the spiciest option, (2) I'm asked "are you sure" by a server, (3) I say yes, and then (4) I find that the food isn't quite as spicy as I’d like it.
That experience happened even though I played things somewhat conservatively: I asked for the second spiciest ramen (at most restaurants I will ask for the spiciest option), was asked if I was sure, said yes... and then found that it was too spicy for my mouth (and stomach) to handle2.
What happened to me with this bowl of ramen happens everywhere: people make choices without a clear, well-calibrated scale to guide them. And while my mistake only cost me a few hours of discomfort, in other domains, similar choices can have much bigger consequences.
Calibration Problems
My spicy ramen experience is an example of a calibration problem. The restaurant tried to communicate how spicy it is, but there is no common scale and thus it's very difficult for me to know what one pepper, two peppers, three peppers, "Kilauea" (the option I chose) or "Ghost" (the spiciest option) actually means.
Given the right data, this would not be a technically challenging problem to solve.
If an app or service had access to my spicy-food eating history, the spicy food eating history of others like me, and the experiences of people in that restaurant, the app likely would know of at least a handful of people "like me" who had ordered the Kilauea ramen and struggled with it. And it could have guided me towards a better decision. In this case, that would have meant a slightly less aggressively spicy option, but in other cases, the spiciest option on the menu would have been fine.
It goes without saying that better modeling the level of spiciness in one's food is neither the world's most pressing problem, nor the world's largest business opportunity.
However, calibrating properly is immensely valuable across many domains. There are many areas that don't have a clear or obvious scale, leaving the equivalent of the inconvenient and inexact one pepper, two peppers, or three peppers.
Calibration Has Increased
There are efforts at calibration in many industries. Hotels were first classified by stars in Switzerland in 1979, with a ★ denoting a “tourist” hotel and a ★★★★★ hotel denoting a luxury hotel. It soon spread to many countries around the world.
On restaurant review sites like Yelp, a $ restaurant might mean the cost is under $10 per person, while a $$$$ restaurant would cost $50 or more per diner. Hotel review and booking sites have similar scales.
Crucially, a reader can look at multiple hotels or multiple viewers and have a common, largely quantitative rubric. To one person, a ★★ hotel may be “nicer than usual”; to another person “nicer than usual” means ★★★★ or ★★★★★. A $$ meal might be a splurge for one diner and an affordable meal for someone else. With consistent calibration, this ceases to be an issue.
Calibration is also extensive in many area of sports and other competitions. In chess, someone might have an Elo rating of 1900 in chess, indicating that she would be a clear favorite against a player with a rating of 1750; that rating is much more precise than “she’s a pretty good chess player.” Likewise a tennis rating of 4.8 is a highly specific way to indicate someone’s skill on the tennis court. This is in large part the problem that my first startup Team Rankings was built to solve.
Education is increasingly calibrated as well. Modern standardized tests are adaptive: they start by asking “average” level questions, and based on how well the test taker does on them, adjusts the level up or down to get a more accurate gauge of how well they understand the material. This would be akin to adjusting a basketball team’s competition based on how they are doing: if they play an average high school team and win by 40 points, they should play their next game against another strong team to see if they’re closer to the 100th best team in the state or the very best team; if they lose by 40 points, they should play against a weak team.
There is a metric for the spiciness of peppers called the Scoville scale, which is (as far as I can tell) well-calibrated. However, it does not extend to individual dishes. One can imagine an equivalent for dish spiciness, for saltiness, for sweetness, for “garlickiness”, and for a hundred other things related to food.
Calibration Is Often Poor
My calibration example is, in the grand scheme of things, pretty trivial.
However, most of the world is poorly calibrated, and better calibration would facilitate a ton of value. Some examples:
Though there are a few places where education is well-calibrated, most educational tools, resources, and processes are not at all well-calibrated. There is a huge amount of room for improvement to give students materials that are at just the right level to allow them to build knowledge and skills.
Most professional skills are not calibrated. Sports are an exception here — I can see exactly how well Steph Curry shoots three pointers in games — and that allows teams to be quite smart about how they make decisions. There are no similar widely adopted metrics to judge software engineers, marketing managers, or designers. In some cases, measuring sales people’s proficiency is well calibrated inside a given company, but is tough to compare across companies.
Our day to day health metrics are getting more well-calibrated but have room for improvement. I can see my sleep scores from my Oura and Fitbit, and can get some sense of how I slept last night and whether my month to month scores are shifting (for instance: I seem to sleep a little worse in the summer). I don’t fully trust these scores, but they are getting better and better.
There are examples in countless other categories, like transportation (how effectively / efficiently different systems operate, bottlenecks) and manufacturing (throughput, difficulty, accuracy).
Calibration is of course also a huge challenge and opportunity for AI. A big part of building a good AI system is designing good metrics for what works, and then optimizing to improve those metrics.
Better Calibrating Systems
A lot of calibration falls into a handful of categories:
Skill level: how good is a person / machine / algorithm at doing a specific task?
Speed: how quickly can something do what it needs to do?
Challenge-level: for a task (like a math problem), what is the difficulty level?
Effectiveness: how reliable is something at doing what it’s asked (perhaps for a specific task).
Physical characteristics: how big or how heavy or how spicy is something?
Cost
The world is full of poorly calibrated systems. Those systems lead to struggles with measurement, comparison, and communication of key attributes.
Spicy ramen is a minor example, but in fields like education, hiring, health, and technology, better calibration could unlock enormous value. As our ability to collect and analyze data improves, we have an opportunity to build better scales, more useful benchmarks, and smarter recommendations.
Getting the right levels of spice, skill, challenge, speed is more than just a matter of taste. It’s a matter of making the world work better.
From Bario Ramen in Honolulu. See page 7 of the menu; I got the Kilauea level.
The restaurant made me sign a consent form before ordering. That gave me pause, and the fact that I had to sign the form made me opt for the second spiciest option rather than the first. You can come to your own conclusions about my judgment in ordering that ramen dish in spite of the fact that the consent form was required.
I think the big issue with calibration or measurement in general is the heterogeneity problem: How spicy is a specific dish now really? How would you go about measuring that in a way that abstracts away from the details of the dish itself?
You could imagine perhaps blending a spoonful and then measuring the blended mixture with the scoville scale, but of course, the blending itself induces an element of homogenization that doesn't necessarily map smoothly to the original dish.
However, there is a danger in spurious precision here: since the concept of spiciness itself is somewhat vague, necessarily the terms used to describe spiciness itself are vague. To quote Lord Bacon in his Novum Organum:
"The logical syllogism consists of propositions, propositions of words, and words are tokens of notions. Therefore if the notions themselves (which is the foundation of the matter) are confused and carelessly abstracted from things, nothing built upon them can be firm."
Basically, the stream cannot rise higher than the source: can a firm, consistent and objective measurement be created from a nebulous and uncertain concept?
I fell prey to taking a bet against Big Daddy that I could eat a spoonful of his fourth-hottest sauce and lost - I couldn't speak for about half an hour, and had knots in my stomach for days. He didn't make me sign a consent form, but he did tell me later that he had been required to give an "antidote" to local hospitals for other contestants who had worse reactions than I did.
https://www.bigdaddysoriginalbbq.com/