Alternative (and generally easier to understand) formulation: Once a measure becomes a target, it ceases to be a good measure.
See: grades, GDP, workplace metrics…
Yeah. OP’s overly complicated explanation doesn’t convey the reason why this happens, which is really kinda just human psychology. People are probably thinking it’s like observability in quantum mechanics or some shit.
Goodhart’s Law Example:
- App has poor testing and low quality. New bugs are introduced weekly. Customers complain.
- Management sets a test coverage KPI of 90% for the apps codebase.
- Dev team focuses on tests that hit as many lines of code as possible; NOT on requirements, business logic, or anything that would improve quality, or prevent bugs.
- Quality does not improve because KPI was an arbitrary statistic that holds no value in isolation. Productivity drops because dev team wasted hundreds of hours writing useless tests to achieve KPI. Codebase worse, abd less maintainable. Company wasted millions of dollars. Customers still not happy. Devs hate their lives.
It sounds like this is an example you chose from first hand experience. If so, I’m very sorry. That sounds incredibly frustrating.
Thanks but I only joined when the companies several dozen codebases were already at step 4, and that example wasn’t even in the top 5 of their worst problems. I completed a small greenfield project to high praise. They wanted me to fix other codebases. I handed them a laundry list of problems (with suggested solutions), told them their problems are due to incompetent management, and left.
Not my monkeys. Not my circus.
Actually this is a pretty common thing in software development. It’s become a bit of a trope. So much so that when management proposed things like KPIs tied to code (which they often do!) the devs are like, “can we get bonuses based on these KPIs? 🤑”
“If they’re such a good measure clearly we should be compensated when we do fantastic job! How about $5k extra for exceeding KPIs!”
“Oh, you don’t have enough trust in the system for that? Then why would you trust it for improving quality? I mean, you’re the one that made them…”
I never heard of this before, but I now know the word to describe exercise problems!
Body builders who are judged by how bulging their muscles are feel like garbage despite supposedly being “peak”.
People with high muscle mass or tall being screwed over by BMI targets.
People who are told weight indicates health ignore everything else in exchange for lowering calories.
Even in high school I remember how they would judge you based on like how many push ups you could do… no one who did a ton did proper push ups. Which led to them not helping at all as actual exercise, and even possibly leading to injury.
Heck, we can even use this for stupid Dog Shows, where because they measure specific things for the “goodness” of the dog, they screw over the dog in every way imaginable that isn’t being judged.
This is a good law to know. I like knowing this law. It’s sad how often it’s used, but it’s good to know.
Body builders who are judged by how bulging their muscles are feel like garbage despite supposedly being “peak”.
Plenty of good examples listed in your post, but I disagree with the bodybuilding one. The point isn’t to make you feel good. It’s to play this game where you compete against others to best accomplish a specific task. Just like any other sport, when you compete at the elite level, it’s never going to feel good, and it’s never going to be good for your health.
Hmm okay I just am not a fan of sports that destroy people’s bodies.
“Cut” diets that focus on looking great by not eating/drinking water before showing off just sound awful and not a fan of them. But you’re right, it’s not much worse than any other sport which damages athlete’s body.
People are probably thinking it’s like observability in quantum mechanics or some shit.
Lol I really don’t think anyone was thinking that
I was, at first
I don’t think anyone other than this guy was thinking that
Still, an interesting take, same terms mean different things to different people
A big part of it seems to be manipulation of the results? So, like, devs writing tests for more parts of the code base, but ones that are written to always pass.
Yes, of course. Fundamentally the end goal is to improve the app’s quality. However “quality” is not a measurable thing. Therefore, someone observed that as test coverage goes up, bugs tend to decrease, and as bugs decrease app quality tends to go up. So they make code coverage a KPI, and start putting pressure on developers to increase it.
The problem is that once people are pressured into optimizing a certain number, they will get very creative at doing so. And this creativity often breaks the measure’s relationship with the actual underlying quality we were trying to improve.
Could anyone explain what a “test coverage” means?
Test coverage is defined as the percentage of your application’s functionality that is being covered by the automated tests.
Usually this is measured in lines of code. You run the automated tests, then for every line of code, you track whether it’s executed or not. If 20% of lines were never executed during the test run, your test coverage is 80%.
Software teams will often aspire to reach high coverage, because lines that are never executed during testing are a good place where bugs can hide. However it’s generally acknowledged that this isn’t a foolproof method to get rid of bugs, and reaching 100% coverage can be more effort than it’s worth. Often you have critical code sections that should be covered by multiple tests, and unimportant sections that are unlikely to fail.
This is the first line of the linked article btw.
Goodhart’s law is an adage often stated as, “When a measure becomes a target, it ceases to be a good measure”.
So how does a company manage anything if they can’t use measurement targets?
Like software engineering. How do you improve productivity or code quality if setting a target value for a measurement doesn’t work?
You don’t make your measures your targets.
Example:
“Our customers hate us. We will make our employees get a 10 on their surveys for each customer or we’ll punish them” makes the measure a target.
“Our customers hate us, so we’re going to change our shitty policies to be more consumer friendly and see how our customers respond” keeps the measure as a measure.
So the difference is who decides what changes to make when interacting with the subject of the measure: workers vs management. Making the measure a target is basically a shitty management technique that abdicates responsibility.
Ok I’m going to answer my own question because I’m too curious to wait lol
Goodhart’s Law states that “when a measure becomes a target, it ceases to be a good measure.” In other words, when we use a measure to reward performance, we provide an incentive to manipulate the measure in order to receive the reward. This can sometimes result in actions that actually reduce the effectiveness of the measured system while paradoxically improving the measurement of system performance. … The manipulation of measures resulting from Goodhart’s Law is pervasive because direct measures of effectiveness (MOEs), which are more difficult to manipulate, are also more difficult to measure, and sometimes simply impossible to define and quantify. As a result, analysts must often settle for measures of performance (MOPs) that correlate to the desired effect of the MOE. … These negative effects can sometimes be avoided. When they cannot, they can be identified, mitigated, and even reversed.
- Use MOEs instead of MOPs whenever practicable and possible
- Use the scientific method to generate new measurement data, rather than harvesting existing and possibly compromised data
- Help customers establish authoritative and difficult-to-manipulate definitions for measures
- Identify and avoid the use of manipulated data and data prone to manipulation
- Use measurement data not generated by the organization being measured
- Collect data secretly or after a measurable activity has already occurred
- Measure all relevant system characteristics rather than just a representative few
- Randomize the measures used over time
- Wargame or red team potential measures
This report recommends that the organizations that employ analysts should do the following:
- Return to the roots of operational research to focus more on direct measurements in the field
- Answer the questions that should be answered, rather than the questions that can be answered simply because the required data are already available
- Train analysts on MOEs, MOPs, and Goodhart’s Law and how they are interrelated
- Make recognition of Goodhart’s Law part of the internal peer review process and part of all delivered analytical products
- Identify and share mitigation best practices
[Source]
It’s pretty well established academically that basically the only way KPIs can actually work toward their intended purpose is if they are changed often and determined by the people doing the work that is ultimately measured. Ongoing measurements should only ever be used as indicators - hence the term *key performance indicators_ - and should never be used as targets. What that means in practice is that you should generally ignore all the individual metrics, and look across all of them instead to see if you can spot trends and anomalies, then investigate these qualitatively with the workers who ultimately produce those data to figure out what is happening and if any intervention is necessary.
The problem is that the higher up you get in the hierarchy, the less of that kind of work there is to do and you end up chasing the people below you for nice numbers to plot into your presentations to make it look like there’s a point to your job’s existence.
Thanks for uncovering this report, very insightful and lots of great examples!
Thanks, yes, I saw that one, too, but I liked the emphasis on relationships. The shorter version is easier to get but it does not explain why this happens. E.g., you can observe some relationship (e.g., test results and a student’s intelligence) and then you target grades. But then you have an incentive to teach to the test, which breaks down the relationship between test results and intelligence. Other people here gave great examples of relationships that can fail.
Imagine an antivirus program that looks at a piece of code and outputs either “Yes, this is malware” or “No, this is not malware.” It is not perfect, but it is pretty good.
If the malware authors have access to this program, they can test their malware with it. They can keep modifying their malware until it passes the antivirus program.
Once the antivirus people publish a function AV(code)→boolean, the malware people can use that function to make malware that the function mistakes for non-malware.
If you publish the exact metric that you promise to use to make a decision, then people who want to control your decision can use that metric to test their methods of manipulating you.
That’s what’s happening with Google and Instagram search algorithms. People figure out how to manipulate them and start spamming. Then the search results deteriorate and you have to modify the algorithm.
Partly, yeah. But eventually the fake news people write a narrative that looks prosodically identical to real news; a bot can’t tell it isn’t because the bot doesn’t interact with the real world, only with text on the web.
Ultimately, fact-checkers and anti-spam systems have to touch grass too.
Apart from your mom’s weight.
God dammit almost made me spit out my coffee.
Now that I think of it, it seems to be at the core of some issues with training AI agents using reinforcement learning (e.g., if you choose a wrong metric, you’d get the behavior that makes sense for the agent but not what you want) and with any kind of planned economy (you need targets for planning, but people manipulate them, so you do not get what you want)
The economic calculation problem is not only Goodhart, but Goodhart certainly doesn’t help.