Imagine a hospital system that celebrated a stunning achievement: surgical mortality rates had plummeted by 23% in just two years. Leadership praised the metrics. Bonuses were paid. The system was held up as a national model.
Then an independent auditor—going beyond the dashboard metrics to examine what the numbers couldn't see—asked a quiet, uncomfortable question: Where did the high-risk patients go?
They hadn't been cured. They hadn't found better care. Surgeons, under pressure to protect their mortality scores, had quietly stopped accepting the most complex, life-threatening cases. The number looked extraordinary. The underlying reality was anything but (Muller, 2018).
Now ask yourself this: How many of your organization's "green" dashboards are telling a similar story?
This is not a failure of intention. It is not a failure of effort. It is a failure of measurement design—a metrology problem. And if you are a Chief HR Officer, an Organizational Psychologist, or a professional coach relying on assessment instruments and performance metrics to guide human development, there is an excellent chance this failure is quietly operating inside your own systems right now.
Peter Drucker's famous adage—"If you can't measure it, you can't improve it"—has become organizational gospel. And the science behind it is sound. Quantitative indicators enable feedback loops, bottleneck identification, predictive modeling, and data-driven resource allocation. In industrial and defense systems, the distinction between Measures of Effectiveness (MOEs)—which capture true outcomes—and Measures of Performance (MOPs)—which capture easier-to-observe activities—is treated as foundational engineering discipline (Stumborg et al., 2022).
The answer, therefore, is not to abandon measurement. Systems without observable states cannot be diagnosed, optimized, or sustained. The answer is to measure correctly—with the full rigor that physical scientists have demanded for centuries and that the behavioral and organizational sciences have only recently begun to take seriously.
In 1975, economist Charles Goodhart formalized what practitioners had long sensed: "When a measure becomes a target, it ceases to be a good measure" (Goodhart, 1975). Any observed statistical regularity will collapse under the pressure of incentivization.
Four years later, social psychologist Donald T. Campbell articulated a parallel principle with sharper institutional focus: "The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor" (Campbell, 2010/1979).
Together, these laws describe a predictable system failure mode. As Rodamar (2018) elegantly noted, they are not competing ideas—they are complementary lenses on the same phenomenon: human beings are goal-directed systems, and when you install a visible target, you should expect the system to optimize for the target, not for the underlying goal. Manheim and Garrabrant (2019) identify four distinct variants of this failure—regressional, extremal, causal, and feedback—each requiring different countermeasures.
The organizational manifestations are immediately recognizable:
Here is where a hard truth must be stated plainly.
Measurement across the physical and engineering sciences has long required five non-negotiable properties: accuracy, precision, linearity, freedom from bias, quantified uncertainty, and traceability to a universal construct scaffold (Pendrill, 2014). When a physicist measures temperature, the result is traceable to international standards. When an engineer measures torque, uncertainty is explicitly reported alongside every value. No manufacturing operation would accept a tolerance specification that lacked an error estimate.
Until recently, psychology—and by extension, HR measurement—has quietly granted itself permission to skip this level of rigor.
Classical Test Theory (CTT) dominated applied psychometrics for much of the 20th century. Its foundational assumptions produce measurement instruments with unequal and often large uncertainty bands across the ability continuum. A score of 75 on a leadership competency assessment does not mean the same thing for a high-performer as for a mid-level one—the error attached to each is different, rarely reported, and almost never incorporated into decision logic.
Rasch Measurement represented a substantial improvement, providing person-independent item calibration and a more principled link between observed responses and latent traits. But most psychologists don’t use Rasch, and instead use a statistical approach that is superficially similar called Item Response Theory that lacks traceability and alignment with metrological-style standards that Rasch affords.
Computerized Adaptive Testing (CAT) narrows measurement uncertainty more consistently by tailoring item selection to each respondent—producing equally small standard errors across the measurement range, a genuine metrological advance. But even CAT, as currently practiced in most organizational assessment platforms, lacks the traceability infrastructure that metrology requires: a clear, theoretically grounded, empirically validated construct scaffold that anchors scores to something real and stable beyond the instrument itself (Wilson, 2005).
This traceability gap is not a minor technical footnote. It is the reason your 360-degree feedback scores cannot be meaningfully compared across cohorts, why your engagement survey trends may be artifacts of item drift rather than real change, and why your high-potential identification program may be recycling the same false positives year after year.
Compounding the metrology problem is a construct coverage problem. The contemporary science of job performance has converged on a robust three-factor structure:
Critically, OCB and CWB are not opposite ends of a single continuum. Dalal's (2005) landmark meta-analysis demonstrated only a modest negative correlation ( $\rho = -0.32$) between them, meaning an employee can simultaneously exhibit high OCB and high CWB—the high-performing salesperson who bullies junior staff being the archetype. Marcus et al. (2016) extended this structural analysis with convergent findings. Mackey et al.'s (2019) meta-analysis of workplace deviance further documents the organizational cost of this blind spot.
When organizations measure only Task Performance—which is what most performance management systems do—they implicitly assign a weight of zero to OCB and CWB. Under Goodhart's Law, this is an instruction to the workforce: optimize task numbers; everything else is optional. The results are visible in every organization that has ever celebrated a record sales quarter while simultaneously experiencing a spike in workplace complaints, turnover, and customer attrition.
Consider this honestly: If your organization's leadership development assessments were subjected to the same traceability and uncertainty standards your finance team applies to quarterly earnings—would they pass?
If the answer is uncertain, you are already experiencing the problem this piece diagnoses. And every month that passes with untraceable metrics and invisible error bars represents high-stakes human capital decisions made in the dark—decisions your most measurement-sophisticated competitors may have already stopped making that way.
The good news is that the engineering discipline to fix this exists. It simply needs to be demanded and applied.
Manheim (2023) provides the most actionable synthesis of the current evidence base. Paired with the defense systems analysis tradition documented by Stumborg et al. (2022) and Muller's (2018) institutional critique, a coherent mitigation architecture emerges:
1. Reduce the Stakes Attached to Any Single Metric Decouple diagnostic measurement from high-stakes rewards wherever possible. Use satisficing thresholds rather than maximization incentives. Metrics used for internal monitoring and development are far less susceptible to corruption than those tied to promotion, bonus, or public ranking (Koretz, 2008).
2. Obscure the Specific Target Rotate which competencies are assessed in which development cycles. Use randomization of weights rather than fixed formulas where transparency requirements prevent full secrecy. Post-hoc specification—defining specific criteria after the performance period—removes the gaming window entirely.
3. Upgrade Measurement Design: MOEs Over MOPs Replace activity metrics (hours logged, trainings attended, calls completed) with outcome metrics wherever feasible. Pair speed indicators with quality and defect rates. Pair output metrics with behavioral integrity data. Make it genuinely difficult to improve one metric by degrading another.
4. Add Independent Verification Source performance data from outside the measured unit—client feedback systems, field observations, third-party audits—rather than from self-report or managerially mediated channels. Before deploying any new measurement regime, conduct a structured pre-mortem in which analysts role-play adversarial optimization of the system, then design countermeasures.
5. Restore Human Judgment Quantitative metrics are necessary but not sufficient. Calibrated expert judgment—structured narrative assessment, behavioral observation, coaching conversation—functions as an audit mechanism that catches gaming the numerical system cannot detect. This is not sentimentality; it is measurement design.
6. Treat Metrics as Dynamic, Not Permanent Periodically audit whether your indicators still correlate with true organizational objectives. Sunset metrics that have become corrupted targets. Use logic models to map the causal chain from activity to outcome, and measure as close to the outcome end of that chain as data quality permits.
7. Demand Metrological Standards From Your Assessment Vendors Ask your psychometric platform providers directly: What is the standard error of measurement at each point in the score distribution? To what construct theory is this instrument traceable? When was the last criterion validity study conducted on this population? If the answer is evasion or silence, that is diagnostic information.
This is precisely why platforms like TruMind.ai have built their measurement architecture from metrological first principles. TruMind's AI Precision Measurement (AIM) delivers 15x the measurement precision of traditional high-stakes credentialing exams—including those used in medical licensure—using transcripts from ordinary coaching and leadership conversations. It measures not just task-oriented leadership behavior, but the full three-dimensional performance spectrum: nine dimensions of leadership capability alongside all eight ICF Coaching Competencies. Its construct scaffold is explicitly traceable to established theoretical frameworks, and its uncertainty estimates are built into the reporting rather than assumed away.
For coaches, this means objective, session-by-session evidence of client development that can be shown to sponsors and organizations. For Organizational Psychologists, it means a measurement instrument that meets the traceability and precision standards the field has long demanded but rarely received. For CHROs, it means the ability to build a defensible, auditable, multi-dimensional human capital measurement system—one that cannot be easily gamed because it is measuring outcomes, not proxies, and because the data comes from natural conversational transcripts rather than self-report.
The hospital in our opening story had plenty of metrics. What it lacked was a measurement system designed to be resistant to the very gaming it was inadvertently incentivizing. That distinction is the difference between a dashboard that looks good and a system that actually works.
The costs of naive measurement are not theoretical. Every performance cycle you run with single-dimensional metrics implicitly deprioritizes OCB and CWB. Every competency assessment you deploy without quantified uncertainty produces talent decisions with invisible error bars. Every coaching engagement evaluated by self-report alone creates a gaming surface that sophisticated clients—consciously or not—will optimize.
Meanwhile, the organizations that have already adopted metrologically rigorous, multi-dimensional measurement are making better promotion decisions, identifying developmental needs earlier, and building human capital portfolios they can actually manage. The gap between measurement-sophisticated and measurement-naive organizations compounds over time, just as any performance gap does.
The question is not whether rigorous human performance measurement is possible. It is. The question is how long your organization will continue making consequential decisions without it.
For Professional Coaches: Every competency framework your clients are assessed against embeds assumptions about what is measured, what is not, and what the stakes are. Understanding those assumptions is not optional technical knowledge—it is the foundation of informed developmental practice.
For Chief HR Officers: Your performance architecture is a measurement system. Apply the same engineering discipline to it that your operations colleagues apply to supply chain. Map true objectives first. Design MOEs before MOPs. Build uncertainty estimates into every consequential decision. Demand traceability from every vendor.
For Organizational Psychologists: The psychometric infrastructure of most organizations is operating below the metrological standard the science now makes possible. Advocate for traceability, for adaptive precision, and for three-dimensional performance models that include OCB and CWB alongside task metrics.
Measurement is indispensable. Naive measurement is dangerous. Metrologically sound measurement is among the most powerful levers available to those who lead human systems. The mystery at the heart of this piece—how a system can improve its metrics while its underlying performance deteriorates—resolves the moment you demand that your measures be as rigorous as the decisions they inform.
Campbell, D. T. (2010). Assessing the impact of planned social change. Journal of MultiDisciplinary Evaluation, 7(15), 3–43. (Original work published 1979)
Dalal, R. S. (2005). A meta-analysis of the relationship between organizational citizenship behavior and counterproductive work behavior. Journal of Applied Psychology, 90(6), 1241–1255. https://doi.org/10.1037/0021-9010.90.6.1241
El-Mhamdi, E., & Hoang, L. (2024). On Goodhart's law, with an application to value alignment. arXiv. https://doi.org/10.48550/arXiv.2410.09638
Goodhart, C. A. E. (1975). Problems of monetary management: The U.K. experience. Papers in Monetary Economics, 1. Reserve Bank of Australia. Reprinted in Courakis, A. S. (Ed.). (1981). Inflation, depression, and economic policy in the West (pp. 111–116). Barnes & Noble Books. ISBN: 0-389-20144-8
Koretz, D. M. (2008). Measuring up: What educational testing really tells us. Harvard University Press. ISBN: 978-0-674-02805-0
Mackey, J. D., McAllister, C. P., Ellen, B. P., III, & Carson, J. E. (2019). A meta-analysis of interpersonal and organizational workplace deviance research. Journal of Management. https://doi.org/10.1177/0149206319862612
Manheim, D. (2018). Building less flawed metrics: Dodging Goodhart and Campbell's laws (MPRA Paper No. 98288). Munich Personal RePEc Archive. https://mpra.ub.uni-muenchen.de/id/eprint/98288
Manheim, D. (2023). Building less-flawed metrics: Understanding and creating better measurement and incentive systems. Patterns, 4(10), 100842. https://doi.org/10.1016/j.patter.2023.100842
Manheim, D., & Garrabrant, S. (2019). Categorizing variants of Goodhart's law. arXiv. https://doi.org/10.48550/arXiv.1803.04585
Marcus, B., Taylor, O. A., Hastings, S. E., Sturm, A., & Weigelt, O. (2016). The structure of counterproductive work behavior: A review, a structural meta-analysis, and a primary study. Journal of Management, 42(1), 203–233. https://doi.org/10.1177/0149206313503019
Muller, J. Z. (2018). The tyranny of metrics. Princeton University Press. ISBN: 978-0-691-17495-2
Pendrill, L. R. (2014). Man as a measurement instrument. NCSLi Measure, 9(4), 22–33. https://doi.org/10.1080/19315775.2014.11721702
Rodamar, J. (2018). There ought to be a law! Campbell versus Goodhart. Significance, 15(6), 9. https://doi.org/10.1111/j.1740-9713.2018.01205.x
Strathern, M. (1997). 'Improving ratings': Audit in the British university system. European Review, 5(3), 305–321. https://doi.org/10.1002/(SICI)1234-981X(199707)5:3<305::AID-EURO184>3.0.CO;2-4
Stumborg, M. F., Blasius, T. D., Full, S. J., & Hughes, C. A. (2022). Goodhart's law: Recognizing and mitigating the manipulation of measures in analysis (COP-2022-U-033385-Final). CNA Corporation.
Wilson, M. (2005). Constructing measures: An item response modeling approach. Lawrence Erlbaum Associates. ISBN: 978-0-805-84785-3