Metrics Without Measurement

Metrics, Meaning, and the Generative Turn

Sep 05, 2025

For decades, metrics have acted as the quiet governors of data science, machine learning, and AI. They decide what counts as progress, what qualifies as intelligence, and what kinds of mistakes a society will tolerate. They do measure and they legislate. As Ian Hacking argued of statistics, measurement systems shape the very realities they claim to describe. To measure is not only to know, but also to normalize and to enforce, to draw the horizon of what will be considered “real.” Metrics, in this sense, do not just track the world, they help make it go round.

But in the age of AI, the role of metrics has grown uncanny and debatable. What once anchored evaluation now reappears as simulation. Models fabricate benchmarks, hallucinate citations, and generate tables that look convincing but bear no relation to any actual experiment.

What once served as proof of genuine research can now be simulated instantly, without any underlying experiments or validation. The mechanisms that ensure scientific integrity (peer review, reproducibility, empirical testing) can now be bypassed entirely when AI generates the complete appearance of rigorous evaluation.

The problem arises when such gestures are consumed as if they were real, leaving us unable to separate authentic knowledge from synthetic imitation.

This essay traces how metrics have transformed from instruments of validation into aesthetic performances. First, we examine how generative AI systems learned to simulate results and the entire apparatus of scientific evaluation: tables, citations, statistical KPIs. Then we explore what it means that the epistemic friction that once guaranteed authentic knowledge is removed. It leaves us vulnerable to mistaking synthetic performance for real evaluation.

When Measurement Becomes Ornament

Generative models are trained among other things on the literature of science itself: papers, benchmarks, evaluations. Which means they are able to produce the language of being evaluated.

Ask a model to summarize a nonexistent experiment, and it may offer a results section: neat tables, confident percentages, citations to papers that don’t exist. None of it is grounded in experiment. But it looks like science.

At this point, metrics cease to discipline and begin to decorate. A hallucinated F1 score becomes a kind of rhetorical ornament. The strange loop is complete: AI now performs the performance of being measured.

When this happens, three risks emerge: epistemic pollution (fabricated results contaminating scientific literature), cascading errors (researchers building on phantom foundations), and the dissolution of expertise (when anyone can generate professional-looking results, what distinguishes actual knowledge?).

At the core, we risk losing our collective ability to distinguish between what we've actually learned and what merely looks like learning.

The Collapse of Epistemic Friction

Metrics once imposed friction. To publish results, you had to validate on real data, running experiments and measuring something. To climb a leaderboard, you had to submit a working model. Friction was the hard, empirical work to ensure that scientific claims were grounded in reality rather than aesthetically convincing performances.

In generative AI, that friction collapses. Models can produce the discourse of evaluation without its discipline.

And just when we need them most, benchmarks themselves are exhausted—ImageNet, GLUE, SuperGLUE—saturated until “state of the art” means little more than marginal gains on tasks that no longer surprise us.

Progress risks becoming a performance: metric inflation without insight, curve-chasing without clarity. We mistake the appearance of rigor for the labor of rigor eroding the boundary between hard-won knowledge and spectacle.

The two trends unfortunately compound one another. As genuine benchmarks lose their force and significance, synthetic ones rush in to fill the gap. What once slowed us down to guarantee validity now accelerates the production of appearances. Progress now risks metric inflation with little added insight and curve-chasing without clarity.

Reclaiming the Measure

Metrics definitely still make the world go round. But in the generative age, they risk becoming decorative, aestheticized, and simulated.

To restore their force, we need to recover what they were meant to be: instruments of discipline rather than ornaments of discourse. This requires systems that expose their own uncertainty, and cultures of interpretation that treat metrics as wagers rather than neutral truths. It requires collective validation in which benchmarks are understood as living agreements, not static leaderboards to be gamed.

The task is not to abandon metrics but to remember them. Only then can they serve again as tools of knowledge rather than props in a performance. To measure should mean to engage reality, not to rehearse its image. To evaluate should mean to genuinely test our limits, not to synthetically flatter our illusions.

In a world where any result can be performed on demand, where the appearance of rigor is indistinguishable from rigor itself, we face the possibility of mistaking our sophisticated performances for genuine understanding.

Sebastian Osorno

Sep 7

This exposure to AI, may resonate with some trends in which statistics aren't welcome in the rigorous world of mathematics. Some would argue that statistics are closer to rhetoric than to mathematics.

It also reminded me that we often create the world out of fiction and storytelling, and I think there is great power behind that.

Wouldn't repositioning metrics in the Cartesian sense round back to believe in objectivity? Don't we need to deeply question this, and therefore, the utility of metrics as measurements of truth?

Expand full comment

2 replies by Karin Garcia and others

2 more comments...

Phi / AI

Discussion about this post