Evaluating AI Adjusted Returns
For much of the past two years, the AI industry has been obsessed with size, and headline parameter numbers but not disciplined enough about the quality of returns. Parameters became the equivalent of market capitalisation, and bigger was presumed better, with scale itself taken as proof of superiority. I think that view is going to start to look increasingly dated.
A more useful way to think about AI systems, particularly those run locally, is not in terms of intelligence alone but in terms of intelligence efficiency i.e. how much useful work a model can produce within the real constraints of memory, power, cost and reliability.
The argument is that the relevant metric is no longer simply “how smart is the model?” but rather “how much usable intelligence does it deliver per unit of scarce resource?” In practical terms, that means looking at intelligence per gigabyte of memory, or per watt of power, rather than at raw benchmark performance in isolation.
Memory, not model size, is often the hard limit in local AI deployments. A model that cannot fit comfortably into available VRAM may be technically impressive but practically inconvenient. In that respect, “intelligence per GB” is a credible starting point.
For business users, it is not about how much intelligence can be packed into memory, but how much reliable work can be extracted from a system before errors, hallucinations and supervision costs erode the gain.
This suggests risk-adjusted intelligence would be a more mature metric.
Such a measure would not reward a model merely for appearing clever on a benchmark, it would reward a system for producing accurate, grounded and decision-useful outputs with a tolerably low error rate.
The same logic applies to energy consumption i.e. the best model is not necessarily the one that tops a leaderboard; it is the one that delivers the greatest useful output per watt. For enterprises deploying AI at scale, power draw is not a technical footnote. It is an operating expense.
A sensible framework might combine four variables: task quality, reliability, throughput and compute burden. In plain English: how well the model performs on the jobs one actually has, how often it gets them wrong, how quickly it works, and how much memory and energy it consumes while doing so. Once expressed this way, the evaluation begins to resemble portfolio construction rather than product marketing.
An AI system that performs brilliantly in controlled conditions but poorly in production is not unlike a fund that shines in back-tests and disappoints in live markets. What matters is not isolated brilliance but repeatable, risk-aware output.
Seen in that light, the next phase of AI competition may be less about absolute intelligence than about intelligence efficiency. It should not be be about those that maximise benchmark scores at any cost, but those that optimise for dependable, domain-specific returns under real-world constraints.

