27 September 2019

The Tyranny of Metrics – Jerry Z. Muller

Introduction

There are things that can be measured. There are things that are worth measuring. But what can be measured is not always what is worth measuring; what gets measured may have no relationship to what we really want to know. The costs of measuring may be greater than the benefits. The things that get measured may draw effort away from the things we really care about.

The problem is not measurement, but excessive measurement and inappropriate measurement – not metrics, but metric fixation.

When their scores are used as a basis of reward and punishment, surgeons, as do others under such scrutiny, engage in creaming, that is, they avoid the riskier cases. When hospitals are penalized based on the percentage of patients who fail to survive for thirty days beyond surgery, patients are sometimes kept alive for thirty-one days, so that their mortality is not reflected in the hospital’s metrics. In England, in an attempt to reduce wait times in emergency wards, the Department of Health adopted a policy that penalized hospitals with wait times longer than four hours. The program succeeded – at least on the surface. In fact, some hospitals responded by keeping incoming patients in queues of ambulances, beyond the doors of the hospital, until the staff was confident that the patient could be seen within the allotted four hours of being admitted.

Checklists – standardized procedures for how to proceed under routine conditions – have been shown to be valuable in fields as varied as airlines and medicine. And, as recounted in the book Moneyball, statistical analysis can sometimes discover that clearly measureable but neglected characteristics are more significant than is recognized by intuitive understanding based on accumulated experience.

Professionals don’t like to think about costs. Metrics folks do. When the two groups work together, the result can be greater satisfaction for both. When they are pitted against one another, the result is conflict and declining morale.

I. The Argument

A key premise of metric fixation concerns the relationship between measurement and improvement. There is a dictum (wrongly) attributed to the great nineteenth-century physicist Lord Kelvin: “If you cannot measure it, you cannot improve it.” In 1986 the American management guru, Tom Peters, embraced the motto, “What gets measured gets done,” which became a cornerstone belief of metrics. In time, some drew the conclusion that “anything that can be measured can be improved.”

The key components of metric fixation are:

the belief that it is possible and desirable to replace judgment, acquired by personal experience and talent, with numerical indicators of comparative performance based upon standardized data (metrics);
the belief that making such metrics public (transparent) assures that institutions are actually carrying out their purposes (accountability);
the belief that the best way to motivate people within these organizations is by attaching rewards and penalties to their measured performance, rewards that are either monetary (pay-for-performance) or reputational (rankings).

Trying to force people to conform their work to pre-established numerical goals tends to stifle innovation and creativity – valuable qualities in most settings. And it almost inevitably leads to a valuation of short-term goals over long-term purposes.

One of the purposes of this book is to specify when performance metrics are genuinely useful – how to use metrics without the characteristic dysfunctions of metric fixation.

Measuring the most easily measurable
Measuring the simple when the desired outcome is complex
Measuring inputs rather than outcomes
Degrading information quality through standardization
Gaming through creaming
Improving numbers by lowering standards. Airlines improve their on-time performance by increasing the scheduled flying time of their flights.
Improving numbers through omission or distortion of data
Cheating

II. The Background

Military officers were themselves increasingly imbibing a managerial outlook, pursuing degrees in business administration, management, or economics. That led to what Luttwak called a “materialist bias,” aimed at measuring inputs and tangible outputs (such as firepower), rather than intangible human factors, such as strategy, leadership, group cohesion, and the morale of servicemen. What could be precisely measured tended to overshadow what was really important.

One vector of the metric fixation was the rise of management consultants, outfitted with the managerial skills of quantitative analysis, whose first maxim was “If you can’t measure it, you can’t manage it.” Reliance on numbers and quantitative manipulation not only gave the impression of scientific expertise based on “hard” evidence, it also minimized the need for specific, intimate knowledge of the institutions to whom advice was being sold. The culture of management demanded more data – standardized, numerical data.

Numerical metrics also give the appearance (if one does not analyze their genesis and relevance too closely) of transparency and objectivity. A good part of their attractiveness is that they appear to be readily understood by all.

The quest for numerical metrics of accountability is particularly attractive in cultures marked by low social trust. And mistrust of authority has been a leitmotif of American culture since the 1960s.

The suspicion of authority was intrinsic to the post-1960s political left: to rely upon the judgment of experts was to surrender to the prejudices of established elites. Thus, the left had its reasons for advancing an agenda that professed to make institutions accountable and transparent, using the purportedly objective and scientific standards of measured performance. On the right there was the suspicion, sometimes well founded, that public-sector institutions were being run more for the benefit of their employees than their clients and constituents. In some schools, police departments, and other government agencies, time-serving was indeed a reality, even if not as predominant or universal as its critics alleged. The culture of metric accountability was an understandable attempt to break the stranglehold of entrenched gerontocracy. When institutional establishments came under populist attack, they too resorted to metrics as a means of defense to demonstrate their effectiveness.

The problem is that management’s quest to get a handle on a complex organization often leads to what Yves Morieux and Peter Tollman have dubbed “complicatedness”: the expansion of procedures for reporting and decision-making, requiring ever more coordination bodies, meetings, and report-writing. With all that time spent reporting, meeting, and coordinating, there is little time left for actual doing.

A strange, egalitarian alchemy often assumes that there must be someone better to be found outside the organization than within it: that no one within the organization is good enough to ascend, but unknown people from other places might be. That assumption leads to a turnover of top leaders, executives, and managers, who arrive at their new posts with limited substantive knowledge of the institutions they are to manage. Hence their greater reliance on metrics, and preferably metrics that are similar from one organization to another (aka “best practices”). These outsiders-turned-insiders, lacking the deep knowledge of context that comes from experience, are more dependent on standardized forms of measurement. Not only that, but with an eye on their eventual exit to some better job with another organization, mobile managers are on the lookout for metrics of performance that can be deployed when the headhunter calls.

Yet another factor is the spread of information technology.

Public Management schemes are plausible solutions for dealing with units of government that produce a single product or service, such as issuing passports. But that is the exception rather than the rule. Moreover, in business there are clear financial criteria of success and failure: costs and benefits can be compared to determine profits, and managers can plausibly be rewarded on that basis. But in government and nonprofit organizations there are rarely single goals, and they cannot be readily measured.

People who choose to work for government agencies and nonprofit organizations, such as schools, universities, hospitals, or the Red Cross, are also interested in earning a living, but they tend to be more motivated by a commitment to the mission of the organization: to teach, to research, to heal, to rescue. They respond differently to the lure of monetary rewards, because their motivations are different, at least in degree.

It is simple-minded to assume that people are motivated only by the desire for more money, and naive to assume that they are motivated only by intrinsic rewards. The challenge is to figure out when each of these motivations is most effective.

Some rewards enhance intrinsic motivation. For example, when the rewards are verbal and expressed primarily to convey information (“You did a great job on that!”) rather than to exercise control. Or when awards are given out after the fact, for excellence in achievement, without having been offered as an incentive in advance. Or, in fields such as science or scholarship, when prizes or honorific titles are bestowed to recognize long-term achievement. More broadly, above-market wages can reinforce employees’ intrinsic motivation if those wages are perceived as a signal of the organization’s appreciation of the employees’ performance. Intrinsic and extrinsic motivation can work in tandem when the outcomes that are rewarded are in keeping with the agents’ own sense of mission: when hospitals, for example, are rewarded for better safety records. But when mission-oriented organizations try to use extrinsic rewards, as in promises of pay-for-performance, the result may actually be counterproductive.

Many of these outputs are not highly visible or measureable in any numerical sense. Organizations depend on employees engaging in mentoring and in team work, for example, which are often at odds with what the employees would do if their only interests were to maximize their measured performance for purposes of compensation. Thus, there is a gap between the measureable contribution and the actual, total contribution of the agent.

III. The Mismeasure of All Things? Case Studies

Education

By allowing more students to pass, a college transparently demonstrates its accountability through its excellent metric of performance. What is not so transparent is the lowered standards demanded for graduation.

High dropout rates seem to indicate that too many students are attempting college, not too few. And those who do obtain degrees find that a generic B.A. is of diminishing economic value, because it signals less and less to potential employers about real ability and achievement. Recognizing this, prospective college students and their parents seek admission not just to any college, but to a highly ranked one. And that, in turn, has led to the arms race of college rankings.

Lowering the standards for obtaining a B.A. means that using the percentage of those who attain a college degree as an indicator of “human capital” becomes a deceptive unit of measurement for public policy analysis.

ignoring the fact that not all B.A.’s are the same, and that some may not reflect much ability or achievement.

In an age in which technology is replacing many tasks previously performed by those with low to moderate levels of human capital, national economic growth based on innovation and technological progress depends not so much on the average level of educational attainment as on the attainment of those at the top of the distribution of knowledge, ability, and skill. In recent decades, the percentage of the population with a college degree has gone up, while the rate of economic growth has declined. And though the gap between the earnings of those with and those without a college diploma remains substantial, the falling rate of earnings for college graduates seems to indicate that the economy already has an oversupply of graduates.

To be sure, public policy ought to aim at more than economic growth, and there is more to college education than its effect on earning capacity, as we will explore in a moment.

When individual faculty members, or whole departments, are judged by the number of publications, whether in the form of articles or books, the incentive is to produce more publications, rather than better ones.

What, you might ask, is the alternative to tallying up the number of publications, the times they were cited, and the reach of the journals in which articles are published? The answer is professional judgment. In an academic department, evaluation of faculty productivity can be done by the chair or by a small committee, who, consulting with other faculty members when necessary, draw upon their knowledge, based on accumulated experience, of what constitutes significance in a book or article.

The sort of life-long satisfaction that comes from an art history course that allows you thereafter to understand a work of art in its historical context; or a music course that trains you to listen for the theme and variations of a symphony or the jazz interpretation of a standard tune; or a literature course that heightens your appreciation of poetry; or an economics course that leaves you with an understanding of key economic institutions; or a biology course that opens your eyes to the wonders of the structures of the human body – none of these is captured by the metrics of return-on-investment. Nor is the fact that college is often a place where life-long friendships are made, often including that most important of friendships, marriage. All of these should be factored in when considering “return on investment”: but because they are not measureable in quantifiable terms, they are not included. The hazard of metrics so purely focused on monetary return on investment is that like so many metrics, they influence behavior. Already, universities at the very top of the rankings send a huge portion of their graduates into investment banking, consulting, and high-end law firms – all highly lucrative pursuits. These are honorable professions, but is it really in the best interests of the nation to encourage the best and the brightest to choose these careers?

Students too often learn test-taking strategies rather than substantive knowledge.

The problem does not lie in the use of standardized tests, which, when suitably refined, can serve as useful measures of student ability and progress. Value-added testing, which measures the changes in student performance from year to year, has real utility. It has helped to pinpoint poorly performing teachers, who have then left the system. More importantly, value-added testing can be genuinely useful as a diagnostic tool, used by the teachers themselves to discover which aspects of the curriculum work and which do not. But value-added tests work best when they are “low stakes.” It is the emphasis placed on these tests as the major criterion for evaluating schools that creates perverse incentives, including focusing on the tests themselves at the expense of the broader goals of the institution. High-stakes testing leads to other dysfunctions as well, such as creaming: studies of schools in Texas and in Florida showed that average achievement levels were increased by reclassifying weaker students as disabled, thus removing them from the assessment pool. Or out and out cheating, as teachers alter student answers.

No evidence that teacher incentives increase student performance, attendance, or graduation, nor … any evidence that the incentives change student or teacher behavior.

Development of these noncognitive qualities may well be going on in classrooms and schools without being reflected in performance metrics based on test scores.

The costs of trying to use metrics to turn schools into gap-closing factories are therefore not only monetary. The broader mission of schools to instruct in history and in civics is neglected as attention is focused on attempting to improve the reading and math scores of lower-performing groups. Pedagogic strategies that may be effective for lower-achieving students (such as longer school days and shorter summer vacations) are extended to students for whom these strategies are counterproductive. And resources are diverted away from maximizing learning on the part of the more gifted and talented – who may in fact hold the key to national economic performance.

Medicine

One role is informational and diagnostic: the process of keeping track of various methods and procedures, and then comparing the outcomes, makes it possible to determine which are most successful. The successful methods and procedures can then be followed by others. Another is publicly reported metrics, intended to provide transparency to consumers, and a basis for comparison and competition among providers. Yet another is pay-for-performance, in which accountability is backed up with monetary rewards or penalties.

The great push in recent decades has been for metrics to be used not only to improve safety and effectiveness but also to contain costs.

What about the figures for mortality and life expectancy? These, it turns out, are influenced in large part by factors outside the medical system, factors having to do with culture and styles of life.

Americans have disproportionately high rates of death from gunshot wounds, another factor that is lamentable, but has almost nothing to do with the medical system.

Many of the problems of American health are a function not of the medical system but of social and cultural factors beyond the medical system. When it comes to diagnosing and treating disease, Atlas notes, American medicine is among the best.

Metrics at Geisinger are effective because of the way in which they are embedded in a larger system. Crucially, the establishment of measurement criteria and the evaluation of performance are done by teams that include physicians as well as administrators. The metrics of performance, therefore, are neither imposed nor evaluated from above by administrators devoid of firsthand knowledge. They are based on collaboration and peer review. Geisinger also uses its metrics to continuously improve its performance in outpatient care for a variety of conditions.

There are some real advantages to publicly available metrics of surgeon success and of hospital mortality rates. Their publication can point out very poor performers, who may then cease practicing, in the case of surgeons – a sifting process all the more valuable in a profession in which practitioners are reluctant to dismiss incompetent fellow members of the guild. Or the lower-level performers can take steps to improve their measured performance, in the case of hospitals.

There are immediate benefits to discovering poorly performing outliers. The problem is that the metrics continue to get collected from everyone. And at some point the marginal costs exceed the marginal benefits.

“Pay for performance” reduces intrinsic motivation. Many tasks, especially in health care, are potentially intrinsically satisfying. Relieving pain, answering questions, exercising manual dexterity, being confided in, working on a professional team, solving puzzles, and experiencing the role of a trusted authority – these are not at all bad ways to spend part of one’s day at work. Pride and joy in the work of caring is among the many motivations that do result in “performance” among health care professionals.

Unfortunately, neglecting intrinsic satisfiers in work can inadvertently diminish them.

Another recurrent issue with medical metrics: hospitals serve very different patient populations, some of whom are more prone to illness and less able to take care of themselves once discharged. Pay-for-performance schemes try to compensate for this by what is known as “risk adjustment.” But calculations of the degree of risk are at least as prone to mismeasurement and manipulation as other metrics. In the end, hospitals that serve the most challenging patient population are most likely to be penalized. As in the case of schools punished for the poor performance of their students on standardized tests, by penalizing the least successful hospitals, performance metrics may end up exacerbating inequalities in the distribution of resources – hardly a contribution to the public health they are supposed to improve.

Policing

As one Chicago detective explained, “It’s so easy.” First, the responding officer can intentionally misclassify a case or alter the narrative to record a lesser charge. A house break-in becomes “trespassing”; a garage break-in becomes “criminal damage to property”; a theft becomes “lost property.” In each of these cases, what had been a major offense becomes a minor crime, not reflected in the FBI Uniform Crime Report. The temptations to understate crimes is sufficiently great that the New York Police Department devotes substantial resources to auditing the reports it receives, and to punishing officers found to have misreported. But not every police force has the resources – or the will – to create these countervailing forces.

In Britain, this process of directing police resources at easier-to-solve crimes in order to boost detection rates is known as “skewing.” Metrics, then, have played a useful role in policing. But the attempt to use metrics as a basis of reward and punishment can lead to metrics that are less reliable and even counterproductive.

Military

He also warns against the use of all “input metrics,” that is, metrics that count what the army and its allies are doing, for these may be quite distinct from the outcomes of those actions: Input metrics are indicators based on our own level of effort, as distinct from the effects of our efforts. For example, input metrics include numbers of enemy killed, numbers of friendly forces trained, numbers of schools or clinics built, miles of road completed, and so on. These indicators tell us what we are doing but not the effect we are having. To understand that effect, we need to look at output metrics (how many friendly forces are still serving three months after training, for example, or how many schools or clinics are still standing and in use after a year) or, better still, at outcome metrics. Outcome metrics track the actual and perceived effect of our actions on the population’s safety, security, and well-being. Coming up with useful metrics often requires an immersion in local conditions. Take, for example, the market price of exotic (i.e., nonlocal) vegetables, which few outsiders look to as a useful indicator of a population’s perceived peace and well-being. Kilcullen, however, explains why they might be helpful: Afghanistan is an agricultural economy, and crop diversity varies markedly across the country. Given the free-market economics of agricultural production in Afghanistan, risk and cost factors – the opportunity cost of growing a crop, the risk of transporting it across insecure roads, the risk of selling it at market and of transporting money home again – tend to be automatically priced in to the cost of fruits and vegetables. Thus, fluctuations in overall market prices may be a surrogate metric for general popular confidence and perceived security. In particular, exotic vegetables – those grown outside a particular district that have to be transported further at greater risk in order to be sold in that district – can be a useful telltale marker.

Knowledge that may be of no use in other circumstances – to the chagrin of those who look for universal templates and formulae. The hard part is knowing what to count, and what the numbers you have counted actually mean in context.

Business

There are indeed circumstances when pay for measured performance fulfills that promise: when the work to be done is repetitive, uncreative, and involves the production or sale of standardized commodities or services; when there is little possibility of exercising choice over what one does; when there is little intrinsic satisfaction in it; when performance is based almost exclusively on individual effort, rather than that of a team; and when aiding, encouraging, and mentoring others is not an important part of the job.

As one sociologist has put it, “Extrinsic rewards become an important determinant of job satisfaction only among workers for whom intrinsic rewards are relatively unavailable.”

People do want to be rewarded for their performance, both in terms of recognition and remuneration. But there is a difference between promotions (and raises) based on a range of qualities, and direct remuneration based on measured quantities of output. For most workers, contributions to their company include many activities that are intangible but no less real: coming up with new ideas and better ways to do things, exchanging ideas and resources with colleagues, engaging in teamwork, mentoring subordinates, relating to suppliers or customers, and more. It’s appropriate to reward such activities through promotions and bonuses – even if it is more difficult to document and requires a greater degree of judgment by those who decide on the rewards. Nor is the problem assigning numbers to performance. There is nothing wrong with rating people on a scale. The problems arise when the scale is too one-dimensional, measuring only a few outputs that are most easily measured because they can be standardized.

(...) the depressive effect of performance pay on creativity; the propensity to cook the books; the inevitable imperfections of the measurement instruments; the difficulty of defining long-term performance; and the tendency for extrinsic motivation to crowd out intrinsic motivation. They’ve concluded that it might be more advantageous to abolish pay-for-performance for top managers, and replace it with a higher fixed salary.

“Even though over half of the companies used forced ranking, the respondents reported that this approach resulted in lower productivity, inequity, skepticism, decreased employee engagement, reduced collaboration, damage to morale, and mistrust in leadership.” Increasing numbers of technology companies, conscious of the demotivating effect of performance rankings on the majority of their staff, are moving away from performance bonuses. They are replacing them with higher base salaries combined with shares or share options, to give employees a tangible interest in the long-term flourishing of the company (while paying special rewards to particularly high performers). Yet other companies are dropping annual ratings in favor of “crowdsourced” continuous performance data, by which supervisors, colleagues, and internal customers provide ongoing online feedback about employee performance. That may be substituting the frying pan for the fire, as employees constantly game for compliments, while resenting the omnipresent surveillance of their activities – a dystopian possibility captured in Dave Eggers 2014 novel The Circle. Yet as improvements in information technology make it easier to monitor one or another index of worker performance, it will become ever more tempting to link pay to performance, whether in the form of piece rates, bonuses, or commissions – in spite of evidence of the hazards of measuring too narrowly, and of discouraging teamwork and innovation.

Unable to count intangible assets such as reputation, employee satisfaction, motivation, loyalty, trust, and cooperation, those enamored of performance metrics squeeze assets in the short term at the expense of long-term consequences. For all these reasons, reliance upon measurable metrics is conducive to short-termism, a besetting malady of contemporary American corporations.

Charitable organisations

What gets measured is what is most easily measured, and since the outcomes of charitable organizations are more difficult to measure than their inputs, it is the inputs that get the attention. At the extremes, the ratio of overhead-to-program costs can provide a useful indicator of fraud or of poor financial management. But too often, measured performance that may be useful in aberrant cases is extended to all cases. For most charities, equating low overhead with higher productivity is not only deceptive but downright counterproductive. In order to be successful, charitable organizations need competent, trained staff. They need adequate computer and information systems. They need functional offices. And yes, the ability to keep raising funds. But the assumption that the effectiveness of charities is inversely proportional to their overhead expenses leads to underspending on overhead and the degradation of organizational capacities: instead of high-quality and well-trained staff, too many novices and too much staff turnover; computer systems that are out of date and inefficient; and as a result, less effectiveness in raising funds for ongoing activities or new programs. To make matters worse, the funders impose growing demands for reports, so that staff time devoted to documentation eats up an ever larger portion of the grant. In response, the leaders of charitable organizations often end up trying to game the figures: by reporting that the time of leading staff members is devoted almost entirely to programs, or that there is no spending on fundraising. That response is understandable. But it feeds the expectations of funders that low overhead is the measure they should be looking at to hold charities accountable.1 Thus the snake of accountability eats its own tail.

Foreign aid

Programs whose achievements are not easily measured in quantitative terms have been curtailed. It is easier to measure enrollment in primary schools and literacy rates, for example, than the sort of cultural education of future elites that comes from providing scholarships for students from poor countries to study in American universities. So when metrics becomes the standard of evaluation, programs that cannot demonstrate their short-term benefits are sacrificed. The U.S. Agency for International Development’s scholarship program, for example, was gutted by the White House Office of Management and Budget on the grounds that its benefits could not be put into dollar terms, and thus the government could not determine whether the program’s benefits exceeded its costs.

Those who suffer from Obsessive Measurement Disorder, Natsios writes, ignore “a central principle of development theory – that those development programs that are most precisely and easily measured are the least transformational, and those programs that are the most transformational are the least measureable.” High among those are the development of competent leadership and management.

Excursus: on transparency

Effective politicians must to some degree be two-faced, pursuing more flexibility in closed negotiations than in their public advocacy. Only when multiple compromises have been made and a deal has been reached can it be subjected to public scrutiny, that is, made transparent.

Outputs, he argues, ought to be made as publicly accessible as possible. Inputs, by contrast, are the discussions that go into government decision-making: discussions between policymakers and civil servants. There are increasing pressures to make those publicly available as well.

Making internal deliberations open to public disclosure – that is, transparent – is counterproductive, Sunstein argues, since if government officials know that all of their ideas and positions may be made public, it inhibits openness, candor, and trust in communications. The predictable result will be for government officials to commit ever less information to writing, either in print or in the form of emails. Instead, they will limit important matters to oral conversation. But that decreases the opportunity to carefully lay out positions. All policies have costs: if internal deliberations are subject to transparency, it makes it impossible to deflate policy prescriptions that may be popular but are ill advised, or desirable but likely to offend one or another constituency. Thus transparency of inputs becomes the enemy of good government.

IV. Conclusions

Goal displacement through diversion of effort to what gets measured
Promoting short-termism
Costs in employee time
Diminishing utility
Rule cascades
Rewarding luck
Discouraging risk-taking
Discouraging innovation
Discouraging cooperation and common purpose
Degradation of work
Costs to productivity

THE CHECKLIST

1. What kind of information are you thinking of measuring? The more the object to be measured resembles inanimate matter, the more likely it is to be measureable: that is why measurement is indispensable in the natural sciences and in engineering. When the objects to be measured are influenced by the process of measurement, measurement becomes less reliable. Measurement becomes much less reliable the more its object is human activity, since the objects – people – are self-conscious, and are capable of reacting to the process of being measured. And if rewards and punishments are involved, they are more likely to react in a way that skews the measurement’s validity. By contrast, the more they agree with the goals of those rewards, the more likely they are to react in a way that enhances the measurement’s validity.

2. How useful is the information? Always begin by reminding yourself that the fact that some activity is measureable does not make it worth measuring, indeed, the ease of measuring may be inversely proportional to the significance of what is measured. To put it another way, ask yourself, is what you are measuring a proxy for what you really want to know? If the information is not very useful or not a good proxy for what you’re really aiming at, you’re probably better off not measuring it.

3. How useful are more metrics? Remember that measured performance, when useful, is more effective in identifying outliers, especially poor performers or true misconduct. It is likely to be less useful in distinguishing between those in the middle or near the top of the ladder of performance. Plus, the more you measure, the greater the likelihood that the marginal costs of measuring will exceed the benefits. So, the fact that metrics is helpful doesn’t mean that more metrics is more helpful.

4. What are the costs of not relying upon standardized measurement? Are there other sources of information about performance, based on the judgment and experience of clients, patients, or parents of students? In a school setting, for example, the degree to which parents request a particular teacher for their children is probably a useful indicator that the teacher is doing something right, whether or not the results show up on standardized tests. In the case of charities, it may be most useful to allow the beneficiaries to judge the results.

5. To what purposes will the measurement be put, or to put it another way, to whom will the information be made transparent? Here a key distinction is between data to be used for purposes of internal monitoring of performance by the practitioners themselves versus data to be used by external parties for reward and punishment. For example, is crime data being used to discover where the police ought to deploy more squad cars or to decide whether the precinct commander will get a promotion? Or is a surgical team using data to discover which procedures have worked best or are administrators using that same data to decide whether the hospital will be financially rewarded or penalized for its scores? Measurement instruments, such as tests, are invaluable, but they are most useful for internal analysis by practitioners rather than for external evaluation by public audiences who may fail to understand their limits. Such measurement can be used to inform practitioners of their performance relative to their peers, offering recognition to those who have excelled and offering assistance to those who have fallen behind. To the extent that they are used to determine continuing employment and pay, they will be subject to gaming the statistics or to outright fraud. (...) “Low stakes” metrics are often more effective than when the stakes are higher.

6. What are the costs of acquiring the metrics?

7. Ask why the people at the top of the organization are demanding performance metrics. As we’ve noted, the demand for performance measures sometimes flows from the ignorance of executives about the institutions they’ve been hired to manage, and that ignorance is often a result of parachuting into an organization with which one has little experience.

8. How and by whom are the measures of performance developed? Accountability metrics are less likely to be effective when they are imposed from above, using standardized formulas developed by those far from active engagement with the activity being measured.

9. Remember that even the best measures are subject to corruption or goal diversion.

10. Remember that sometimes, recognizing the limits of the possible is the beginning of wisdom. Not all problems are soluble, and even fewer are soluble by metrics. It’s not true that everything can be improved by measurement, or that everything that can be measured can be improved. Nor is making a problem more transparent necessarily a step to its solution. Transparency may make a troubling situation more salient, without making it more soluble.