Benchmarks And Outcomes — ‘Moneyball’ For GenAI (Part I)

moneyball It’s October. The seasons are changing. The air is growing crisper. And people in the United States are beginning to take more interest in Major League Baseball as the World Series fast approaches. In celebration of the season, we invite legal professionals to revisit “Moneyball,” the 2011 sports drama starring Brad Pitt and directed by Bennett Miller.

Pitt plays Billy Beane, who, it should noted upfront, is neither a lawyer nor an AI expert. He is however the general manager of the 2002 Oakland A’s, a lifelong student of the game who struggled to consistently produce a winning team in a small baseball market. Beane didn’t have the budget to compete for players like teams in larger markets like New York or Boston. The A’s could develop players, but they couldn’t retain them when they became stars. They did, however, have access to the vital statistics.

Baseball is a sport with a century of data behind it and benchmarks like a player’s batting average are known by even casual fans. What Billy and the A’s did was use analytics and different key performance indicators to win. A player’s batting average is a great metric, but it doesn’t account for other factors like the ability to get on base so on-base percentage is better. Getting on base leads to more runs scored. And more runs scored means winning more games.

When losing a star player like Jason Giambi to the New York Yankees or Johnny Damon to the Boston Red Sox, conventional wisdom would say the team would have to replace two stars. But Beane, recognized the A’s needed to replace the production of the players they lost. In aggregate, they needed to get on base as much as the prior year. Using analytics, the A’s would go on to set an American League record by winning 20 straight games in a row and also made it to the World Series. But the way they did it was the bigger story.

So what can “Moneyball” teach us about benchmarking AI in legal?

The Stanford study on benchmarking GenAI solutions this past summer moved the conversation forward regarding the usefulness and impact of GenAI solutions. The study was not without some controversy that also helped in generating awareness to an important topic: How do we measure the results of GenAI on the legal industry.

The Stanford study tested leading research products on their ability to create answers to questions related to caselaw research. A correct answer was one that accurately reflected the current state of the law. An answer that did not reflect the current state of the law was considered a hallucination. The result? One in six queries hallucinated.

The definition of hallucination in the study is great for benchmarking. But does a hallucination as defined in the study always equal a bad outcome? What if the answer moved your research in the right direction and then you were able to formulate a Boolean search that answered your question? Another run crossing home plate.

And what about associates using traditional research solutions? Has anyone benchmarked their legal research skills to see how often their conclusions do not reflect the current state of the law?

The key points are:

Benchmarks are important, and the right benchmarks for your goals are more important than what is easy to measure.
Benchmarks on new approaches need to reflect the context of the effectiveness of current approaches.
Outcomes are more important than benchmarks.

Outcomes are always interesting. The goals of two organizations can differ. And what counts as winning or a positive outcome at one level may be different at another level of an organization.

An entertaining television advertisement that viewers recall, is considered a winner in the advertising world. But what if viewers can’t remember the name of the advertiser? What if there is no discernable uptick in sales activity as a result of the ad campaign? Recall of an ad can be an example of a vanity metric — something that that is perhaps easy to measure but doesn’t support decisions that a business or law firm should make. The same pitfalls can apply to measuring the efficacy of GenAI solutions. Is what we are measuring aligned with outcomes for the firm?

To be sure, goals and outcomes can change over time. Billy Beane came up with a winning strategy to confront the realities of being a general manager in a small market. Circumstances have changed: On September 26, 2024, the Oakland A’s played their last game in Oakland as they prepared for an eventual move to Las Vegas, a much bigger market with its own unique challenges.

Next month, I’ll explore different use cases for legal GenAI and relate the performance of tools to positive outcomes. Said another way, I’ll explore how to identify getting on base to score runs to win games with legal GenAI.

Ken Crutchfield is Vice President and General Manager of Legal Markets at Wolters Kluwer Legal & Regulatory U.S., a leading provider of information, business intelligence, regulatory and legal workflow solutions. Ken has more than three decades of experience as a leader in information and software solutions across industries. He can be reached at [email protected].