Legal AI tools used by law firms have a propensity to hallucinate—can anything be done?

1 hour ago 1

ARTICLE AD BOX

logo

Professional service firms will need to build a new version of the verification function that operates at the scale necessary to handle AI-boosted work products. (Reuters)

Summary

Yet another law firm has had to apologize for using citations hallucinated by AI in court. Jurisprudence can’t afford such goof-ups. Without some serious investment in fixing the problem, AI’s productivity promise will be eclipsed by its unreliability.

A few days ago, the law firm Sullivan & Cromwell apologized to the chief judge of Manhattan’s US Bankruptcy Court for a court filing with AI-hallucinated citations.

Andrew Dietderich, co-head of the firm’s global restructuring group, wrote the judge that the firm’s “comprehensive policies and training requirements governing the use of AI tools” had not been followed.

A secondary review process also failed. A database of similar incidents that had around 90 entries a year ago now has 1,333. Many are from pro se litigants and small-firm practitioners. Now add one of America’s leading firms.

It’s unlikely this is just about law firms. Their errors are in public court filings. But all professional service firms, even ones like Boston Consulting Group, Goldman Sachs, and the Big Four accountants, have a similar business model—they have partners who oversee teams of associates and check their work.

The more associates per partner, the more the partners earn. That’s what’s known in professional services as leverage, and it’s so important that law firms are actually ranked by profits per partner.

But this model means they’re all going to struggle to capture the economic gains artificial intelligence seems to promise, because partners’ ability to supervise and verify AI-enabled work will become the rate-limiting step to the firms’ growth.

The model works because even though associates lack partners’ expertise, they mostly know when they don’t know something and can flag the points in the work they produce that need more senior eyes. Call it an SOS Post-It note.

Partners don’t have to check every citation and comma. They can focus on the places that genuinely require their expertise, so one partner can oversee 10 associates because the associates’ uncertainty tells them where to look.

AI is sometimes wrong. But it’s never uncertain. In a September paper, researchers at OpenAI explained why. Language models are trained to suppress uncertainty expression, because “I don’t know” scores zero on the benchmarks the field uses to rank models.

Legal AI vendors claim they have solved the problem. LexisNexis promised “100% hallucination-free linked legal citations.” Reuters said its retrieval-augmented system “dramatically reduces hallucinations to nearly zero.” They haven’t. Stanford University tested them. (Bloomberg Law competes with Reuters in providing legal news, analysis, and workflow tools to the legal community.)

A paper in the Journal of Empirical Legal Studies showed that the three leading legal AI products hallucinate between 17% and 33% of the time. Westlaw’s AI-Assisted Research, the worst performer, confidently invented a paragraph of the Federal Rules of Bankruptcy Procedure to support a proposition the Supreme Court has rejected.

AI allows the production of much more material but strips out the crucial signals that made it possible for partners to supervise merely human associates. The verification task that made the whole thing work has changed in both kind (check everything) and magnitude (there’s so much more to check). Partners, though, still have the same amount of time, and that will become the choke point to these firms seeing the gains from AI they’re hoping for.

Three compounding mechanisms make this hard to fix.

First, incentives. Partners own and run these firms for their benefit. Although productivity gains from AI show up immediately, the risks from missing a hallucination may not show up for years, if at all. This creates a principal-agent problem, because there will be a constant temptation for the partners setting policy to reap the gains and not pay enough attention to the risks.

Historically, partnerships have managed principal-agent problems through mechanisms like deferred compensation, reputation enforcement and up-or-out selection. But all of them assume failure modes that are visible in the short run. AI’s is not.

Second, evaluation mismatch. Compensation committees reward visible outputs. A partner who catches three AI hallucinations before they reach a client has decreased their short-run productivity. A partner who pushes AI-assisted work through the pyramid faster looks better. The risk is that firms will select for the partners most enthusiastic about AI, not the ones best at managing it.

Third, the AI-nativity gap. Partners evaluating AI-assisted work are, on average, overseeing output produced by tools they use less fluently than the associates producing it. The distribution is real but bimodal, and the claim is about averages. The usual quality-assurance intuition that makes partnerships work, where the senior partner is more skilled at every task than his or her associates, is poorly calibrated for this type of failure.

All of this combines to mean that for professional service firms, there may well be a first-mover disadvantage with AI, not an advantage.

Learning how to use AI tools without overloading your verifiers is going to take time. The firms that put the investment in to do so will look worse in the short run because they are investing in verification infrastructure, moving deliberately, and expanding their senior ranks to handle the greater verification burden (which means deleveraging and decreasing profits per partner).

Economists Erik Brynjolfsson and Lorin Hitt found the same pattern with corporate IT adoption in the 1990s. The firms that captured the productivity gains were the ones that invested in organizational complements, not just the hardware. The firms that bought the technology without redesigning around it spent a decade looking productive and ended up behind.

Every general-purpose technology has required an organizational innovation that took years to develop. The economic historian Paul David estimated that electricity took about 40 years to produce the factory-floor redesign necessary to reap productivity gains.

Professional service firms will need to focus on building a new version of the all-important verification function that operates at the scale necessary to handle AI-boosted work products. That’s likely to involve some combination of new AI tools, changes in staffing models, and new approaches to training for both associates and partners. If they don’t do so proactively, they’ll be forced via a high-profile failure. Don’t believe me? Ask Sullivan & Cromwell. ©Bloomberg

The author is a Bloomberg columnist who writes about corporate management and innovation.

Read Entire Article