Levi Logo

Finance Transformation

Embrace a new era of empowered finances. Redefine success through innovative financial solutions.

Levi Logo

Taxation

PAYE. VAT, Self Assessment Personal and Corporate Tax.

Levi Logo

Accounting

A complete accounting services from transasction entry to management accounts.

Levi Logo

Company Formation

Company formation for starts up

VIEW ALL SERVICES

Discussion – 

0

Discussion – 

0

CFO

The best AI model still fails 1 in 5 accounting tasks

This audio is auto-generated. Please let us know if you have feedback.

The following is a guest post from Woosung Chun, CFO at DualEntry. Opinions are the author’s own.

We build accounting software powered by artificial intelligence. That’s exactly why we needed to know where AI accounting falls short.

We ran 19 leading AI models through 101 real accounting workflows. Not trivia. Not a “what is accounts payable” multiple choice. Actual accounting scenarios: Classify this transaction, create a journal entry for this scenario, reconcile this bank statement and close the month. The kind of work that sits in every finance team’s queue every single day.

The best model we tested scored 79.2% accuracy. That was Claude Opus 4.7. Second place was OpenAI GPT-5.4 at 77.3%. GPT-4 scored 39.8% on the same tasks. Whatever you think about AI, that trajectory is hard to ignore.

But here’s the number I keep coming back to: The best model in the world gets one in five accounting tasks wrong.

In most contexts, 80% is a good score. In accounting, it’s not. A misclassified transaction doesn’t stay in one place. It flows into your P&L. Into your balance sheet. Into your tax filings. Into the materials you hand to auditors. One error at month-end close doesn’t just affect one line. It compounds.

What surprised me most wasn’t the headline accuracy. It was where the models broke down. The models that looked reasonable on conceptual accounting knowledge, the “do you understand how accruals work” type questions, often fell apart on bank reconciliation and month-end close tasks. The hardest operational work, the stuff finance teams actually lose sleep over, is exactly where AI performance dropped.

There’s a difference between knowing accounting and doing accounting. I knew that going in. I didn’t expect the gap to be quite so visible in the data.

The other thing worth saying: Most of the AI tools being pitched to CFOs right now aren’t being evaluated this way. Vendors show you demos and outputs that look correct, but they don’t show you task-level failure rates across a representative sample of real workflows. There’s no standard for this yet, which is part of why we built and published the 2026 Accounting AI Benchmark, a full methodology and results set, openly available.

So what does this mean practically? A few things.

First, any CFO considering AI for finance operations needs to push vendors on task-level accuracy in their specific workflow categories, not overall benchmark scores. The variance between categories in our data was significant. A model that performs reasonably at transaction classification can fail badly at reconciliation. Those are not the same risk profile.

Second, the question of what happens when the model is wrong matters as much as how often it’s right. A 20% error rate is manageable if you have validation layers, review workflows and controls between AI output and your books. It’s not manageable without them. But the answer here isn’t to avoid AI in accounting. It’s to be deliberate about where the model sits in the workflow. There’s a meaningful difference between AI that drafts and surfaces, sitting inside a system with deterministic validation, audit trails and exception handling built in as core features, versus AI bolted onto a legacy system or accessed raw through an API with none of that infrastructure around it. The former can work. The latter is where errors compound quietly until they don’t.

Third, I’d be skeptical of any AI tool for finance that can’t tell you its error rate. Not because the vendors are being dishonest, but because most of them genuinely haven’t tested it this way. Task-oriented evaluation against real accounting workflows is harder to build than a product demo. That’s not a criticism; task-oriented evaluation is genuinely hard to build. 

But right now the gap is real. GPT-4 to today’s top performers in two years is a remarkable trajectory. I don’t think 79% accuracy is the ceiling. It’s probably closer to a floor for where frontier models will be in another 12 months.

But model capability and deployment readiness are two different things. The controls, validation workflows and audit trails that make AI safe to use in a real accounting environment take time to build and test. Right now, the models are running ahead of the systems designed to catch their mistakes.

That gap is where CFOs need to be paying attention.

The benchmark was designed and built by Ignacio Brasca, staff software engineer at DualEntry. 

Tags:

You May Also Like