Databricks KARL Agent Targets Enterprise RAG Limitations

alex2404
By
Disclosure: This website may contain affiliate links, which means I may earn a commission if you click on the link and make a purchase. I only recommend products or services that I personally use and believe will add value to my readers. Your support is appreciated!

Databricks has released KARL, a knowledge agent trained across six distinct enterprise search behaviors simultaneously, claiming it matches Claude Opus 4.6 on a purpose-built benchmark at 33% lower cost per query and 47% lower latency, with no human-labeled training data required.

Most enterprise retrieval-augmented generation pipelines are built to handle one type of search well. A model tuned for cross-document synthesis performs poorly on constraint-driven entity lookup. A model optimized for simple fact retrieval collapses on multi-step reasoning over fragmented internal records. Failures tend to surface only after something breaks in production.

What KARL Actually Does

KARL stands for Knowledge Agents via Reinforcement Learning. Databricks trained the agent using a new reinforcement learning algorithm across six enterprise search behaviors, evaluated on KARLBench, a benchmark the company built specifically for this purpose. The six tasks cover constraint-driven entity search, cross-document report synthesis, long-document traversal with tabular numerical reasoning, exhaustive entity retrieval, procedural reasoning over technical documentation, and fact aggregation over internal company notes.

That last category draws from Databricks’ own product manager meeting notes, a dataset the company calls PMBench. The notes are fragmented, ambiguous, and unstructured in ways that frontier models handle poorly.

The agent generated its own synthetic training data. No human labeling was involved.

The Reinforcement Learning Problem

Standard reinforcement learning wins in AI over the past year have largely come from tasks with clear right and wrong answers, where automated verification is straightforward. Enterprise knowledge work is different.

“The tasks that we’re working on for KARL, and that are just normal for most enterprises, are not strictly verifiable in that same way,” said Jonathan Frankle, Chief AI Scientist at Databricks. “Doing reinforcement learning in a world where you don’t have a strict right and wrong answer, and figuring out how to guide the process and make sure reward hacking doesn’t happen — that’s really non-trivial.”

Building a competitive battle card for a financial services customer, for example, requires identifying relevant accounts, filtering for recency, reconstructing past deals, and inferring outcomes from data that carries no labels. None of it is verifiable by a simple scoring function.

Frankle describes what KARL does as “grounded reasoning,” anchoring each step in a reasoning chain to retrieved facts. “You can think of this as RAG,” he said, “but like RAG plus plus plus plus plus plus, all the way up to 200 vector database calls.”

Generalization Across Tasks

A key finding from the KARL research is that multi-task reinforcement learning generalizes in ways that single-task training does not. The team trained KARL on synthetic data covering two of the six benchmark tasks. The agent performed well on all four tasks it had never encountered during training. Training on any single task and testing on the others produces poor results.

The Algorithm Behind It

KARL runs on OAPL, short for Optimal Advantage-based Policy Optimization with Lagged Inference policy, developed jointly by researchers from Cornell, Databricks, and Harvard. Standard reinforcement learning algorithms like GRPO assume the model generating training data and the model being updated stay in sync. In distributed training environments, they never do.

Prior approaches corrected for this drift using importance sampling, which introduces variance and instability. OAPL takes a different approach, embracing the off-policy nature of distributed training with a regression objective that remains stable at policy lags exceeding 400 gradient steps, roughly 100 times more off-policy than earlier methods could handle.

Photo by ZBRA Marketing on Unsplash

This article is a curated summary based on third-party sources. Source: Read the original article

Share This Article