SalesforceBench | Recursion Lab

Can Agents Actually Work Inside a CRM?

Frontier models have hill-climbed most of the standard white-collar work benchmarks. But inside real companies, adoption is still early.

One reason is integration. Every company has its own systems, permissions, data models, and internal processes.

The deeper reason is business understanding.

Humans are surprisingly good at deriving business knowledge from messy systems. Give an operations person access to a Salesforce org and they will explore records, notice weird fields, infer local conventions, and identify anomalies. They will learn that one quote should not be converted because the totals do not reconcile, or that a support case is a duplicate except for one scheduled-maintenance record that should be left alone.

LLMs are much worse at this. They can call tools and produce plausible business text, but they often fail to build the underlying model of how the business works. They stop searching too early, treat contradictory records as noise, miss the one field that changes the decision, or confidently make a CRM mutation from incomplete context.

We saw this directly. In one billing-dispute task, the agent needed to reconcile multiple asset usage logs before escalating. A human would keep digging until the customer claim, dates, and asset logs lined up or contradicted each other. The agent stopped short, missed the required logs, and produced an answer that looked like work without leaving the CRM in the right state.

This matters because Salesforce, SAP, NetSuite, ServiceNow, and similar systems are the operating memory of a company. They do not just store data. Their metadata, custom fields, automations, permissions, and configuration encode how the company actually works.

At Ressl, we spend a lot of time embedded in customer operations: understanding how teams work, replicating pieces of those workflows with agents, evaluating which models are mature enough for which use cases, and being honest about where the technology still breaks. SalesforceBench is part of that research effort.

We built a simulated Salesforce environment around an industrial equipment operations company and tested whether agents can perform real CRM work: investigating cases, reconciling billing state, creating quote options, respecting approval gates, avoiding duplicate orders, and leaving behind auditable CRM mutations.

The current single-model run lands below one-third overall on a strict evaluation slice. The interesting part is not the score. It is where the agent fails.

DozensSalesforce-style objects

ThousandsSynthetic CRM records

150+Verifier checks

The Company

The environment is built around a synthetic industrial equipment operations business.

Think of a company that provides specialized vehicles and equipment to public-sector and commercial operators. Its Salesforce org tracks assets, deployments, rentals, quotes, quote line items, sales orders, logistics documents, support cases, customer tasks, pricing records, replacement units, freight charges, and approval workflows.

The daily work is exactly the kind of back-office Salesforce work that looks easy from far away and gets thorny up close:

A customer says a unit has no hydraulic power again.
A public-sector customer says it returned every unit, but billing still shows activity.
A replacement truck should be free through the existing rental term, unless the customer chooses a standard-rate option.
A quote should become a draft order only if totals reconcile.
A sales order has a suspicious six-figure freight line that should not be copied blindly.
A customer wants to extend a rental, but the same asset may already be booked for someone else.
Several cases describe the same down-unit issue, but one scheduled maintenance reminder is not part of the duplicate set.

Customer Accounts

Support Cases

Quotes

Commercial Orders

Equipment Assets

Asset Usage Logs

Logistics Documents

Pricing Staging

Quote Lines

Billing Review

Operational Tasks

Approval Gates

A simplified map of the simulated CRM state. The real benchmark forces the agent to move across these connected records before acting.

The point of the simulation is not to mimic one specific company. It is to recreate the kind of interconnected CRM state that enterprise operators actually work inside.

The Environment

The data is synthetic, but it is not toy data.

We seeded a fake Salesforce company using synthetic records informed by realistic CRM distributions: object mix, field sparsity, naming conventions, old records, duplicate records, custom fields, partial histories, stale cases, and the uneven data quality that accumulates in real business systems.

In other words, the records are fake. The shape of the work is real.

Dimension	Scale
Salesforce-style objects	Dozens
Exported/custom fields	Thousands
Synthetic CRM records	Thousands
Workflow families	Multiple
Verifier checks	150+

The org includes standard Salesforce objects like Account, Case, Task, Opportunity, Quote, and QuoteLineItem, plus industrial-equipment-operations-specific custom objects such as Asset_Usage_Log__c, Equipment_Asset__c, Logistics_Document__c, Asset_Booking_Detail__c, Extension_Pricing_Staging__c, Commercial_Order__c, and Commercial_Order_Line__c.

That custom schema matters. A serious Salesforce agent cannot assume every org looks like the demo org. It has to discover the schema, recover from bad assumptions, and use local object relationships correctly.

Standard objects

Account Case Task Opportunity Quote QuoteLineItem

Custom operations objects

Equipment_Asset__c Asset_Usage_Log__c Logistics_Document__c Asset_Booking_Detail__c Commercial_Order__c Commercial_Order_Line__c

Workflow state

Pricing staging Approval checks Duplicate gates Audit tasks Reconciliation notes Escalation routing

The Agent Tools

The agent does not get a database dump. It has to work through Salesforce-style APIs.

Tool	What it lets the agent do
`sf_rest_describe_global`	List available Salesforce objects and discover the org surface
`sf_rest_describe_object`	Inspect fields, create/update permissions, and object metadata
`sf_rest_retrieve_record`	Fetch a specific record by object and ID
`sf_soql_query`	Query structured records with Salesforce Object Query Language
`sf_sosl_search`	Search across objects when the agent has partial names, unit numbers, or customer text
`sf_rest_create_record`	Create allowed Salesforce records in the mock environment
`sf_rest_update_record`	Update allowed Salesforce records in the mock environment

This tool surface is intentionally plain. It tests whether an agent can do the unglamorous work: discover fields, query carefully, follow evidence, handle errors, and make safe writes.

One early trace shows why this matters. The agent tried to query a field that did not exist in this org’s Case schema. A reliable CRM agent has to adapt: describe the object, find the actual custom fields or text references, and continue. That kind of schema recovery is a core enterprise-agent skill.

User task

Investigate a support, billing, logistics, or quote-to-cash request.

Salesforce tools

Describe schema, search, query, retrieve, create, and update records.

Verifier

Check evidence, mutations, guardrails, audit text, and final response format.

describe SOQL SOSL retrieve create update

The Tasks

The tasks are modeled on day-to-day Salesforce work owned by real office teams: support, billing, logistics, asset operations, sales ops, and finance.

Task Pack	What It Tests	Examples
Equipment Support	Customer-facing support triage and escalation	Hydraulic failures, billing disputes, paired delivery/pickup checks, damage coverage
CRM State Mutation	Safe updates to existing CRM state	Case deduplication, custody mismatch, asset extension conflicts, duplicate cleanup
QTC/CPQ Workflow	Quote-to-cash and sales operations workflows	Quote option creation, order generation, approval gates, freight splits, duplicate order checks

Each task forces the agent to use CRM evidence before acting. Some tasks require creating records. Some require updating existing records. Some require doing nothing except creating an audit task because the source data does not reconcile.

That last category is important. In enterprise software, “do not create the order” can be the most valuable action.

The Verifier

Verification runs against concrete evidence, allowed mutations, and final Salesforce state.

Each task has a verifier that checks the final answer and the end-state mutations. Depending on the task, verification is human-authored, checked against expected Salesforce end state, or checked against the mock Salesforce transaction log.

Verification Layer	Example Check
Evidence	Did the agent find every required source record?
Decision	Did it choose the expected action or escalation path?
Mutation	Did the CRM end state contain the required create/update?
Guardrails	Did it avoid forbidden source-record mutations?
Auditability	Did created records include enough operational context?
Format	Was the final response parseable and contract-compliant?

This separates plausible-looking CRM work from correct CRM work.

For example, it is not enough to say, “I escalated this.” The verifier checks whether the right Task or Case exists, whether it is attached to the right Salesforce object, whether its text includes the required operational context, and whether the agent avoided unsafe mutations to source records.

Current Run

The current artifacts are a single-model experiment run, so this is a run report rather than a multi-model leaderboard.

Run	Overall	QTC/CPQ Workflow	CRM State Mutation	Equipment Support
Current single-model run	Below one-third	Strongest category	Partial success	Hardest category

That result is low, but it is also useful. It shows that realistic Salesforce work stresses agents in ways ordinary text benchmarks do not.

A Walkthrough: Replacement Quotes

One successful task asked the agent to create replacement quote options for a large public-sector customer.

The CRM contained a quote for a replacement unit that superseded an original unit. The task asked for two new draft quote options:

Option A: keep the replacement unit free through the existing term.
Option B: charge the standard rear-loader multi-weekly rate from the next billing period.

The agent had to:

Find the correct source quote among several similar customer records.
Distinguish replacement-unit context from nearby similar quote records.
Preserve the original quote unchanged.
Create two new draft Quote records.
Create corresponding QuoteLineItem records.
Add an explanatory Task with replacement-unit and end-date context.

This passed.

Before

Messy source state

Source quote
Original unit
Replacement unit
Similar nearby quotes

After

Controlled mutation

Draft option A
Draft option B
Audit task
Source quote unchanged

That is a meaningful success because it is not simple retrieval. It is a small quote-to-cash workflow with source disambiguation, controlled mutation, and auditability.

A Walkthrough: The Wrong Discount Base

Another task looked similar on the surface but failed for a more subtle reason.

The agent had to create three extension pricing staging records for a regional operator’s ramp extension:

Year 1 at the current price.
Year 2 with a 5% discount.
Year 3 with a 10% discount.

The agent created the right object type and the right number of records. It even used the expected discount percentages: 0%, 5%, and 10%.

But it failed because the discount amounts were wrong. The oracle expected the discount to be applied against the full commercial base; the agent applied it against a narrower line-level base.

Expected

Apply the discount against the full commercial base, then stage the annual pricing records.

Actual

Applied the same percentages against a narrower line-level base, producing plausible but wrong amounts.

That is exactly the kind of mistake that makes enterprise automation risky. The workflow shape was right, but the business base for the calculation was wrong.

What Passed

The model was strongest when the task had a crisp source record and a clear mutation pattern.

It successfully created 2-year and 3-year equipment ramp quote options with the expected quote line structure.

It created replacement quote options for a public-sector customer while preserving the source quote.

It converted a municipal quote into a billing-ready draft order package with a recurring equipment line, an adjustment line, and a validation task.

It also handled several negative-control workflows correctly. In one approval-gate scenario, it detected that the non-rental delta exceeded the approval threshold and created an approval task instead of an order. In a reconciliation-gate scenario, it refused order creation when quote totals did not reconcile. In a duplicate-order scenario, it found an existing converted order package and avoided duplicate order regeneration.

These are not trivial behaviors. The model can create records, respect gates, and sometimes know when to stop.

What Failed

The failures clustered into a few practical categories.

Failure Mode	What Happened
Under-searching	The agent stopped before collecting all required records
Wrong-source selection	The agent failed to distinguish similar quotes, cases, asset records, or customers
Malformed output	The final response was not parseable JSON in some tasks
Missed mutation	The agent answered but did not create the required task, case, quote, or order
Wrong business math	The agent used a plausible but incorrect calculation base
Incomplete audit trail	The agent created a record but omitted required operational context

Failure Mode	Equipment Support	CRM State Mutation	QTC/CPQ Workflow
Under-searching	High	Medium	Low
Wrong-source selection	High	Medium	Medium
Missed mutation	High	Medium	Low
Wrong business math	Low	Low	High
Incomplete audit trail	Medium	High	Medium

The equipment support pack was the clearest under-searching signal. Many failures involved too few tool calls, missing matched records, incorrect decisions, or missing required customer/date values.

A public-sector billing dispute is a good example. The agent needed to reconcile multiple asset usage logs and dates. Instead, it failed to produce parseable final JSON, missed every required log, omitted the required date values, and created no required escalation.

The CRM-state pack showed a different pattern: partial operational competence without exact end-state correctness. The agent passed the hydraulic case deduplication task, reopening the right existing case and creating an urgent follow-up. But in other tasks it created plausible tasks or updates that did not satisfy the verifier’s exact expected end state.

That is what makes the benchmark interesting. It does not just ask, “Did the answer sound right?” It asks, “Did the CRM end up in the right state?”

What This Measures

SalesforceBench measures five capabilities that matter for enterprise agents.

Capability	What It Means in Salesforce	Example
Grounded search	Keep querying until the evidence set is complete	Find every relevant asset log, case, quote, or order
Schema adaptation	Use describes and query errors to learn the org	Recover when an assumed field does not exist
Record disambiguation	Separate similar customers, units, quotes, and cases	Pick the replacement quote, not a nearby parks quote
Mutation discipline	Create/update only allowed objects and preserve sources	Do not mutate source quotes or duplicate orders
Business arithmetic	Apply the right base, thresholds, and reconciliation rules	Compute deltas, discounts, prorations, freight splits

These are not benchmark parlor tricks. They are the daily mechanics of enterprise software work.

Why Salesforce Is a Hard Agent Benchmark

Salesforce is a useful benchmark substrate because every org is its own little operating system.

The schema is customized. Processes are encoded in fields, tasks, notes, automation habits, and team conventions. Important evidence may be spread across standard objects, custom objects, history records, line items, attachments, and stale records. Similar records are common. The right answer often depends on what not to mutate.

A model that can answer Salesforce questions is not automatically a model that can do Salesforce work.

To do the work, it has to operate through tools, tolerate uncertainty, gather evidence, and leave the system in a correct end state.

Takeaway

SalesforceBench points at the real bottleneck for enterprise agents: business understanding inside messy systems of record.

Tool use and polished writing are table stakes. The hard part is reconstructing process from CRM state: which records matter, which contradictions are signal, when arithmetic must reconcile, and when the correct action is to hold off.

That is why simulated systems of record matter. They let us study agents against customized schemas, stale data, hidden process assumptions, and high-cost writes before those failures reach production.