Can Agents Actually Work Inside a CRM?
Frontier models have hill-climbed most of the standard white-collar work benchmarks. But inside real companies, adoption is still early.
One reason is integration. Every company has its own systems, permissions, data models, and internal processes.
The deeper reason is business understanding.
Humans are surprisingly good at deriving business knowledge from messy systems. Give an operations person access to a Salesforce org and they will explore records, notice weird fields, infer local conventions, and identify anomalies. They will learn that one quote should not be converted because the totals do not reconcile, or that a support case is a duplicate except for one scheduled-maintenance record that should be left alone.
LLMs are much worse at this. They can call tools and produce plausible business text, but they often fail to build the underlying model of how the business works. They stop searching too early, treat contradictory records as noise, miss the one field that changes the decision, or confidently make a CRM mutation from incomplete context.
We saw this directly. In one billing-dispute task, the agent needed to reconcile multiple asset usage logs before escalating. A human would keep digging until the customer claim, dates, and asset logs lined up or contradicted each other. The agent stopped short, missed the required logs, and produced an answer that looked like work without leaving the CRM in the right state.
This matters because Salesforce, SAP, NetSuite, ServiceNow, and similar systems are the operating memory of a company. They do not just store data. Their metadata, custom fields, automations, permissions, and configuration encode how the company actually works.
At Ressl, we spend a lot of time embedded in customer operations: understanding how teams work, replicating pieces of those workflows with agents, evaluating which models are mature enough for which use cases, and being honest about where the technology still breaks. SalesforceBench is part of that research effort.
We built a simulated Salesforce environment around an industrial equipment operations company and tested whether agents can perform real CRM work: investigating cases, reconciling billing state, creating quote options, respecting approval gates, avoiding duplicate orders, and leaving behind auditable CRM mutations.
The current single-model run lands below one-third overall on a strict evaluation slice. The interesting part is not the score. It is where the agent fails.
The Company
The environment is built around a synthetic industrial equipment operations business.
Think of a company that provides specialized vehicles and equipment to public-sector and commercial operators. Its Salesforce org tracks assets, deployments, rentals, quotes, quote line items, sales orders, logistics documents, support cases, customer tasks, pricing records, replacement units, freight charges, and approval workflows.
The daily work is exactly the kind of back-office Salesforce work that looks easy from far away and gets thorny up close:
- A customer says a unit has no hydraulic power again.
- A public-sector customer says it returned every unit, but billing still shows activity.
- A replacement truck should be free through the existing rental term, unless the customer chooses a standard-rate option.
- A quote should become a draft order only if totals reconcile.
- A sales order has a suspicious six-figure freight line that should not be copied blindly.
- A customer wants to extend a rental, but the same asset may already be booked for someone else.
- Several cases describe the same down-unit issue, but one scheduled maintenance reminder is not part of the duplicate set.
A simplified map of the simulated CRM state. The real benchmark forces the agent to move across these connected records before acting.
The point of the simulation is not to mimic one specific company. It is to recreate the kind of interconnected CRM state that enterprise operators actually work inside.
The Environment
The data is synthetic, but it is not toy data.
We seeded a fake Salesforce company using synthetic records informed by realistic CRM distributions: object mix, field sparsity, naming conventions, old records, duplicate records, custom fields, partial histories, stale cases, and the uneven data quality that accumulates in real business systems.
In other words, the records are fake. The shape of the work is real.
| Dimension | Scale |
|---|---|
| Salesforce-style objects | Dozens |
| Exported/custom fields | Thousands |
| Synthetic CRM records | Thousands |
| Workflow families | Multiple |
| Verifier checks | 150+ |
The org includes standard Salesforce objects like Account, Case, Task, Opportunity, Quote, and QuoteLineItem, plus industrial-equipment-operations-specific custom objects such as Asset_Usage_Log__c, Equipment_Asset__c, Logistics_Document__c, Asset_Booking_Detail__c, Extension_Pricing_Staging__c, Commercial_Order__c, and Commercial_Order_Line__c.
That custom schema matters. A serious Salesforce agent cannot assume every org looks like the demo org. It has to discover the schema, recover from bad assumptions, and use local object relationships correctly.
Standard objects
Account Case Task Opportunity Quote QuoteLineItemCustom operations objects
Equipment_Asset__c Asset_Usage_Log__c Logistics_Document__c Asset_Booking_Detail__c Commercial_Order__c Commercial_Order_Line__cWorkflow state
Pricing staging Approval checks Duplicate gates Audit tasks Reconciliation notes Escalation routingThe Agent Tools
The agent does not get a database dump. It has to work through Salesforce-style APIs.
| Tool | What it lets the agent do |
|---|---|
sf_rest_describe_global | List available Salesforce objects and discover the org surface |
sf_rest_describe_object | Inspect fields, create/update permissions, and object metadata |
sf_rest_retrieve_record | Fetch a specific record by object and ID |
sf_soql_query | Query structured records with Salesforce Object Query Language |
sf_sosl_search | Search across objects when the agent has partial names, unit numbers, or customer text |
sf_rest_create_record | Create allowed Salesforce records in the mock environment |
sf_rest_update_record | Update allowed Salesforce records in the mock environment |
This tool surface is intentionally plain. It tests whether an agent can do the unglamorous work: discover fields, query carefully, follow evidence, handle errors, and make safe writes.
One early trace shows why this matters. The agent tried to query a field that did not exist in this org’s Case schema. A reliable CRM agent has to adapt: describe the object, find the actual custom fields or text references, and continue. That kind of schema recovery is a core enterprise-agent skill.
User task
Investigate a support, billing, logistics, or quote-to-cash request.
Salesforce tools
Describe schema, search, query, retrieve, create, and update records.
Verifier
Check evidence, mutations, guardrails, audit text, and final response format.
The Tasks
The tasks are modeled on day-to-day Salesforce work owned by real office teams: support, billing, logistics, asset operations, sales ops, and finance.
| Task Pack | What It Tests | Examples |
|---|---|---|
| Equipment Support | Customer-facing support triage and escalation | Hydraulic failures, billing disputes, paired delivery/pickup checks, damage coverage |
| CRM State Mutation | Safe updates to existing CRM state | Case deduplication, custody mismatch, asset extension conflicts, duplicate cleanup |
| QTC/CPQ Workflow | Quote-to-cash and sales operations workflows | Quote option creation, order generation, approval gates, freight splits, duplicate order checks |
Each task forces the agent to use CRM evidence before acting. Some tasks require creating records. Some require updating existing records. Some require doing nothing except creating an audit task because the source data does not reconcile.
That last category is important. In enterprise software, “do not create the order” can be the most valuable action.
The Verifier
Verification runs against concrete evidence, allowed mutations, and final Salesforce state.
Each task has a verifier that checks the final answer and the end-state mutations. Depending on the task, verification is human-authored, checked against expected Salesforce end state, or checked against the mock Salesforce transaction log.
| Verification Layer | Example Check |
|---|---|
| Evidence | Did the agent find every required source record? |
| Decision | Did it choose the expected action or escalation path? |
| Mutation | Did the CRM end state contain the required create/update? |
| Guardrails | Did it avoid forbidden source-record mutations? |
| Auditability | Did created records include enough operational context? |
| Format | Was the final response parseable and contract-compliant? |
This separates plausible-looking CRM work from correct CRM work.
For example, it is not enough to say, “I escalated this.” The verifier checks whether the right Task or Case exists, whether it is attached to the right Salesforce object, whether its text includes the required operational context, and whether the agent avoided unsafe mutations to source records.
Current Run
The current artifacts are a single-model experiment run, so this is a run report rather than a multi-model leaderboard.
| Run | Overall | QTC/CPQ Workflow | CRM State Mutation | Equipment Support |
|---|---|---|---|---|
| Current single-model run | Below one-third | Strongest category | Partial success | Hardest category |
That result is low, but it is also useful. It shows that realistic Salesforce work stresses agents in ways ordinary text benchmarks do not.
A Walkthrough: Replacement Quotes
One successful task asked the agent to create replacement quote options for a large public-sector customer.
The CRM contained a quote for a replacement unit that superseded an original unit. The task asked for two new draft quote options:
- Option A: keep the replacement unit free through the existing term.
- Option B: charge the standard rear-loader multi-weekly rate from the next billing period.
The agent had to:
- Find the correct source quote among several similar customer records.
- Distinguish replacement-unit context from nearby similar quote records.
- Preserve the original quote unchanged.
- Create two new draft
Quoterecords. - Create corresponding
QuoteLineItemrecords. - Add an explanatory
Taskwith replacement-unit and end-date context.
This passed.
Before
Messy source state- Source quote
- Original unit
- Replacement unit
- Similar nearby quotes
After
Controlled mutation- Draft option A
- Draft option B
- Audit task
- Source quote unchanged
That is a meaningful success because it is not simple retrieval. It is a small quote-to-cash workflow with source disambiguation, controlled mutation, and auditability.
A Walkthrough: The Wrong Discount Base
Another task looked similar on the surface but failed for a more subtle reason.
The agent had to create three extension pricing staging records for a regional operator’s ramp extension:
- Year 1 at the current price.
- Year 2 with a 5% discount.
- Year 3 with a 10% discount.
The agent created the right object type and the right number of records. It even used the expected discount percentages: 0%, 5%, and 10%.
But it failed because the discount amounts were wrong. The oracle expected the discount to be applied against the full commercial base; the agent applied it against a narrower line-level base.
Expected
Apply the discount against the full commercial base, then stage the annual pricing records.
Actual
Applied the same percentages against a narrower line-level base, producing plausible but wrong amounts.
That is exactly the kind of mistake that makes enterprise automation risky. The workflow shape was right, but the business base for the calculation was wrong.
What Passed
The model was strongest when the task had a crisp source record and a clear mutation pattern.
It successfully created 2-year and 3-year equipment ramp quote options with the expected quote line structure.
It created replacement quote options for a public-sector customer while preserving the source quote.
It converted a municipal quote into a billing-ready draft order package with a recurring equipment line, an adjustment line, and a validation task.
It also handled several negative-control workflows correctly. In one approval-gate scenario, it detected that the non-rental delta exceeded the approval threshold and created an approval task instead of an order. In a reconciliation-gate scenario, it refused order creation when quote totals did not reconcile. In a duplicate-order scenario, it found an existing converted order package and avoided duplicate order regeneration.
These are not trivial behaviors. The model can create records, respect gates, and sometimes know when to stop.
What Failed
The failures clustered into a few practical categories.
| Failure Mode | What Happened |
|---|---|
| Under-searching | The agent stopped before collecting all required records |
| Wrong-source selection | The agent failed to distinguish similar quotes, cases, asset records, or customers |
| Malformed output | The final response was not parseable JSON in some tasks |
| Missed mutation | The agent answered but did not create the required task, case, quote, or order |
| Wrong business math | The agent used a plausible but incorrect calculation base |
| Incomplete audit trail | The agent created a record but omitted required operational context |
| Failure Mode | Equipment Support | CRM State Mutation | QTC/CPQ Workflow |
|---|---|---|---|
| Under-searching | High | Medium | Low |
| Wrong-source selection | High | Medium | Medium |
| Missed mutation | High | Medium | Low |
| Wrong business math | Low | Low | High |
| Incomplete audit trail | Medium | High | Medium |
The equipment support pack was the clearest under-searching signal. Many failures involved too few tool calls, missing matched records, incorrect decisions, or missing required customer/date values.
A public-sector billing dispute is a good example. The agent needed to reconcile multiple asset usage logs and dates. Instead, it failed to produce parseable final JSON, missed every required log, omitted the required date values, and created no required escalation.
The CRM-state pack showed a different pattern: partial operational competence without exact end-state correctness. The agent passed the hydraulic case deduplication task, reopening the right existing case and creating an urgent follow-up. But in other tasks it created plausible tasks or updates that did not satisfy the verifier’s exact expected end state.
That is what makes the benchmark interesting. It does not just ask, “Did the answer sound right?” It asks, “Did the CRM end up in the right state?”
What This Measures
SalesforceBench measures five capabilities that matter for enterprise agents.
| Capability | What It Means in Salesforce | Example |
|---|---|---|
| Grounded search | Keep querying until the evidence set is complete | Find every relevant asset log, case, quote, or order |
| Schema adaptation | Use describes and query errors to learn the org | Recover when an assumed field does not exist |
| Record disambiguation | Separate similar customers, units, quotes, and cases | Pick the replacement quote, not a nearby parks quote |
| Mutation discipline | Create/update only allowed objects and preserve sources | Do not mutate source quotes or duplicate orders |
| Business arithmetic | Apply the right base, thresholds, and reconciliation rules | Compute deltas, discounts, prorations, freight splits |
These are not benchmark parlor tricks. They are the daily mechanics of enterprise software work.
Why Salesforce Is a Hard Agent Benchmark
Salesforce is a useful benchmark substrate because every org is its own little operating system.
The schema is customized. Processes are encoded in fields, tasks, notes, automation habits, and team conventions. Important evidence may be spread across standard objects, custom objects, history records, line items, attachments, and stale records. Similar records are common. The right answer often depends on what not to mutate.
A model that can answer Salesforce questions is not automatically a model that can do Salesforce work.
To do the work, it has to operate through tools, tolerate uncertainty, gather evidence, and leave the system in a correct end state.
Takeaway
SalesforceBench points at the real bottleneck for enterprise agents: business understanding inside messy systems of record.
Tool use and polished writing are table stakes. The hard part is reconstructing process from CRM state: which records matter, which contradictions are signal, when arithmetic must reconcile, and when the correct action is to hold off.
That is why simulated systems of record matter. They let us study agents against customized schemas, stale data, hidden process assumptions, and high-cost writes before those failures reach production.