Can Agents Actually Work Inside a CRM?

Frontier models have hill-climbed most of the standard white-collar work benchmarks. But inside real companies, adoption is still early.

One reason is integration. Every company has its own systems, permissions, data models, and internal processes.

The deeper reason is business understanding.

Humans are surprisingly good at deriving business knowledge from messy systems. Give an operations person access to a Salesforce org and they will explore records, notice weird fields, infer local conventions, and identify anomalies. They will learn that one quote should not be converted because the totals do not reconcile, or that a support case is a duplicate except for one scheduled-maintenance record that should be left alone.

LLMs are much worse at this. They can call tools and produce plausible business text, but they often fail to build the underlying model of how the business works. They stop searching too early, treat contradictory records as noise, miss the one field that changes the decision, or confidently make a CRM mutation from incomplete context.

We saw this directly. In one billing-dispute task, the agent needed to reconcile multiple asset usage logs before escalating. A human would keep digging until the customer claim, dates, and asset logs lined up or contradicted each other. The agent stopped short, missed the required logs, and produced an answer that looked like work without leaving the CRM in the right state.

This matters because Salesforce, SAP, NetSuite, ServiceNow, and similar systems are the operating memory of a company. They do not just store data. Their metadata, custom fields, automations, permissions, and configuration encode how the company actually works.

At Ressl, we spend a lot of time embedded in customer operations: understanding how teams work, replicating pieces of those workflows with agents, evaluating which models are mature enough for which use cases, and being honest about where the technology still breaks. SalesforceBench is part of that research effort.

We built a simulated Salesforce environment around an industrial equipment operations company and tested whether agents can perform real CRM work: investigating cases, reconciling billing state, creating quote options, respecting approval gates, avoiding duplicate orders, and leaving behind auditable CRM mutations.

The current single-model run lands below one-third overall on a strict evaluation slice. The interesting part is not the score. It is where the agent fails.

DozensSalesforce-style objects
ThousandsSynthetic CRM records
150+Verifier checks

The Company

The environment is built around a synthetic industrial equipment operations business.

Think of a company that provides specialized vehicles and equipment to public-sector and commercial operators. Its Salesforce org tracks assets, deployments, rentals, quotes, quote line items, sales orders, logistics documents, support cases, customer tasks, pricing records, replacement units, freight charges, and approval workflows.

The daily work is exactly the kind of back-office Salesforce work that looks easy from far away and gets thorny up close:

  • A customer says a unit has no hydraulic power again.
  • A public-sector customer says it returned every unit, but billing still shows activity.
  • A replacement truck should be free through the existing rental term, unless the customer chooses a standard-rate option.
  • A quote should become a draft order only if totals reconcile.
  • A sales order has a suspicious six-figure freight line that should not be copied blindly.
  • A customer wants to extend a rental, but the same asset may already be booked for someone else.
  • Several cases describe the same down-unit issue, but one scheduled maintenance reminder is not part of the duplicate set.
Customer Accounts
Support Cases
Quotes
Commercial Orders
Equipment Assets
Asset Usage Logs
Logistics Documents
Pricing Staging
Quote Lines
Billing Review
Operational Tasks
Approval Gates

A simplified map of the simulated CRM state. The real benchmark forces the agent to move across these connected records before acting.

The point of the simulation is not to mimic one specific company. It is to recreate the kind of interconnected CRM state that enterprise operators actually work inside.

The Environment

The data is synthetic, but it is not toy data.

We seeded a fake Salesforce company using synthetic records informed by realistic CRM distributions: object mix, field sparsity, naming conventions, old records, duplicate records, custom fields, partial histories, stale cases, and the uneven data quality that accumulates in real business systems.

In other words, the records are fake. The shape of the work is real.

DimensionScale
Salesforce-style objectsDozens
Exported/custom fieldsThousands
Synthetic CRM recordsThousands
Workflow familiesMultiple
Verifier checks150+

The org includes standard Salesforce objects like Account, Case, Task, Opportunity, Quote, and QuoteLineItem, plus industrial-equipment-operations-specific custom objects such as Asset_Usage_Log__c, Equipment_Asset__c, Logistics_Document__c, Asset_Booking_Detail__c, Extension_Pricing_Staging__c, Commercial_Order__c, and Commercial_Order_Line__c.

That custom schema matters. A serious Salesforce agent cannot assume every org looks like the demo org. It has to discover the schema, recover from bad assumptions, and use local object relationships correctly.

Standard objects

Account Case Task Opportunity Quote QuoteLineItem

Custom operations objects

Equipment_Asset__c Asset_Usage_Log__c Logistics_Document__c Asset_Booking_Detail__c Commercial_Order__c Commercial_Order_Line__c

Workflow state

Pricing staging Approval checks Duplicate gates Audit tasks Reconciliation notes Escalation routing

The Agent Tools

The agent does not get a database dump. It has to work through Salesforce-style APIs.

ToolWhat it lets the agent do
sf_rest_describe_globalList available Salesforce objects and discover the org surface
sf_rest_describe_objectInspect fields, create/update permissions, and object metadata
sf_rest_retrieve_recordFetch a specific record by object and ID
sf_soql_queryQuery structured records with Salesforce Object Query Language
sf_sosl_searchSearch across objects when the agent has partial names, unit numbers, or customer text
sf_rest_create_recordCreate allowed Salesforce records in the mock environment
sf_rest_update_recordUpdate allowed Salesforce records in the mock environment

This tool surface is intentionally plain. It tests whether an agent can do the unglamorous work: discover fields, query carefully, follow evidence, handle errors, and make safe writes.

One early trace shows why this matters. The agent tried to query a field that did not exist in this org’s Case schema. A reliable CRM agent has to adapt: describe the object, find the actual custom fields or text references, and continue. That kind of schema recovery is a core enterprise-agent skill.

User task

Investigate a support, billing, logistics, or quote-to-cash request.

Salesforce tools

Describe schema, search, query, retrieve, create, and update records.

Verifier

Check evidence, mutations, guardrails, audit text, and final response format.

describe SOQL SOSL retrieve create update

The Tasks

The tasks are modeled on day-to-day Salesforce work owned by real office teams: support, billing, logistics, asset operations, sales ops, and finance.

Task PackWhat It TestsExamples
Equipment SupportCustomer-facing support triage and escalationHydraulic failures, billing disputes, paired delivery/pickup checks, damage coverage
CRM State MutationSafe updates to existing CRM stateCase deduplication, custody mismatch, asset extension conflicts, duplicate cleanup
QTC/CPQ WorkflowQuote-to-cash and sales operations workflowsQuote option creation, order generation, approval gates, freight splits, duplicate order checks

Each task forces the agent to use CRM evidence before acting. Some tasks require creating records. Some require updating existing records. Some require doing nothing except creating an audit task because the source data does not reconcile.

That last category is important. In enterprise software, “do not create the order” can be the most valuable action.

The Verifier

Verification runs against concrete evidence, allowed mutations, and final Salesforce state.

Each task has a verifier that checks the final answer and the end-state mutations. Depending on the task, verification is human-authored, checked against expected Salesforce end state, or checked against the mock Salesforce transaction log.

Verification LayerExample Check
EvidenceDid the agent find every required source record?
DecisionDid it choose the expected action or escalation path?
MutationDid the CRM end state contain the required create/update?
GuardrailsDid it avoid forbidden source-record mutations?
AuditabilityDid created records include enough operational context?
FormatWas the final response parseable and contract-compliant?

This separates plausible-looking CRM work from correct CRM work.

For example, it is not enough to say, “I escalated this.” The verifier checks whether the right Task or Case exists, whether it is attached to the right Salesforce object, whether its text includes the required operational context, and whether the agent avoided unsafe mutations to source records.

Current Run

The current artifacts are a single-model experiment run, so this is a run report rather than a multi-model leaderboard.

RunOverallQTC/CPQ WorkflowCRM State MutationEquipment Support
Current single-model runBelow one-thirdStrongest categoryPartial successHardest category

That result is low, but it is also useful. It shows that realistic Salesforce work stresses agents in ways ordinary text benchmarks do not.

A Walkthrough: Replacement Quotes

One successful task asked the agent to create replacement quote options for a large public-sector customer.

The CRM contained a quote for a replacement unit that superseded an original unit. The task asked for two new draft quote options:

  • Option A: keep the replacement unit free through the existing term.
  • Option B: charge the standard rear-loader multi-weekly rate from the next billing period.

The agent had to:

  1. Find the correct source quote among several similar customer records.
  2. Distinguish replacement-unit context from nearby similar quote records.
  3. Preserve the original quote unchanged.
  4. Create two new draft Quote records.
  5. Create corresponding QuoteLineItem records.
  6. Add an explanatory Task with replacement-unit and end-date context.

This passed.

Before

Messy source state
  • Source quote
  • Original unit
  • Replacement unit
  • Similar nearby quotes

After

Controlled mutation
  • Draft option A
  • Draft option B
  • Audit task
  • Source quote unchanged

That is a meaningful success because it is not simple retrieval. It is a small quote-to-cash workflow with source disambiguation, controlled mutation, and auditability.

A Walkthrough: The Wrong Discount Base

Another task looked similar on the surface but failed for a more subtle reason.

The agent had to create three extension pricing staging records for a regional operator’s ramp extension:

  • Year 1 at the current price.
  • Year 2 with a 5% discount.
  • Year 3 with a 10% discount.

The agent created the right object type and the right number of records. It even used the expected discount percentages: 0%, 5%, and 10%.

But it failed because the discount amounts were wrong. The oracle expected the discount to be applied against the full commercial base; the agent applied it against a narrower line-level base.

Expected

Apply the discount against the full commercial base, then stage the annual pricing records.

Actual

Applied the same percentages against a narrower line-level base, producing plausible but wrong amounts.

That is exactly the kind of mistake that makes enterprise automation risky. The workflow shape was right, but the business base for the calculation was wrong.

What Passed

The model was strongest when the task had a crisp source record and a clear mutation pattern.

It successfully created 2-year and 3-year equipment ramp quote options with the expected quote line structure.

It created replacement quote options for a public-sector customer while preserving the source quote.

It converted a municipal quote into a billing-ready draft order package with a recurring equipment line, an adjustment line, and a validation task.

It also handled several negative-control workflows correctly. In one approval-gate scenario, it detected that the non-rental delta exceeded the approval threshold and created an approval task instead of an order. In a reconciliation-gate scenario, it refused order creation when quote totals did not reconcile. In a duplicate-order scenario, it found an existing converted order package and avoided duplicate order regeneration.

These are not trivial behaviors. The model can create records, respect gates, and sometimes know when to stop.

What Failed

The failures clustered into a few practical categories.

Failure ModeWhat Happened
Under-searchingThe agent stopped before collecting all required records
Wrong-source selectionThe agent failed to distinguish similar quotes, cases, asset records, or customers
Malformed outputThe final response was not parseable JSON in some tasks
Missed mutationThe agent answered but did not create the required task, case, quote, or order
Wrong business mathThe agent used a plausible but incorrect calculation base
Incomplete audit trailThe agent created a record but omitted required operational context
Failure ModeEquipment SupportCRM State MutationQTC/CPQ Workflow
Under-searchingHighMediumLow
Wrong-source selectionHighMediumMedium
Missed mutationHighMediumLow
Wrong business mathLowLowHigh
Incomplete audit trailMediumHighMedium

The equipment support pack was the clearest under-searching signal. Many failures involved too few tool calls, missing matched records, incorrect decisions, or missing required customer/date values.

A public-sector billing dispute is a good example. The agent needed to reconcile multiple asset usage logs and dates. Instead, it failed to produce parseable final JSON, missed every required log, omitted the required date values, and created no required escalation.

The CRM-state pack showed a different pattern: partial operational competence without exact end-state correctness. The agent passed the hydraulic case deduplication task, reopening the right existing case and creating an urgent follow-up. But in other tasks it created plausible tasks or updates that did not satisfy the verifier’s exact expected end state.

That is what makes the benchmark interesting. It does not just ask, “Did the answer sound right?” It asks, “Did the CRM end up in the right state?”

What This Measures

SalesforceBench measures five capabilities that matter for enterprise agents.

CapabilityWhat It Means in SalesforceExample
Grounded searchKeep querying until the evidence set is completeFind every relevant asset log, case, quote, or order
Schema adaptationUse describes and query errors to learn the orgRecover when an assumed field does not exist
Record disambiguationSeparate similar customers, units, quotes, and casesPick the replacement quote, not a nearby parks quote
Mutation disciplineCreate/update only allowed objects and preserve sourcesDo not mutate source quotes or duplicate orders
Business arithmeticApply the right base, thresholds, and reconciliation rulesCompute deltas, discounts, prorations, freight splits

These are not benchmark parlor tricks. They are the daily mechanics of enterprise software work.

Why Salesforce Is a Hard Agent Benchmark

Salesforce is a useful benchmark substrate because every org is its own little operating system.

The schema is customized. Processes are encoded in fields, tasks, notes, automation habits, and team conventions. Important evidence may be spread across standard objects, custom objects, history records, line items, attachments, and stale records. Similar records are common. The right answer often depends on what not to mutate.

A model that can answer Salesforce questions is not automatically a model that can do Salesforce work.

To do the work, it has to operate through tools, tolerate uncertainty, gather evidence, and leave the system in a correct end state.

Takeaway

SalesforceBench points at the real bottleneck for enterprise agents: business understanding inside messy systems of record.

Tool use and polished writing are table stakes. The hard part is reconstructing process from CRM state: which records matter, which contradictions are signal, when arithmetic must reconcile, and when the correct action is to hold off.

That is why simulated systems of record matter. They let us study agents against customized schemas, stale data, hidden process assumptions, and high-cost writes before those failures reach production.