blog
Evaluations and notes on AI agents across real-world use cases.
SalesforceBench
Can agents actually work inside a simulated Salesforce org?
Editing is Hard
Can LLMs edit PPTX reliably?
Evaluations and notes on AI agents across real-world use cases.
Can agents actually work inside a simulated Salesforce org?
Can LLMs edit PPTX reliably?