blog

Evaluations and notes on AI agents across real-world use cases.

  1. SalesforceBench

    Can agents actually work inside a simulated Salesforce org?

  2. Editing is Hard

    Can LLMs edit PPTX reliably?