A boring setting with huge payoff.
Each agent is basically an API call written in English with custom returns. You can't test non-deterministic systems the way you test normal software. Same input, different output every time. That ...