AdamIssah

Software engineer working on AI agents and evaluation. I build agent gyms and LLM eval harnesses: reproducible systems that measure how well models actually do real work, and the cloud infrastructure that runs them at scale.

Projects

FireDrill agent gym

An RL-compatible incident-response gym. Drops an agent into a broken software project and scores whether it fixes the incident, how precisely, and what it cost, across 10 scenarios and 4 models, on autoscaling cloud infra.

firedrill.adamissah.com →

ChessPuzzle Benchmark eval harness

An LLM evaluation harness benchmarking models on 300 Lichess mate-in-N puzzles. A composite score (correctness, valid-move rate, format compliance, and latency) goes beyond pass/fail, with accuracy broken down by difficulty tier and mate length to show where models degrade.

chess.adamissah.com →