Gensyn's Judge Tackles AI's Biggest Trust Gap: Who Evaluates the Evaluators?
GPT-4 disagrees with itself 40% of the time when asked to judge the same response twice. Bard hallucinated 91% of its references in medical systematic reviews. And the benchmarks meant to keep AI honest? Models are increasingly optimized to game them. The entire AI evaluation stack — the infrastructure that tells us whether a model is good, safe, or truthful — rests on foundations that are opaque, non-reproducible, and silently shifting under our feet.
Gensyn, the decentralized machine-learning protocol backed by $50 million from a16z crypto, CoinFund, and Protocol Labs, thinks it has a structural fix. Its new system, called Judge, brings cryptographically verifiable AI evaluation to production — replacing black-box API calls with deterministic, challengeable, on-chain proofs of model quality. If it works at scale, it could reshape how the AI industry establishes trust.