1 min read

Link: A new Artificial Analysis benchmark, focusing on OpenAI's gpt-oss-120b, shows how open-weight LLMs exhibit inconsistent performance across hosting providers (Simon Willison/Simon Willison's Weblog)

Artificial Analysis recently released a benchmark testing OpenAI's gpt-oss-120b model across various hosted providers, revealing notable performance differences. The study used the model to solve the American Invitational Mathematics Examination (AIME) with high reasoning effort, showing varied accuracy rates ranging from 36.7% to 93.3%.

The highest scores, 93.3%, were predominantly achieved by providers using the latest version of the vLLM model. Notably, CompactifAI lagged significantly behind with only a 36.7% score, explained by their model's high compression levels and reduced costs.

There were specific instances where older version usage affected performance; for example, Azure's initial score of 80% improved after updating their model service. The variation in results exemplifies the nuanced challenge customers face when choosing providers for open weight models.

Assessing model performance reliably remains complex due to factors like quantization size and serving framework, which often lack transparency. Tool-calling conventions also pose a risk, as deviations can lead to unpredictable results.

A potential solution to improve reliability and consistency among hosted models would be a standardized conformance suite, which is yet to be implemented due to model non-determinism.

OpenAI has responded to some of these issues by including a compatibility test in the gpt-oss GitHub repository, helping providers ensure correct implementation of model specifications.

 #

--

Yoooo, this is a quick note on a link that made me go, WTF? Find all past links here.