Benchmarking LLM Performance for Journalism

0

Figuring out which generative-AI model to use can be intimidating. It can be dizzying trying to parse the grid of model names and scores. Each model seems to vie for bragging rights in things like “graduate-level reasoning”, “agentic coding”, “multilingual Q&A”, and so on. Anthropic’s latest Claude release is emblematic. Other benchmarks for performance include the popular LM Arena, which measures user preferences between model outputs. But as these various kinds of metrics proliferate, there’s a question that’s been nagging us: Do any of these scores tell us which models we should use for journalism and when?
Benchmarks shown in one of the recent Claude releases.

To try to address this question, the Generative AI in the Newsroom Initiative convened 23 journalists to work through what a news benchmark tailored to journalism might look like. The workshop, held at Northwestern, aimed to sketch news-oriented benchmarks supported by the community. Such benchmarks might serve as a clearer compass for comparing and choosing generative AI models and systems appropriate to news use cases and inform tech companies about journalistic needs as they improve their models.

Workshop participants tackled questions such as: What matters to journalists in terms of what to measure? How might we measure those things with validity and realism with respect to practice? And what data would be needed to implement a rigorous yet practical benchmark? Participants worked through these questions in the context of six umbrella use cases: Information/Data Extraction, Semantic Search, Summarization, Content Transformation, Background Research, and Fact Checking; and six values that are typically important to the practice of journalism: Accuracy, Transparency, Confidence/Uncertainty, Accountability, Objectivity/Bias, and Timeliness/Recency. These selections were informed by our previous research and a pre-survey of the workshop participants.

Read the complete article