Google AI Launches Stax: Helping Developers Evaluate Large Language Models Based on Custom Criteria

AIbase基地

Published inAI News · 5 min read · Sep 3, 2025

Google AI recently released an experimental evaluation tool called Stax, designed to help developers test and analyze large language models (LLMs) more effectively. Unlike traditional software testing, LLMs are probabilistic systems that may produce different responses to the same prompt, making evaluation consistency and reproducibility complex. Therefore, Stax provides developers with a structured approach to evaluate and compare different LLMs based on custom criteria.

When evaluating models, leaderboards and general benchmark tests are usually used. These methods help track high-level model progress but do not reflect specific domain requirements. For example, a model that performs well in open-domain reasoning tasks might struggle with compliance summaries, legal text analysis, or answering specific enterprise questions. Stax addresses this issue by allowing developers to define evaluation processes relevant to their use cases.

An important feature of Stax is "Quick Comparison." This feature enables developers to test multiple prompts across different models side by side, making it easier to understand the impact of prompt design or model selection on output results and reducing the time needed for trial and error. In addition, Stax offers the "Projects and Datasets" feature. When larger-scale testing is required, developers can create structured test sets and apply consistent evaluation criteria across multiple samples, which not only supports reproducibility but also makes it easier to evaluate models under more realistic conditions.

The core concept of Stax is the "Auto Evaluator." Developers can build custom evaluators tailored to their use cases or use pre-built evaluators. Built-in options cover common evaluation categories, such as fluency (grammatical correctness and readability), factuality (factual consistency with reference material), and safety (ensuring outputs avoid harmful or inappropriate content). This flexibility allows evaluations to align with actual needs rather than relying on a single generic metric.

Additionally, Stax's analytics dashboard makes it easier to interpret results. Developers can view performance trends, compare outputs from different evaluators, and analyze how different models perform on the same dataset. Overall, Stax provides developers with a tool to transition from ad-hoc testing to structured evaluation, helping teams better understand model performance under specific conditions in production environments and track whether outputs meet the standards required for real-world applications.

Project: https://stax.withgoogle.com/landing/index.html

Key Points:
🌟 Stax is an experimental tool launched by Google AI, aimed at helping developers evaluate large language models according to custom criteria.
🔍 With features like "Quick Comparison" and "Projects and Datasets," developers can conduct model testing and evaluation more efficiently.
📊 Stax supports custom and pre-built evaluators, helping developers obtain evaluation results aligned with actual needs.

Reject the 'Feeding' Answer Approach! Former Google Executive Launches Fermi.ai: Can AI Turn High School Students into Scientists?

A former Google executive launched the AI education platform Fermi.ai, aiming to change the way high school students learn STEM subjects. The core philosophy of the platform is to promote deep thinking rather than providing direct answers, rejecting quick problem-solving and emphasizing deep learning.

AI Daily: Kolors 3.0 Released; Alibaba's Large Model Brand Officially Renamed Qwen; Mistral AI Releases Voxtral Transcribe 2 Voice Model

Welcome to the [AI Daily] column! Here is your guide to exploring the world of artificial intelligence every day. Each day, we present you with the latest content in the AI field, focusing on developers, helping you grasp technological trends and understand innovative AI product applications. Click to learn more about new AI products: https://app.aibase.com/zh1. 'Global First Subject Reference': Kolors AI 3.0 is officially released, opening the era of AI directing with 15-second long videos. The release of Kolors AI 3.0 marks a new era in AI video creation, through

Latest AI News

AI Daily Brief

AI Product Finder

AI Product Rankings

AI Product Submit

AI Tools Directory

AI Models Finder

LLM Leaderboard

Model Providers

Compare LLMs

LLM Cost Calculator

LLM Arena

MCP Servers

MCP Client

MCP Case Tutorials

MCP Ranking

MCP Service Submission

MCP Playground

MCP Inspector

GEO Brand Visibility

AI Brand Monitoring Tool

AI Search Visibility Checker

GEO Promotion Link Detection

GEO Ranking Optimization System

GEO Services​

AI Model Compatibility Checker

AI Deployment Calculator

Google AI Launches Stax: Helping Developers Evaluate Large Language Models Based on Custom Criteria

AIbase基地

This article is from AIbase Daily

AI News Recommendations

Topping AI Intelligence Benchmark Tests: Claude Opus 4.6 Strongly Outperforms GPT-5.2

ByteDance Seedance 2.0 Launches Shockingly: Film Hurricane Tim Reveals the AI Training Black Box

Reject the 'Feeding' Answer Approach! Former Google Executive Launches Fermi.ai: Can AI Turn High School Students into Scientists?

16 Claude Agents Wrote 100,000 Lines of Rust Code in Two Weeks, Self-Developed a C Compiler

The Smoke Has Faded? Anthropic Urgently Amends Super Bowl Ad Copy to Avoid Direct Confrontation with OpenAI

OpenAI's First AI Hardware Revealed: The Smart Earbuds Codenamed Dime May Be Released Within the Year

Google and Apple Team Up Strongly! The Next Generation AI Model is About to Be Released

Tencent Games Launches the 2026 Winter Holiday Protection Action for Minors, AI Features Assist Families in Scientific Management

Musk Makes Another Bold Statement: Tesla Humanoid Robot Can Establish Civilization on Habitable Planets Independently

AI Daily: Kolors 3.0 Released; Alibaba's Large Model Brand Officially Renamed Qwen; Mistral AI Releases Voxtral Transcribe 2 Voice Model

AI News Recommendations

Topping AI Intelligence Benchmark Tests: Claude Opus 4.6 Strongly Outperforms GPT-5.2

ByteDance Seedance 2.0 Launches Shockingly: Film Hurricane Tim Reveals the AI Training Black Box

Reject the 'Feeding' Answer Approach! Former Google Executive Launches Fermi.ai: Can AI Turn High School Students into Scientists?

16 Claude Agents Wrote 100,000 Lines of Rust Code in Two Weeks, Self-Developed a C Compiler

The Smoke Has Faded? Anthropic Urgently Amends Super Bowl Ad Copy to Avoid Direct Confrontation with OpenAI

OpenAI's First AI Hardware Revealed: The Smart Earbuds Codenamed Dime May Be Released Within the Year

Google and Apple Team Up Strongly! The Next Generation AI Model is About to Be Released

Tencent Games Launches the 2026 Winter Holiday Protection Action for Minors, AI Features Assist Families in Scientific Management

Musk Makes Another Bold Statement: Tesla Humanoid Robot Can Establish Civilization on Habitable Planets Independently

AI Daily: Kolors 3.0 Released; Alibaba's Large Model Brand Officially Renamed Qwen; Mistral AI Releases Voxtral Transcribe 2 Voice Model

GEO Services