generalization

Public

Thematic Generalization Benchmark: measures how effectively various LLMs can infer a narrow or specific "theme" (category/rule) from a small set of examples and anti-examples, then detect which item truly fits that theme among a collection of misleading candidates.

benchmark evaluation generalization gpt-4-5 llm llms llms-benchmarking sonnet3-7

Creat：2025-01-14T19:05:40

Update：2025-03-27T11:08:44

Stars

Stars Increase

Related projects

Langfuse

Hot

analytics

? Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. ?YC W23

19141

1年前

+126today

Opik

Hot

langchain

Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

16579

1年前

+99today

RagaAI Catalyst

agentic-ai

Python SDK for Agent AI Observability, Monitoring and Evaluation Framework. Includes features like agent, llm and tools tracing, debugging multi-agentic system, self-hosted dashboard and advanced analytics with timeline and execution graph view

16080

1年前

+4today

Transferlearning

deep-learning

Transfer learning / domain adaptation / domain generalization / multi-task learning etc. Papers, codes, datasets, applications, tutorials.-迁移学习

14195

1年前

+6today

Fashion Mnist

benchmark

A MNIST-like fashion product database. Benchmark :point_down:

12557

2年前

+4today

Deepeval

evaluation-framework

The LLM Evaluation Framework

12511

1年前

+41today

Ragas

evaluation

Supercharge Your LLM Application Evaluations ?

11688

1年前

+37today

Lm Evaluation Harness

evaluation-framework

A framework for few-shot evaluation of language models.

10879

1年前

+40today

Promptfoo

Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

9346

1年前

+40today

BELLE

bloom

BELLE: Be Everyone's Large Language model Engine（开源中文对话大模型）

8275

1年前

+1today

Latest AI News

AI Daily Brief

AI Product Finder

AI Product Rankings

AI Product Submit

AI Tools Directory

GEO Brand Visibility

AI Visibility Audit

AI Search Visibility Checker

GEO Ranking Monitor

AI Conversation Insight

GEO Promotion Link Detection

GEO Ranking Optimization System

GEO Ranking Optimization

MCP Servers

MCP Client

MCP Case Tutorials

MCP Ranking

MCP Service Submission

MCP Playground

MCP Inspector

LLM API Hub

AI Models Finder

Model Providers

LLM Leaderboard

LLM API Proxy Checker

Compare LLMs

LLM Cost Calculator

LLM Arena

AI Model Compatibility Checker

AI Deployment Calculator