HomeAI Tutorial
Information

AI Dataset Collection

Large-scale datasets and benchmarks for training, evaluating, and testing models to measure

Tools

Intelligent Document Recognition

Comprehensive Text Extraction and Document Processing Solutions for Users

Human-Aligned-LLM-Evaluation-Audit

Public

A data-driven audit of AI judge reliability using MT-Bench human annotations. This project analyzes 3,500+ model comparisons across 6 LLMs and 8 task categories to measure how well GPT-4 evaluations align with human judgment. Includes Python workflow, disagreement metrics, and a Power BI dashboard for insights.

Creat2025-11-23T21:21:46
Update2025-11-24T20:03:12
0
Stars
0
Stars Increase

Related projects