reasoning-benchmarks
PublicA reproducible harness for evaluating LLM reasoning strategies (CoT, Self-Consistency, ToT, etc.) across benchmarks like GSM8K, ARC-Challenge, and MMLU. Supports OpenAI, Hugging Face, and Ollama backends with unified metrics and plots.