Rethinking LLM Evaluation with 3C3H: AraGen Benchmark and Leaderboard

In the rapidly evolving landscape of large language models (LLMs), comprehensive and robust evaluation methodologies remain a critical challenge, particularly for low-resource languages. In this blog, we introduce AraGen, a generative tasks benchmark and leaderboard for Arabic LLMs, which we hope will inspire work for other languages as well.

The AraGen leaderboard makes three key contributions:

3C3H Measure: The 3C3H measure scores a model’s response and is central to this framework. It is a holistic approach assessing model responses across multiple dimensions –Correctness, Completeness, Conciseness, Helpfulness, Honesty, and Harmlessness- based on LLM-as-judge.

Dynamic Evaluations: AraGen Leaderboard implements a dynamic evaluation strategy, which includes three-month blind testing cycles, where the datasets and the evaluation code remain private before being publicly released at the end of the cycle, and replaced by a new benchmark, where these are private again.

Arabic Evaluation Dataset: AraGen Benchmark offers a meticulously constructed evaluation dataset for Arabic LLM evaluation, combining multi-turn and single-turn scenarios, which tests the model capability across multiple domains and tasks.

We believe that AraGen addresses persistent issues of data contamination with its dynamic evaluation approach, preserving the benchmark’s integrity. It also serves as the first application of a scalable, language-agnostic framework for a nuanced and fair model assessment, which represents an important effort in understanding LLM performance across diverse linguistic contexts and sets a new standard for comprehensive model benchmarking.

Summary

Evaluating large language models (LLMs) is a key challenge in AI research. While existing methodologies have improved our understanding of LLM capabilities, they often fail to comprehensively address both factuality—assessing a model’s core knowledge—and usability—its alignment with human (end user) expectations. Current evaluation approaches can broadly be categorized into knowledge or factuality-based benchmarks and preference-based benchmarks.

Knowledge-based benchmarks focus on evaluating foundational knowledge and factual correctness. For instance, initiatives like the Open LLM Leaderboard by Hugging Face assess the likelihood of the choices for a given prompt (question) and compare the most likely output with a golden reference choice. While effective in testing core knowledge, these benchmarks provide limited insight into how models perform in practical, user-facing contexts, leaving critical aspects of usability unaddressed.

In contrast, preference-based benchmarks aim to capture alignment with human or user preferences. Examples include LMSYS’s Chatbot Arena and AtlaAI’s Judge Arena, which mostly rely on subjective assessments of outputs based on style, tone, and overall utility. However, these approaches risk prioritizing stylistic alignment over factual accuracy, potentially skewing evaluations toward stylistically preferred yet less accurate responses. Additionally, crowdsourced arenas can reflect the biases of their annotators, who may lack strong voting guidelines, further impacting the consistency and reliability of evaluations.

To address these limitations, we propose a new evaluation measure that aims to combine both approaches, offering a comprehensive mechanism to evaluate language models. It assesses two key aspects of model outputs:

Factuality: The accuracy and the correctness of the model’s output, reflecting its core knowledge.
Usability: The degree to which the model’s outputs align with human preferences, ensuring user-centric assessment.

This is done through the introduction of an LLM as a judge approach see here for more on this approach, which evaluates the model performance across six dimensions modeling factuality and usability. By adopting a balanced perspective, we ensure that usability does not come at the expense of factual accuracy or vice-versa.

AraGen: A Generative Benchmark and Leaderboard for Arabic LLMs

The AraGen Leaderboard ranks both open and proprietary models, evaluated on the AraGen Benchmark using the new 3C3H measure, which we introduce below. 3C3H provides a comprehensive framework for assessing both the factual accuracy and usability of large language models. Arabic was chosen as the first application of this framework, aligning with the mission of Inception to democratize AI for Arabic and the Global South in general, while addressing the lack of robust generative benchmarks for these languages and regions, and we hope to see extensions of this work in many other languages.

The leaderboard is dynamic, with evaluation datasets remaining private (blind testing) for three months to ensure fair and unbiased assessments. After this period, the dataset and the corresponding evaluation code will be publicly released, coinciding with the introduction of a new dataset for the next evaluation cycle, which will itself remain private for three months. This iterative process ensures that evaluations stay current and models are consistently tested on fresh, unseen data.

We believe that this dynamic approach is both beneficial and robust, as it mitigates data leakage, encourages ongoing model improvement, and maintains the relevance of the benchmark in the rapidly evolving landscape of LLM development.

Read the detailed technical blog here:
https://huggingface.co/blog/leaderboard-3c3h-aragen

Latest posts

Inception and Mirror Security Announce Strategic Agreement to Co-Develop Next-Generation AI Security Solutions

Inception, a G42 company and the region’s leading innovator of AI-powered domain-specific products and enterprise solutions, today announced a strategic partnership with Ireland-based Mirror Security, a global leader at the…

News

·

November 24, 2025
Inception and X14 Media Partner to Combat Online Misinformation

Announced at GITEX Global 2025, the partnership represents a unified effort to pioneer AI-powered media intelligence tools to identify misinformation and safeguard reputation. Inception, a G42 company, and the region’s…

News

·

October 16, 2025
Inception and Brain Co. Partner to Accelerate Development of AI Products for Enterprises

Formalized at GITEX Global 2025, the partnership will drive the co-development of trusted, industry-specific AI products that deliver measurable business impact. Inception, a G42 company and the region’s leading innovator…

News

·

October 16, 2025

Get updates

Spam-free subscription, we guarantee. This is just a friendly ping when new content is out.

Go back

Rethinking LLM Evaluation with 3C3H: AraGen Benchmark and Leaderboard

Summary

AraGen: A Generative Benchmark and Leaderboard for Arabic LLMs

Share

Latest posts

Inception and Mirror Security Announce Strategic Agreement to Co-Develop Next-Generation AI Security Solutions

Inception and X14 Media Partner to Combat Online Misinformation

Inception and Brain Co. Partner to Accelerate Development of AI Products for Enterprises

Get updates

Your message has been sent