In the rapidly evolving landscape of large language models (LLMs), comprehensive and robust evaluation methodologies remain a critical challenge, particularly for low-resource languages. In this blog, we introduce AraGen, a generative tasks benchmark and leaderboard for Arabic LLMs, which we hope will inspire work for other languages as well.
The AraGen leaderboard makes three key contributions:
3C3H Measure: The 3C3H measure scores a model’s response and is central to this framework. It is a holistic approach assessing model responses across multiple dimensions –Correctness, Completeness, Conciseness, Helpfulness, Honesty, and Harmlessness- based on LLM-as-judge.
Dynamic Evaluations: AraGen Leaderboard implements a dynamic evaluation strategy, which includes three-month blind testing cycles, where the datasets and the evaluation code remain private before being publicly released at the end of the cycle, and replaced by a new benchmark, where these are private again.
Arabic Evaluation Dataset: AraGen Benchmark offers a meticulously constructed evaluation dataset for Arabic LLM evaluation, combining multi-turn and single-turn scenarios, which tests the model capability across multiple domains and tasks.
We believe that AraGen addresses persistent issues of data contamination with its dynamic evaluation approach, preserving the benchmark’s integrity. It also serves as the first application of a scalable, language-agnostic framework for a nuanced and fair model assessment, which represents an important effort in understanding LLM performance across diverse linguistic contexts and sets a new standard for comprehensive model benchmarking.
Summary
Evaluating large language models (LLMs) is a key challenge in AI research. While existing methodologies have improved our understanding of LLM capabilities, they often fail to comprehensively address both factuality—assessing a model’s core knowledge—and usability—its alignment with human (end user) expectations. Current evaluation approaches can broadly be categorized into knowledge or factuality-based benchmarks and preference-based benchmarks.
Knowledge-based benchmarks focus on evaluating foundational knowledge and factual correctness. For instance, initiatives like the Open LLM Leaderboard by Hugging Face assess the likelihood of the choices for a given prompt (question) and compare the most likely output with a golden reference choice. While effective in testing core knowledge, these benchmarks provide limited insight into how models perform in practical, user-facing contexts, leaving critical aspects of usability unaddressed.
In contrast, preference-based benchmarks aim to capture alignment with human or user preferences. Examples include LMSYS’s Chatbot Arena and AtlaAI’s Judge Arena, which mostly rely on subjective assessments of outputs based on style, tone, and overall utility. However, these approaches risk prioritizing stylistic alignment over factual accuracy, potentially skewing evaluations toward stylistically preferred yet less accurate responses. Additionally, crowdsourced arenas can reflect the biases of their annotators, who may lack strong voting guidelines, further impacting the consistency and reliability of evaluations.
To address these limitations, we propose a new evaluation measure that aims to combine both approaches, offering a comprehensive mechanism to evaluate language models. It assesses two key aspects of model outputs:
- Factuality: The accuracy and the correctness of the model’s output, reflecting its core knowledge.
- Usability: The degree to which the model’s outputs align with human preferences, ensuring user-centric assessment.
This is done through the introduction of an LLM as a judge approach see here for more on this approach, which evaluates the model performance across six dimensions modeling factuality and usability. By adopting a balanced perspective, we ensure that usability does not come at the expense of factual accuracy or vice-versa.
AraGen: A Generative Benchmark and Leaderboard for Arabic LLMs
The AraGen Leaderboard ranks both open and proprietary models, evaluated on the AraGen Benchmark using the new 3C3H measure, which we introduce below. 3C3H provides a comprehensive framework for assessing both the factual accuracy and usability of large language models. Arabic was chosen as the first application of this framework, aligning with the mission of Inception to democratize AI for Arabic and the Global South in general, while addressing the lack of robust generative benchmarks for these languages and regions, and we hope to see extensions of this work in many other languages.
The leaderboard is dynamic, with evaluation datasets remaining private (blind testing) for three months to ensure fair and unbiased assessments. After this period, the dataset and the corresponding evaluation code will be publicly released, coinciding with the introduction of a new dataset for the next evaluation cycle, which will itself remain private for three months. This iterative process ensures that evaluations stay current and models are consistently tested on fresh, unseen data.
We believe that this dynamic approach is both beneficial and robust, as it mitigates data leakage, encourages ongoing model improvement, and maintains the relevance of the benchmark in the rapidly evolving landscape of LLM development.
Read the detailed technical blog here:
https://huggingface.co/blog/leaderboard-3c3h-aragen