The Vision for GEM
We want to build a transparent, scalable, and inclusive benchmarking framework, aiming to measure and improve the performance of AI across diverse domains and multiple modalities, in every language.
Capturing this diversity of language is a hard problem. A lack of high-quality diverse training sets then leads to poor performance of AI models. This poor performance then implies that state-of-the-art models— which are primarily trained on Internet content of the world’s richest economies—don’t work particularly well for writers and speakers of under-resourced languages.
The GEM effort is an attempt to create transparency and competition among the AI community and its potential users to highlight and improve the state of AI for every language and speaker. We hope to also prod private, government and public technology makers to create ever-better AI models for global language understanding via the proven power of open, transparent competition. We take inspiration from other AI leaderboards like HuggingFace’s Open LLM ranking board. As private, public and open source technology makers publish ever-improving AI models each month, we hope to enable organisations to quickly benchmark new models and evaluate with their data so they can make appropriate price, speed and performance decisions as well.
Model Builders
People/Organisations building Global AI models want to know how good their models are. With the extensive evaluation, model builders will be able to quantify how good their models are and have a path to make them better.
Model Users
People/Organisations who want to use AI models can use this to know which models best fit their particular use-case and consumer language needs. Especially organisations working with low-resource language populations to get an AI bot understand them linguistically.
Research Groups
Research groups can use this to uncover big gaps that still exist in the models that are unresolved. Our open evaluation methods would help understand better ways to do evaluations in general.
Curated Series of Evaluations
With experts such as at the GEM Partners, we have created create relevant and challenging prompts, ensuring assessments go beyond accuracy to include fairness and diversity.
Transparent & Open Methodology
We detail the evaluation criteria of each curator and process, allowing for clear understanding and replicability. Ranking may not be perfect but it should be transparent. We aim to rank models by current performance and make outcomes transparent.
Extensible curation by design
In order to cater to different needs of different languages, modalities & evaluation frameworks, we have kept the GEM Leaderboard to be managed by multiple curators
Designed as an Iterative Process
No evaluation is perfect. Acknowledging the imperfection of benchmarks, regular updates and feedback loops are incorporated to refine the evaluation process continuously.
Run it with your company own data
The system is designed for customization, allowing organizations to conduct the evaluations on their own data, or run custom evaluations tailored to their specific needs.
Frequently Asked Questions