Spaces:
Running
Running
from dataclasses import dataclass | |
from enum import Enum | |
class Task: | |
benchmark: str | |
metric: str | |
col_name: str | |
# Select your tasks here | |
# --------------------------------------------------- | |
class Tasks(Enum): | |
# task_key in the json file, metric_key in the json file, name to display in the leaderboard | |
task0 = Task("anli_r1", "acc", "ANLI") | |
task1 = Task("logiqa", "acc_norm", "LogiQA") | |
NUM_FEWSHOT = 0 # Change with your few shot | |
# --------------------------------------------------- | |
# Your leaderboard name | |
TITLE = """<h1 align="center" id="space-title">MageBench Leaderboard</h1>""" | |
# What does your leaderboard evaluate? | |
INTRODUCTION_TEXT = """ | |
MageBench is a reasoning-oriented multimodal intelligent agent benchmark introduced in the paper ["MageBench: Bridging Large Multimodal Models to Agents"](https://arxiv.org/abs/2412.04531). | |
The tasks we selected meet the following criteria: | |
- Simple environment, | |
- Reflect a certain reasoning ability, | |
- High level of visual involvement. | |
In our paper, we demonstrate that our benchmark can generalize well to other scenarios. | |
We hope our work can empower future research in the fields of intelligent agents, robotics, and more. | |
""" | |
# Which evaluations are you running? how can people reproduce what you have? | |
LLM_BENCHMARKS_TEXT = f""" | |
## How it works | |
This platform will not run your model for testing, it only provides a leaderboard. | |
You need to choose a preset that matches your results, test it in your local environment, | |
and then submit the results to us for approval. Once approved, we will make your results public. | |
## Reproducibility | |
Since we are unable to reproduce the submitter's results, to ensure the reliability of the results, | |
we require all submitters to provide either a link to a paper/blog/report that includes contact information or an open-source GitHub link that reproduces the results. | |
**Results that do not meet the above conditions or have other issues affecting fairness | |
(such as incorrect setting category) will be removed by us.** | |
""" | |
EVALUATION_QUEUE_TEXT = """ | |
# Instructions to submit results | |
- First, make sure you've read the content in About part. | |
- Test you model locally and submit your results in the following form. | |
- Upload **one** result each time by fulfill the form and click "Upload One Eval", and you will be able to see the result in the "Uploaded results" part. | |
- Continue to upload untill all results are uploaded, click "Submit All", after restarting the space, you will be able to see your result on the leaderboard, but marked as checking. | |
- If your uploaded results contain error, click "Click Upload" and re-upload all results | |
- If there is an error in submitted results, you can upload an alternative, we will use the latest submitted results during our review. | |
- If there is an error in "checked" results, email us to withdraw. | |
# Detailed settings | |
- **Score**: float number, the corresponding evaluation number | |
- **Name**: str **less than 3 words**, an abbreviation representing your work, it can be a model name or paper key words. | |
- **BaseModel**: str, LMM model for agent, suggested to be the unique hf model id | |
- **Target-research**: (1)`Model-Eval-Online` and `Model-Eval-Global` represent the standard setting proposed in our paper, this setting is used to test the model capability. (2) `Agent-Eval-Prompt`: Any agent design that use fixed model weight, including using RAG, memory and etc. (3) `Agent-Eval-Finetune`: The model weight is changed, and it is trained on in-domain (same environment) data. | |
""" | |
CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results" | |
CITATION_BUTTON_TEXT = r""" | |
@article{zhang2024magebench, | |
title={MageBench: Bridging Large Multimodal Models to Agents}, | |
author={Miaosen Zhang and Qi Dai and Yifan Yang and Jianmin Bao and Dongdong Chen and Kai Qiu and Chong Luo and Xin Geng and Baining Guo}, | |
journal={arXiv preprint arXiv:2412.04531}, | |
year={2024} | |
} | |
""" | |