Answer Overlap Module of QAFactEval Metric

This is the span scorer module, used in RQUGE paper to evaluate the generated questions of the question generation task. The model was originally used in QAFactEval for computing the semantic similarity of the generated answer span, given the reference answer, context, and question in the question answering task. It outputs a 1-5 answer overlap score. The scorer is trained on their MOCHA dataset (initialized from Jia et al. (2021)), consisting of 40k crowdsourced judgments on QA model outputs.

The input to the model is defined as:

[CLS] question [q] gold answer [r] pred answer [c] context

Generation

You can use the following script to get the semantic similarity of the predicted answer given the gold answer, context, and question.

from transformers import AutoModelForSequenceClassification, AutoTokenizer
sp_scorer = AutoModelForSequenceClassification.from_pretrained('alirezamsh/quip-512-mocha')
tokenizer_sp = AutoTokenizer.from_pretrained('alirezamsh/quip-512-mocha')
sp_scorer.eval()

pred_answer = ""
gold_answer = ""
question = ""
context = ""

input_sp = f"{question} <q> {gold_answer} <r>" \
                   f" {pred_answer} <c> {context}"

inputs = tokenizer_sp(input_sp, max_length=512, truncation=True, \
                                   padding="max_length", return_tensors="pt")

outputs = sp_scorer(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"])
print(outputs)

Citations

@inproceedings{fabbri-etal-2022-qafacteval,
    title = "{QAF}act{E}val: Improved {QA}-Based Factual Consistency Evaluation for Summarization",
    author = "Fabbri, Alexander  and
      Wu, Chien-Sheng  and
      Liu, Wenhao  and
      Xiong, Caiming",
    booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
    month = jul,
    year = "2022",
    address = "Seattle, United States",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.naacl-main.187",
    doi = "10.18653/v1/2022.naacl-main.187",
    pages = "2587--2601",
    abstract = "Factual consistency is an essential quality of text summarization models in practical settings. Existing work in evaluating this dimension can be broadly categorized into two lines of research, entailment-based and question answering (QA)-based metrics, and different experimental setups often lead to contrasting conclusions as to which paradigm performs the best. In this work, we conduct an extensive comparison of entailment and QA-based metrics, demonstrating that carefully choosing the components of a QA-based metric, especially question generation and answerability classification, is critical to performance. Building on those insights, we propose an optimized metric, which we call QAFactEval, that leads to a 14{\%} average improvement over previous QA-based metrics on the SummaC factual consistency benchmark, and also outperforms the best-performing entailment-based metric. Moreover, we find that QA-based and entailment-based metrics can offer complementary signals and be combined into a single metric for a further performance boost.",
}

@inproceedings{mohammadshahi-etal-2023-rquge,
    title = "{RQUGE}: Reference-Free Metric for Evaluating Question Generation by Answering the Question",
    author = "Mohammadshahi, Alireza  and
      Scialom, Thomas  and
      Yazdani, Majid  and
      Yanki, Pouya  and
      Fan, Angela  and
      Henderson, James  and
      Saeidi, Marzieh",
    editor = "Rogers, Anna  and
      Boyd-Graber, Jordan  and
      Okazaki, Naoaki",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings-acl.428",
    doi = "10.18653/v1/2023.findings-acl.428",
    pages = "6845--6867",
    abstract = "Existing metrics for evaluating the quality of automatically generated questions such as BLEU, ROUGE, BERTScore, and BLEURT compare the reference and predicted questions, providing a high score when there is a considerable lexical overlap or semantic similarity between the candidate and the reference questions. This approach has two major shortcomings. First, we need expensive human-provided reference questions. Second, it penalises valid questions that may not have high lexical or semantic similarity to the reference questions. In this paper, we propose a new metric, RQUGE, based on the answerability of the candidate question given the context. The metric consists of a question-answering and a span scorer modules, using pre-trained models from existing literature, thus it can be used without any further training. We demonstrate that RQUGE has a higher correlation with human judgment without relying on the reference question. Additionally, RQUGE is shown to be more robust to several adversarial corruptions. Furthermore, we illustrate that we can significantly improve the performance of QA models on out-of-domain datasets by fine-tuning on synthetic data generated by a question generation model and reranked by RQUGE.",
}
Downloads last month
984
Safetensors
Model size
355M params
Tensor type
I64
·
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train alirezamsh/quip-512-mocha

Space using alirezamsh/quip-512-mocha 1