In this section, we provide an overview of the current landscape of LLM evaluation methods, the challenges posed by data contamination, and the importance of meta-evaluation in assessing the reliability and validity of evaluation protocols.

2.1 Automatic Evaluation Methods for LLMs

The rapid development of Large Language Models (LLMs) has led to the emergence of various evaluation methods, each aiming to assess different aspects of model performance. These methods can be broadly categorized into three groups: classic reference-based evaluation, dataset-based benchmarks, and LLM-based evaluators.

\ Reference-Based Evaluation methods, such as BLEU (Papineni et al., 2002), ROUGE (Lin, 2004), and BERTScore (Zhang et al., 2019), assess the quality of generated text by comparing it against human-written references. While straightforward, they may not fully capture the open-ended nature of LLM-generated outputs and can be sensitive to reference quality and diversity (Wang et al., 2023c).

\ Dataset-Based Benchmarks, such as ARC (Clark et al., 2018), HellaSwag (Zellers et al., 2019), MMLU (Hendrycks et al., 2020), and CEval (Huang et al., 2023), evaluate LLMs using carefully curated datasets that test specific skills or knowledge. However, they may not fully capture the open-ended nature of LLMs and can be vulnerable to data contamination (Schaeffer, 2023; Wei et al., 2023).

\ LLM-Based Evaluators leverage strong LLMs, such as GPT-4 (OpenAI, 2023), to assess the performance of other models. Examples include PandaLM (Wang et al., 2023c), MT-Bench (Zheng et al., 2023b), GPTScore (Fu et al., 2023), PRD (Li et al., 2023a), and KIEval (Yu et al., 2024). These evaluators can capture nuanced aspects of language understanding and generation, but their performance is influenced by the evaluator LLM and prompting strategies. Biases present in the evaluator LLM may propagate to the evaluation process (Zeng et al., 2023; Wang et al., 2023b), requiring careful meta-evaluation. Additionally, the inference cost of LLMs necessitates optimization for large-scale evaluation.

:::info This paper is available on arxiv under CC BY 4.0 DEED license.

:::

:::info Authors:

(1) Zhuohao Yu, Peking University;

(2) Chang Gao, Peking University;

(3) Wenjin Yao, Peking University;

(4) Yidong Wang, Peking University;

(5) Zhengran Zeng, Peking University;

(6) Wei Ye, Peking University and a corresponding author;

(7) Jindong Wang, Microsoft Research;

(8) Yue Zhang, Westlake University;

(9) Shikun Zhang, Peking University.

:::

Feed: Hacker Noon - Medium

View: Original article

Tags: microsoft