Articles

LLM Evaluation: Metrics, Benchmarks & Best Practices

Large language models (LLMs) have transformed the way we research, write, and code. With their wider adoption, ensuring that the LLM applications are reliable, accurate, and safe to use is also essential. LLM evaluation helps us measure a model’s performance across reasoning, factual accuracy, fluency, and real-world tasks.

In this article, we discuss the different LLM evaluation methodologies, metrics, and benchmarks that we can use to assess LLMs for various use cases. We will also discuss the advantages, challenges, and best practices for LLM evaluation to help you decide on the best processes and metrics to evaluate LLMs.

  • Learn to integrate large language models into applications using APIs, prompt engineering, and evaluation metrics for AI systems.
    • Includes 5 Courses
    • With Certificate
    • Intermediate.
      3 hours
  • Evaluate LLM skill through metrics like BLEU, ROUGE, F1, HELM, navigating accuracy, latency, cost, scalability trade-offs, addressing bias and ethical concerns.
    • Includes 27 Courses
    • With Certificate
    • Intermediate.
      10 hours

What is LLM evaluation?

LLM evaluation is the process of systematically assessing the performance, reliability, and usefulness of an LLM across different tasks. LLM evaluation goes beyond simply checking whether the model generates grammatically correct sentences. It helps us measure a model’s performance on a specific task, ensure the safety and trustworthiness of the model outputs, and compare different models.

To evaluate LLMs, we use different methods, metrics, and benchmarks. It is important to choose the right metric and benchmark for evaluating LLM outputs during evaluation. For example, we cannot use the same metric to evaluate two different LLMs trained for summarization and paraphrasing tasks. Each task requires separate tests to understand how the model will perform in a real-world scenario.

We need to understand the different evaluation methodologies, metrics, and benchmarks to evaluate LLMs efficiently. Let’s start with the methodologies.

LLM evaluation methodologies

Effective LLM evaluation frameworks use different methodologies depending on the task and available resources. The primary LLM evaluation approaches include metric-based evaluation, human-based assessment, LLM-as-a-judge evaluation, and benchmark-based evaluation, each offering unique advantages for measuring model performance.

Metric-based evaluation

Metric-based evaluation methods consist of automated tests to evaluate LLM outputs. In these methods, we use different computational metrics to score model outputs against references.

  • We use lexical overlap metrics like BLEU, ROUGE, and METEOR for tasks like translation and summarization.
  • We use semantic similarity metrics like BERTScore and MoverScore for tasks like paraphrasing and summarization.

Metric-based evaluation is a quick and objective method to evaluate LLMs. However, it doesn’t capture how the LLMs perform in tasks that require reasoning, common sense, and natural language inference. To evaluate LLMs for these aspects, we use benchmarks.

Benchmark-based evaluation

Benchmarks are a set of tasks that we use to evaluate LLMs for different aspects, such as natural language understanding, reasoning, reading comprehension, truthfulness, etc. Benchmarks like GLUE, SuperGLUE, HellaSwag, TruthfulQA, SQuAD, and ARC help us evaluate LLMs for tasks that cannot be evaluated using metric-based methods.

Benchmarks perform better in evaluating LLMs. However, they don’t capture whether the LLM output is fluent, factual, and free from bias. We use human evaluation to evaluate LLM outputs for coherence, fluency, factuality, and bias.

Human evaluation

In human evaluation, human judges are directly employed to rate LLM outputs based on different parameters. To evaluate the LLM output, the human judges use various parameters and rate the model output from 1 to 5 or 0 to 1. The average score of all the human judges is considered the LLM’s score for a particular parameter. Some of the popular parameters we use to evaluate LLMs using human judges are as follows:

Parameter Aspect
Fluency Is the LLM output natural and grammatically correct?
Coherence Do sentences in the output connect logically?
Factuality Is the information in the output factually correct?
Helpfulness Does the output answer the question fully?
Safety & Bias Does the LLM avoid harmful content?

Human evaluation evaluates LLM outputs for aspects that are difficult to measure using automated metrics and benchmarks. It is important to include humans in the LLM evaluation process. However, human evaluation is slow and costly. Hence, we should combine automatic metrics, human ratings, and benchmark scores for a robust LLM evaluation process.

LLM-as-a-judge evaluation

Instead of relying on humans, we can use LLMs like ChatGPT and Gemini to evaluate outputs from other LLMs. ChatGPT and Gemini are trained on web-scale datasets and possess reasoning capabilities as they are trained using techniques like chain-of-thought prompting. Hence, we can use them as judges to evaluate other LLMs by providing them with context and reference texts. This will allow us to evaluate thousands of LLM outputs quickly without needing large human annotation teams while still delivering reasonably reliable assessments.

Having a brief idea of the evaluation methodologies, let’s discuss the different LLM evaluation metrics and benchmarks.

LLM evaluation metrics

LLM evaluation metrics evaluate an LLM’s output based on a specified criterion. These metrics help us quantify the different properties of LLM’s output, such as correctness, hallucination, semantic relevance, etc.

BLEU

Bilingual Evaluation Understudy (BLEU) is a metric we use to evaluate the quality of a text generated by a machine. To assess an LLM’s response using BLEU, we need reference outputs for every input given to the LLM. BLEU measures how similar the LLM-generated output is to the reference text. For this, it creates sequences of words of length n (n-grams) from the output text and checks how many of the generated n-grams appear in the reference text.

BLEU scores range from 0 to 1, where 1 suggests a perfect match of the LLM output with the reference and 0 represents no overlap. BLEU focuses on precision and evaluates how much of the LLM output is correct compared to the reference. We can use BLEU to evaluate LLM outputs in summarization, translation, and question-answering tasks.

ROUGE

Recall-Oriented Understudy for Gisting Evaluation (ROUGE) consists of metrics for evaluating LLM’s outputs using reference texts. ROUGE measures how much of the content from the reference text is present in the LLM output. We calculate the ROUGE score by dividing the number of common words, phrases, or n-grams between the reference and output text by the total words, phrases, or n-grams in the reference text.

ROUGE scores also range from 0 to 1, where 1 indicates a good output. ROUGE is recall-oriented and measures how much of the reference content is present in the LLM output. ROUGE is widely used to evaluate LLMs for text summarization tasks.

METEOR

Metric for Evaluation of Translation with Explicit ordering (METEOR) improves on BLEU and ROUGE to evaluate the quality of LLM output using reference texts. Like BLEU or ROUGE, METEOR measures exact matches of words and phrases. However, it also measures words with the same root (stemming matches) and words with similar meaning (synonym matches).

We use the following steps while evaluating the LLM output with METEOR:

  • We first calculate the precision, i.e., the ratio of the number of common words between the LLM output and the reference text to the total number of words in the LLM output.
  • Next, we calculate recall, i.e., the ratio of the number of common words between the LLM output and the reference text to the total number of words in the reference text.
  • After calculating precision and recall, we calculate the F1 Score, i.e., the harmonic mean of precision and recall.
  • Finally, we add a penalty to the F1 Score based on the difference in the order of the words in the LLM output and the reference text to get the final METEOR score.

Unlike BLEU and ROUGE, METEOR understands synonyms and word variations in the LLM output and better aligns with human judgment. However, these metrics depend on word overlap and statistical analysis of the text, not semantic meaning. We can use vector embeddings of the reference text and the LLM output to measure semantic similarity between them. For this, we can use BERTScore.

BERTScore

We use BERTScore to measure the semantic similarity between the reference text and LLM output while evaluating LLMs. To use BERTScore, we first tokenize the reference text and the LLM output and generate their embeddings using models like BERT or RoBERTa. Then, we compute the similarity between each candidate word embedding and each reference word embedding using cosine similarity. For each token in the LLM output, we find the most similar token in the reference text using precision and recall.

  • Precision determines how well the words in the LLM output are covered by reference words in meaning. To calculate precision, we find the most similar token in the reference text for each token in the LLM output. Then, we take the average of the maximum similarity scores for all the tokens in the LLM output.
  • Recall determines how well the reference words are covered by the words in the LLM output in meaning. To calculate recall, we find the most similar token in the LLM output for each token in the reference text. Then, we take the average of the maximum similarity scores for all the tokens in the reference text.

After calculating precision and recall, we calculate the F1 Score by taking their harmonic mean, which is termed the BERTScore.

MoverScore

MoverScore evaluates LLM outputs against a reference text using the text embeddings. In this technique, we first create vector embeddings for the LLM output and the reference text. Then, we use the earth mover distance to calculate the dissimilarity between the LLM output and the reference text. The more the distance, the more the LLM output is semantically different from the reference text.

Factuality

Factuality measures how well an LLM output aligns with facts. LLM applications can hallucinate and produce incorrect outputs. Hence, evaluating the LLM output’s truthfulness is important using human evaluation and automated metrics.

  • In human evaluation, we check whether each statement in the LLM output is factually correct with respect to a reference document. However, this is costly and time-consuming.
  • We can use natural language inference (NLI) models to check if a given text in the LLM output is entailed or supported by the reference text.
  • We can also create question-answer pairs from the LLM output and use the reference text to answer the questions. If the answers match, we can establish the factuality of the LLM output.

You can use models like FactCC,SummaC, and QAGS to measure the factuality of a summary with reference to a reference text.

LLM evaluation benchmarks

LLM evaluation benchmarks are a set of tasks and datasets for evaluating LLMs. Let’s start with GLUE and discuss different benchmarks, such as SUPERGLUE, HellaSwag, TruthfulQA, SQuAD, and ARC.

GLUE

We use the General Language Understanding Evaluation (GLUE) benchmark to evaluate how well an LLM understands natural language. GLUE consists of nine datasets and tasks that cover different aspects of natural language. All the tasks are divided into three categories, i.e., single-sentence tasks, similarity and paraphrase tasks, and inference tasks.

Single-sentence tasks

Corpus of Linguistic Acceptability (CoLA) and Stanford Sentiment Treebank (SST-2) are single-sentence tasks in GLUE.

  • CoLA is used to decide if a sentence in LLM output is grammatically acceptable. The dataset contains a sequence of words and a label showing whether the sentence is grammatically correct.
  • The SST-2 dataset contains sentences extracted from movie reviews and their sentiment labels, which we can use to evaluate LLMs for sentiment analysis tasks.

Similarity and paraphrase tasks

GLUE has Microsoft Research Paraphrase Corpus (MRPC), Quora Question Pairs (QQP), and Semantic Textual Similarity Benchmark (STS-B) datasets to evaluate LLMs for similarity and paraphrase tasks.

  • The MRPC dataset contains sentence pairs and a label to denote whether they are semantically equivalent.
  • The QQP dataset contains question pairs from Quora, with a label showing if the questions are duplicates.
  • The STS-B dataset contains sentence pairs with human-annotated similarity scores between 0 and 1, which indicate whether the sentences have the same meaning.

We can use MRPC, STS-B, and QQP to evaluate LLMs for paraphrasing and semantic understanding tasks.

Natural language inference tasks

For natural language inference, GLUE provides Multi-Genre Natural Language Inference (MNLI), Question Answering Natural Language Inference (QNLI), Recognizing Textual Entailment (RTE), and Winograd Schema Challenge(WNLI) datasets.

  • The MNLI dataset contains premise and hypothesis sentence pairs with a classification label showing entailment, neutrality, or contradiction between the premise and hypothesis.
  • RTE contains premise and hypothesis sentence pairs with a classification label showing entailment or no entailment between the premise and hypothesis.
  • WNLI is similar to RTE. However, it contains a complex reading comprehension task in which the LLM needs to figure out what a pronoun refers to to decide whether the premise and hypthesis show entailment or no entailment.
  • The QNLI dataset contains question-paragraph pairs with a label showing whether or not one of the sentences in the paragraph contains the answer to the question.

GLUE provides a robust mechanism for evaluating LLM outputs using all these datasets and tasks. Each task checks a different skill in language understanding, and an LLM’s overall score tells us how good the model really is.

With continuous improvement, models like BERT and RoBERTa efficiently solved tasks in the GLUE framework. Hence, SuperGLUE was introduced to test LLMs for deeper reasoning, commonsense, and more challenging NLP tasks that GLUE couldn’t assess.

SuperGLUE

The Super General Language Understanding Evaluation (SuperGLUE) benchmark contains eight tasks that evaluate LLMs on different parameters.

  • The Boolean Questions (BoolQ) dataset contains a paragraph, a question, and a boolean (Yes/No) answer to the question that tests the LLM for reading comprehension and factual reasoning.
  • The CommitmentBank (CB) dataset contains premise and hypothesis sentence pairs and a label showing entailment, neutrality, or contradiction.
  • Choice of Plausible Alternatives(COPA) contains a premise and two choices where one of the choices is the cause or effect of the premise. It tests the LLM’s causal reasoning abilities.
  • Multi-Sentence Reading Comprehension (MultiRC) contains multi-sentence paragraphs, a question, and some answer options where one or more answers can be correct. The LLM has to predict the correct and incorrect answers. MultiRC checks for LLM’s deep comprehension abilities between multiple sentences.
  • Reading Comprehension with Commonsense Reasoning (ReCoRD) consists of passages from news articles along with a query with a missing entity. The LLM needs to find the missing entity in the query by analyzing the passage in the dataset.
  • The Word-in-Context(WiC) dataset contains two sentences with a common word, and the LLM has to identify if the given word has the same meaning in both the sentences.
  • The WSC (Winograd Schema Challenge) dataset contains a sentence, an entity from the sentence, a pronoun, and a label showing if the pronoun refers to the given entity. The WNLI dataset used in GLUE is a reformulated version of WSC.
  • The RTE dataset, which is also known as Stanford Natural Language Inference (SNLI), is the same as that used in the GLUE benchmark.

SuperGLUE is better than GLUE for evaluating LLMs for comprehension, reasoning, commonsense, and multi-sentence understanding tasks, and it remains the gold standard for evaluating LLMs.

HellaSwag

The HellaSwag benchmark was built to evaluate the commonsense reasoning abilities of LLMs. The dataset contains a sentence, a list of possible endings, and the label denoting the actual ending. HellaSwag checks if an LLM can choose the most sensible continuation of the sentence beyond grammar or word overlap.

TruthfulQA

TruthfulQA is a benchmark for evaluating if LLMs give factually correct answers or mirror common misconceptions, false beliefs, or biased statements. The TruthfulQA dataset contains a question, the best answer for the question, a list of correct answers, and a list of incorrect answers. Using the questions and the LLM output, we can evaluate the truthfulness and factuality of the answers given by the LLM.

SQuAD

Stanford Question Answering Dataset (SQuAD) is the benchmark to evaluate LLMs for machine reading comprehension. The SQuAD dataset contains a paragraph, a question for which the answer is present in the given paragraph, and the answer to the question. The LLM needs to answer the question based on the information provided in the paragraph. The QNLI dataset used in the GLUE benchmark is also derived from SQuAD.

ARC

AI2 Reasoning Challenge (ARC) is a multiple-choice question-answering benchmark. The ARC dataset contains questions, multiple choices for the answer, and an answer key containing the correct answer. It tests whether LLMs can apply reasoning and scientific knowledge instead of memorization and word matching.

As we have discussed different metrics and benchmarks to evaluate LLMs, let’s discuss the advantages of evaluating large language models.

Advantages of LLM evaluation

Evaluating LLMs using different metrics and benchmarks helps us understand how they perform in different tasks. The following are some of the advantages of LLM evaluation.

  • Measuring model quality: LLM evaluation helps us check whether the LLM can produce accurate and relevant outputs for a given use case. Without evaluation, we might be unable to decide whether a given model is suitable for a particular task.
  • Fair comparison between LLMs: Benchmarks like GLUE and SuperGLUE give us a standardized setup for evaluating LLMs. We can use the performance of LLMs on these benchmarks to decide if one is better than another for the given set of tasks.
  • Identifying an LLM’s strengths and weaknesses: An LLM might be good at paraphrasing, but it might not work well for question-answering tasks. LLM evaluation helps us identify an LLM’s strengths and weaknesses so that we can use the model in tasks where it performs best.
  • Progress in AI research and development: With insight into an LLM’s weaknesses, we can develop new models to overcome the shortcomings. Similarly, if an LLM becomes excellent at a given benchmark, we can build more complex tasks and benchmarks to evaluate LLMs and drive progress.
  • Reliability and safety: Different benchmarks give us an insight into an LLM’s reasoning and comprehension abilities. With benchmarks like TruthfulQA, we can assess if an LLM can generate outputs free from bias or misconceptions. Hence, LLM evaluation helps ensure reliability and safety in LLM applications.

While LLM evaluation has many benefits, it also has several challenges. Let’s discuss some of these challenges.

Challenges in LLM evaluation

We face several technical and procedural challenges while evaluating LLMs. The following are some of the challenges in evaluating large language models.

  • Benchmark saturation: LLMs have become very good at solving tasks in earlier benchmarks like GLUE and SQuAD and often outperform humans. However, it doesn’t mean that they have become truly intelligent. As the benchmark datasets get included in the training dataset of new LLMs, we need to devise new benchmarks that can objectively evaluate LLMs.
  • Narrow coverage in benchmarks: Unlike GLUE and SuperGLUE, most benchmarks evaluate models for a specific task or aspect. An LLM can excel at HellaSwag but struggle in TruthfulQA. Hence, it is important to evaluate LLMs for different benchmarks to establish their reliability in real-world applications.
  • Costly human evaluation: We need human judges to evaluate LLMs for fluency, coherence, helpfulness, and bias. Human evaluation is expensive, slow, and subjective. There is also an inherent risk of biased human judges, who can rate unsafe or biased outputs as safe. Hence, it is essential to standardize the human evaluation process.
  • Lack of real-world alignment: Benchmarks and evaluation metrics don’t capture how people use LLMs. A model that performs well on benchmarks can still give incorrect or biased outputs in real-world scenarios.

Having discussed the advantages and challenges of LLM evaluation, let’s discuss some of the best practices for evaluating LLMs that can help you maximize advantages and overcome the challenges.

LLM evaluation best practices

LLM evaluation is a complex process, and we must avoid any mistakes that can make the results unfair or misleading. The following are some of the LLM evaluation best practices you can use to evaluate LLMs in a better manner:

  • Use multiple metrics and benchmarks: An LLM can excel on a benchmark but struggle with another. Therefore, we should evaluate LLMs across different metrics and benchmarks before deciding on the quality of the model.
  • Use standard datasets and benchmarks: We shouldn’t rely on ad hoc and less reproducible methods and datasets to evaluate LLMs. Always use established benchmarks like GLUE, SuperGLUE, TruthfulQA, SQuAD, HellaSwag, etc, for LLM evaluation.
  • Check for overfitting: It is possible that an LLM memorizes the publicly available benchmark dataset or is fine-tuned for a particular benchmark. To avoid misjudgement, we should use out-of-distribution tests and adversarial examples. For example, you can paraphrase the question in the SQuaD or QNLI dataset to check if an LLM can comprehend the question or has just memorized the answers from the training dataset.
  • Combine metric-based and human evaluation: Human and metric-based evaluations work differently. Automated tests like GLUE or SuperGLUE are fast but lack tests for fluency, helpfulness, safety, and bias. Human evaluation can help assess LLMs on these subjective criteria, even if it’s an expensive and slow method. Hence, we should combine human evaluation and automated evaluation methodologies to better assess the LLMs.
  • Include bias, fairness, and safety checks: We should explicitly test the LLMs for harmful content, stereotypes, or unsafe responses. This is important for real-world deployment, as biased and harmful responses can lead to economic and reputation loss for any given organization using the LLM.
  • Adopt continuous evaluation: We should continuously evaluate and improve LLMs using user feedback, A/B testing, and reinforcement learning so that the model performs well in real-world situations.

Conclusion

As LLMs continue to be adopted in business, education, healthcare, and other domains, rigorous evaluation is required to ensure they remain accurate, safe, and reliable. By systematically assessing an LLM’s accuracy, reasoning, safety, and real-world usefulness, we can ensure that the model serves the intended purpose responsibly.

In this article, we discussed LLM evaluation methodologies, metrics, and benchmarks. We also discussed the advantages, challenges, and best practices for LLM evaluation. To learn more about generative AI topics, you can learn how to build AI agents. You might also like this IT Automation with Generative AI skillpath that discusses AI fundamentals, SRE practices, ethical considerations, server monitoring, and automation system integration.

Frequently asked questions

1. Is a higher BLEU score better?

Yes, a higher BLEU score is better as it indicates that the LLM output is more similar to the reference text, which is good.

2. What are the most important benchmarks for LLM evaluation?

LLM evaluation benchmarks are standardized tasks and datasets to test LLMs for causal reasoning, natural language inference, and real-world usefulness. The most important LLM evaluation benchmarks within modern evaluation frameworks include GLUE and SuperGLUE for language understanding, HellaSwag for commonsense reasoning, TruthfulQA for factual accuracy, and MMLU for multitask performance.

3. What is golden dataset for LLM evaluation?

A golden dataset for LLM evaluation is a curated and labeled dataset we use as ground truth to evaluate the performance of large language models. Presently, SuperGLUE is the gold standard benchmark for LLM evaluation, containing eight datasets for different tasks, such as reading comprehension, factual reasoning, and natural language inference.

4. What is the difference between LLM model evaluation and system evaluation?

LLM model evaluation only tests the LLM’s comprehension, reasoning, and inference capabilities. LLM system evaluation includes integrating testing, load testing, security testing, and usability testing to test the entire LLM application for latency, scalability, security, and user-friendliness in a real-world setting.

5. What is the goal of regression testing in LLM evaluations?

Regression testing checks whether an application’s old features work correctly after new changes. We use regression testing in LLM evaluations to ensure that model updates, prompt updates, or algorithm changes do not impact the LLM’s existing functionalities and performance.

Codecademy Team

'The Codecademy Team, composed of experienced educators and tech experts, is dedicated to making tech skills accessible to all. We empower learners worldwide with expert-reviewed content that develops and enhances the technical skills needed to advance and succeed in their careers.'

Meet the full team

Learn more on Codecademy

  • Learn to integrate large language models into applications using APIs, prompt engineering, and evaluation metrics for AI systems.
    • Includes 5 Courses
    • With Certificate
    • Intermediate.
      3 hours
  • Evaluate LLM skill through metrics like BLEU, ROUGE, F1, HELM, navigating accuracy, latency, cost, scalability trade-offs, addressing bias and ethical concerns.
    • Includes 27 Courses
    • With Certificate
    • Intermediate.
      10 hours
  • Learn the basics of large language models (LLMs) and text-based Generative Artificial Intelligence (AI). We’ll show you how LLMs work and how they’re used.
    • Beginner Friendly.
      1 hour