Traditional NLP Metrics

The direct use of metrics such as perplexity and BLEU score has declined in popularity, largely due to their inherent flaws in many contexts. However, it remains crucial to comprehend these metrics and discern their appropriate applications.

BLEU

Paper: BLEU: a Method for Automatic Evaluation of Machine Translation
Originally developed to measure machine translation. BLEU scores are based on an average of unigram, bigram, trigram and 4-gram precision

Example1:
candidate: the cat sat on the mat
reference: the cat is on the mat

bleu1 score = ⅚ = 0.83
bleu2 score = ⅗ = 0.6
Checking for the occurrence of five words from the candidate set, {the cat, cat sat, sat on, on the, the mat}, within the reference set, it was found that three of these words are present in the reference set. Hence, the proportion is 0.6.

Example2:
candidate: the the the the the
reference: the cat is on the mat

bleu1 score = ⅖ = 0.4
We clipped the occurrence of “the” in the reference

Readings:
https://towardsdatascience.com/evaluating-text-output-in-nlp-bleu-at-your-own-risk-e8609665a213

ROUGE

BLEU focuses on precision: how much the words (and/or n-grams) in the candidate model outputs appear in the human reference.
ROUGE focuses on recall: how much the words (and/or n-grams) in the human references appear in the candidate model outputs.

Perplexity

Perplexity (PPL) is one of the most common metrics for evaluating language models. Intuitively, perplexity can be understood as a measure of uncertainty. The perplexity of a language model can be seen as the level of perplexity when predicting the following symbol.

Most language models estimate this probability as a product of each symbol’s probability given its preceding symbols.

Perplexity is defined as the exponentiated average negative log-likelihood of a sequence. If we have a tokenized sequence 𝑋=(𝑥0,𝑥1,…,𝑥𝑡), then the perplexity of 𝑋 is,

Where log𝑝𝜃(𝑥𝑖∣𝑥<𝑖) is the log-likelihood of the ith token conditioned on the preceding tokens 𝑥<𝑖 according to our model. This is also equivalent to the exponentiation of the cross-entropy between the data and model predictions. So log(perplexity) = cross-entropy which is the training loss of the causal language model

Readings:
https://huggingface.co/docs/transformers/perplexity
https://thegradient.pub/understanding-evaluation-metrics-for-language-models/

LLM Evaluation

Benchmarks

Traditionally we have benchmarks for different tasks in NLP.

NLP Tasks Description Benchmark Dataset Metrics
Sentiment Analysis Determine sentiment (positive/negative/neutral) of text IMDb Reviews Accuracy, F1-score, ROC AUC
Named Entity Recognition (NER) Identify and classify named entities in text CoNLL-2003 Precision, Recall, F1-score
Part-of-Speech Tagging (POS) Assign grammatical categories to words in a sentence Penn Treebank Accuracy, F1-score
Machine Translation Translate text from one language to another WMT (Workshop on Machine Translation) BLEU score, METEOR, TER
Text Classification Categorize text documents into predefined classes GLUE (General Language Understanding Evaluation) Accuracy, F1-score, Precision, Recall
Question Answering (QA) Generate answers to questions posed in natural language SQuAD (Stanford Question Answering Dataset) Exact Match (EM), F1-score
Text Summarization Generate concise summaries of longer text documents CNN/Daily Mail ROUGE score, BLEU score
Language Modeling Predict the next word in a sequence of text - Perplexity

For example, the GLUE benchmark is a collection of 9 different NLP tasks

LLM Benchmarks

Evaluate LLM should not focus on a single task in traditional NLP. Compare human-level performance on various professional and academic benchmarks from GPT-4 report.

More benchmarks

Tools: Based on the Language Model Evaluation Harness, the Open LLM Leaderboard is the main benchmark for general-purpose LLMs (like ChatGPT). There are other popular benchmarks like BigBench, MT-Bench, etc.

Beyond accuracy

Expanding upon the principles of the 3H rule (Helpfulness, Honesty, and Harmlessness)

Still need human evaluation

This evaluation set contains 1,800 prompts that cover 12 key use cases

Tools

Frameworks / Platforms Description Tutorials/lessons Reference
Azure AI Studio Evaluation (Microsoft) Azure AI Studio is an all-in-one AI platform for building, evaluating, and deploying generative AI solutions and custom copilots.Technical Landscape: No code: model catalog in AzureML studio & AI studio, Low-code: as CLI, Pro-code: as azureml-metrics SDK Tutorials Link
Prompt Flow (Microsoft) A suite of development tools designed to streamline the end-to-end development cycle of LLM-based AI applications, from ideation, prototyping, testing, and evaluation to production, deployment, and monitoring. Tutorials Link
Weights & Biases(Weights & Biases) A Machine Learning platform to quickly track experiments, version and iterate on datasets, evaluate model performance, reproduce models, visualize results and spot regressions, and share findings with colleagues. Tutorias, DeepLearning.AI Lesson Link
LangSmith (LangChain) Helps the user trace and evaluate language model applications and intelligent agents to help user move from prototype to production. Tutorials Link
TruLens (TruEra) TruLens provides a set of tools for developing and monitoring neural nets, including LLMs. This includes both tools for the evaluation of LLMs and LLM-based applications with TruLens-Eval and deep learning explainability with TruLens-Explain. Tutorials, DeepLearning.AI Lesson Link
Vertex AI Studio (Google) You can evaluate the performance of foundation models and your tuned generative AI models on Vertex AI. The models are evaluated using a set of metrics against an evaluation dataset that you provide. Tutorials Link
Amazon Bedrock Amazon Bedrock supports model evaluation jobs. The results of a model evaluation job allow you to evaluate and compare a model’s outputs, and then choose the model best suited for your downstream generative AI applications. Model evaluation jobs support common use cases for large language models (LLMs) such as text generation, text classification, question and answering, and text summarization. Tutorials Link
DeepEval (Confident AI) An open-source LLM evaluation framework for LLM applications. Examples Link
Parea AI Parea helps AI Engineers build reliable, production-ready LLM applications. Parea provides tools for debugging, testing, evaluating, and monitoring LLM-powered applications. Article on evals Link

Readings:
https://slds-lmu.github.io/seminar_nlp_ss20/resources-and-benchmarks-for-nlp.html
https://openai.com/research/gpt-4
A Survey on Evaluation of Large Language Models
https://ai.meta.com/blog/meta-llama-3/
https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard
https://medium.com/data-science-at-microsoft/evaluating-llm-systems-metrics-challenges-and-best-practices-664ac25be7e5