How to evaluate NLP tasks

Traditional NLP Metrics

The direct use of metrics such as perplexity and BLEU score has declined in popularity, largely due to their inherent flaws in many contexts. However, it remains crucial to comprehend these metrics and discern their appropriate applications.

BLEU

Paper: BLEU: a Method for Automatic Evaluation of Machine Translation
Originally developed to measure machine translation. BLEU scores are based on an average of unigram, bigram, trigram and 4-gram precision

Example1:
candidate: the cat sat on the mat
reference: the cat is on the mat

bleu1 score = ⅚ = 0.83
bleu2 score = ⅗ = 0.6
Checking for the occurrence of five words from the candidate set, {the cat, cat sat, sat on, on the, the mat}, within the reference set, it was found that three of these words are present in the reference set. Hence, the proportion is 0.6.

Example2:
candidate: the the the the the
reference: the cat is on the mat

bleu1 score = ⅖ = 0.4
We clipped the occurrence of “the” in the reference

Readings:
https://towardsdatascience.com/evaluating-text-output-in-nlp-bleu-at-your-own-risk-e8609665a213

ROUGE

BLEU focuses on precision: how much the words (and/or n-grams) in the candidate model outputs appear in the human reference.
ROUGE focuses on recall: how much the words (and/or n-grams) in the human references appear in the candidate model outputs.

Perplexity

Perplexity (PPL) is one of the most common metrics for evaluating language models. Intuitively, perplexity can be understood as a measure of uncertainty. The perplexity of a language model can be seen as the level of perplexity when predicting the following symbol.

Most language models estimate this probability as a product of each symbol’s probability given its preceding symbols.

Perplexity is defined as the exponentiated average negative log-likelihood of a sequence. If we have a tokenized sequence 𝑋=(𝑥0,𝑥1,…,𝑥𝑡), then the perplexity of 𝑋 is,

Where log𝑝𝜃(𝑥𝑖∣𝑥<𝑖) is the log-likelihood of the ith token conditioned on the preceding tokens 𝑥<𝑖 according to our model. This is also equivalent to the exponentiation of the cross-entropy between the data and model predictions. So log(perplexity) = cross-entropy which is the training loss of the causal language model

Readings:
https://huggingface.co/docs/transformers/perplexity
https://thegradient.pub/understanding-evaluation-metrics-for-language-models/

LLM Evaluation

Benchmarks

Traditionally we have benchmarks for different tasks in NLP.

NLP Tasks	Description	Benchmark Dataset	Metrics
Sentiment Analysis	Determine sentiment (positive/negative/neutral) of text	IMDb Reviews	Accuracy, F1-score, ROC AUC
Named Entity Recognition (NER)	Identify and classify named entities in text	CoNLL-2003	Precision, Recall, F1-score
Part-of-Speech Tagging (POS)	Assign grammatical categories to words in a sentence	Penn Treebank	Accuracy, F1-score
Machine Translation	Translate text from one language to another	WMT (Workshop on Machine Translation)	BLEU score, METEOR, TER
Text Classification	Categorize text documents into predefined classes	GLUE (General Language Understanding Evaluation)	Accuracy, F1-score, Precision, Recall
Question Answering (QA)	Generate answers to questions posed in natural language	SQuAD (Stanford Question Answering Dataset)	Exact Match (EM), F1-score
Text Summarization	Generate concise summaries of longer text documents	CNN/Daily Mail	ROUGE score, BLEU score
Language Modeling	Predict the next word in a sequence of text	-	Perplexity

For example, the GLUE benchmark is a collection of 9 different NLP tasks

LLM Benchmarks

Evaluate LLM should not focus on a single task in traditional NLP. Compare human-level performance on various professional and academic benchmarks from GPT-4 report.

More benchmarks

Tools: Based on the Language Model Evaluation Harness, the Open LLM Leaderboard is the main benchmark for general-purpose LLMs (like ChatGPT). There are other popular benchmarks like BigBench, MT-Bench, etc.

Beyond accuracy

Expanding upon the principles of the 3H rule (Helpfulness, Honesty, and Harmlessness)

Still need human evaluation

This evaluation set contains 1,800 prompts that cover 12 key use cases

Tools

Frameworks / Platforms	Description	Tutorials/lessons	Reference
Azure AI Studio Evaluation (Microsoft)	Azure AI Studio is an all-in-one AI platform for building, evaluating, and deploying generative AI solutions and custom copilots.Technical Landscape: No code: model catalog in AzureML studio & AI studio, Low-code: as CLI, Pro-code: as azureml-metrics SDK	Tutorials	Link
Prompt Flow (Microsoft)	A suite of development tools designed to streamline the end-to-end development cycle of LLM-based AI applications, from ideation, prototyping, testing, and evaluation to production, deployment, and monitoring.	Tutorials	Link
Weights & Biases(Weights & Biases)	A Machine Learning platform to quickly track experiments, version and iterate on datasets, evaluate model performance, reproduce models, visualize results and spot regressions, and share findings with colleagues.	Tutorias, DeepLearning.AI Lesson	Link
LangSmith (LangChain)	Helps the user trace and evaluate language model applications and intelligent agents to help user move from prototype to production.	Tutorials	Link
TruLens (TruEra)	TruLens provides a set of tools for developing and monitoring neural nets, including LLMs. This includes both tools for the evaluation of LLMs and LLM-based applications with TruLens-Eval and deep learning explainability with TruLens-Explain.	Tutorials, DeepLearning.AI Lesson	Link
Vertex AI Studio (Google)	You can evaluate the performance of foundation models and your tuned generative AI models on Vertex AI. The models are evaluated using a set of metrics against an evaluation dataset that you provide.	Tutorials	Link
Amazon Bedrock	Amazon Bedrock supports model evaluation jobs. The results of a model evaluation job allow you to evaluate and compare a model’s outputs, and then choose the model best suited for your downstream generative AI applications. Model evaluation jobs support common use cases for large language models (LLMs) such as text generation, text classification, question and answering, and text summarization.	Tutorials	Link
DeepEval (Confident AI)	An open-source LLM evaluation framework for LLM applications.	Examples	Link
Parea AI	Parea helps AI Engineers build reliable, production-ready LLM applications. Parea provides tools for debugging, testing, evaluating, and monitoring LLM-powered applications.	Article on evals	Link

Readings:
https://slds-lmu.github.io/seminar_nlp_ss20/resources-and-benchmarks-for-nlp.html
https://openai.com/research/gpt-4
A Survey on Evaluation of Large Language Models
https://ai.meta.com/blog/meta-llama-3/
https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard
https://medium.com/data-science-at-microsoft/evaluating-llm-systems-metrics-challenges-and-best-practices-664ac25be7e5