Traditional NLP Metrics
The direct use of metrics such as perplexity and BLEU score has declined in popularity, largely due to their inherent flaws in many contexts. However, it remains crucial to comprehend these metrics and discern their appropriate applications.
BLEU
Paper: BLEU: a Method for Automatic Evaluation of Machine Translation
Originally developed to measure machine translation. BLEU scores are based on an average of unigram, bigram, trigram and 4-gram precision
Example1:
candidate: the cat sat on the mat
reference: the cat is on the mat
bleu1 score = ⅚ = 0.83
bleu2 score = ⅗ = 0.6
Checking for the occurrence of five words from the candidate set, {the cat, cat sat, sat on, on the, the mat}, within the reference set, it was found that three of these words are present in the reference set. Hence, the proportion is 0.6.
Example2:
candidate: the the the the the
reference: the cat is on the mat
bleu1 score = ⅖ = 0.4
We clipped the occurrence of “the” in the reference
Readings:
https://towardsdatascience.com/evaluating-text-output-in-nlp-bleu-at-your-own-risk-e8609665a213
ROUGE
BLEU focuses on precision: how much the words (and/or n-grams) in the candidate model outputs appear in the human reference.
ROUGE focuses on recall: how much the words (and/or n-grams) in the human references appear in the candidate model outputs.
Perplexity
Perplexity (PPL) is one of the most common metrics for evaluating language models. Intuitively, perplexity can be understood as a measure of uncertainty. The perplexity of a language model can be seen as the level of perplexity when predicting the following symbol.
Most language models estimate this probability as a product of each symbol’s probability given its preceding symbols.
Perplexity is defined as the exponentiated average negative log-likelihood of a sequence. If we have a tokenized sequence 𝑋=(𝑥0,𝑥1,…,𝑥𝑡), then the perplexity of 𝑋 is,
Where log𝑝𝜃(𝑥𝑖∣𝑥<𝑖) is the log-likelihood of the ith token conditioned on the preceding tokens 𝑥<𝑖 according to our model. This is also equivalent to the exponentiation of the cross-entropy between the data and model predictions. So log(perplexity) = cross-entropy which is the training loss of the causal language model
Readings:
https://huggingface.co/docs/transformers/perplexity
https://thegradient.pub/understanding-evaluation-metrics-for-language-models/
LLM Evaluation
Benchmarks
Traditionally we have benchmarks for different tasks in NLP.
| NLP Tasks | Description | Benchmark Dataset | Metrics |
|---|---|---|---|
| Sentiment Analysis | Determine sentiment (positive/negative/neutral) of text | IMDb Reviews | Accuracy, F1-score, ROC AUC |
| Named Entity Recognition (NER) | Identify and classify named entities in text | CoNLL-2003 | Precision, Recall, F1-score |
| Part-of-Speech Tagging (POS) | Assign grammatical categories to words in a sentence | Penn Treebank | Accuracy, F1-score |
| Machine Translation | Translate text from one language to another | WMT (Workshop on Machine Translation) | BLEU score, METEOR, TER |
| Text Classification | Categorize text documents into predefined classes | GLUE (General Language Understanding Evaluation) | Accuracy, F1-score, Precision, Recall |
| Question Answering (QA) | Generate answers to questions posed in natural language | SQuAD (Stanford Question Answering Dataset) | Exact Match (EM), F1-score |
| Text Summarization | Generate concise summaries of longer text documents | CNN/Daily Mail | ROUGE score, BLEU score |
| Language Modeling | Predict the next word in a sequence of text | - | Perplexity |
For example, the GLUE benchmark is a collection of 9 different NLP tasks
LLM Benchmarks
Evaluate LLM should not focus on a single task in traditional NLP. Compare human-level performance on various professional and academic benchmarks from GPT-4 report.
More benchmarks
Tools: Based on the Language Model Evaluation Harness, the Open LLM Leaderboard is the main benchmark for general-purpose LLMs (like ChatGPT). There are other popular benchmarks like BigBench, MT-Bench, etc.
Beyond accuracy
Expanding upon the principles of the 3H rule (Helpfulness, Honesty, and Harmlessness)
Still need human evaluation
This evaluation set contains 1,800 prompts that cover 12 key use cases
Tools
| Frameworks / Platforms | Description | Tutorials/lessons | Reference |
|---|---|---|---|
| Azure AI Studio Evaluation (Microsoft) | Azure AI Studio is an all-in-one AI platform for building, evaluating, and deploying generative AI solutions and custom copilots.Technical Landscape: No code: model catalog in AzureML studio & AI studio, Low-code: as CLI, Pro-code: as azureml-metrics SDK | Tutorials | Link |
| Prompt Flow (Microsoft) | A suite of development tools designed to streamline the end-to-end development cycle of LLM-based AI applications, from ideation, prototyping, testing, and evaluation to production, deployment, and monitoring. | Tutorials | Link |
| Weights & Biases(Weights & Biases) | A Machine Learning platform to quickly track experiments, version and iterate on datasets, evaluate model performance, reproduce models, visualize results and spot regressions, and share findings with colleagues. | Tutorias, DeepLearning.AI Lesson | Link |
| LangSmith (LangChain) | Helps the user trace and evaluate language model applications and intelligent agents to help user move from prototype to production. | Tutorials | Link |
| TruLens (TruEra) | TruLens provides a set of tools for developing and monitoring neural nets, including LLMs. This includes both tools for the evaluation of LLMs and LLM-based applications with TruLens-Eval and deep learning explainability with TruLens-Explain. | Tutorials, DeepLearning.AI Lesson | Link |
| Vertex AI Studio (Google) | You can evaluate the performance of foundation models and your tuned generative AI models on Vertex AI. The models are evaluated using a set of metrics against an evaluation dataset that you provide. | Tutorials | Link |
| Amazon Bedrock | Amazon Bedrock supports model evaluation jobs. The results of a model evaluation job allow you to evaluate and compare a model’s outputs, and then choose the model best suited for your downstream generative AI applications. Model evaluation jobs support common use cases for large language models (LLMs) such as text generation, text classification, question and answering, and text summarization. | Tutorials | Link |
| DeepEval (Confident AI) | An open-source LLM evaluation framework for LLM applications. | Examples | Link |
| Parea AI | Parea helps AI Engineers build reliable, production-ready LLM applications. Parea provides tools for debugging, testing, evaluating, and monitoring LLM-powered applications. | Article on evals | Link |
Readings:
https://slds-lmu.github.io/seminar_nlp_ss20/resources-and-benchmarks-for-nlp.html
https://openai.com/research/gpt-4
A Survey on Evaluation of Large Language Models
https://ai.meta.com/blog/meta-llama-3/
https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard
https://medium.com/data-science-at-microsoft/evaluating-llm-systems-metrics-challenges-and-best-practices-664ac25be7e5
