RedBlock Reflections
Discover the latest trends in AI driven cybersecurity.
A Quick Introduction to AI Agents
A Quick Introduction to AI Agents
PARROT: How We Used Game Show Trivia to Build an LLM Benchmark
We have been experimenting with evaluating LLMs more effectively. The standard metrics, like accuracy or F1 scores, are good for some things but don’t tell us how well models reason through tough questions—especially those involving ambiguity or context. So, we decided to take a different approach: we turned to game shows.
The Problem with Current LLM Benchmarks
Most benchmarks (frameworks to assess the performance of an LLM) out there focus on whether a model can get the right answer to the question posed, but that’s just one piece of the puzzle. What happens when the model has to navigate complex or ambiguous situations or a differently phrased question? How does it adapt when things get harder? Those were the questions we wanted to answer, and existing benchmarks weren’t cutting it.
We needed something mimicking real-world decision-making and reasoning, not just fact recall. That’s when we thought of using game show trivia, which challenges both factual knowledge and the ability to reason through increasingly difficult questions.
What We Built: PARROT
Performance Assessment of Reasoning and Responses On Trivia (PARROT) is a combination of two datasets:
- PARROT-Millionaire – The key here is that the questions get progressively harder, and we weigh them based on how many contestants historically got them wrong. It’s a way of simulating real-world pressure where making a tough decision has a bigger impact.
- PARROT-Jeopardy – This subset is a whole different beast. It’s less about simple facts and more about reasoning through clues. We used the show’s dollar-value clues to weigh each question, so tougher clues (with higher dollar values) count for more in the model’s final score.
Why Game Shows?
Game shows provide an ideal platform for testing LLMs, and here's why: They begin with relatively simple questions, gradually increasing the difficulty as the stakes rise. This structure mimics real-world decision-making, where tasks can start easy but become more complex under pressure. Game shows let us evaluate how LLMs handle a broad spectrum of questions, from straightforward to highly complex, all in real-time [1][2].
It was a challenging process to convert game show data into a useful benchmark for LLMs. We had to sift through a vast amount of historical game show data, sourced from dedicated fanbase websites, to understand how contestant performance changed as questions became tougher. By analyzing where contestants tended to drop off, we were able to weigh each question by its difficulty. Harder questions, which knocked out many contestants, were assigned more points. This system allows us to evaluate LLMs in a way that mirrors the increasing complexity and stakes of real-world challenges.
What’s truly interesting about our approach is that it doesn’t just measure an LLM’s ability to answer fact-based questions—it tests how well models can navigate increasing pressure and complexity, just like contestants on a game show.
It’s the blend of increasing difficulty and real-time pressure that makes game shows an ideal framework for testing LLMs—just one of the many feathers in PARROT’s cap.
PARROT’S Features and Feathers
PARROT captures a broad range of question types and difficulty levels, offering a well-rounded assessment of an LLM's abilities [3][4]. Let’s look at its features, a.k.a feathers, which enable PARROT as a benchmark:
The Millionaire Metric: Real-World Decision-Making Under Pressure
The PARROT-Millionaire Metric assigns weights to questions based on their difficulty. Harder questions carry more weight, simulating the impact of high-pressure decision-making [5].
Using past game show data, we identified how many contestants were eliminated at each question level—the more difficult the question, the higher its weight and contribution to the overall score.
This approach ensures that LLMs are rewarded for solving tougher challenges, similar to real-life scenarios where success in difficult tasks holds more value. The weights were standardized across all seasons, ensuring fairness. By applying these weights to a model's accuracy, we generate a score that prioritizes solving complex problems, not just easy ones.
The Jeopardy Metric: Navigating Ambiguity and Complexity
The PARROT-Jeopardy Metric assesses how well models handle ambiguity and complexity, assigning higher weights to clues with higher dollar values.
For example, a $1,000 clue is harder than a $200 clue. We incorporated this concept into the PARROT-Jeopardy Metric by weighting clues according to their value. This ensures that models that solve harder questions receive a proportionally higher score. This metric captures a model's ability to handle complexity, going beyond basic, fact-based answers [5].
The PARROT Score as a metric combines the results from both metrics, offering a composite view of a model's ability to handle both straightforward and complex tasks:
Model | PARROT | MMLU | IFeval | BBH | GPQA |
---|---|---|---|---|---|
GPT-4o | 0.81 | 0.88 | 0.90 | UNK | 0.53 |
Claude-3.5-Sonnet | 0.80 | 0.81 | 0.92 | 0.93 | 0.59 |
Gemini-flash-1.5 | 0.67 | 0.79 | UNK | 0.88 | 0.59 |
Mistral (7.2B) | 0.65 | 0.71 | 0.54 | 0.25 | 0.28 |
Gemma-2 (9B) | 0.62 | 0.69 | 0.74 | 0.42 | 0.36 |
Llama-3.1 (8B) | 0.56 | 0.68 | 0.78 | 0.30 | 0.27 |
Llama-3 (8B) | 0.53 | 0.68 | 0.47 | 0.26 | 0.29 |
Qwen-2 (7.6B) | 0.53 | 0.69 | 0.56 | 0.37 | 0.31 |
Gemma (7B) | 0.45 | 0.64 | 0.38 | 0.12 | 0.29 |
Phi-3 (3.8B) | 0.33 | 0.69 | 0.54 | 0.37 | 0.34 |
Phi-3.5-Mini (3.8B) | 0.19 | 0.69 | 0.57 | 0.36 | 0.33 |
Did you notice something? Even if you are not an AI wizard, it is evident that these results show us that models still have a long way to go when it comes to reasoning under pressure, especially in scenarios where answers have consequences. We realized that accuracy alone isn’t enough; models need to demonstrate adaptability and complex reasoning too, and our PARROT addresses this gap creatively.
What’s Next for PARROT?
We’re releasing PARROT to the community and believe it’s an important step forward in LLM evaluation. We’d love to see how other models perform and where you as part of the community can take this idea. We’re also curious to hear your thoughts on how we can continue improving ways to evaluate an LLM.
Bibliography
- Jeopardy! In Encyclopaedia Britannica. Retrieved August 13, 2024, from https://www.britannica.com/topic/Jeopardy-American-television-game-show.
- "Who Wants to Be a Millionaire (American game show)." In Wikipedia: The Free Encyclopedia. Retrieved August 13, 2024, from https://en.wikipedia.org/wiki/Who_Wants_to_Be_a_Millionaire_(American_game_show).
- Redblock AI Team. "How good are LLMs at Trivia Game Shows?" Redblock AI, 2024. https://www.redblock.ai/blog/genai-on-gameshows
- Redblock AI Team. "Introducing PARROT–A New LLM Benchmark using Television Trivia Datasets" Redblock AI, 2024. https://www.redblock.ai/blog/parrot-benchmark.
- Redblock AI Team. "The PARROT Benchmark: How LLMs stack up" Redblock AI, 2024. https://www.redblock.ai/blog/parrot-benchmarking-evaluating-llms.
The Future of Data Privacy
The Future of Data Privacy
The Future of Data Privacy
The Future of Data Privacy
The Future of Data Privacy
The Future of Data Privacy
The PARROT Benchmark: How LLMs stack up
Introduction
Large Language Models (LLMs) have become pivotal in driving advancements across various fields, from customer service to data analysis. As their influence grows, so does the need to ensure these models perform optimally across different tasks. We’ve discussed this at great length, you might find it helpful to start here if you’re new to LLM benchmarking.
Why Benchmarking Matters in LLM Development?
Benchmarking provides a structured way to assess and compare LLMs, allowing us to understand their strengths, weaknesses, and suitability for specific applications. At Redblock, we've taken this further by introducing a new dataset that pushes the limits of LLM benchmarking, particularly in question-answering (QA) tasks [1].
We believe, benchmarking LLMs should be more than just assessing accuracy; it’s about understanding how well a model can adapt to different contexts, and respond to unforeseen challenges. Our benchmarking framework, Performance Assessment of Reasoning and Responses On Trivia (PARROT), focuses on several key metrics beyond simple accuracy, which will be covered later in this blog. Our metric provides a multi-faceted view of how LLMs operate under various conditions, offering deeper insights into their real-world applicability.
Why do we need a new benchmark?
Existing benchmarks often fall short of providing a comprehensive evaluation of LLMs for several reasons:
- Over-Reliance on Simple Metrics: Many benchmarks heavily rely on a single metric like accuracy or F1-score. While these metrics are useful, they can lead to models that are finely tuned for specific types of errors or tasks but may not generalize well across different applications [2][3].
- Surface-Level Evaluation: Traditional benchmarks evaluate models on tasks that may not fully challenge their reasoning abilities or adaptability. Such benchmarks often over-represent simpler tasks, allowing models to score well without truly testing their depth of understanding [2][3][4].
- Lack of Contextual Understanding: Many existing benchmarks do not adequately test these aspects, focusing instead on tasks that require straightforward, fact-based answers. This gap means that models might perform well in controlled benchmark environments but struggle in more dynamic, real-world situations [3][4].
- Few-Shot Learning Bias: Many benchmarks showcased utilize few-shot learning, which can give a false sense of a model’s capabilities. While few-shot learning allows models to adapt to new tasks with minimal examples, it doesn’t necessarily reflect their inherent understanding or reasoning skills. This can lead to overestimating a model’s performance in zero-shot or real-world scenarios where such guidance isn’t available [5].
PARROT: A New Dimension in QA-Benchmarking
At Redblock, we’ve gone beyond traditional metrics like Exact Match and F1 Score to evaluate LLM performance in unique contexts. It includes two different metrics namely, the Millionaire Metric and the Jeopardy Metric, both tailored to capture the intricacies of question-answering (QA) in formats inspired by popular television game shows. The aim was to create a benchmark that measures accuracy and accounts for factors like question difficulty and response correctness in a way that reflects real-world challenges.
The Millionaire Metric focuses on scenarios where questions are progressively more difficult and require precise, often fact-based answers, simulating environments where each decision carries significant weight. On the other hand, the Jeopardy Metric assesses how well models can handle questions that require deeper reasoning, contextual understanding, and the ability to navigate ambiguity, reflecting situations where flexibility and the ability to infer from incomplete information are of utmost importance.
Designing the Millionaire Metric
The Millionaire Metric was inspired by the format of "Who Wants to Be a Millionaire?" where contestants answer questions with increasing difficulty [6]. The core idea behind this metric is that not all questions should be weighted equally; answering a more difficult question correctly should contribute more to the overall score than answering an easier one. Here’s how we designed this metric:
- Question Difficulty and Elimination Ratio
- We categorized questions into 15 levels, each corresponding to the difficulty typically seen in the game show. Questions at higher levels are more challenging, so an LLM’s performance at these levels should be given more weight.
- To quantify this, we calculated an elimination ratio for each question level, which reflects how many contestants were eliminated at each game stage.
- This ratio helps determine the weight of each level in the final score.
- Weight Coefficients
- We standardized these elimination ratios across all seasons of the show to develop a weight coefficient for each question level.
- This coefficient ensures that the contribution of a correct answer to the final score is proportional to the difficulty of the question.
- Performance Calculation
- The final Millionaire Metric score is calculated by multiplying the weight coefficient of each level by the LLM’s accuracy at that level. The sum of these products across all 15 levels gives the overall performance score.
- This approach allows us to measure how well an LLM performs under increasing pressure, mimicking the real-world stakes of decision-making under uncertainty.
LLM | Millionaire Score |
---|---|
GPT-4o | 0.75 |
Claude-3.5-Sonnet | 0.75 |
Mistral (7.2B) | 0.67 |
Gemini-flash-1.5 | 0.61 |
Gemma-2 (9B) | 0.57 |
Llama-3.1 (8B) | 0.54 |
Qwen-2 (7.6B) | 0.55 |
Llama-3 (8B) | 0.50 |
Gemma (7B) | 0.42 |
Phi-3 (3.8B) | 0.30 |
Phi-3.5-mini (3.8B) | 0.27 |
Designing the Jeopardy Metric
The Jeopardy Metric was developed to assess how well LLMs can handle questions that require deeper reasoning and the ability to manage ambiguity. Jeopardy’s unique format, where clues are presented with varying monetary values, served as the foundation for this metric. Here’s the design process:
- Difficulty Levels Based on Game Rounds
- Jeopardy is divided into three rounds: Jeopardy, Double Jeopardy, and Final Jeopardy [7].
- Each round has clues with different difficulty levels, reflected by their monetary value.
- In our metric, we associated these levels with specific coordinates on the game board and their respective rounds, creating a structured difficulty gradient from 0 (easiest) to 11 (most difficult).
- Handling Non-Elimination
- Unlike the Millionaire show, Jeopardy does not eliminate contestants during the game, which means difficulty must be assessed differently.
- We used the difficulty level of each clue and the associated category to gauge how challenging a question is, independent of contestant elimination.
- Performance at Each Level
- Similar to the Millionaire Metric, the Jeopardy Metric evaluates LLM performance by calculating accuracy at each difficulty level.
- However, the weighting is more nuanced, reflecting the varying difficulty within a single game round.
- The final Jeopardy Metric score is the sum of the weighted accuracies across all levels, providing a comprehensive view of the LLM’s ability to handle complex and ambiguous questions.
Model | Jeopardy Score |
---|---|
GPT-4o | 0.86 |
Claude-3.5-Sonnet | 0.85 |
Gemini-flash-1.5 | 0.73 |
Gemma-2 (9B) | 0.66 |
Mistral (7.2B) | 0.63 |
Llama-3.1 (8B) | 0.58 |
Llama-3 (8B) | 0.56 |
Qwen-2 (7.6B) | 0.50 |
Gemma (7B) | 0.47 |
Phi-3 (3.8B) | 0.37 |
Phi-3.5-mini (3.8B) | 0.11 |
The PARROT Score is a composite metric that reflects an LLM's performance across two subsets: PARROT-Jeopardy and PARROT-Millionaire. These non-overlapping datasets are uniquely designed to align with different types of QA tasks in Natural Language Processing (NLP). The mean performance of an LLM over these distinct subsets serves as the PARROT Score, providing a holistic view of the model's QA capabilities.
Model | PARROT | MMLU | IFeval | BBH | GPQA |
---|---|---|---|---|---|
GPT-4o | 0.81 | 0.88 | 0.90 | UNK | 0.53 |
Claude-3.5-Sonnet | 0.80 | 0.81 | 0.92 | 0.93 | 0.59 |
Gemini-flash-1.5 | 0.67 | 0.79 | UNK | 0.88 | 0.59 |
Mistral (7.2B) | 0.65 | 0.71 | 0.54 | 0.25 | 0.28 |
Gemma-2 (9B) | 0.62 | 0.69 | 0.74 | 0.42 | 0.36 |
Llama-3.1 (8B) | 0.56 | 0.68 | 0.78 | 0.30 | 0.27 |
Llama-3 (8B) | 0.53 | 0.68 | 0.47 | 0.26 | 0.29 |
Qwen-2 (7.6B) | 0.53 | 0.69 | 0.56 | 0.37 | 0.31 |
Gemma (7B) | 0.45 | 0.64 | 0.38 | 0.12 | 0.29 |
Phi-3 (3.8B) | 0.33 | 0.69 | 0.54 | 0.37 | 0.34 |
Phi-3.5-Mini (3.8B) | 0.19 | 0.69 | 0.57 | 0.36 | 0.33 |
How is PARROT different?
PARROT sets itself apart by not only testing LLMs on direct, factual questions but also on those with implicit challenges embedded in the format of how a question is posed. Many existing benchmarks focus on explicit knowledge retrieval, where the answer is a direct fact, but trivia questions often involve implicit reasoning, such as understanding wordplay, interpreting subtle hints, or drawing connections between seemingly unrelated concepts. For instance, in PARROT-Jeopardy, questions are often framed in reverse, requiring the model to interpret the clue and supply the correct question. Similarly, in PARROT-Millionaire, questions can involve making a decision towards similar choices [8]. It uniquely tests an LLM's ability to handle more complex and less structured queries, setting it apart from traditional benchmarks.
1. Novelty in Curation of Weights and Coefficients
PARROT introduces a unique approach by assigning weights based on question difficulty, unlike most benchmarks that treat all tasks equally.
- Millionaire Metric: Uses elimination ratios to weight questions, ensuring tougher questions contribute more to the final score.
- Jeopardy Metric: Reflects difficulty through monetary values associated with game rounds, requiring models to navigate complex reasoning tasks.
This weighted scoring offers a more realistic evaluation of a model’s ability to handle real-world challenges. And reinstates the fact that answers have consequences in the real world and PARROT is here to capture it.
2. Size and Scope of the PARROT Dataset
With nearly 84,000 samples, PARROT provides a deep and varied evaluation, reducing the risk of score inflation seen in smaller benchmarks.
3. Rigorous Zero-Shot Evaluation
PARROT evaluates models in a zero-shot context, unlike many benchmarks that use few-shot learning. This approach ensures models are tested on their inherent reasoning skills, providing a purer measure of their true abilities without prior guidance.
In automation and other high-stakes environments, the ability to reason and handle ambiguity is not just an advantage; it's a necessity. PARROT challenges models with the kind of complex, context-dependent questions that they are likely to encounter in real-world applications, making it a more accurate and reliable benchmark for evaluating an LLM's true capabilities.
Contributing to the Community
“We believe in the power of community-driven innovation. We’re excited to share our PARROT benchmark and other metrics with the wider AI community, inviting researchers and developers to collaborate with us in pushing the boundaries of LLM evaluation.
As we continue to refine our benchmarks and develop new ones, we’re committed to staying at the forefront of AI evaluation. Our goal is to create tools and frameworks that not only assess LLM performance but also drive the next generation of AI models. We look forward to seeing how these benchmarks are adopted and adapted by the community in the years to come.”
- Redblock AI Team.
Bibliography
- Redblock AI Team. "PARROT: Performance Assessment of Reasoning and Responses on Trivia." Redblock, 2024. https://huggingface.co/datasets/redblock/parrot↩︎
- Beeson, L. "LLM Benchmarks." GitHub, 2023.https://github.com/leobeeson/llm_benchmarks↩︎
- McIntosh, T. R. et. al. (2024). Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence. arXiv. https://arxiv.org/abs/2402.09880↩︎
- Gema, A. P., et. al. (2024). Are We Done with MMLU? arXiv. https://arxiv.org/abs/2406.04127↩︎
- Anthropic. Claude 3.5 Sonnet: A new chapter in AI development. https://www.anthropic.com/news/claude-3-5-sonnet↩︎
- Wikipedia. "Who Wants to Be a Millionaire (American game show)." Wikipedia, The Free Encyclopedia, 2024. https://en.wikipedia.org/wiki/Who_Wants_to_Be_a_Millionaire_(American_game_show)s↩︎
- Encyclopaedia Britannica. "Jeopardy! (American television game show)." Encyclopaedia Britannica, 2024. https://www.britannica.com/topic/Jeopardy-American-television-game-show↩︎
- RedBlock AI. "Introducing PARROT–A New LLM Benchmark using Television Trivia Datasets." RedBlock AI, https://www.redblock.ai/blog/parrot-benchmark↩︎
Thanks to Indus Khaitan, Aviral Srivastava, Basem Rizk, and Raj Khaitan for reading drafts of this.
Introducing PARROT–A New LLM Benchmark using Television Trivia Datasets
Why a new LLM Benchmark?
Evaluating the performance of a Large Language Model (LLM) is crucial for understanding its ability to answer questions accurately and effectively. Ideally, this evaluation would involve domain-specific human experts manually judging the generated outputs. However, the cost of hiring such experts is often prohibitive, especially for small businesses that need to benchmark LLMs for their specific use cases. In our previous blog posts, we discussed various aspects of LLM evaluation, including methods to quantify performance, available datasets for benchmarking, and the potential of using game show data to test an LLM's knowledge. To address these challenges, we developed PARROT using data from popular game shows.
Introducing PARROT: Performance Assessment of Reasoning and Responses on Trivia
PARROT is a novel dataset designed to evaluate the question-answering capabilities of LLMs in a more challenging and human-centric context. Unlike traditional datasets that focus solely on accuracy, PARROT leverages the dynamic and interactive nature of trivia game shows to test LLMs' abilities to:
- Understand and respond to complex questions.
- Categorize knowledge and synthesize coherent answers.
- Exhibit common-sense reasoning and real-world knowledge.
- Handle uncertainty and ambiguity in a way that mimics human behavior during trivia.
Purpose of PARROT
PARROT tests LLMs' ability to mimic human behavior in several key areas:
- Conversational fluency: Tests their ability to understand and respond to context-dependent questions, a skill essential for human-like communication.
- Common sense reasoning: Many questions require LLMs to apply common-sense knowledge and reasoning to arrive at a correct answer. This tests their ability to understand and interpret information in a way that is consistent with human intuition.
- Real-world knowledge: PARROT's questions cover a wide range of topics, from historical events to popular culture. This tests LLMs' ability to access and apply relevant real-world knowledge, a crucial skill for effective communication.
- Handling uncertainty: Some questions are designed to be ambiguous or have multiple correct interpretations. This tests LLMs' ability to handle uncertainty and provide informative responses even when the information is incomplete or contradictory.
PARROT: A Dual-Dataset Approach
PARROT is comprised of two distinct datasets
- PARROT-Jeopardy: A dataset comprising questions from the show Jeopardy, with short, concise questions for topic breadth, reasoning, and ambiguity handling.
- PARROT-Millionaire: A dataset comprising questions from the show Who Wants to Be a Millionaire, known for its straightforward nature and broad range of topics, can be a valuable dataset for evaluating an LLM's knowledge.
By combining these two datasets, PARROT offers a more comprehensive evaluation of LLMs’ question-answering capabilities, testing their ability to handle different question styles and sentence structures and providing insights into their strengths and weaknesses across various domains.
Why These Two Game shows for PARROT?
I spoke to Indus, Founder and CEO of Redblock, about this choice of game shows for our dataset:
“I have been a quizzing nerd since my middle school days. When the Millionaire show launched in India, I qualified to be a contestant in the show's first season hosted by the legendary actor Amitabh Bachchan.
And I have never missed a Jeopardy episode for the last many years!
When considering this problem of benchmarking the LLMs, I thought, why not use the game show data? They span time, the clues are cryptic, and have questions around multiple categories–and the best part is that the answers are already available.”
Curating the Millionaire Dataset
Initially, we began searching for an existing Millionaire game show dataset to work on this idea. Despite its global reach and decades-long run [1], one would assume there would be a wealth of data out there, but that wasn’t the case.
So, we took matters into our own hands at Redblock and curated the PARROT-Millionaire dataset by scraping and organizing data from the Millionaire Fandom site [2]. We will write a separate blog on our steps to scrape and curate this dataset.
The PARROT-Millionaire dataset has 22,698 samples (questions and answers). Each sample was scraped from the fandom page and includes features such as question type, difficulty level, and answer choices.
Field Name | Description |
---|---|
question_info | The price value and the current question number. |
question | Question in the text. |
options | Four options to the question. |
correct_answer | The correct answer to the question. |
price | Engineered feature from question_info, which gives the dollar value of the question price. |
normalized_options | An engineered feature that provides options for text normalization. |
normalized_correct_opt | An engineered feature that gives the normalized text of the correct_answer. |
Curating Jeopardy
Jeopardy has been a staple of American television for nearly 50 years, first created by Merv Griffin in 1964. With over 8,000 episodes across 40 seasons, curating a dataset from all seasons would be impractical due to the sheer volume of data and the complexity that follows in processing such a large set [3]. Although Jeopardy datasets are available on open-source platforms like Kaggle, GitHub, and Hugging Face, at Redblock, we've curated a version tailored to meet our specific requirements.
We selected seven key seasons—1, 2, 12, 13, 19, 33, and 34—to ensure a representative sample across the show's timeline. We scraped data from the J! Archive, a fan-created archive containing over 500,000 clues [4], to create the PARROT-Jeopardy dataset.
The PARROT-Jeopardy dataset features clues as questions, allowing us to effectively gauge the reasoning and ambiguity-handling ability of an LLM. In contrast, PARROT-Millionaire focuses on straightforward questions, providing a way to assess the LLM’s ability to answer a question by structuring it in more than one way. PARROT-Jeopardy comprises a total of 61,462 samples and includes features such as category, clue format, and difficulty level [5].
Field Name | Description |
---|---|
ep_num | Episode number from the season. |
air_date | The date on which the episode was aired. |
extra_info | Additional information related to the episode, including the host's name. |
round_name | The round being played (e.g., Jeopardy, Double Jeopardy, Final Jeopardy). |
coord | The coordinates of the clues on the game board. |
category | The category to which the clue belongs. |
value | The monetary value of the clue. |
daily_double | A boolean variable indicating whether the clue is part of the Daily Double round. |
question | The corresponding clue within the category. |
answer | The labeled answer or guess. |
correct_attempts | The count of contestants who answered correctly. |
wrong_attempts | The count of contestants who answered incorrectly. |
PARROT Benchmarking
Now that we’ve introduced the brand-new datasets, we have some interesting observations coming up! We scaled some of the best LLMs against these datasets with our unique metric. This metric is built into the benchmarking, making it more than just a dataset but a framework to gauge an LLM’s ability to answer questions accurately and effectively. Keep an eye out for our next blog, where we’ll share how these LLMs perform in handling trivia questions to win a prize.
Bibliography
- Wikipedia, "Who Wants to Be a Millionaire (American game show)." In Wikipedia, The Free Encyclopedia. Retrieved August 13, 2024. https://en.wikipedia.org/wi-ki/Who_Wants_to_Be_a_Millionaire_(American_game_show) ↩︎
- "Jackie Edmiston." Millionaire Wiki, Fandom, August 13, 2024. https://millionaire.fandom.com/wi-ki/Jackie_Edmiston ↩︎
- Encyclopaedia Britannica. Jeopardy! (American television game show). In Encyclopaedia Britannica. Retrieved August 13, 2024, from https://www.britannica.com/topic/Jeopardy-American-television-game-show↩︎
- J! Archive. Retrieved August 14, 2024, from https://j-archive.com/ ↩︎
- Redblock AI Team. (2024). PARROT: Performance Assessment of Reasoning and Responses on Trivia. Redblock. https://huggingface.co/data-sets/redblock/parrot↩︎
Thanks to Indus Khaitan, Aviral Srivastava, Basem Rizk, and Raj Khaitan for reading drafts of this.
How good are LLMs at Trivia Game Shows?
Introduction
At Redblock, we are exploring how large language models (LLMs) can handle trivia questions. To do this, we are focusing on evaluating and benchmarking large language models. If you're unfamiliar with these concepts or haven't read our previous article on various benchmarking techniques used to assess an LLM’s performance, you might find it helpful to start here. We're trying to find where LLMs might be weak at answering questions, particularly by using data from two well-known U.S. television game shows: Jeopardy and Who Wants to Be a Millionaire.
Using game show data for Question & Answering
Game shows have a wide range of fact-checked questions on a wide range of topics. This makes them a valuable source of human-labeled data, essential for effective benchmarking and ensuring that evaluations are grounded in real-world knowledge.
Let us examine two game shows with the potential to understand why they're ideal for testing AI's question-answering capabilities.
Jeopardy
Jeopardy is an American television game show created by Merv Griffin [1]. Unlike traditional quiz competitions, here, the typical question-and-answer format is reversed. Contestants are given general knowledge clues related to a specific category. They must identify the person, place, thing, or idea described by the clue, responding as a question.
More importantly, each clue is associated with a monetary value that a contestant wins if they guess the answer correctly.
Consider the following example:
Category: Regions and States
Clue: This state is known as the Sunshine State
Contestant’s Answer: What is Florida?
The game consists of three standard rounds (excluding the specials), namely:
- Jeopardy
- Double Jeopardy
- Final Jeopardy
The Jeopardy and Double Jeopardy rounds each feature game board comprising six categories, with five clues per category. The clues are valued by dollar amounts from lowest to highest, ostensibly increasing in difficulty. In the Final Jeopardy round, contestants are the category and the clue upfront. At this time, they write down their wager amount and their guess on a board. A correct guess adds the wagered amount to their score, while an incorrect guess subtracts the amount from their score.
The novelty of using Jeopardy for QA lies in its potential to evaluate the reasoning ability of an LLM. The monetary value of a clue often correlates with the complexity or ambiguity of the answer—the higher the value, the more challenging or misleading the clue might be. It will be fascinating to see how an LLM navigates these scenarios and what outputs it generates. Redblock is currently exploring this in a novel way.
Who Wants to be a Millionaire?
Who Wants to Be a Millionaire is a widely popular game show with various adaptations across the globe. The show first aired in Britain in 1998 and was later adapted in the United States and other countries [2]. For our research, we are evaluating the performance of LLMs using questions from the United States syndicated version of this game show.
If you’re unfamiliar with this show, here’s a quick and simple breakdown of the game structure. The game consists of 15 multiple-choice questions, starting with a $100 question and culminating in a $1,000,000 question for the 15th. A contestant is eliminated if they answer any question incorrectly during their session.
We use questions from Who Wants to Be a Millionaire to assess LLM performance with multiple-choice questions across various fields [3], where difficulty increases as the game progresses. This method contrasts with the Jeopardy format discussed earlier and offers a valuable comparison of how an LLM performs when it is given a vague clue versus a straightforward question.
Advantages of employing questions from these game shows
- Data Quality: A validated test set is essential for LLMs to ensure that their performance is accurately measured and compared to other models, preventing inflated benchmarks and promoting fair evaluation. Since these questions are sourced from a game show, they are generally reliable, allowing us to use them without concern about the authenticity of the question-answer pairs.
- Variety: These questions cover a wide range of topics, from music to science, including subjects like photosynthesis, history, and literature, demonstrating the importance of considering knowledge across various domains and not just math.
- Quantity: These shows have been on the air for several decades, amassing a wealth of knowledge that is ripe for use in LLM testing. The sheer volume of curated questions offers a rich resource for evaluating LLM performance.
Disadvantages of Using Game Show Questions for LLM QA Evaluation
- Dataset Curation: A significant challenge in using data from game shows is finding a reliable dataset. Even though one might expect to find plenty on platforms like Kaggle or Google Dataset Search, our initial search didn't turn up much. And even the few datasets we did find often lacked clear information about where they came from, making it hard to trust their accuracy.
- Domain Bias: While game shows cover a broad range of topics, they may still reflect a bias toward certain domains that are more popular or easily accessible information.
- Outdated Information: Questions can become outdated, especially in contexts like trivia games or quizzes where knowledge is time-sensitive. For instance, a question about the artist with the most awards in 2014 could have a different answer in 2024, reflecting a shift in the relevance of such data over time.
Redblock is actively working to address these gaps by curating valuable datasets and planning to release open-source versions to the community in the following weeks. We are committed to advancing LLM evaluation, providing meaningful insights, tackling the challenges of testing LLMs on their knowledge, and addressing the shortcomings in the current set of benchmarks.
Bibliography
- Wikipedia, "Jeopardy!" In Wikipedia, The Free Encyclopedia. Retrieved August 13, 2024. https://en.wikipedia.org/wiki/Jeopardy!#Gameplay ↩︎
- Wikipedia, "Who Wants to Be a Millionaire (American game show)." In Wikipedia, The Free Encyclopedia. Retrieved August 13, 2024. https://en.wikipedia.org/wiki/Who_Wants_to_Be_a_Millionaire_(American_game_show) ↩︎
- "Jackie Edmiston." Millionaire Wiki, Fandom, August 13, 2024. https://millionaire.fandom.com/wiki/Jackie_Edmiston ↩︎
Thanks to Indus Khaitan, Aviral Srivastava, Basem Rizk, and Raj Khaitan for reading drafts of this.
Introduction to LLM Evaluation and Benchmarking
Large Language Models: the talk of the town
Unless someone has been disconnected from the technology world, they are likely aware of the significance of Generative AI in today’s landscape. With the emergence of new opportunities and unfolding questions about the applications of Large Language Models (LLMs), AI remains a central topic of discussion. Large corporations are increasingly adopting this technological marvel of the past decade to enhance their business practices in customer service, data analysis, predictive modeling, and decision-making processes.
With that in mind, how do these LLMs work? Are they difficult to incorporate? And do they genuinely make a significant impact in their applications? The answer is both yes and no. The effectiveness of using an LLM depends on leveraging its strengths, as not every released one excels in all possible domains. Their capabilities and limitations are directly tied to the data on which they have been primarily trained on.
One might wonder, wouldn’t it be easier if we could test or supervise the model while it is learning? Ideally, that’s exactly what we would want. However, it’s impractical to manually fact-check the vast amount of information being scanned, parsed, and processed by an LLM when billions of documents are involved. These models are self-learning agents that do not require human supervision to verify the quality information they’re picking on; instead, they recognize patterns and learn from them each time they encounter new information.
Several other ways exist to assess the quality of an LLM's responses. In this context, we will explore its performance in Question and Answering, one of this technology's primary applications.
Before we dive into the details of standards practiced, it is important to get familiar with these definitions:
- Metric: A metric is a standard way to measure something. For example, height can be measured in inches or centimeters, depending on your location.
- Benchmarking: Benchmarking involves comparing an LLM as a candidate against a set of other LLMs using a standard metric on possibly a well-defined task to assess its performance. For example, we can compare a list of LLMs and their performance towards text translation using a well-established machine-translation metric such as BLEU [1].
How are the LLM outputs judged?
The concept is straightforward: to assess the correctness of a response, we use pairs of questions labeled with corresponding answers (often referred to as 'gold answer') within a dataset that contains specific domain knowledge. The LLM is tested by asking questions from this set, generating a set of candidate answers, which are then validated.
The newly generated response for each question is compared to the labeled correct answer using a metric to assess its correctness. While it would be ideal for experts with domain-specific knowledge to evaluate the output from an LLM, the cost and time required for human judgment make this approach impractical for widespread use. Fortunately, thanks to the efforts of various academic and AI institutions, less precise but more cost-effective evaluation methods have been developed, which we will explore in later sections of this blog.
Evaluation can sometimes be subjective. While this may seem vague, at Redblock we believe evaluation can be influenced by who performs it. Consider this: what if we used an AI to judge or rate the quality of its outputs? This approach, known as 'LLM as the Judge' evaluation, often leads to the well-known problem of hallucination. In AI, hallucination refers to the phenomenon where the model confidently provides an incorrect answer. Even when challenged, the AI might still consider its false answer valid.
As discussed earlier, human evaluation is the preferred method for assessing an LLM's performance. In this approach, the LLM generates a set of possible outputs for a question within a specific domain. A human judge, typically an expert in that domain, selects the answer that appears closest to the correct one and sounds the most natural. The LLM then adjusts its weights (the parameters used to generate its output) to produce similar answers, thereby improving the quality of its text generation.
More importantly, companies that develop large-scale LLMs often prioritize metrics in which their models outperform direct competitors in the market. This focus has led to the creation of sophisticated datasets which allow us to evaluate LLMs across a broad range of topics, extending beyond the standard metrics mentioned earlier. The goal is to ensure that the evaluation process remains fair and challenging, as increasing the difficulty of questions is one of the most effective ways to identify an LLM's weaknesses.
On the other hand, evaluation also depends on the specific objectives of the tasks that an LLM is designed to address. LLMs have various capabilities; they can translate and mask text, generate summaries, tell stories, and answer questions based on the knowledge they have curated during training. These abilities can be measured differently, as shown in the table. However, the overarching purpose of evaluation remains consistent: to assess the model’s strengths and weaknesses, particularly in terms of its factual reasoning across domains, to identify any biases it may have developed during training, and to compare its performance to that of other LLMs to determine which is more accurate and human-like in reasoning, without the errors a human might make. Let’s take a look at these metrics:
Category | Metric | Description |
---|---|---|
Text Translation | BLEU | Compares how similar the translated text is to reference translations by checking for common phrases. |
METEOR | Checks translation accuracy by matching words, considering synonyms and word forms. | |
TER/Minimum Edit Distance | Counts the number of changes needed to make the translation match the reference. | |
Text Masking | Accuracy | Measures how often the model correctly fills in missing words. |
Perplexity | Shows how confidently the model predicts missing words; lower values are better. | |
Text Classification | Accuracy | Determines the percentage of correct predictions overall. |
Precision | Determines the percentage of correct positive predictions out of all positive predictions made. | |
Recall | Determines the percentage of correct positive predictions out of all actual positives. | |
F1 Score | Balance/ratio between precision and recall. | |
AUC-ROC | Measures how well the model separates different classes. | |
Text Generation | Perplexity | Indicates how well the model predicts the next word; lower values are better. |
ROUGE | Compare the generated text to reference text by looking for similar phrases. | |
Human Evaluation | Involves experts rating the quality of generated text by an LLM. | |
Question & Answering | Exact Match (EM) | Percentage of answers that are exactly the same as the correct answer. |
F1 Score | Measures overlap between predicted and correct answers, considering both precision and recall. | |
Mean Reciprocal Rank (MRR) | Averages the rank of the first correct answer in a list of predictions. | |
BLEU | Compares generated answers to reference answers using phrase similarity. |
These are a few metrics that can be used to evaluate the responses of an LLM within various types of applications.
What are some of the existing methods to evaluate LLMs?
Question and Answering is a crucial and widely applied task for LLMs. It involves decision-making, logical reasoning for problem-solving, information retrieval from documents or larger text corpora, and fact-checking. To evaluate these essential attributes, we use various datasets that allow us to benchmark the performance of LLMs.
- Popular datasets used for LLM evaluation:
The following datasets are curated from various sources and contain various questions. Here are the popularly used datasets for benchmarking [2]:
Dataset | Description | Purpose | Relevance |
---|---|---|---|
Massive Multitask Language Understanding (MMLU) | Measures general knowledge across 57 subjects, ranging from STEM to social sciences. | To assess the LLM's understanding and reasoning in various subject areas. | Ideal for multifaceted AI systems that require extensive world knowledge and problem-solving ability. |
AI2 Reasoning Challenge (ARC) | Tests LLMs on grade-school science questions that require deep general knowledge and reasoning abilities. | To evaluate the ability to answer complex science questions that require logical reasoning. | Useful for educational AI applications, automated tutoring systems, and general knowledge assessments. |
General Language Understanding Evaluation (GLUE) | A collection of various language tasks from multiple datasets designed to measure overall language understanding. | To provide a comprehensive assessment of language understanding abilities in different contexts. | Crucial for applications requiring advanced language processing, such as chatbots and content analysis. |
HellaSwag | Tests natural language inference by requiring LLMs to complete passages in a way that requires understanding intricate details. | To evaluate the model's ability to generate contextually appropriate text continuations. | Useful in content creation, dialogue systems, and applications requiring advanced text generation capabilities. |
TriviaQA | A reading comprehension test with questions from sources like Wikipedia that demands contextual analysis. | To assess the ability to sift through context and find accurate answers in complex texts. | Suitable for AI systems in knowledge extraction, research, and detailed content analysis. |
GSM8K | A set of 8.5K grade-school math problems that require basic to intermediate math operations. | To test LLMs’ ability to work through multi-step math problems. | Useful for assessing AI’s capability in solving fundamental mathematical problems valuable in educational contexts. |
Big-Bench Hard (BBH) | A subset of BIG-Bench focuses on the most challenging tasks requiring multi-step reasoning. | To challenge LLMs with complex tasks demanding advanced reasoning skills. | Important for evaluating the upper limits of AI capabilities in complex reasoning and problem-solving. |
These are just a few of the many datasets used for LLM benchmarking that have become industry standards for AI leaders developing new models. Each of these datasets is specifically designed to evaluate LLMs on targeted tasks within the broader field of Natural Language Processing.
- Benchmarking state-of-the-art LLMs:
Now that we’ve reviewed the available datasets and benchmarks, let us examine where the current LLMs stand against some of the benchmarks mentioned above:
Benchmark | GPT-4o | Claude-3.5 | Gemini-1.5-Pro |
---|---|---|---|
MMLU | 88.7% | 88.7% | 85.9% |
Code/HumanEval | 87.8% | 92.0% | 82.6% |
Math | 52.9% | 71.1% | 67.7% |
Reasoning/GPQA | 53.6% | 59.4% | 46.2% |
Big-Bench Hard | 86.8% | 93.1% | 89.2% |
Each of the benchmarks from the table above is associated with the performance of LLMs toward a specific task.
Reasoning Tasks:
- MMLU (The benchmark used for testing reasoning ability) contains questions covering a wide range of subjects like humanities, STEM, and social sciences, designed to test a model's reasoning ability.
- GPT-4o outperforms Gemini-Pro and Claude Sonnet on the MMLU benchmark, as shown in the table above, with a score of 88.7%. Claude Sonnet matches this score, indicating it is a close competitor, while Gemini-Pro struggles with its performance of 85.9%. While Gemini has made strides with its newer model, it still does not stand up against GPT-4o and Claude Sonnet.
Math and Coding Proficiency:
- HumanEval: Evaluates coding ability by asking models to generate Python code. It Includes coding tasks where models are required to generate Python code that passes specific unit tests.
- Math (Benchmark Dataset): Assesses problem-solving in math with questions of varying difficulty. It consists of mathematical problems with corresponding gold answers.
- On the HumanEval benchmark, which assesses coding performance, GPT-4o scores 87.8%, closely competing with Claude Sonnet at 92.0%. However, it significantly outperforms Gemini-Pro with 82.6%. Once again, the performance is a close match between GPT-4o and Claude Sonnet.
- In the Math benchmark, which assesses the LLM’s ability to solve mathematical problems, GPT-4o scores 52.9%, lower than Claude Sonnet at 71.1%. Gemini-Pro still trails behind GPT-4o and Claude in mathematical reasoning. Claude clearly excels in this area without much competition from GPT-4o
Reasoning/GPQA:
- Measures how well models handle complex reasoning and problem-solving tasks. Contains questions usually of Graduate level understanding.
- GPT-4o scores 53.6% on the GPQA benchmark, which measures reasoning abilities. Claude Sonnet surpasses GPT-4o with a score of 59.4%, and Gemini-Pro lags further behind this time with a score of 46.2%. This indicates that while GPT-4o performs decently, Claude has a slightly better edge in reasoning tasks, and Gemini continues to show room for improvement.
Big-Bench Hard:
- Challenges models with difficult questions that test advanced reasoning skills. It contains a variety of texts, including comprehension questions.
- On the Big-Bench Hard benchmark, GPT-4o performs well with a score of 86.8%, although Claude Sonnet outperforms it for the second time with 93.1%. Gemini-Pro outperforms GPT-4o for the second time. This benchmark, which involves complex reasoning tasks, proves that Claude consistently delivers superior performance, while GPT-4o remains competitive but struggles to keep up around the board against Claude and Gemini.
Conclusion
Based on the evaluations and benchmarks discussed, it is evident that the LLMs tested show distinct performance differences across various tasks. The benchmarking results indicate that while some models excel in specific areas like reasoning or coding, others fall short in critical aspects.
These benchmarks highlight the importance of selecting the right model based on the task requirements. At Redblock AI, we utilize benchmarks to ensure that our analysis is both precise and actionable to choose the right LLM. The methodologies employed, including metric-based assessments and benchmarking against industry standards, provided clear insights into the strengths and limitations of each model tested.
Bibliography
- Papers with Code, "BLUE Dataset." Retrieved August 14, 2024, from https://paperswithcode.com/dataset/blue ↩︎
- Beeson, L., "LLM Benchmarks." GitHub, 2023. https://github.com/leobeeson/llm_benchmarks ↩︎
- OpenAI, "GPT-4o Mini: Advancing Cost-Efficient Intelligence." OpenAI, 2024. https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/ ↩︎
- Hugging Face, "Chapter 7: Transformers and transfer learning." Hugging Face, n.d. https://huggingface.co/learn/nlp-course/chapter7/1?fw=pt ↩︎
- Clefourrier, L., "LLM Evaluation." Hugging Face, 2024. https://huggingface.co/blog/clefourrier/llm-evaluation ↩︎
- DeepMind, "Gemini: Flash." DeepMind, 2024. https://deepmind.google/technologies/gemini/flash/ ↩︎
- AllenAI, "AI2 ARC Dataset." Hugging Face, n.d. https://huggingface.co/datasets/allenai/ai2_arc ↩︎
- NYU MLL, "GLUE Dataset." Hugging Face, n.d. https://huggingface.co/datasets/nyu-mll/glue ↩︎
- Zellers, R., "HellaSwag Data." GitHub, n.d. https://github.com/rowanz/hellaswag/tree/master/data ↩︎
- Joshi, M., "TriviaQA Dataset." Hugging Face, n.d. https://huggingface.co/datasets/mandarjoshi/trivia_qa ↩︎
- OpenAI, "GSM8K Dataset." Hugging Face, n.d. https://huggingface.co/datasets/openai/gsm8k ↩︎
- Lukaemon, "BBH Dataset." Hugging Face, n.d. https://huggingface.co/datasets/lukaemon/bbh ↩︎
Thanks to Indus Khaitan, Aviral Srivastava, Basem Rizk, and Raj Khaitan for reading drafts of this.
Manual is out. Automate is in.
Better, Bigger, Bolder
The year from February 2023 to February 2024 was incredibly tough, personally and professionally.
I faced the heartbreaking loss of both my mother-in-law and father-in-law, who passed away one year apart—same hour, same day. Precisely 12 months apart. Uncanny and deeply painful.
These personal tragedies unfolded as I grappled with a difficult decision about the future of Quolum–a fintech startup I co-founded in 2019.
Quolum faced a small, competitive market. Despite our best efforts and the value our products brought to customers, scaling became increasingly challenging. Introducing procurement services further complicated our product story—not just for us but for every vendor in the market. I have many thoughts on this but for another time.
Ultimately, the decision was clear: downsize to a lean 4-5 person team and continue generating $500K-$1M ARR, or wind down the business.
I chose the latter.
However, I wasn’t ready to step away from entrepreneurship. I’m passionate about tackling unsolved problems, no matter how niche they may seem. Plus, the macro shifts in AI, especially with the rise of large language models, made me rethink how every technology category could be reimagined.
Picking up the pieces, Redblock was born, bringing me back to my roots in cybersecurity.
We are on a mission to build a product that empowers AI to automate tasks traditionally destined for humans. We aim to let AI handle the rote, repetitive, and complex tasks. We’ve identified a few specific use cases and are using them to understand human behavior, teaching AI to take over these tasks.
I’m back–better, bigger, bolder, and more ambitious than I was.
A Quick Introduction to AI Agents
Introduction
Definition and Purpose
AI agents are sophisticated computer programs designed with the goal of simulating human cognitive functions. They are developed to undertake tasks that typically require human intelligence, such as understanding natural language and solving complex problems.
Importance in Technological Evolution
The development of AI agents represents a leap forward in the field of artificial intelligence. Their ability to mimic human abilities not only showcases the progress in AI but also opens up new possibilities for interaction between humans and machines.
Early Interactions and Examples
First Wave of AI Agents
The initial wave of AI agents, including Siri, Alexa, and Google Assistant, introduced us to the convenience of voice-activated digital assistance. These agents made it easier to perform simple tasks such as setting reminders or playing music, integrating technology more seamlessly into daily life.
Impact on Daily Life
The presence of these early AI agents in smartphones and home devices marked the beginning of widespread consumer interaction with artificial intelligence. Their ability to carry out specific commands significantly improved user experience and introduced the mass market to the potential of AI.
Evolution and Advancements
Multi-Step Processes and Memory
Recent advancements have enabled AI agents to not only execute complex multi-step tasks but also remember past interactions. This progression allows for more personalized and efficient assistance, reflecting a closer mimicry of human cognitive processes.
Dynamic and Adaptive Behavior
Modern AI agents are now capable of learning from interactions and adapting their responses accordingly. This dynamic behavior enables them to provide more accurate and contextually relevant information, enhancing the user experience.
Interface and Automation
Direct Interaction with Platforms
The latest generation of AI agents can directly interface with both physical and virtual platforms, enabling a hands-free approach to task execution. This direct interaction facilitates seamless integration of AI into various aspects of work and life.
Enhancements in Task Automation
By automating tasks without human intervention, AI agents are revolutionizing efficiency and accuracy in numerous fields. This automation not only saves time but also reduces the likelihood of human error, leading to more reliable outcomes.
Application in Cybersecurity
Preventing Sophisticated Threats
At RedBlock, the AI agents we're developing are designed to identify and prevent advanced cyber threats before they can cause harm. This proactive approach to cybersecurity represents a significant step forward in protecting digital assets.
Containment and Prevention
While cybersecurity experts handle the aftermath of an attack, our AI agents specialize in containment and prevention. By focusing on these areas, the agents aim to minimize the impact of cyber threats and maintain the integrity of digital systems.
Reinvent Your Cybersecurity Strategy
Want to know more about our product. Request a demo.