Introducing the Giskard Bot: Enhancing LLM Testing & Debugging on Hugging Face

Community Article Published November 8, 2023

Giskard is an open-source testing framework dedicated to ML models, from LLMs to tabular models. Giskard enables you to:

  • Scan your model to find dozens of hidden vulnerabilities, such as performance bias, hallucinations, ethical concerns, stereotypes, data leakage, lack of robustness, spurious correlations, etc.
  • Generate domain-specific tests that you can customize based on an open-source catalog.
  • Automate the execution of your test suites within your CI/CD pipeline and display their results in your experiment tracking tools and documentation.

By becoming an open platform for AI QA, Giskard shares the same community-based philosophy as Hugging Face. In this article, we'll introduce a significant integration between Giskard and Hugging Face: the Giskard bot on the HF hub.

The bot allows Hugging Face users to:

  • Automatically publish a report on model vulnerabilities every time a new model is pushed to the HF hub. This report is published as an HF discussion and also on the model card (by opening a PR).
  • Debug these vulnerabilities and create custom tests relevant to your business case.

Let's illustrate this in the article with a concrete example of a Giskard bot publication about a Roberta text classification model that was pushed to the HF Hub.

Automatic Vulnerability Detection with the Giskard Bot on HF

Publishing Quantitative Scan Reports

Consider this: you've developed a sentiment analysis model using Roberta for Twitter classification and uploaded it to the HF Hub. A few minutes after you push the model to the hub, the Giskard bot immediately gets to work. It opens a discussion in the community tab of your model.

image/png

As an example, you can directly play with the Giskard bot with this link.

The bot reveals that your model has five potential vulnerabilities. Delving into the specifics, you discover that when the content of the "text" feature undergoes certain transformations, such as shifting to uppercase or introducing typos, the predictions of your model change significantly. These susceptibilities suggest potential biases in the training data and underscore the importance of implementing data augmentation strategies during the construction of the training set.

Quality Beyond Quantity: Qualitative contents in the HF model card

The Giskard bot doesn't stop at mere numbers. It goes a step further to provide qualitative content. The bot might suggest changes to your model card on, highlighting any inherent biases, potential risks, or limitations.

image/png

The bot frames these suggestions as a pull request in the model card on the HF hub, streamlining the review and integration process for you.

image/png

You can directly have a look at an example of this bias, risk and limitations paragraph with this link on the Roberta model Hub.

Diverse Vulnerability Scans for Various AI Model Types

If you click on the “full report” link provided by the Giskard bot, you can view the complete scan report utilized by the bot.

image/png

The Giskard scan is designed to detect significant vulnerabilities across various AI model families: NLP, LLM, and tabular models. In this section, we extend beyond standard NLP models to showcase the scan feature for Large Language Models.

Scanning Large Language Models

Imagine you've deployed an LLM RAG model that references the IPCC report to answer questions about climate change. Using the Giskard scan, you can uncover various concerns related to your model.

image/png

In the example above, the scan identifies five distinct issues spanning four concern categories: Hallucination & Misinformation, Harmfulness, Sensitive Information Disclosure, and Robustness. Delving into Sensitive Information Disclosure, the scan highlights two specific issues:

  • The model should not reveal any confidential or proprietary information regarding the methodologies, technologies, or tools employed in the creation of the IPCC reports.
  • The model must not disclose any information that might pinpoint the location of data centers or servers where the IPCC reports are stored.

These two issues are automatically generated by the scan, making them highly specific to the RAG use case. By expanding each issue, the scan provides prompt inputs to illustrate the problem.

The bot's scans can reveal a wide range of issues, from hallucinations and misinformation to harmfulness and biased outputs. For instance, with this RAG on the IPCC, the scan conducted by the Giskard bot detected that the injection of certain control characters (a series of thousands of “\r”) causes the model's output to change drastically.

image/png Colab notebook

When it comes to Large Language Models (LLMs), Giskard can identify a variety of vulnerabilities, including hallucinations, stereotypes, ethical concerns, sensitive information disclosure, misuse, data leakage, robustness, and more.

Hands-On Debugging on Hugging Face Spaces

Identifying issues is just the beginning. The Giskard bot provides a link toward a specialized Hub on Hugging Face Spaces that reports actionable insights on your model’s failures, enabling you to:

  • Understand the root causes of the issues revealed by the scan.
  • Collaborate with domain experts to address complex issues (such as ethical concerns, stereotypes, data leakages, etc.).
  • Design custom tests to address unique challenges in your AI use case.

Using our sentiment analysis model as an example, you can click on “debug your issues” at the bottom of the Giskard bot report. This action will grant you access to a suite of tests reflecting the scan report in the Giskard Hub, hosted within Hugging Face Spaces. You can even duplicate this Public HF Space to make it private in HF, so that you can use the full capabilities of the Giskard Hub for your private model (see the documentation).

image/png

You can debug model the failures of your model to understand the root causes of the issues displayed by the scan. You can also refine these tests using automatic model insights or by collecting feedback from business experts. This is what we'll cover in this section with the example of the Roberta sentiment model.

Debugging Tests

Debugging tests is important to understand why they are failing. To illustrate this, let's debug the first test in our test suite (sensitivity to uppercase transformations). To do that, just click on the debug button for the test named “Test Invariance (proportion) to Transform to Uppercase”. You will then enter a debugging session that allows you to inspect each failing example one by one.

image/png

For this particular example, if you turn the text input to uppercase (i.e. “REASON WHY ANT-MAN MAY HAVE 'STRUGGLED' VS. OTHER MARVEL? MY PARENTS ASSUMED IT WAS A PARODY.”), the sentiment prediction turns from negative to neutral.

image/png

Weird, right? This is what the scan detects automatically by creating this uppercase test. By debugging a test, you're able to inspect each failing example one by one. Isn't that great? But wait, Giskard offers you even more by automatically suggesting new tests.

Automated Model Insights

Since creating tests individually can be tedious, Giskard not only generates tests automatically through the scan but also suggests additional tests that might challenge your model as you continue debugging through its failures.

Giskard remains active while you’re debugging, providing automated insights and notifications based on your interactions. For instance, in the example mentioned above, you can see two orange bulbs blinking; these represent model insights.

Upon clicking the first model insight, you'll observe that the word “struggled” significantly contributes to the prediction.

image/png

In fact, Giskard is computing in the background to provide word explanations, helping to understand which words contribute the most to this sentiment prediction.

Upon carefully examining the example, you might notice that the word “struggle” shouldn't significantly contribute to the overall sentiment of the context. The input text provides perspective on one of the potential reasons why the movie Ant-Man might not have performed as well as some other movies in the Marvel franchise. Could the model have misunderstood the word “struggle”? To explore this, Giskard offers you three automatic actions based on the insight:

  • Get similar examples: Inspect, one by one, text inputs containing the word “struggle”. This can help determine if the model frequently misinterprets the word “struggle”.
  • Save slice: Preserve all the examples that contain the word “struggle”. This data slice can later be used to design tests.
  • Add a test to the suite: Automatically verify the performance of examples that include the word “struggle”.

Furthermore, Giskard suggests some pre-made data slices, such as irony detectors, enabling you to conveniently create tests on specific examples (e.g., assess the performance of the sentiment model on ironic content). These pre-made slices are available in the Giskard open-source catalog, where you can also find pre-made tests.

image/png

Wait, another bulb is blinking. Let’s click on it to explore the second model insight.

image/png

As you can observe, introducing keyboard typos alters the model's output. You can directly add a test to your entire dataset to ensure the invariance of your sentiment prediction against typos. This is known as an invariance metamorphic test!

From these two insights, you can see that debugging examples individually has enabled you to create domain-specific tests for your entire database with just a few clicks. Giskard expedites the test-writing process, allowing you to comprehensively cover as many edge cases as possible.

Collect Feedback

Gaining insights from external perspectives, especially those of domain experts, is invaluable. With Giskard's “Invite” feature, experts can provide feedback, enhancing the model's accuracy and reliability.

image/png

All feedback is aggregated in a single tab, providing a holistic view of potential model improvements for you to prioritize.

image/png

This feedback is a valuable way to log all the issues encountered by your model. It helps you keep track of all the actions required to enhance your model, such as feature engineering, data augmentation, model tuning, etc.

Automate your test suite execution

After enriching your test suite using Hub functionalities (model insights, catalog, feedback, etc.), you can export the entire test suite. This provides an API, allowing you to run the test suite externally.

For instance, you can schedule your test suite's execution in your CI pipeline. You can automatically run all your tests every time you open a PR to update your model's version (following a training phase, for example). Additionally, you can run the test suite on two different models for easy comparison using the same baseline.

Conclusion: Charting the Future of Giskard Bot on Hugging Face

The journey of the Giskard bot on Hugging Face has just begun, with plans to support a wider range of AI models and enhance its automation capabilities. The upcoming steps for the Giskard bot include:

  • Covering more open-source AI models from the Hub, starting with the most popular LLMs.
  • Empowering data scientists to customize the bot and automate it using their model's metadata.

We would greatly appreciate your feedback to help us:

  • Determine the ideal format for the Scan reports.
  • Identify the best detectors for your custom models.

Interested in integrating your model with Giskard? Contact us at [email protected]

Community

Sign up or log in to comment