How to test an LLM-based application as a DevOps engineer

Testing is vital for software, but with LLM-based apps, how to go about it isn’t clear. Here are the types of tests you should run and how to organize them.

By Mike Vanbuskirk

Oct 27, 2023 • 10 Minute Read

Please set an alt value for this image...

Subscribe to the newsletter

Testing is a valuable tool for software engineering—for the applications and the underlying infrastructure configurations. While DevOps engineers may not write the tests, they’re often tasked with implementing and automating testing logic.

As generative AI (GenAI) and large language models (LLMs) have grown in popularity, organizations have started to invest significant resources in building and deploying software based on these systems. Like any other software application, testing is essential to successful deployments. However, generative AI presents unique challenges in terms of testing and validation. Unlike traditional software, where the output is deterministic and predictable, generative AI models produce outputs that are inherently probabilistic and non-deterministic. This makes the testing story much fuzzier.

How should DevOps engineers think about designing and deploying a testing suite that’s effective for LLMs and generative AI-based applications? In this article we’ll cover the types of testing and how tests should be organized. And we’ll provide an example GitHub repository with a demo LLM and testing implementation.

Different tests and their objectives

Before implementing tests, it’s important to understand the different tests and their use cases. This section provides a brief overview of unit tests and integration tests and a note on what testing can mean in the context of machine learning (ML).

Unit testing

The objective of unit tests is to ensure that individual components of a software application function correctly. They’re typically written with these objectives in mind:

Functionality: Does this code behave as expected? Business logic isn’t as important here. Tests might check whether a function returns a valid JSON object or that an array is never empty.
Avoiding regression: A test suite with good coverage helps ensure new changes don’t break existing functionality. Even simplistic tests will often catch unintended results from a seemingly unrelated change.
Detecting errors or bugs: You can use tests to catch errors like unexpected null values, boundary conditions, or syntax errors.

In some cases, unit tests can also emulate elements of integration testing. Rather than depending on external services, like databases or API endpoints, developers can use tools like Mock to simulate objects or responses.

Integration testing

Integration testing is the other primary test category for software development. The objective of integration tests is to validate the interactions between different components of the software and external services. Integration test objectives include:

Functionality: Functionality encompasses a broader scope with integration testing. It’s important that individual components behave as expected, but in the context of integration, they often depend on external components. For example, checking that an API always returns specific headers or a database call always returns a value.
Data consistency: Modern software applications often need to pass data between different systems before presenting a finalized result or output. It’s important to ensure the data remains consistent during the entire lifecycle of a request. For example, part of an application may read numerical values from a database, perform mathematical operations, then write new values back to the database. Integration tests can assert things like consistency of data types (ensuring the data is always integers or floats instead of strings) and that the database responds appropriately to read and write operations.
Performance: Complex software systems depend on a lot of moving parts, and if any individual components fall below a certain responsiveness threshold, it can impact the performance of the entire system. A test might check to make sure that an API call never takes longer than n milliseconds to complete; if it does, it might indicate inefficient query logic or a problem with an upstream endpoint.

Integration testing is where the business logic of an application is often evaluated. While these tests are immensely valuable, they often require longer test cycles to return useful information. Design and implement them with care.

Testing machine learning

It’s important to distinguish the types of testing we’re covering in this article with actual testing of machine learning models. Testing and fine-tuning a machine learning model is a complex process that typically involves large amounts of parallelized GPU-driven workloads. ML engineers and data scientists often need to apply state-of-the-art research and software development to train and test these models, which is a different type of testing from the operational-focused tests covered in this article.

This section by no means offers an exhaustive list of the types and applications of these tests. I recommend anyone involved in software development and deployment to spend time studying and getting hands-on experience with software testing and test automation to expand their knowledge and help drive improved application quality and performance.

Strategies for organizing and categorizing tests

Organizing and categorizing tests effectively can help engineers make informed decisions about how, where, and when to implement the various types of tests.

Deterministic vs. non-deterministic tests

One way to categorize tests is by their predictability, dividing them into deterministic and non-deterministic tests. Deterministic tests are those where the inputs, process, and outputs are predictable and repeatable, producing the same results every time they run when given the same conditions. If a user applies input A, they’ll always receive output B. In most contexts, unit tests are considered deterministic tests, although some integration tests could be considered deterministic as well.

Conversely, non-deterministic tests have variable outcomes and potential randomness. Input A may not always produce output B, even if all other testing conditions are the same. As previously mentioned, integration tests can be deterministic but can also produce non-deterministic output. Consider these scenarios:

Testing an API call to an endpoint that provides weather data: Users can test the structure of the returned data, but the data itself will almost always vary.
Responsiveness: A call to an external system may always be expected to return the same data, but the response time will change from test run to test run.
Unique ID or key values: An application component may be responsible for generating unique UUIDs or other values as metadata. You can test functionality, but the returned value may always be unique if the system is functioning correctly.

LLM-based systems in particular will often generate non-deterministic outputs. For example, the responses could vary if a user prompts an LLM with the question, "Where is the Eiffel Tower located?" Responses might include "Paris," "paris," "Paris, France," or "Europe." All these answers are technically correct, but they are technically not the same values in terms of data validation. As the complexity of the prompts increases, these tests become harder to conduct and evaluate.

Early unit testing

Deploying unit testing earlier (before) integration tests is the recommended approach. These tests are typically lightweight, fast, and isolated, making them ideal for catching low-hanging fruit early in the development cycle. Developers can use mocks or stubs to simulate APIs or datastores during these tests. Many unit testing suites can be run locally from a development machine in addition to within the CI/CD pipeline, enabling developers to receive much faster, actionable feedback on their changes before committing their code and being subject to the longer test cycles of an integration testing suite.

Integration testing in CI/CD pipeline

On the other hand, you should incorporate integration testing into the continuous integration/continuous deployment (CI/CD) pipeline. If possible, conduct linting and unit tests in the earlier stages of the pipeline and run them on all commits prior to pushing them to the upstream repository. Isolate integration tests in a separate stage or pipeline job after unit tests have already been performed. You should configure CI/CD pipeline stages, particularly later ones, to more closely mirror the environments and systems into which the application will be deployed.

The rest of this article will provide an example GitHub repo and GitHub Actions workflow with an LLM-based application using Python as the primary language.

Setting up a Python environment in GitHub Actions

We’re going to set up a basic Python-based LLM application and testing environment in GitHub and GitHub Actions. This section assumes you have some familiarity with GitHub and GitHub Actions, CI/CD, and Python development. I’ll provide full examples in a GitHub repository. If you want to replicate this example in its entirety, you’ll need access to the OpenAI API and a paid account.

Configuring Python and GitHub Actions

For the application itself, we’ll use the amazing LangChain library to handle calls to the LLM, which will be OpenAI’s GPT-4 model in this case. The `main.py` script will use this example from the LangChain documentation. I use Poetry to handle local dependency and virtual environment management.

Within the GitHub Actions workflow, we’ll base our workflow configuration off the example provided in the documentation with some additional tweaks.

For testing, we’ll use pytest, ruff for linting, and pyspellchecker for testing the prompt and inputs.

The complete GitHub Actions workflow can be found here.

Basic testing for an LLM prompt

Software applications based on LLMs are highly dependent on the quality of input data and prompts. Prompts are essentially text-based instructions to an LLM that indicate things like desired tone, structure, and context of a given question and the desired response. It’s important to maintain quality and consistency in prompts. This is where testing comes in handy.

Deterministic tests

The unit tests in this example are designed to be deterministic and don’t depend on external API calls for successful validation. These will run in an isolated workflow step. If there’s any kind of test failure, further time isn’t wasted making potentially costly external API calls in the integration tests.

Check the application for syntactical correctness: Linting looks for common formatting issues and other errors that often prevent the code from running.
Validate inputs and prompts: Using the `unit_tests.py` module, we validate that our input and output parsing handles expected values correctly and the prompt doesn’t contain any misspelled words that might affect the quality of the LLM output.

Non-deterministic tests

The integration test examples are potentially non-deterministic and may produce different results or test failures, even with consistent testing conditions.

Evaluate responsiveness: The testing fixture contains a simplified evaluation of response time. Based on the value set here, if the response takes longer (in seconds) than that value, the test will fail. If the OpenAI API is experiencing issues or high demand, a failure may occur even though the code is correct.
Validate output format: Although the LLM has been provided explicit instructions to return the output in the form of a list, it may provide a mal-formed output. This assertion validates the format.
Validate response data: The prompt instructs the LLM to respond with a list of “objects” that fit into a category (in this case, colors). The primary and secondary colors are defined here to help fulfill this assertion, but it’s entirely possible that the LLM might respond with something like “light gray,” causing a test failure.

Evaluating test results

The code and test examples are contrived and incredibly simplified. A production LLM-based application will typically contain more complex logic and behavior and require a more comprehensive suite of tests. However, these tests provide a good starting point and a benchmark for evaluation and deployment of tests.

You can view an example of a successful GitHub Actions workflow run here.

Responding to linting and unit test failures is often a simple proposition: fix the offending code and run the tests again. Responding to integration test failures typically requires more thought. If the LLM consistently provides incorrectly formatted output, or the output is nonsensical in relation to the query, it’s likely you’ll need to rewrite the prompt to be more effective. Users with an OpenAI account can make use of the playground to quickly test various inputs and prompt combinations against the live model without going through the testing cycle each time they make a change.

Wrap-up

Applying traditional software testing methodology to generative AI software requires thinking differently about testing. DevOps engineers are likely to start being assigned roadmap objectives to help deploy LLM-based software applications and infrastructure.

Understanding how to take advantage of CI/CD and testing automation to help operationalize them will be incredibly important for the organization to stay competitive in this new landscape—and for DevOps engineers to maintain a relevant skill set for the future.

Mike V.

More about this author