Skip to main content

Command Palette

Search for a command to run...

How to Write Tests for AI-Generated Functions When You Didn't Write the Code

A step-by-step process for testing code you received from an AI model rather than wrote yourself

Published

Testing code you wrote yourself is conceptually straightforward. You built the function, you know where your uncertainty lives, and you write tests for the parts you're least confident about. The tests encode your mental model of the implementation.

Testing code generated by an AI assistant is different in a specific way: you do not have that mental model. You received code that looks correct. You need to verify correctness without the benefit of having built the thing you're verifying.

This guide covers a step-by-step process for writing tests in that context. The approach is designed for the specific failure modes of AI-generated code: edge case gaps, silent error handling, assumption mismatches with your system, and input validation issues.

Step 1: Read the Function as a Specification, Not as Implementation

Before writing any tests, read the AI-generated function the same way you would read a specification document rather than an implementation. Your goal is to identify the contract: what inputs does the function accept, what does it promise to do with them, and what does it return?

Write down the contract explicitly:

  • Valid input range (minimum and maximum values, accepted types)
  • Expected output for valid inputs
  • Expected behavior for invalid inputs (throw, return error, return default?)
  • External dependencies the function calls
  • State that the function reads from or writes to

This step is non-optional. Without the explicit contract, you have no basis for writing a test that can fail for the right reason. Most gaps in AI-generated code testing come from skipping this step and writing tests directly against the code as written, which only verifies that the code does what it does, not what it should do.

Step 2: Write the Happy Path Test First

The happy path test verifies that the function produces the correct output for a representative valid input. This test should pass immediately if the AI-generated code is correct for the typical case.

In Jest:

describe('calculateMonthlyRate', () => {
  it('returns correct monthly rate for standard annual rate', () => {
    const result = calculateMonthlyRate(0.12); // 12% annual
    expect(result).toBeCloseTo(0.009489, 5);   // (1.12)^(1/12) - 1
  });
});

If this test fails, the function has a fundamental logic error that you can identify and report before investing time in edge case tests. If it passes, you have a baseline.

Step 3: Build the Edge Case Matrix

This is the step that separates adequate AI code testing from thorough AI code testing. The edge case matrix is a structured enumeration of the boundary conditions the function needs to handle.

For every numeric input: the minimum valid value, the maximum valid value, one below the minimum, one above the maximum, zero, and negative values if the type allows them.

For every string input: empty string, very long string, strings with special characters, strings with Unicode edge cases, null, and undefined.

For every collection input: empty collection, single-element collection, maximum-size collection, and collection with null or undefined elements.

Write these as parameterized tests. In Pytest:

import pytest

@pytest.mark.parametrize("rate,expected_error", [
    (-0.01, ValueError),   # negative rate not allowed
    (0.0, ValueError),     # zero rate is meaningless
    (0.001, None),         # minimum valid rate
    (0.12, None),          # typical case
    (0.99, None),          # maximum valid rate
    (1.0, ValueError),     # 100% annual rate should be rejected
    (None, TypeError),
])
def test_calculate_monthly_rate_boundaries(rate, expected_error):
    if expected_error:
        with pytest.raises(expected_error):
            calculate_monthly_rate(rate)
    else:
        result = calculate_monthly_rate(rate)
        assert 0 < result < 1

Using Python's parametrize pattern, you run all eight cases with one test structure. Adding a new case is adding one row.

In Vitest, the same pattern applies using test.each. The edge case matrix is data, not code.

Step 4: Test Error Handling Behavior Explicitly

AI-generated code has a consistent failure mode in error handling: it catches exceptions from dependencies and returns a default value (null, empty array, empty string) rather than propagating the error or converting it to a domain-specific type.

This behavior means the function does not throw. All your tests pass. But in production, when the dependency fails, the function returns a silent incorrect result that propagates through the system and surfaces as a confusing downstream failure.

To catch this, write tests that inject dependency failures:

describe('getUserProfile', () => {
  it('propagates database error rather than returning null', async () => {
    const mockDb = {
      findUser: jest.fn().mockRejectedValue(new Error('Connection timeout'))
    };

    await expect(getUserProfile('user-123', mockDb))
      .rejects.toThrow('Connection timeout');
    // If this assertion fails, the function is swallowing the error
  });
});

If the function returns null or an empty object instead of throwing, the test fails and you have found a silent error handler to fix before shipping.

This test pattern requires you to identify every external dependency the function calls, which is another reason the explicit contract from Step 1 matters. The dependency list in the contract becomes the list of failure conditions to test.

Step 5: Add Security Tests for Input-Handling Functions

Any function that accepts data from outside the trusted perimeter of your system needs a specific set of tests that verifies it correctly handles adversarial inputs. OWASP maintains documentation on the standard test patterns for each vulnerability category.

For functions that construct database queries or shell commands, test with injection strings. For functions that read files or construct paths, test with path traversal sequences. For functions that return HTML or text that will be rendered in a browser, test with XSS payloads.

The ESLint static analysis tool can flag the patterns in the code that suggest these tests are needed: string formatting in query construction, unsanitized string interpolation in template literals, and direct use of user input in sensitive operations. But static analysis flags potential issues. Tests verify actual behavior.

Semgrep also flags security-relevant patterns during static analysis. Both tools are complementary to these tests rather than substitutes for them.

Step 6: Write Integration Tests for External Dependency Assumptions

Unit tests with mocked dependencies verify the function's logic given assumed dependency behavior. They cannot verify that the assumed behavior matches the actual behavior of your specific system.

AI-generated code makes assumptions about external dependencies based on the most common patterns in its training data. Your database might return column names differently than the ORM default. Your API client might wrap responses differently than the standard SDK. These mismatches only surface when the code runs against real or realistic system state.

For each external dependency in the function's contract:

  1. Identify what the AI-generated code assumes about the response shape and behavior
  2. Verify whether your system actually matches that assumption
  3. If the assumption is wrong, write a failing test that documents the mismatch and fix the code

Integration tests take longer to run than unit tests. Run them in a dedicated CI stage against a seeded test database or a realistic stub rather than in the same stage as unit tests.

SonarCloud can provide project-level visibility into how integration test coverage changes over time as AI-generated code ships. If the coverage trends decline as volume increases, the pipeline is under pressure.

Step 7: Add Test Coverage to the Pull Request Requirement

Once you have the test structure in place, enforce it. Configure your CI pipeline to require that new files have test coverage above a threshold before a pull request can merge.

This is not primarily about the threshold percentage. It is about making the conversation about untested AI-generated code explicit during review rather than implicit after deployment.

The reviewer's job for AI-generated code is different from the reviewer's job for human-written code. The reviewer did not write the code and lacks the developer's mental map of where the uncertainty lives. Requiring test coverage before merge puts the behavioral verification into the PR conversation rather than leaving it to production incidents.

For a complete framework covering this workflow from topic selection through CI integration, 137Foundry documents the approach used in AI-assisted development engagements where testing is built into the generation workflow rather than added as a downstream step.

The companion guide on testing AI-generated code before it ships covers the reasoning behind each stage in more detail, including the confidence calibration problem that makes human review less reliable for AI output than for code the reviewer wrote themselves.

Test-driven development advocates have long argued that writing tests first forces explicit thinking about what the code should do. The same principle applies in reverse to AI-generated code: making the behavioral contract explicit before testing is the step that prevents tests from simply verifying the AI's choices rather than verifying your requirements.

Node.js projects specifically benefit from explicit async error propagation tests, since AI-generated async functions frequently use try-catch patterns that swallow Promise rejections rather than propagating them up the call chain. This is one of the more common silent failure patterns in AI-generated JavaScript and deserves its own test category in any Node.js project using AI assistance.

Understanding static program analysis helps frame why the tooling in Steps 4 and 5 complements rather than replaces test-based verification. Static analysis identifies patterns without running the code. Tests verify actual behavior. Both matter, and neither substitutes for the other.

lab notebook instruments desk workspace Photo by contact me +923323219715 on Pexels

The process described here is more deliberate than what most teams apply to human-written code. That is appropriate. When you write the code yourself, you build an implicit test suite in your head before you start. When the AI writes the code, the explicit test suite is the only verification you have.