How to Write Unit Tests for Legacy Code Using AI Coding Assistants
A step-by-step approach to generating, reviewing, and extending AI-produced test suites for legacy code
Untested legacy code is the highest-risk category in software modernization. You cannot change what you cannot verify. And you cannot verify behavior you have not defined in tests. The traditional advice - "write tests before you refactor" - is sound but has always been slow in practice, because writing tests for code you did not write means first understanding that code deeply enough to know what the tests should cover.
AI coding assistants accelerate this phase significantly. They can analyze a legacy function and generate an initial test suite in seconds. That test suite is not sufficient as-is, but it is a dramatically better starting point than a blank test file.
This guide covers how to use AI to generate useful legacy code tests, how to evaluate and extend the output, and where the approach breaks down.
Why Legacy Code Is Hard to Test Retroactively
Testing code after it is written is harder than testing it during design because the code was not designed for testability. Legacy code has several structural properties that make test generation difficult:
Global state dependencies. Functions that read from or write to global variables, class-level state, or module-level singletons are hard to test in isolation because their behavior depends on state set elsewhere in the system.
Implicit dependencies. Functions that create their own database connections, make HTTP requests inline, or read from the filesystem cannot be tested without those external systems available - or without mocking infrastructure the original code did not anticipate.
Deeply nested conditionals. Complex branching logic with many paths requires many test cases to achieve reasonable coverage. AI can help enumerate these paths, but the test setup for each may require significant work.
Side effects as primary output. Functions whose purpose is to write to a database or send an email are hard to test without examining the side effect, which requires mocking the relevant systems.
Understanding these properties in your specific legacy code shapes how you approach AI-assisted test generation.
Step 1: Prepare the Function for AI Analysis
Before asking the AI to generate tests, give it the context to generate useful ones.
Provide:
- The function itself
- The function's callers (so the AI understands how it is used)
- Any related constants or configuration it depends on
- A plain-language description of what the function is supposed to do
A prompt template that works well:
# Context prompt template for AI test generation
prompt = """
I need unit tests for the following function.
Business context: This function calculates the final discount amount
to apply to an order. The key business rule is that discounts do not
stack - the highest applicable discount wins, not the sum.
The function is called by: order_processor.py (finalize_order method)
and batch_processor.py (process_bulk_discounts).
Generate pytest unit tests that cover:
1. The happy path with typical inputs
2. Boundary conditions (zero price, single item, maximum discount tier)
3. Error cases (invalid tier, negative price, missing required fields)
4. The specific business rule about non-stacking discounts
5. Any edge cases you identify from reading the code
For each test, add a comment explaining what behavior it tests.
[function code here]
"""
The "business context" section is the most important part. Without it, the AI generates tests based on inferences from the code, which means it may test incorrect behavior as if it were correct.
Step 2: Review and Classify the AI Output
AI-generated test suites for legacy code have consistent strengths and consistent gaps. Review the output by classifying each test:
# Example: AI-generated tests that need review
# KEEP AS-IS: Standard happy path - AI got this right
def test_calculate_discount_standard_customer():
result = calculate_discount(price=100.00, tier="standard")
assert result == 90.00 # 10% standard discount
# KEEP BUT VERIFY: Edge case - confirm 0 is the correct return for no discount
def test_calculate_discount_no_tier():
result = calculate_discount(price=100.00, tier=None)
assert result == 100.00 # verify: is None tier 0 discount or error?
# NEEDS REVISION: Business rule test - AI tested stacking, but rule says no stacking
def test_calculate_discount_multiple_discounts():
# AI generated this test assuming stacking behavior - WRONG
# The actual behavior: highest discount wins
result = calculate_discount(price=100.00, tier="gold", promo="SAVE15")
# AI asserted: assert result == 72.50 (20% gold + 15% promo stacked)
assert result == 80.00 # correct: gold 20% wins, promo 15% ignored
# MISSING: Production incident #1447 - very old accounts had a different calculation
# Need to add this test manually based on incident history
def test_calculate_discount_legacy_account_flag():
result = calculate_discount(price=100.00, tier="gold", is_legacy=True)
assert result == 85.00 # legacy accounts use fixed 15% regardless of tier
The review process reveals three categories: tests to keep, tests that need the assertion corrected, and missing tests that the AI would not have known to generate. The third category - tests based on production incidents and bug history - is the most valuable addition you make.
Step 3: Add Tests Based on Historical Context
The AI knows what the code looks like. It does not know what bugs have been fixed, what incidents have occurred, or what edge cases production traffic has surfaced. That knowledge lives in your Git history, issue tracker, and incident reports.
For each module you are testing, run:
# Find all commits that touched this file and their messages
git log --oneline --follow path/to/legacy_module.py
# Show the full diff for commits that mention bug or fix
git log --oneline --grep="fix\|bug\|incident\|hotfix" -- path/to/legacy_module.py | \
xargs -I {} git show {} -- path/to/legacy_module.py
Each bug fix or incident commit is a test case that the AI would not generate on its own. Convert each one into a regression test and add it to the suite manually. These tests are your highest-value coverage because they test the exact conditions that broke in production.
Step 4: Handle Untestable Code with Mocking
For legacy functions with implicit dependencies - database connections, filesystem access, external HTTP calls - you need to isolate those dependencies before testing the logic. Python's unittest.mock and pytest fixtures provide the infrastructure for this.
Ask the AI to help identify which dependencies need mocking and generate the mock setup:
# AI-generated mock structure for a database-dependent function
import pytest
from unittest.mock import MagicMock, patch
@pytest.fixture
def mock_db():
"""Mock database connection for discount module tests."""
db = MagicMock()
# Set up the return values the discount function expects
db.get_customer_tier.return_value = {"tier": "gold", "discount_rate": 0.20}
db.get_active_promos.return_value = [{"code": "SAVE15", "rate": 0.15}]
return db
def test_calculate_discount_with_db(mock_db):
with patch('discount_module.get_db_connection', return_value=mock_db):
result = calculate_discount(price=100.00, customer_id="CUST001")
assert result == 80.00
# Verify the function actually called the database
mock_db.get_customer_tier.assert_called_once_with("CUST001")
The AI is good at generating this kind of mock scaffolding. Review it carefully to ensure the mock return values match what the real database would actually return - incorrect mock values are a common source of tests that pass but do not actually verify real behavior.
Step 5: Measure Coverage and Fill Gaps
After generating and reviewing tests, measure coverage against the legacy function to identify untested paths. pytest-cov provides line and branch coverage reporting:
pytest --cov=discount_module --cov-report=html tests/
The HTML report identifies exactly which lines and branches are not covered. Feed the uncovered paths back to the AI:
The following lines of the discount function are not covered by the current tests:
Lines 145-160: the enterprise customer path
Lines 178-185: the refund discount recalculation path
Generate tests that specifically exercise these paths.
Include any setup needed to reach these code paths.
This iterative coverage-gap-fill cycle, alternating between AI generation and human review, converges on comprehensive coverage faster than either approach alone.
Integrating with Your CI Pipeline
Once you have a test suite you trust, configure it to run automatically on every commit. GitHub Actions provides straightforward CI configuration. Jest for JavaScript and pytest for Python both integrate cleanly with standard CI pipelines.
The critical step is requiring the test suite to pass before merging any AI-generated refactoring. This enforces that the safety net you built is actually used. A test suite that is optional will eventually be bypassed under time pressure.
ESLint and Prettier add static analysis and formatting checks to the same pipeline, making the full quality gate automatic.
The Relationship to Safe Refactoring
A test suite built before refactoring transforms the risk profile of AI-assisted refactoring. Instead of manually reviewing every line of AI-generated refactored code for behavioral differences, you can run the test suite against the refactored code and trust that any behavioral difference that matters is caught.
This does not eliminate the need for code review - the test suite covers the cases you thought to test, not all possible cases. But it changes code review from the primary safety mechanism to a secondary one, which significantly speeds up the refactoring cycle.
For the full framework on how test generation fits into the broader AI-assisted legacy modernization process, the guide on using AI coding assistants for legacy code modernization covers the three-phase approach including the context-building phase that makes test generation significantly more accurate.
137Foundry works with engineering teams on legacy modernization, including test infrastructure setup and refactoring strategy. OWASP provides security-focused testing guidance that is particularly relevant for the security-critical paths in legacy systems. SonarSource offers code quality tools including coverage analysis that complement the manual approach described here. Node.js testing documentation covers the JavaScript ecosystem equivalents of the Python patterns described above.
The test suite you build before refactoring is the difference between a modernization project that discovers problems in staging and one that discovers them in production.

