Testing Strategy Design Note¶
Overview¶
Testing LLM-dependent code presents unique challenges: API calls cost money, responses are non-deterministic, and external services may be unavailable. This document describes the testing strategy for causaliq-knowledge.
Testing Pyramid¶
┌─────────────────┐
│ Functional │ ← Cached responses
│ Tests │ Real scenarios, reproducible
└────────┬────────┘
┌─────────────┴─────────────┐
│ Integration Tests │ ← Real API calls (optional)
│ (with live LLM APIs) │ Expensive, non-deterministic
└─────────────┬─────────────┘
┌────────────────────────┴────────────────────────┐
│ Unit Tests │ ← Mocked LLM responses
│ (mocked LLM responses) │ Fast, free, deterministic
└──────────────────────────────────────────────────┘
Test Categories¶
1. Unit Tests (Always Run in CI)¶
Unit tests mock all LLM calls, making them:
- Fast: No network latency
- Free: No API costs
- Deterministic: Same result every time
- Isolated: No external dependencies
# tests/unit/graph/test_generator.py
import pytest
from unittest.mock import MagicMock
# Test graph generator creates edges from LLM response.
def test_generator_creates_edges_from_response(monkeypatch):
"""Test that valid LLM JSON is correctly parsed into edges."""
from causaliq_knowledge.graph import GraphGenerator, GraphGeneratorConfig
from causaliq_knowledge.graph import ModelSpec, VariableSpec
# Create test model spec
spec = ModelSpec(
name="test",
variables=[
VariableSpec(id="A", name="smoking"),
VariableSpec(id="B", name="cancer"),
],
)
config = GraphGeneratorConfig(
llm_model="groq/llama-3.1-8b-instant",
prompt_detail="standard",
)
# Mock the LLM client
mock_response = {
"edges": [
{"source": "A", "target": "B", "confidence": 0.85}
]
}
generator = GraphGenerator(config)
# Mock internal client call
generator._client = MagicMock()
generator._client.complete_json.return_value = (mock_response, None)
result = generator.generate(spec)
assert len(result.edges) == 1
assert result.edges[0].source == "A"
assert result.edges[0].target == "B"
2. Integration Tests (Optional, Manual or CI with Secrets)¶
Integration tests use real LLM APIs to validate actual behaviour:
- Expensive: May cost money per call (though free tiers available)
- Non-deterministic: LLM responses vary
- Slow: Network latency
- Validates real integration: Catches API changes
# tests/integration/test_graph_generation_live.py
import pytest
import os
pytestmark = pytest.mark.skipif(
not os.getenv("GROQ_API_KEY"),
reason="GROQ_API_KEY not set"
)
@pytest.mark.slow
@pytest.mark.integration
def test_groq_generates_valid_graph():
"""Validate real Groq API returns parseable graph."""
from causaliq_knowledge.graph import GraphGenerator, GraphGeneratorConfig
from causaliq_knowledge.graph import NetworkContext
context = NetworkContext.load("tests/data/simple_context.json")
config = GraphGeneratorConfig(
temperature=0.1,
prompt_detail="standard",
)
generator = GraphGenerator(
model="groq/llama-3.1-8b-instant",
config=config,
)
result = generator.generate_from_context(context)
# Don't assert specific values - LLM may vary
# Just validate structure and reasonable bounds
assert len(result.edges) >= 0
for edge in result.edges:
assert 0.0 <= edge.confidence <= 1.0
3. Functional Tests with Cached Responses¶
Functional tests use cached LLM responses for reproducible testing:
- Realistic: Uses actual LLM responses (captured once)
- Deterministic: Same cached response every time
- Free: No API calls after initial capture
- Fast: Disk read instead of network call
How It Works¶
┌─────────────────────────────────────────────────────────────────┐
│ Test Fixture Generation │
│ (run once, manually) │
│ │
│ 1. Run graph generation against real LLMs │
│ 2. Cache stores responses in tests/data/functional/cache/ │
│ 3. Commit cache files to git │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Functional Tests (CI) │
│ │
│ 1. Load cached responses from tests/data/functional/cache/ │
│ 2. GraphGenerator configured to use cache-only mode │
│ 3. Tests run with real LLM responses, no API calls │
└─────────────────────────────────────────────────────────────────┘
CI Configuration¶
pytest Markers¶
# pyproject.toml
[tool.pytest.ini_options]
markers = [
"slow: marks tests as slow (deselect with '-m \"not slow\"')",
"integration: marks tests requiring live external APIs",
"functional: marks functional tests using cached responses",
]
addopts = "-ra -q --strict-markers -m 'not slow and not integration'"
GitHub Actions Strategy¶
| Test Type | When | API Keys | Cost |
|---|---|---|---|
| Unit | Every push/PR | No | Free |
| Functional | Every push/PR | Uses cache | Free |
| Integration | Main branch only, optional | GitHub Secrets | ~$0.01/run |
# .github/workflows/ci.yml (conceptual)
jobs:
unit-and-functional:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run unit and functional tests
run: pytest tests/unit tests/functional
integration:
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main' # Only on main
steps:
- uses: actions/checkout@v4
- name: Run integration tests
env:
GROQ_API_KEY: ${{ secrets.GROQ_API_KEY }}
run: pytest tests/integration -m integration
Test Data Management¶
Directory Structure¶
tests/
├── __init__.py
├── unit/
│ ├── __init__.py
│ ├── graph/
│ │ ├── test_generator.py # Graph generation
│ │ ├── test_models.py # Network context models
│ │ └── test_view_filter.py # View filtering
│ └── llm/
│ ├── test_clients.py # LLM clients
│ └── test_config.py # Configuration
├── integration/
│ ├── __init__.py
│ └── test_graph_live.py # Real API calls
├── functional/
│ ├── __init__.py
│ └── test_graph_cached.py # Using cached responses
└── data/
├── network_contexts/ # Test network context files
│ ├── simple.json
│ └── cancer.json
└── functional/
└── cache/ # Committed to git
└── groq/
└── simple_graph.json
Benefits of This Strategy¶
| Benefit | How Achieved |
|---|---|
| Fast CI | Unit tests are mocked, functional use cache |
| Low cost | Only integration tests (optional) call APIs |
| Reproducible | Cached responses are deterministic |
| Realistic | Functional tests use real LLM responses |
| Stable experiments | Same cache = same results across runs |
| Version controlled | Cache files in git track response changes |
Future Considerations¶
Cache Invalidation for Tests¶
When updating test fixtures:
- Delete relevant cache files
- Run fixture generation script
- Review new responses
- Commit updated cache files
Model Version Tracking¶
Cache files should include model version to detect when responses might change due to model updates.