Testing Strategy Design Note¶
Overview¶
Testing LLM-dependent code presents unique challenges: API calls cost money, responses are non-deterministic, and external services may be unavailable. This document describes the testing strategy for causaliq-knowledge.
Testing Pyramid¶
┌─────────────────┐
│ Functional │ ← Cached responses (v0.3.0+)
│ Tests │ Real scenarios, reproducible
└────────┬────────┘
┌─────────────┴─────────────┐
│ Integration Tests │ ← Real API calls (optional)
│ (with live LLM APIs) │ Expensive, non-deterministic
└─────────────┬─────────────┘
┌────────────────────────┴────────────────────────┐
│ Unit Tests │ ← Mocked LLM responses
│ (mocked LLM responses) │ Fast, free, deterministic
└──────────────────────────────────────────────────┘
Test Categories¶
1. Unit Tests (Always Run in CI)¶
Unit tests mock all LLM calls, making them:
- Fast: No network latency
- Free: No API costs
- Deterministic: Same result every time
- Isolated: No external dependencies
# tests/unit/test_llm_providers.py
import pytest
from unittest.mock import MagicMock
def test_query_edge_parses_valid_response(monkeypatch):
"""Test that valid LLM JSON is correctly parsed."""
from causaliq_knowledge.llm import LLMKnowledge
from causaliq_knowledge.llm.groq_client import GroqClient
# Mock the Groq client's complete_json method
mock_json = {
"exists": True,
"direction": "a_to_b",
"confidence": 0.85,
"reasoning": "Smoking causes lung cancer via carcinogens."
}
mock_client = MagicMock(spec=GroqClient)
mock_client.complete_json.return_value = (mock_json, MagicMock())
knowledge = LLMKnowledge(models=["groq/llama-3.1-8b-instant"])
knowledge._clients["groq/llama-3.1-8b-instant"] = mock_client
result = knowledge.query_edge("smoking", "lung_cancer")
assert result.exists is True
assert result.direction.value == "a_to_b"
assert result.confidence == 0.85
def test_query_edge_handles_malformed_json(monkeypatch):
"""Test graceful handling of invalid LLM response."""
from causaliq_knowledge.llm import LLMKnowledge
from causaliq_knowledge.llm.groq_client import GroqClient
# Mock returning None (failed parse)
mock_client = MagicMock(spec=GroqClient)
mock_client.complete_json.return_value = (None, MagicMock())
knowledge = LLMKnowledge(models=["groq/llama-3.1-8b-instant"])
knowledge._clients["groq/llama-3.1-8b-instant"] = mock_client
result = knowledge.query_edge("A", "B")
assert result.exists is None # Uncertain
assert result.confidence == 0.0
2. Integration Tests (Optional, Manual or CI with Secrets)¶
Integration tests use real LLM APIs to validate actual behavior:
- Expensive: May cost money per call (though free tiers available)
- Non-deterministic: LLM responses vary
- Slow: Network latency
- Validates real integration: Catches API changes
# tests/integration/test_llm_live.py
import pytest
import os
pytestmark = pytest.mark.skipif(
not os.getenv("GROQ_API_KEY"),
reason="GROQ_API_KEY not set"
)
@pytest.mark.slow
@pytest.mark.integration
def test_groq_returns_valid_response():
"""Validate real Groq API returns parseable response."""
from causaliq_knowledge.llm import LLMKnowledge
knowledge = LLMKnowledge(models=["groq/llama-3.1-8b-instant"])
result = knowledge.query_edge("smoking", "lung_cancer")
# Don't assert specific values - LLM may vary
# Just validate structure and reasonable bounds
assert result.exists in [True, False, None]
assert 0.0 <= result.confidence <= 1.0
assert len(result.reasoning) > 0
3. Functional Tests with Cached Responses (v0.3.0+)¶
Once response caching is implemented, we can create reproducible functional tests using cached LLM responses. This is the best of both worlds:
- Realistic: Uses actual LLM responses (captured once)
- Deterministic: Same cached response every time
- Free: No API calls after initial capture
- Fast: Disk read instead of network call
How It Works¶
┌─────────────────────────────────────────────────────────────────┐
│ Test Fixture Generation │
│ (run once, manually) │
│ │
│ 1. Run queries against real LLMs │
│ 2. Cache stores responses in tests/data/functional/cache/ │
│ 3. Commit cache files to git │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Functional Tests (CI) │
│ │
│ 1. Load cached responses from tests/data/functional/cache/ │
│ 2. LLMKnowledge configured to use cache-only mode │
│ 3. Tests run with real LLM responses, no API calls │
└─────────────────────────────────────────────────────────────────┘
Example Functional Test¶
# tests/functional/test_edge_queries.py
import pytest
from pathlib import Path
CACHE_DIR = Path(__file__).parent / "data" / "cache"
@pytest.fixture
def cached_knowledge():
"""LLMKnowledge using only cached responses."""
return LLMKnowledge(
models=["groq/llama-3.1-8b-instant"],
cache_dir=str(CACHE_DIR),
cache_only=True # Fail if cache miss, don't call API
)
def test_smoking_cancer_relationship(cached_knowledge):
"""Test with cached response for smoking->cancer query."""
result = cached_knowledge.query_edge("smoking", "lung_cancer")
# Can assert specific values since response is cached
assert result.exists is True
assert result.direction == "a_to_b"
assert result.confidence > 0.8
def test_consensus_across_models(cached_knowledge):
"""Test multi-model consensus with cached responses."""
knowledge = LLMKnowledge(
models=["groq/llama-3.1-8b-instant", "gemini/gemini-2.5-flash"],
cache_dir=str(CACHE_DIR),
cache_only=True
)
result = knowledge.query_edge("exercise", "heart_health")
assert result.exists is True
Generating Test Fixtures¶
# scripts/generate_test_fixtures.py
"""
Run this script manually to generate/update cached responses for functional tests.
Requires API keys for all models being tested.
"""
from causaliq_knowledge.llm import LLMKnowledge
from pathlib import Path
CACHE_DIR = Path("tests/data/functional/cache")
TEST_EDGES = [
("smoking", "lung_cancer"),
("exercise", "heart_health"),
("education", "income"),
("rain", "wet_ground"),
]
def generate_fixtures():
knowledge = LLMKnowledge(
models=["groq/llama-3.1-8b-instant", "gemini/gemini-2.5-flash"],
cache_dir=str(CACHE_DIR)
)
for node_a, node_b in TEST_EDGES:
print(f"Caching: {node_a} -> {node_b}")
knowledge.query_edge(node_a, node_b)
print(f"Fixtures saved to {CACHE_DIR}")
if __name__ == "__main__":
generate_fixtures()
CI Configuration¶
pytest Markers¶
# pyproject.toml
[tool.pytest.ini_options]
markers = [
"slow: marks tests as slow (deselect with '-m \"not slow\"')",
"integration: marks tests requiring live external APIs",
"functional: marks functional tests using cached responses",
]
addopts = "-ra -q --strict-markers -m 'not slow and not integration'"
GitHub Actions Strategy¶
| Test Type | When | API Keys | Cost |
|---|---|---|---|
| Unit | Every push/PR | ❌ Not needed | Free |
| Functional | Every push/PR | ❌ Uses cache | Free |
| Integration | Main branch only, optional | ✅ GitHub Secrets | ~$0.01/run |
# .github/workflows/ci.yml (conceptual addition)
jobs:
unit-and-functional:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run unit and functional tests
run: pytest tests/unit tests/functional
integration:
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main' # Only on main
steps:
- uses: actions/checkout@v4
- name: Run integration tests
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: pytest tests/integration -m integration
Test Data Management¶
Directory Structure¶
tests/
├── __init__.py
├── unit/
│ ├── __init__.py
│ ├── test_models.py # EdgeKnowledge, etc.
│ ├── test_prompts.py # Prompt formatting
│ └── test_llm_providers.py # Mocked LLM calls
├── integration/
│ ├── __init__.py
│ └── test_llm_live.py # Real API calls
├── functional/
│ ├── __init__.py
│ └── test_edge_queries.py # Using cached responses
└── data/
└── functional/
└── cache/ # Committed to git
├── groq/
│ ├── smoking_lung_cancer.json
│ └── exercise_heart_health.json
└── gemini/
└── ...
Cache File Format¶
{
"query": {
"node_a": "smoking",
"node_b": "lung_cancer",
"context": {"domain": "epidemiology"}
},
"model": "groq/llama-3.1-8b-instant",
"timestamp": "2026-01-05T10:30:00Z",
"response": {
"exists": true,
"direction": "a_to_b",
"confidence": 0.92,
"reasoning": "Smoking is an established cause of lung cancer..."
}
}
Benefits of This Strategy¶
| Benefit | How Achieved |
|---|---|
| Fast CI | Unit tests are mocked, functional use cache |
| Low cost | Only integration tests (optional) call APIs |
| Reproducible | Cached responses are deterministic |
| Realistic | Functional tests use real LLM responses |
| Stable experiments | Same cache = same results across runs |
| Version controlled | Cache files in git track response changes |
Future Considerations¶
Cache Invalidation for Tests¶
When updating test fixtures:
- Delete relevant cache files
- Run fixture generation script
- Review new responses
- Commit updated cache files
Model Version Tracking¶
Cache files should include model version to detect when responses might change due to model updates.
Semantic Similarity Testing¶
For v0.4.0+, consider testing that semantically similar queries hit cache (e.g., "smoking" vs "tobacco use" → "cancer" vs "lung cancer").