LLM Integration Design Note¶
Overview¶
This document describes how causaliq-knowledge integrates with Large Language Models (LLMs) to provide knowledge about causal relationships. The primary use case for v0.1.0 is answering queries about edge existence and edge orientation to support graph averaging in causaliq-analysis.
How it works¶
Query Flow¶
- Consumer requests knowledge about a potential edge (e.g., "Does smoking cause cancer?")
- KnowledgeProvider receives the query with optional context
- LLM client formats the query using structured prompts
- One or more LLMs are queried (configurable)
- Responses are parsed into structured
EdgeKnowledgeobjects - Multi-LLM consensus combines responses (if multiple models used)
- Result returned with confidence score and reasoning
Core Interface¶
from abc import ABC, abstractmethod
from pydantic import BaseModel
class EdgeKnowledge(BaseModel):
"""Structured knowledge about a potential causal edge."""
exists: bool | None # True, False, or None (uncertain)
direction: str | None # "a_to_b", "b_to_a", "undirected", None
confidence: float # 0.0 to 1.0
reasoning: str # Human-readable explanation
model: str | None = None # Which LLM provided this (for logging)
class KnowledgeProvider(ABC):
"""Abstract interface for all knowledge sources."""
@abstractmethod
def query_edge(
self,
node_a: str,
node_b: str,
context: dict | None = None
) -> EdgeKnowledge:
"""
Query whether a causal edge exists between two nodes.
Args:
node_a: Name of first variable
node_b: Name of second variable
context: Optional context (domain, variable descriptions, etc.)
Returns:
EdgeKnowledge with existence, direction, confidence, reasoning
"""
pass
LLM Implementation¶
class LLMKnowledge(KnowledgeProvider):
"""LLM-based knowledge provider using vendor-specific API clients."""
def __init__(
self,
models: list[str] = ["groq/llama-3.1-8b-instant"],
consensus_strategy: str = "weighted_vote",
temperature: float = 0.1,
max_tokens: int = 500,
):
"""
Initialize LLM knowledge provider.
Args:
models: List of model identifiers with provider prefix.
e.g., ["groq/llama-3.1-8b-instant", "gemini/gemini-2.5-flash"]
consensus_strategy: How to combine multi-model responses
"weighted_vote" or "highest_confidence"
temperature: LLM temperature (0.0-1.0)
max_tokens: Maximum tokens in response
"""
...
LLM Provider Configuration¶
Architectural Decision: Vendor-Specific APIs¶
We use direct vendor-specific API clients rather than wrapper libraries like LiteLLM or LangChain. Each provider has a dedicated client class that uses httpx for HTTP communication.
Benefits of this approach:
- Reliability: No wrapper bugs or version conflicts
- Minimal dependencies: Only httpx required for HTTP
- Full control: Direct access to vendor-specific features
- Better debugging: Clear stack traces without abstraction layers
- Predictable behavior: No surprises from wrapper library updates
Supported Providers¶
| Provider | Client Class | Model Examples | API Key Variable |
|---|---|---|---|
| Groq | GroqClient |
groq/llama-3.1-8b-instant |
GROQ_API_KEY |
| Google Gemini | GeminiClient |
gemini/gemini-2.5-flash |
GEMINI_API_KEY |
| OpenAI | OpenAIClient |
openai/gpt-4o-mini |
OPENAI_API_KEY |
| Anthropic | AnthropicClient |
anthropic/claude-sonnet-4-20250514 |
ANTHROPIC_API_KEY |
| DeepSeek | DeepSeekClient |
deepseek/deepseek-chat |
DEEPSEEK_API_KEY |
| Mistral | MistralClient |
mistral/mistral-small-latest |
MISTRAL_API_KEY |
| Ollama | OllamaClient |
ollama/llama3 |
N/A (local) |
Additional providers can be added by implementing new client classes following the same pattern.
Cost Considerations¶
For edge queries (~500 tokens each):
| Provider | Model | Cost per 1000 queries | Quality | Speed |
|---|---|---|---|---|
| Groq | llama-3.1-8b-instant | Free tier | Good | Very fast |
| gemini-2.5-flash | Free tier | Good | Fast | |
| Ollama | llama3 | Free (local) | Good | Depends on HW |
| DeepSeek | deepseek-chat | ~$0.07 | Excellent | Fast |
| Mistral | mistral-small-latest | ~$0.50 | Good | Fast |
| OpenAI | gpt-4o-mini | ~$0.15 | Excellent | Fast |
| Anthropic | claude-sonnet-4-20250514 | ~$1.50 | Excellent | Fast |
Recommendation: Use Groq free tier for development and testing. Ollama is great for local development. Both Groq and Gemini offer generous free tiers suitable for most research use cases.
Prompt Design¶
Edge Existence Query¶
System: You are an expert in causal reasoning and domain knowledge.
Your task is to assess whether a causal relationship exists between two variables.
Respond in JSON format with: exists (true/false/null), direction (a_to_b/b_to_a/undirected/null), confidence (0-1), reasoning (string).
User: In the domain of {domain}, does a causal relationship exist between "{node_a}" and "{node_b}"?
Consider:
- Direct causation (A causes B)
- Reverse causation (B causes A)
- Bidirectional/feedback relationships
- No causal relationship (correlation only or independence)
Variable context:
{variable_descriptions}
Response Format¶
{
"exists": true,
"direction": "a_to_b",
"confidence": 0.85,
"reasoning": "Smoking is an established cause of lung cancer through well-documented biological mechanisms including DNA damage from carcinogens in tobacco smoke."
}
Multi-LLM Consensus¶
When multiple models are configured, responses are combined:
Weighted Vote Strategy (default)¶
def weighted_vote(responses: list[EdgeKnowledge]) -> EdgeKnowledge:
"""Combine responses weighted by confidence."""
# For existence: weighted majority vote
# For direction: weighted majority among those agreeing on existence
# Final confidence: average confidence of agreeing models
# Reasoning: concatenate key points from each model
Highest Confidence Strategy¶
def highest_confidence(responses: list[EdgeKnowledge]) -> EdgeKnowledge:
"""Return response with highest confidence."""
return max(responses, key=lambda r: r.confidence)
Integration with Graph Averaging¶
The primary consumer is causaliq_analysis.graph.average():
# Current output from average()
df = average(traces, sample_size=1000)
# Returns: node_a, node_b, p_a_to_b, p_b_to_a, p_undirected, p_no_edge
# Entropy calculation identifies uncertain edges
def edge_entropy(row):
probs = [row.p_a_to_b, row.p_b_to_a, row.p_undirected, row.p_no_edge]
probs = [p for p in probs if p > 0]
return -sum(p * math.log2(p) for p in probs)
df["entropy"] = df.apply(edge_entropy, axis=1)
uncertain_edges = df[df["entropy"] > 1.5] # High uncertainty
# Query LLM for uncertain edges
knowledge = LLMKnowledge(models=["groq/llama-3.1-8b-instant"])
for _, row in uncertain_edges.iterrows():
result = knowledge.query_edge(row.node_a, row.node_b)
# Combine statistical and LLM probabilities...
Design Rationale¶
Why Vendor-Specific APIs (not LiteLLM/LangChain)?¶
- Minimal dependencies: Only httpx for HTTP, no wrapper libraries
- Reliability: No wrapper bugs or version conflicts to debug
- Full control: Direct access to vendor-specific features and error handling
- Predictable: Behavior doesn't change when wrapper library updates
- Debuggable: Clear stack traces without abstraction layers
- Lightweight: ~5KB of client code vs ~50MB of wrapper dependencies
Why structured JSON responses?¶
- Reliable parsing: Avoids regex/heuristic extraction
- Validation: Pydantic ensures response integrity
- Consistency: Same structure regardless of model
Why multi-model consensus?¶
- Reduced hallucination: Multiple models catch individual errors
- Confidence calibration: Agreement increases confidence
- Robustness: Not dependent on single provider availability
Error Handling and Resilience¶
API Failures¶
- Automatic retry with timeout handling
- Fallback to next model in list if primary fails
- Return
EdgeKnowledge(exists=None, confidence=0.0)if all fail
Invalid Responses¶
- Pydantic validation catches malformed JSON
- Default to
exists=Noneif parsing fails - Log warnings for debugging
Rate Limiting¶
- Vendor clients handle rate limit errors gracefully
- Configure timeout per client
Performance¶
Latency¶
- Single query: 0.5-2s depending on model/provider
- Batch queries: Can parallelize across edges (async)
- Cached queries: <10ms
Throughput (v0.3.0 with caching)¶
- First query to new edge: 1-2s
- Cached query: <10ms
- 1000 unique edges: ~20-30 minutes (sequential), ~5 min (parallel)
Future Extensions¶
v0.3.0: Caching¶
- Disk-based cache keyed by (node_a, node_b, context_hash)
- Semantic similarity cache for similar variable names
v0.4.0: Rich Context¶
- Variable descriptions and roles
- Domain-specific literature retrieval (RAG)
- Conversation history for follow-up queries
v0.5.0: Algorithm Integration¶
- Direct integration with structure learning search
- Knowledge-guided constraint generation