Architecture Vision for causaliq-knowledge¶

CausalIQ Ecosystem¶

causaliq-knowledge is a component of the overall CausalIQ ecosystem architecture.

This package provides knowledge services to other CausalIQ packages, enabling them to incorporate LLM-derived and human-specified knowledge into causal discovery and inference workflows.

Architectural Principles¶

Simplicity First¶

Use lightweight libraries over heavy frameworks
Start with minimal viable features, extend incrementally
Prefer explicit code over framework "magic"
Use vendor-specific APIs rather than abstraction wrappers

Cost Efficiency¶

Built-in cost tracking and budget management (critical for independent research)
Caching of LLM queries and responses to avoid redundant API calls
Support for cheap/free providers (Groq, Gemini free tiers)

Transparency and Reproducibility¶

Cache all LLM interactions for experiment reproducibility
Provide reasoning/explanations with all knowledge outputs
Log confidence levels to enable uncertainty-aware decisions

Clean Interfaces¶

Abstract KnowledgeProvider interface allows multiple implementations
LLM-based, rule-based, and human-input knowledge sources use same interface
Easy integration with causaliq-analysis and causaliq-discovery

Architecture Components¶

Core Components (v0.1.0)¶

causaliq_knowledge/
├── __init__.py              # Package exports
├── cli.py                   # Command-line interface
├── base.py                  # Abstract KnowledgeProvider interface
├── models.py                # Pydantic models (EdgeKnowledge, etc.)
└── llm/
    ├── __init__.py          # LLM module exports
    ├── groq_client.py       # Direct Groq API client
    ├── gemini_client.py     # Direct Google Gemini API client
    ├── prompts.py           # Prompt templates for edge queries
    └── provider.py          # LLMKnowledge implementation

Data Flow¶

┌─────────────────────────────────────────────────────────────────┐
│                 Consuming Package (e.g., causaliq-analysis)     │
│                                                                 │
│   uncertain_edges = df[df["entropy"] > threshold]               │
│                          │                                      │
│                          ▼                                      │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │              causaliq-knowledge                          │   │
│   │                                                          │   │
│   │   knowledge.query_edge("smoking", "cancer")              │   │
│   │       │                                                  │   │
│   │       ▼                                                  │   │
│   │   ┌───────────┐    ┌───────────┐    ┌───────────┐       │   │
│   │   │  LLM 1    │    │  LLM 2    │    │  Cache    │       │   │
│   │   │ (GPT-4o)  │    │ (Llama3)  │    │ (disk)    │       │   │
│   │   └───────────┘    └───────────┘    └───────────┘       │   │
│   │       │                 │                │               │   │
│   │       └─────────────────┴────────────────┘               │   │
│   │                         │                                │   │
│   │                         ▼                                │   │
│   │               EdgeKnowledge(                             │   │
│   │                   exists=True,                           │   │
│   │                   direction="a_to_b",                    │   │
│   │                   confidence=0.85,                       │   │
│   │                   reasoning="Established medical..."     │   │
│   │               )                                          │   │
│   └─────────────────────────────────────────────────────────┘   │
│                          │                                      │
│                          ▼                                      │
│   Combine with statistical probabilities                        │
└─────────────────────────────────────────────────────────────────┘

Technology Choices¶

Vendor-Specific APIs over Wrapper Libraries¶

We use direct vendor-specific API clients rather than wrapper libraries like LiteLLM or LangChain. This architectural decision provides:

Aspect	Direct APIs	Wrapper Libraries
Reliability	✅ Full control, predictable	❌ Wrapper bugs, version drift
Debugging	✅ Clear stack traces	❌ Abstraction layers
Dependencies	✅ Minimal (httpx only)	❌ Heavy transitive deps
API Coverage	✅ Full vendor features	❌ Lowest common denominator
Maintenance	✅ We control updates	❌ Wait for wrapper updates

Why Not LiteLLM?

Adds 50+ transitive dependencies
Version conflicts with other packages
Wrapper bugs mask vendor API issues
We only need 2-3 providers, not 100+

Why Not LangChain?

Massive dependency footprint (~100MB+)
Over-engineered for simple structured queries
Rapid breaking changes between versions
May reconsider for v0.4.0+ RAG features only

Current Provider Clients¶

GroqClient: Direct Groq API via httpx (free tier, fast inference)
GeminiClient: Direct Google Gemini API via httpx (generous free tier)

Key Dependencies¶

httpx: HTTP client for API calls
pydantic: Structured response validation
click: Command-line interface
diskcache (v0.3.0): Persistent query caching

Integration Points¶

With causaliq-analysis¶

The primary integration point is the average() function which produces edge probability tables. Future versions will accept a knowledge parameter:

# Future usage (v0.5.0 of causaliq-analysis)
from causaliq_knowledge import LLMKnowledge

knowledge = LLMKnowledge(models=["gpt-4o-mini"])
df = average(traces, sample_size=1000, knowledge=knowledge)

With causaliq-discovery¶

Structure learning algorithms will use knowledge to guide search in uncertain areas of the graph space.