Ollama Client API Reference¶
Local Ollama API client for running Llama and other open-source models locally. This client implements the BaseLLMClient interface using httpx to communicate with a locally running Ollama server.
Overview¶
The Ollama client provides:
- Local LLM inference without API keys or internet access
- Implements the
BaseLLMClientabstract interface - Support for Llama 3.2, Llama 3.1, Mistral, and other models
- JSON response parsing with error handling
- Call counting for usage tracking
- Availability checking via
is_available()method
Prerequisites¶
- Install Ollama from ollama.com/download
- Pull a model:
- Ensure Ollama is running (it usually auto-starts after installation)
Usage¶
from causaliq_knowledge.llm import OllamaClient, OllamaConfig
# Create client with default config (llama3.2:1b on localhost:11434)
client = OllamaClient()
# Or with custom config
config = OllamaConfig(
model="llama3.1:8b",
temperature=0.1,
max_tokens=500,
timeout=120.0, # Local inference can be slow
)
client = OllamaClient(config=config)
# Check if Ollama is available
if client.is_available():
# Make a completion request
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is 2+2?"},
]
response = client.completion(messages)
print(response.content)
else:
print("Ollama not running or model not installed")
Using with LLMKnowledge Provider¶
from causaliq_knowledge.llm import LLMKnowledge
# Use local Ollama for causal queries
provider = LLMKnowledge(models=["ollama/llama3.2:1b"])
result = provider.query_edge("smoking", "lung_cancer")
print(f"Exists: {result.exists}, Confidence: {result.confidence}")
# Mix local and cloud models for consensus
provider = LLMKnowledge(
models=[
"ollama/llama3.2:1b",
"groq/llama-3.1-8b-instant",
],
consensus_strategy="weighted_vote"
)
OllamaConfig¶
OllamaConfig
dataclass
¶
OllamaConfig(
model: str = "llama3.2:1b",
temperature: float = 0.1,
max_tokens: int = 500,
timeout: float = 120.0,
api_key: Optional[str] = None,
base_url: str = "http://localhost:11434",
)
Configuration for Ollama API client.
Extends LLMConfig with Ollama-specific defaults.
Attributes:
-
model(str) –Ollama model identifier (default: llama3.2:1b).
-
temperature(float) –Sampling temperature (default: 0.1).
-
max_tokens(int) –Maximum response tokens (default: 500).
-
timeout(float) –Request timeout in seconds (default: 120.0, local).
-
api_key(Optional[str]) –Not used for Ollama (local server).
-
base_url(str) –Ollama server URL (default: http://localhost:11434).
OllamaClient¶
OllamaClient
¶
OllamaClient(config: Optional[OllamaConfig] = None)
Local Ollama API client.
Implements the BaseLLMClient interface for locally running Ollama server. Uses httpx for HTTP requests to the local Ollama API.
Ollama provides an OpenAI-compatible API for running open-source models like Llama locally without requiring API keys or internet access.
Example
config = OllamaConfig(model="llama3.2:1b") client = OllamaClient(config) msgs = [{"role": "user", "content": "Hello"}] response = client.completion(msgs) print(response.content)
Parameters:
-
(config¶Optional[OllamaConfig], default:None) –Ollama configuration. If None, uses defaults connecting to localhost:11434 with llama3.2:1b model.
Methods:
-
_build_cache_key–Build a deterministic cache key for the request.
-
cached_completion–Make a completion request with caching.
-
complete_json–Make a completion request and parse response as JSON.
-
completion–Make a chat completion request to Ollama.
-
is_available–Check if Ollama server is running and model is available.
-
list_models–List installed models from Ollama.
-
set_cache–Configure caching for this client.
Attributes:
-
_total_calls– -
cache(Optional['TokenCache']) –Return the configured cache, if any.
-
call_count(int) –Return the number of API calls made.
-
config– -
model_name(str) –Return the model name being used.
-
provider_name(str) –Return the provider name.
-
use_cache(bool) –Return whether caching is enabled.
model_name
property
¶
Return the model name being used.
Returns:
-
str–Model identifier string.
_build_cache_key
¶
_build_cache_key(
messages: List[Dict[str, str]],
temperature: Optional[float] = None,
max_tokens: Optional[int] = None,
) -> str
Build a deterministic cache key for the request.
Creates a SHA-256 hash from the model, messages, temperature, and max_tokens. The hash is truncated to 16 hex characters (64 bits).
Parameters:
-
(messages¶List[Dict[str, str]]) –List of message dicts with "role" and "content" keys.
-
(temperature¶Optional[float], default:None) –Sampling temperature (defaults to config value).
-
(max_tokens¶Optional[int], default:None) –Maximum tokens (defaults to config value).
Returns:
-
str–16-character hex string cache key.
cached_completion
¶
cached_completion(messages: List[Dict[str, str]], **kwargs: Any) -> LLMResponse
Make a completion request with caching.
If caching is enabled and a cached response exists, returns the cached response without making an API call. Otherwise, makes the API call and caches the result.
Parameters:
-
(messages¶List[Dict[str, str]]) –List of message dicts with "role" and "content" keys.
-
(**kwargs¶Any, default:{}) –Provider-specific options (temperature, max_tokens, etc.)
Returns:
-
LLMResponse–LLMResponse with the generated content and metadata.
complete_json
¶
complete_json(
messages: List[Dict[str, str]], **kwargs: Any
) -> tuple[Optional[Dict[str, Any]], LLMResponse]
Make a completion request and parse response as JSON.
Parameters:
-
(messages¶List[Dict[str, str]]) –List of message dicts with "role" and "content" keys.
-
(**kwargs¶Any, default:{}) –Override config options passed to completion().
Returns:
-
tuple[Optional[Dict[str, Any]], LLMResponse]–Tuple of (parsed JSON dict or None, raw LLMResponse).
completion
¶
completion(messages: List[Dict[str, str]], **kwargs: Any) -> LLMResponse
Make a chat completion request to Ollama.
Parameters:
-
(messages¶List[Dict[str, str]]) –List of message dicts with "role" and "content" keys.
-
(**kwargs¶Any, default:{}) –Override config options (temperature, max_tokens).
Returns:
-
LLMResponse–LLMResponse with the generated content and metadata.
Raises:
-
ValueError–If the API request fails or Ollama is not running.
is_available
¶
Check if Ollama server is running and model is available.
Returns:
-
bool–True if Ollama is running and the configured model exists.
list_models
¶
List installed models from Ollama.
Queries the local Ollama server to get installed models. Unlike cloud providers, this returns only models the user has explicitly pulled/installed.
Returns:
-
List[str]–List of model identifiers (e.g., ['llama3.2:1b', ...]).
Raises:
-
ValueError–If Ollama server is not running.
Supported Models¶
Ollama supports many open-source models. Recommended for causal queries:
| Model | Size | RAM Needed | Quality |
|---|---|---|---|
llama3.2:1b |
~1.3GB | 4GB+ | Good for simple queries |
llama3.2 |
~2GB | 6GB+ | Better reasoning |
llama3.1:8b |
~4.7GB | 10GB+ | Best quality |
mistral |
~4GB | 8GB+ | Good alternative |
See Ollama Library for all available models.
Troubleshooting¶
"Could not connect to Ollama"
- Ensure Ollama is installed and running
- Run
ollama servein a terminal, or start the Ollama app - Check that nothing else is using port 11434
"Model not found"
- Run
ollama pull <model-name>to download the model - Run
ollama listto see installed models
Slow responses
- Local inference is CPU/GPU bound
- Use smaller models like
llama3.2:1b - Increase the timeout in
OllamaConfig - Consider using GPU acceleration if available