Skip to content

Ollama Client API Reference

Local Ollama API client for running Llama and other open-source models locally. This client implements the BaseLLMClient interface using httpx to communicate with a locally running Ollama server.

Overview

The Ollama client provides:

  • Local LLM inference without API keys or internet access
  • Implements the BaseLLMClient abstract interface
  • Support for Llama 3.2, Llama 3.1, Mistral, and other models
  • JSON response parsing with error handling
  • Call counting for usage tracking
  • Availability checking via is_available() method

Prerequisites

  1. Install Ollama from ollama.com/download
  2. Pull a model:
    ollama pull llama3.2:1b    # Small, fast (~1.3GB)
    ollama pull llama3.2       # Medium (~2GB)
    ollama pull llama3.1:8b    # Larger, better quality (~4.7GB)
    
  3. Ensure Ollama is running (it usually auto-starts after installation)

Usage

from causaliq_knowledge.llm import OllamaClient, OllamaConfig

# Create client with default config (llama3.2:1b on localhost:11434)
client = OllamaClient()

# Or with custom config
config = OllamaConfig(
    model="llama3.1:8b",
    temperature=0.1,
    max_tokens=500,
    timeout=120.0,  # Local inference can be slow
)
client = OllamaClient(config=config)

# Check if Ollama is available
if client.is_available():
    # Make a completion request
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is 2+2?"},
    ]
    response = client.completion(messages)
    print(response.content)
else:
    print("Ollama not running or model not installed")

Using with LLMKnowledge Provider

from causaliq_knowledge.llm import LLMKnowledge

# Use local Ollama for causal queries
provider = LLMKnowledge(models=["ollama/llama3.2:1b"])
result = provider.query_edge("smoking", "lung_cancer")
print(f"Exists: {result.exists}, Confidence: {result.confidence}")

# Mix local and cloud models for consensus
provider = LLMKnowledge(
    models=[
        "ollama/llama3.2:1b",
        "groq/llama-3.1-8b-instant",
    ],
    consensus_strategy="weighted_vote"
)

OllamaConfig

OllamaConfig dataclass

OllamaConfig(
    model: str = "llama3.2:1b",
    temperature: float = 0.1,
    max_tokens: int = 500,
    timeout: float = 120.0,
    api_key: Optional[str] = None,
    base_url: str = "http://localhost:11434",
)

Configuration for Ollama API client.

Extends LLMConfig with Ollama-specific defaults.

Attributes:

  • model (str) –

    Ollama model identifier (default: llama3.2:1b).

  • temperature (float) –

    Sampling temperature (default: 0.1).

  • max_tokens (int) –

    Maximum response tokens (default: 500).

  • timeout (float) –

    Request timeout in seconds (default: 120.0, local).

  • api_key (Optional[str]) –

    Not used for Ollama (local server).

  • base_url (str) –

    Ollama server URL (default: http://localhost:11434).

api_key class-attribute instance-attribute

api_key: Optional[str] = None

base_url class-attribute instance-attribute

base_url: str = 'http://localhost:11434'

max_tokens class-attribute instance-attribute

max_tokens: int = 500

model class-attribute instance-attribute

model: str = 'llama3.2:1b'

temperature class-attribute instance-attribute

temperature: float = 0.1

timeout class-attribute instance-attribute

timeout: float = 120.0

OllamaClient

OllamaClient

OllamaClient(config: Optional[OllamaConfig] = None)

Local Ollama API client.

Implements the BaseLLMClient interface for locally running Ollama server. Uses httpx for HTTP requests to the local Ollama API.

Ollama provides an OpenAI-compatible API for running open-source models like Llama locally without requiring API keys or internet access.

Example

config = OllamaConfig(model="llama3.2:1b") client = OllamaClient(config) msgs = [{"role": "user", "content": "Hello"}] response = client.completion(msgs) print(response.content)

Parameters:

  • config

    (Optional[OllamaConfig], default: None ) –

    Ollama configuration. If None, uses defaults connecting to localhost:11434 with llama3.2:1b model.

Methods:

Attributes:

_total_calls instance-attribute

_total_calls = 0

cache property

cache: Optional['TokenCache']

Return the configured cache, if any.

call_count property

call_count: int

Return the number of API calls made.

config instance-attribute

config = config or OllamaConfig()

model_name property

model_name: str

Return the model name being used.

Returns:

  • str

    Model identifier string.

provider_name property

provider_name: str

Return the provider name.

use_cache property

use_cache: bool

Return whether caching is enabled.

_build_cache_key

_build_cache_key(
    messages: List[Dict[str, str]],
    temperature: Optional[float] = None,
    max_tokens: Optional[int] = None,
) -> str

Build a deterministic cache key for the request.

Creates a SHA-256 hash from the model, messages, temperature, and max_tokens. The hash is truncated to 16 hex characters (64 bits).

Parameters:

  • messages
    (List[Dict[str, str]]) –

    List of message dicts with "role" and "content" keys.

  • temperature
    (Optional[float], default: None ) –

    Sampling temperature (defaults to config value).

  • max_tokens
    (Optional[int], default: None ) –

    Maximum tokens (defaults to config value).

Returns:

  • str

    16-character hex string cache key.

cached_completion

cached_completion(messages: List[Dict[str, str]], **kwargs: Any) -> LLMResponse

Make a completion request with caching.

If caching is enabled and a cached response exists, returns the cached response without making an API call. Otherwise, makes the API call and caches the result.

Parameters:

  • messages
    (List[Dict[str, str]]) –

    List of message dicts with "role" and "content" keys.

  • **kwargs
    (Any, default: {} ) –

    Provider-specific options (temperature, max_tokens, etc.)

Returns:

  • LLMResponse

    LLMResponse with the generated content and metadata.

complete_json

complete_json(
    messages: List[Dict[str, str]], **kwargs: Any
) -> tuple[Optional[Dict[str, Any]], LLMResponse]

Make a completion request and parse response as JSON.

Parameters:

  • messages
    (List[Dict[str, str]]) –

    List of message dicts with "role" and "content" keys.

  • **kwargs
    (Any, default: {} ) –

    Override config options passed to completion().

Returns:

  • tuple[Optional[Dict[str, Any]], LLMResponse]

    Tuple of (parsed JSON dict or None, raw LLMResponse).

completion

completion(messages: List[Dict[str, str]], **kwargs: Any) -> LLMResponse

Make a chat completion request to Ollama.

Parameters:

  • messages
    (List[Dict[str, str]]) –

    List of message dicts with "role" and "content" keys.

  • **kwargs
    (Any, default: {} ) –

    Override config options (temperature, max_tokens).

Returns:

  • LLMResponse

    LLMResponse with the generated content and metadata.

Raises:

  • ValueError

    If the API request fails or Ollama is not running.

is_available

is_available() -> bool

Check if Ollama server is running and model is available.

Returns:

  • bool

    True if Ollama is running and the configured model exists.

list_models

list_models() -> List[str]

List installed models from Ollama.

Queries the local Ollama server to get installed models. Unlike cloud providers, this returns only models the user has explicitly pulled/installed.

Returns:

  • List[str]

    List of model identifiers (e.g., ['llama3.2:1b', ...]).

Raises:

  • ValueError

    If Ollama server is not running.

set_cache

set_cache(cache: Optional['TokenCache'], use_cache: bool = True) -> None

Configure caching for this client.

Parameters:

  • cache
    (Optional['TokenCache']) –

    TokenCache instance for caching, or None to disable.

  • use_cache
    (bool, default: True ) –

    Whether to use the cache (default True).

Supported Models

Ollama supports many open-source models. Recommended for causal queries:

Model Size RAM Needed Quality
llama3.2:1b ~1.3GB 4GB+ Good for simple queries
llama3.2 ~2GB 6GB+ Better reasoning
llama3.1:8b ~4.7GB 10GB+ Best quality
mistral ~4GB 8GB+ Good alternative

See Ollama Library for all available models.

Troubleshooting

"Could not connect to Ollama"

  • Ensure Ollama is installed and running
  • Run ollama serve in a terminal, or start the Ollama app
  • Check that nothing else is using port 11434

"Model not found"

  • Run ollama pull <model-name> to download the model
  • Run ollama list to see installed models

Slow responses

  • Local inference is CPU/GPU bound
  • Use smaller models like llama3.2:1b
  • Increase the timeout in OllamaConfig
  • Consider using GPU acceleration if available