Skip to Content
DocumentationFirst-Party AppsπŸ’« Local Bench

πŸ’« Local Bench

Validate your inference. Benchmark LLM performance directly on your local hardware to ensure absolute confidence in your agent’s speed and capabilities.

Overview

Local Bench is a benchmarking and profiling tool for local language models running on your Companion Hub. It provides standardized test suites, hardware utilization metrics, and comparative reports that tell you exactly how fast and capable your setup is β€” so you can make informed decisions about which models to run for which tasks.

Before deploying a new model to your Companion Agents or Spellbook workflows, run it through Local Bench to understand its latency profile, throughput, memory footprint, and quality on standard tasks.

Key Features

  • Token throughput (tok/s) β€” measures prompt and generation throughput across different context lengths
  • Time-to-first-token (TTFT) β€” latency for agent responsiveness evaluation
  • Memory footprint β€” RAM and VRAM usage per model and quantization level
  • CPU/GPU utilization β€” real-time hardware metrics during inference
  • Quality benchmarks β€” runs MMLU, HellaSwag, and custom task suites to evaluate accuracy
  • Side-by-side comparison β€” compare multiple models across all metrics in a single report
  • Export reports β€” exportable JSON/CSV for custom analysis
  • Hardware profile β€” automatically detects and documents your Hub’s hardware capabilities

Use Cases

  • β€œWhich quantization of Llama 3.1 8B runs fastest on my hardware?”
  • β€œDoes my new GPU significantly improve TTFT for coding tasks?”
  • β€œCan I run Qwen 2.5 72B acceptably on my Core server?”
  • Before deploying a new model to production Spellbook workflows
  • Document your Hub’s capabilities for sharing or support purposes

Supported Backends

BackendStatus
Ollamaβœ…
llama.cpp (direct)βœ…
vLLMβœ…
OpenAI-compatible APIβœ…
LM Studioβœ…

Benchmark Suites

SuiteDescriptionDuration
QuickToken throughput at 3 context lengths~2 min
StandardThroughput + TTFT + memory~8 min
FullAll metrics + MMLU sample~20 min
CustomUser-defined prompts and evaluationVariable

Setup

Install from Hub

Search for Local Bench in the Hub app store and install.

Open Local Bench

Navigate to http://local-bench.ci.localhost.

Connect to your inference backend

In Settings β†’ Backends, add your inference endpoint. For Ollama: http://ollama.ci.localhost:11434.

Run a quick benchmark

Select a model from your Ollama library, choose Quick suite, and click Run Benchmark.

Usage

Running a Benchmark

  1. Navigate to New Benchmark
  2. Select the model(s) to test
  3. Choose a suite (Quick, Standard, or Full)
  4. Click Start and wait for completion

Results are displayed in the Results tab and saved for future comparison.

Comparing Models

On the Compare page, select two or more previous results to see a side-by-side matrix of all metrics.

Scheduling Automated Benchmarks

Configure recurring benchmarks in Settings β†’ Schedule to track performance regressions after system updates.

CLI

# Run a quick benchmark on llama3.2:3b ci-bench run --model llama3.2:3b --suite quick # Compare two results ci-bench compare result-001 result-002 # Export results as JSON ci-bench export result-001 --format json > benchmark.json

Understanding Results

MetricGood target (typical Hub hardware)
Generation tok/s (7B Q4)> 30 tok/s
Time to first token (7B)< 500ms
RAM usage (7B Q4)< 6 GB
MMLU (7B)> 58%

Results will vary significantly based on your specific hardware. The Companion Core serverΒ  is benchmarked and documented in the Reference section.

Troubleshooting

Backend connection refused Ensure the inference backend is running. For Ollama, check Hub β†’ Apps β†’ Ollama β†’ Status. Verify the endpoint URL in Local Bench settings.

Benchmark stalls at 0% The model may be loading for the first time. Check Ollama logs in the Hub for download/load progress.

VRAM out of memory errors The selected model is too large for your GPU. Try a smaller quantization (Q4_K_M instead of F16) or a smaller model variant.

Inconsistent results between runs Background processes on your Hub can affect measurements. For reproducible results, pause other inference-heavy apps before benchmarking.

Last updated on