π« Local Bench
Validate your inference. Benchmark LLM performance directly on your local hardware to ensure absolute confidence in your agentβs speed and capabilities.
Overview
Local Bench is a benchmarking and profiling tool for local language models running on your Companion Hub. It provides standardized test suites, hardware utilization metrics, and comparative reports that tell you exactly how fast and capable your setup is β so you can make informed decisions about which models to run for which tasks.
Before deploying a new model to your Companion Agents or Spellbook workflows, run it through Local Bench to understand its latency profile, throughput, memory footprint, and quality on standard tasks.
Key Features
- Token throughput (tok/s) β measures prompt and generation throughput across different context lengths
- Time-to-first-token (TTFT) β latency for agent responsiveness evaluation
- Memory footprint β RAM and VRAM usage per model and quantization level
- CPU/GPU utilization β real-time hardware metrics during inference
- Quality benchmarks β runs MMLU, HellaSwag, and custom task suites to evaluate accuracy
- Side-by-side comparison β compare multiple models across all metrics in a single report
- Export reports β exportable JSON/CSV for custom analysis
- Hardware profile β automatically detects and documents your Hubβs hardware capabilities
Use Cases
- βWhich quantization of Llama 3.1 8B runs fastest on my hardware?β
- βDoes my new GPU significantly improve TTFT for coding tasks?β
- βCan I run Qwen 2.5 72B acceptably on my Core server?β
- Before deploying a new model to production Spellbook workflows
- Document your Hubβs capabilities for sharing or support purposes
Supported Backends
| Backend | Status |
|---|---|
| Ollama | β |
| llama.cpp (direct) | β |
| vLLM | β |
| OpenAI-compatible API | β |
| LM Studio | β |
Benchmark Suites
| Suite | Description | Duration |
|---|---|---|
| Quick | Token throughput at 3 context lengths | ~2 min |
| Standard | Throughput + TTFT + memory | ~8 min |
| Full | All metrics + MMLU sample | ~20 min |
| Custom | User-defined prompts and evaluation | Variable |
Setup
Install from Hub
Search for Local Bench in the Hub app store and install.
Open Local Bench
Navigate to http://local-bench.ci.localhost.
Connect to your inference backend
In Settings β Backends, add your inference endpoint. For Ollama: http://ollama.ci.localhost:11434.
Run a quick benchmark
Select a model from your Ollama library, choose Quick suite, and click Run Benchmark.
Usage
Running a Benchmark
- Navigate to New Benchmark
- Select the model(s) to test
- Choose a suite (Quick, Standard, or Full)
- Click Start and wait for completion
Results are displayed in the Results tab and saved for future comparison.
Comparing Models
On the Compare page, select two or more previous results to see a side-by-side matrix of all metrics.
Scheduling Automated Benchmarks
Configure recurring benchmarks in Settings β Schedule to track performance regressions after system updates.
CLI
# Run a quick benchmark on llama3.2:3b
ci-bench run --model llama3.2:3b --suite quick
# Compare two results
ci-bench compare result-001 result-002
# Export results as JSON
ci-bench export result-001 --format json > benchmark.jsonUnderstanding Results
| Metric | Good target (typical Hub hardware) |
|---|---|
| Generation tok/s (7B Q4) | > 30 tok/s |
| Time to first token (7B) | < 500ms |
| RAM usage (7B Q4) | < 6 GB |
| MMLU (7B) | > 58% |
Results will vary significantly based on your specific hardware. The Companion Core serverΒ is benchmarked and documented in the Reference section.
Troubleshooting
Backend connection refused Ensure the inference backend is running. For Ollama, check Hub β Apps β Ollama β Status. Verify the endpoint URL in Local Bench settings.
Benchmark stalls at 0% The model may be loading for the first time. Check Ollama logs in the Hub for download/load progress.
VRAM out of memory errors The selected model is too large for your GPU. Try a smaller quantization (Q4_K_M instead of F16) or a smaller model variant.
Inconsistent results between runs Background processes on your Hub can affect measurements. For reproducible results, pause other inference-heavy apps before benchmarking.