Open Source · macOS Native

Local inference,
no compromises.

vMLX is an open source inference engine for Mac built on Apple's MLX framework. Prefix caching, paged KV cache, continuous batching, and MCP tool integration — features the others don't have.

Download for Mac Learn More

Free & Open Source macOS Native Apple Silicon Optimized

Features

Everything you need.
Nothing you don't.

Built from the ground up for Apple Silicon, with advanced inference capabilities that competing engines simply don't offer.

Prefix Cache

Reuse previously computed prefill tokens across conversations. Dramatically reduce time-to-first-token for repeated system prompts and context windows.

Paged KV Cache

Memory-efficient key-value caching with configurable block sizes. Handle longer contexts without running out of unified memory on your Mac.

Continuous Batching

Serve multiple concurrent requests with intelligent batch scheduling. Maximize throughput with up to 256 concurrent sequences.

MCP Tool Integration

Native Model Context Protocol support. Connect your models to external tools, APIs, and data sources for agentic workflows.

OpenAI-Compatible API

Drop-in replacement for OpenAI's chat completions endpoint. Use your existing tools, scripts, and integrations without changing a line of code.

Session Management

Save, restore, and manage conversation sessions. Switch between models and contexts without losing your chat history.

Performance

Built for speed.

Optimized for Apple Silicon's unified memory architecture. Every token, every batch, every cache hit — tuned for maximum throughput.

256

Max concurrent sequences

512

Prefill batch size

20%

Automatic cache memory allocation

0ms

Config overhead — native Swift

Get Started

Up and running in seconds.

Download the latest release, pick a model from HuggingFace, and start inferencing. vMLX handles model management, server lifecycle, and API exposure automatically.

✓ Download any MLX-compatible model
✓ One-click server start with auto config
✓ OpenAI-compatible API on localhost
✓ Full chat UI with settings panel
✓ No Python, no Docker, no fuss

terminal

# Download and launch vMLX

$ open vMLX.app

# Select a model, hit Start — server is running

Server started on http://127.0.0.1:8000

# OpenAI-compatible API ready instantly

$ curl http://127.0.0.1:8000/v1/chat/completions \

-H "Content-Type: application/json" \

-d '{"model":"default","messages":[

{"role":"user","content":"Hello!"}]}'

# Response streams back instantly

{"choices":[{"message":{"content":"Hi! How can I help?"}}]}

Server Configuration

vMLX advanced server settings showing concurrent processing, prefix cache, paged KV cache, and tool integration options

Configuration

Total control over
every parameter.

Fine-tune every aspect of the inference pipeline. From prefill batch sizes to cache memory allocation — vMLX gives you the knobs that other engines hide or simply don't have.

prefill_batch_size max_concurrent_seq prefix_caching paged_kv_cache block_size cache_memory_% continuous_batching mcp_tools temperature top_p

Local inference,no compromises.

Everything you need.Nothing you don't.