Open Source · macOS Native

Local inference,
no compromises.

vMLX is an open source inference engine for Mac built on Apple's MLX framework. Prefix caching, paged KV cache, continuous batching, and MCP tool integration — features the others don't have.

Free & Open Source macOS Native Apple Silicon Optimized
vMLX — mlx-community/Llama-3.2-3B-Instruct-4bit
vMLX chat interface showing a conversation with Llama 3.2 model running locally with inference settings panel
Features

Everything you need.
Nothing you don't.

Built from the ground up for Apple Silicon, with advanced inference capabilities that competing engines simply don't offer.

Prefix Cache

Reuse previously computed prefill tokens across conversations. Dramatically reduce time-to-first-token for repeated system prompts and context windows.

Paged KV Cache

Memory-efficient key-value caching with configurable block sizes. Handle longer contexts without running out of unified memory on your Mac.

Continuous Batching

Serve multiple concurrent requests with intelligent batch scheduling. Maximize throughput with up to 256 concurrent sequences.

MCP Tool Integration

Native Model Context Protocol support. Connect your models to external tools, APIs, and data sources for agentic workflows.

OpenAI-Compatible API

Drop-in replacement for OpenAI's chat completions endpoint. Use your existing tools, scripts, and integrations without changing a line of code.

Session Management

Save, restore, and manage conversation sessions. Switch between models and contexts without losing your chat history.

Performance

Built for speed.

Optimized for Apple Silicon's unified memory architecture. Every token, every batch, every cache hit — tuned for maximum throughput.

256
Max concurrent sequences
512
Prefill batch size
20%
Automatic cache memory allocation
0ms
Config overhead — native Swift

Up and running in seconds.

Download the latest release, pick a model from HuggingFace, and start inferencing. vMLX handles model management, server lifecycle, and API exposure automatically.

  • Download any MLX-compatible model
  • One-click server start with auto config
  • OpenAI-compatible API on localhost
  • Full chat UI with settings panel
  • No Python, no Docker, no fuss
terminal
# Download and launch vMLX
$ open vMLX.app
 
# Select a model, hit Start — server is running
Server started on http://127.0.0.1:8000
 
# OpenAI-compatible API ready instantly
$ curl http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"default","messages":[
    {"role":"user","content":"Hello!"}]}'
 
# Response streams back instantly
{"choices":[{"message":{"content":"Hi! How can I help?"}}]}
Server Configuration
vMLX advanced server settings showing concurrent processing, prefix cache, paged KV cache, and tool integration options

Total control over
every parameter.

Fine-tune every aspect of the inference pipeline. From prefill batch sizes to cache memory allocation — vMLX gives you the knobs that other engines hide or simply don't have.

prefill_batch_size max_concurrent_seq prefix_caching paged_kv_cache block_size cache_memory_% continuous_batching mcp_tools temperature top_p

Ready to run models locally?

vMLX is free, open source, and built for the Mac you already have.
No cloud. No API keys. No rate limits. Just inference.

Download for Mac