Inference Server
MLX-native inference on Apple Silicon
OpenAI-compatible
/v1/chat/completions endpointStreaming SSE responses
GGUF model loading (llama.cpp fallback)
Automatic quantization selection
Model hot-swap without restart
Concurrent request handling
KV cache management
Context length enforcement
vLLM backend for parallel batching
Model Management
Model inventory API (
/admin/models)Dynamic load/unload API
Memory headroom enforcement (never crash on OOM)
Model health monitoring
HuggingFace model download integration
GGUF format support
Model benchmarking suite
API Compatibility
OpenAI
/v1/chat/completionsOpenAI
/v1/modelsHealth endpoint
/healthzAdmin events SSE
/admin/eventsRerank endpoint
/v1/rerankSkills API
/v1/skillsAgents API
/v1/agentsEmbeddings endpoint
/v1/embeddingsPerformance
ANE (Apple Neural Engine) utilization
Batch request coalescing
Throughput metrics (tokens/sec)
Active request tracking
Voice & Audio
Speech-to-text endpoint
POST /v1/audio/transcriptions (mlx-whisper)Text-to-speech endpoint
POST /v1/audio/speech (mlx-audio + Kokoro)Multipart form upload for audio files (WebM, WAV, MP3)
soundfile I/O for audio processingShare an idea for Opta LMX