1. What machine learning models does MLX support?
The MLX framework supports a variety of popular machine learning and deep learning models, primarily including large language models (LLM) and text generation models, such as LLaMA, Mistral, Phi-2, and Qwen; image generation models like Stable Diffusion; speech recognition models such as OpenAI’s Whisper; and models for other common tasks, including but not limited to text recognition, machine translation, image classification, and object detection.
Most importantly, MLX provides extensive support for large language models. Some tutorials mentioning MLX claim it supports limited packages, but this is a complete misunderstanding. In fact, according to the official documentation, “most models like Mistral, Llama, Phi-2, and Mixtral can be directly loaded and used.” In practical tests, it has also been found that the Gemma2 model can be directly used within this framework.
2. Does MLX support fine-tuning? What fine-tuning methods are available?
MLX supports fine-tuning neural network models, and Apple users can directly fine-tune large language models locally.
- LoRA (Low-Rank Adaptation):
LoRA is a parameter-efficient fine-tuning technique that adjusts model weights by adding low-rank matrices without directly modifying the original parameters. - QLoRA (Quantized LoRA):
QLoRA, the quantized version of LoRA. This allows LoRA fine-tuning on quantized models, further reducing memory requirements. - Full Parameter Fine-tuning:
While not a parameter-efficient method, MLX also supports full-parameter fine-tuning of models. - Other Fine-tuning Methods:
Adapter fine-tuning, prefix fine-tuning, and prompt fine-tuning are also supported.
3. Why can MLX run efficiently on Apple Silicon chips? What is the Convert function?
In MLX, a “convert” function may confuse some users. If models can be directly used, what is the purpose of the “convert” function? According to the official explanation: the MLX format is specifically optimized for Apple Silicon chips, leveraging the performance advantages of M-series chips to enhance the speed and efficiency of model operation.
MLX uses a unified memory model to share memory between the CPU and GPU, reducing data transfer overhead and improving memory utilization. Models can run locally on Apple devices like Mac, iPad, and iPhone.
MLX employs lazy computation, meaning computations are only executed when necessary, optimizing computational efficiency.
After investigating the source code, it was found that MLX designed its own array format and framework, similar to NumPy, but tailored for unified memory. This enables more efficient and flexible execution of machine learning tasks on Apple Silicon machines. Lazy computation allows output values to be calculated only when needed, and to run on both the CPU and GPU. The “convert” function transforms arrays stored in Torch into MLX arrays to achieve these features.
4. Should models be quantized?
The “convert” function also allows model quantization, which is the process of converting from high precision (typically 32-bit floating point) to lower precision representations (such as 8-bit integers). The main purpose is to reduce model size, lower computational complexity, and improve inference speed while maintaining model accuracy as much as possible.
Quantized models are significantly smaller, and using integer operations instead of floating-point ones can speed up inference and reduce memory usage.
Based on this, models can be appropriately converted. For instance, if you only need to fine-tune the model on an Apple Silicon machine and eventually deploy it to a Linux server, you can use the original model without conversion. However, if you plan to deploy on an Apple Silicon Mac, conversion is recommended.
5. What is MLX Community on HuggingFace, and what are the models available there?
Regarding the MLX Community and its’ models on HuggingFace, one of MLX’s authors, Awni Hannun, explained:
For the most part there is no real difference between MLX community models and the original Hugging Face model when the precision is fp16, bf16, or fp32. In some cases the model could have a slightly different format but in many cases they are identical.
The main difference between MLX Community models is that we keep the quantized models there (4-bit and 8-bit). The quantization format is quite specific to MLX. But there is no rule that quantized models must live in the MLX community. That’s just a convenient place to put them if the original model creator didn’t make MLX quantized models.
This means you can directly download already quantized models from HuggingFace’s MLX Community and apply them without needing to set up a HuggingFace token since there are no gated models.
from mlx_lm import generate, load
model, tokenizer = load("mlx-community/gemma-2-9b-8bit")