To speed up embedding generation, three effective techniques are using FP16 precision, model quantization, and converting models to ONNX. Each approach optimizes different aspects of computation, such as memory usage, arithmetic efficiency, or framework-level optimizations. These methods can be applied individually or combined for greater performance gains, depending on hardware and model compatibility.
First, FP16 (16-bit floating-point) precision reduces the memory and computation requirements of embedding models compared to standard FP32 (32-bit) precision. Modern GPUs like NVIDIA A100 or V100 support accelerated FP16 operations through Tensor Cores, which perform matrix math faster in reduced precision. For example, in PyTorch, enabling FP16 is straightforward using autocast
and half()
methods to cast model weights and inputs. However, FP16 can sometimes cause numerical instability due to reduced precision, especially in models with large dynamic ranges. To mitigate this, mixed precision training—keeping certain layers in FP32—is often used. Libraries like NVIDIA’s Apex or PyTorch Lightning automate this process, balancing speed and stability.
Second, model quantization converts floating-point weights and activations to lower-bit representations (e.g., INT8), shrinking model size and accelerating inference. Post-training quantization, supported by frameworks like TensorFlow Lite or PyTorch’s Dynamic Quantization, requires minimal code changes. For instance, applying torch.quantization.quantize_dynamic
to a BERT model reduces its size by 4x and speeds up inference by 2-3x. However, aggressive quantization can degrade embedding quality, so calibration with representative data is critical to minimize accuracy loss. Quantization-aware training (QAT) addresses this by simulating lower precision during training, producing more robust quantized models. Tools like TensorRT further optimize quantized models for specific hardware, maximizing throughput.
Third, converting models to ONNX (Open Neural Network Exchange) standardizes them into a portable format, enabling framework-agnostic optimizations. ONNX Runtime applies graph-level optimizations like operator fusion (combining layers into single operations) and kernel tuning for specific hardware. For example, exporting a PyTorch transformer model to ONNX with torch.onnx.export
and running it via ONNX Runtime can reduce inference latency by 20-30%. ONNX also simplifies deployment across environments, as it supports integrations with TensorFlow, PyTorch, and edge devices. Additionally, ONNX Runtime supports FP16 and quantization, allowing these techniques to be combined. However, not all model architectures convert flawlessly to ONNX, so testing for compatibility is essential. Tools like ONNX Simplifier can help resolve conversion errors by streamlining complex graph structures.
By leveraging these techniques—FP16 for faster computation, quantization for reduced model complexity, and ONNX for cross-platform optimizations—developers can significantly accelerate embedding generation while maintaining acceptable accuracy.
Zilliz Cloud is a managed vector database built on Milvus perfect for building GenAI applications.
Try FreeLike the article? Spread the word