In federated learning, data is distributed across multiple devices or servers without being centralized. Instead of pooling data into a single location, each participant (e.g., a smartphone, IoT device, or organization) retains their local dataset. The global model is trained by aggregating updates (like gradients or model weights) from these participants, ensuring raw data never leaves its source. This approach is designed for scenarios where data privacy, regulatory compliance, or bandwidth constraints make centralized training impractical. For example, a keyboard app might train a next-word prediction model using data from millions of users’ devices without accessing their actual messages.
The distribution of data in federated learning can vary in structure. In horizontal federated learning, participants share the same feature space but have different data points. For instance, hospitals in different regions might collect similar patient metrics (e.g., blood pressure, age) but serve distinct populations. In vertical federated learning, participants hold different features for the same data points. A bank and an e-commerce platform might collaborate to train a fraud detection model: the bank has transaction histories, while the e-commerce platform has user browsing behavior. A hybrid approach, federated transfer learning, combines these when data overlaps are minimal. For example, a self-driving car consortium might train a model using diverse sensor data (cameras, lidar) from cars in different environments.
Key challenges arise from this distribution. Data is often non-IID (not independently and identically distributed), meaning one device’s data may not represent the global distribution. A user’s smartphone might have mostly cat photos, while another’s has dogs, leading to biased model updates. Communication costs are another concern—sending frequent large model updates between devices and a central server can be inefficient. Techniques like gradient compression or selective participant sampling help reduce overhead. Privacy risks also persist; even raw data isn’t shared, model updates might leak information. Methods like differential privacy (adding noise to updates) or secure aggregation (encrypting updates before aggregation) mitigate these risks while maintaining model performance.
Zilliz Cloud is a managed vector database built on Milvus perfect for building GenAI applications.
Try FreeLike the article? Spread the word