🚀 Try Zilliz Cloud, the fully managed Milvus, for free—experience 10x faster performance! Try Now>>

Milvus
Zilliz

How does federated learning manage slow or unreliable devices?

Federated learning manages slow or unreliable devices through asynchronous communication, adaptive device selection, and fault-tolerant aggregation strategies. In this setup, devices train local models on their data and send updates (like gradients or weights) to a central server. The server combines these updates to improve a global model. Slow or unreliable devices can disrupt this process, so federated learning employs techniques to minimize their impact without requiring constant connectivity or high-performance hardware.

One key approach is asynchronous model aggregation. Instead of waiting for all devices to respond in a fixed time window, the server proceeds with updates from devices that complete training within a reasonable timeframe. For example, a server might set a 5-minute window to collect contributions, ignoring devices that take longer due to low compute power or spotty connections. This prevents bottlenecks and ensures progress even if some devices lag. Additionally, the server can assign timeouts to individual devices, dropping their participation in a given round if they exceed the limit. This avoids indefinite waiting while allowing slower devices to contribute in later rounds when they’re available.

Another strategy involves prioritizing reliable devices during selection. The server can track historical performance metrics, such as a device’s average response time or dropout rate, and exclude those with poor reliability. For instance, a smartphone with frequent network disconnections might be deprioritized until its connectivity stabilizes. To further reduce strain on slow devices, techniques like model compression (e.g., quantizing weights to 8-bit integers) or partial updates (sending only a subset of parameters) minimize data transfer and computation. Frameworks like TensorFlow Federated also implement retry logic for failed transmissions and redundancy by duplicating critical training tasks across multiple devices. These steps ensure that the global model continues to improve even when some participants are intermittently unavailable or underperforming.

Like the article? Spread the word