Multi-Room Audio on Android: Setup and Sync
How to play music in sync across multiple speakers and devices from your Android phone. Learn about UPnP, Chromecast, and cross-protocol multi-room with practical setup tips.
What Multi-Room Actually Means
Multi-room audio is playing the same music through speakers in different rooms at the same time. Walk from the kitchen to the living room and the song continues seamlessly. No gaps, no echo, no weird delay between rooms.
That’s the promise. Every speaker manufacturer puts it on the box. And within a single ecosystem — Sonos, Apple, Google — it mostly works. The problems start when you have speakers from different manufacturers, which is basically everyone.
The real definition of multi-room isn’t just “same song in multiple places.” It’s synchronized playback — audio arriving at your ears from different speakers within a few tens of milliseconds of each other. If two speakers are more than about 50ms apart, you hear it. Not as an echo, but as a smearing, a thickness to the sound that makes it feel wrong. Under 30ms, most people can’t tell the difference. That’s the target.
The Sync Problem
Here’s why synchronized playback across different devices is genuinely hard.
When you tap play, a chain of events happens: the app sends a command over WiFi to the speaker, the speaker receives it, buffers some audio data, processes it through its DAC, and sound comes out. Every step takes time, and every device takes a different amount of time.
A Chromecast might take 800ms from play command to first sound. A Denon receiver might take 200ms. A SoundTouch speaker sits somewhere around 150ms. Your phone’s headphone jack? Under 10ms. If you send the play command to all four devices at the same instant, sound comes out of your phone almost immediately, the Denon a fifth of a second later, the SoundTouch after that, and the Chromecast nearly a full second after you tapped play.
That’s just startup. Once playing, devices drift. Network jitter adds a few milliseconds of randomness to every command. Internal clocks aren’t perfectly synchronized — a speaker running at 44,099.8 Hz instead of 44,100 Hz will drift by about 150ms over an hour. Some devices pause briefly during WiFi channel switches. Some buffer aggressively and ignore seek commands. Some report their playback position accurately; others round to the nearest second.
None of these problems are individually catastrophic. But combined across a house full of different speakers, they add up fast.
Protocol Landscape
Four major protocols handle multi-room audio today. Each takes a different approach to the sync problem.
| Sonos | AirPlay 2 | Chromecast | UPnP/DLNA | |
|---|---|---|---|---|
| Ecosystem | Sonos only | Apple only | Google only | Open standard |
| Sync accuracy | ~1ms (proprietary mesh) | ~5-20ms (Apple-optimized) | ~50-200ms (Cast groups) | No native sync |
| Grouping | Native, excellent | Native, good | Native, decent | No grouping standard |
| Device range | Sonos speakers only | Apple + licensed | Cast-enabled | Receivers, TVs, streamers, speakers — hundreds of brands |
| Android support | Via Sonos app only | Not available | Yes | Yes |
Sonos solves sync with proprietary hardware — their speakers form a mesh network with microsecond-accurate clock synchronization. It’s the gold standard, but you’re locked into Sonos hardware.
AirPlay 2 uses Apple’s timing synchronization protocol to keep speakers in sync. It works well, but it’s Apple-only. Not an option on Android.
Chromecast groups work reasonably well within the Google ecosystem. You create a speaker group in Google Home, and Cast-enabled apps can target the group. Sync is decent — usually within 50-200ms — but the system is closed. Only Cast-enabled devices participate.
UPnP/DLNA is the open standard. It works with the widest range of hardware — Denon receivers, Yamaha soundbars, Panasonic Blu-ray players, WiiM streamers, smart TVs, and more. The catch: UPnP has absolutely no concept of device grouping or synchronized playback. The spec simply doesn’t address it. Each renderer is treated as an independent device.
Most people don’t live entirely within one ecosystem. You’ve got the receiver you bought five years ago, the Chromecast in the bedroom, the smart TV in the living room, and maybe a Bluetooth speaker on the patio. No single protocol covers all of them.
The Cross-Protocol Challenge
This is the gap that most music apps leave wide open.
Spotify Connect works great — with Spotify Connect devices. Google Home groups Chromecasts beautifully — but only Chromecasts. Your Denon receiver supports UPnP and AirPlay, but your Chromecast doesn’t speak UPnP, and your Android phone doesn’t speak AirPlay.
So what happens when you want the kitchen Chromecast and the living room Denon receiver playing the same song at the same time? In most apps: nothing. You pick one device. The other room stays silent.
The underlying problem is architectural. Most apps are built around a single output paradigm — they send audio to one destination. Multi-room requires a fundamentally different model: one source of truth for what’s playing, fanning out to multiple destinations using whatever protocol each device understands, then continuously monitoring and correcting drift across all of them.
How We Built Multi-Room in Echobox
We call them output groups. You pick the devices you want — any combination of UPnP renderers, Chromecast devices, and your phone’s own output — and Echobox creates a group that coordinates playback across all of them.
The core idea is simple. Echobox maintains a single playback timeline — one authoritative record of what track is playing and what position it’s at. When you tap play, the group coordinator fans out play commands to every device in the group using each device’s native protocol: SOAP commands for UPnP renderers, the Cast SDK for Chromecasts, and the local audio engine for your phone.
Getting them to start at roughly the same time was the first challenge. Different devices have wildly different startup latencies — a Chromecast might need 500ms to buffer and begin, while a UPnP renderer needs 300ms and the local engine needs almost nothing. So we stagger the play commands: the slowest device gets its command first, then we wait, then send to the next-slowest, and so on. The timing uses absolute clock references rather than cumulative delays, so execution time for each command doesn’t throw off the schedule.
We spent months getting drift correction right. The first version was terrible — it would overcorrect, creating audible skips, or undercorrect and let devices drift apart over minutes.
The current system works like this: once all devices are playing, Echobox polls each one for its current playback position roughly once per second. It compares each device’s reported position against where it should be based on the sync anchor — a reference point that ties a wall-clock time to a media position. If a device reports that it’s at 1:23.400 but the anchor says it should be at 1:23.650, that’s 250ms of drift.
What happens next depends on the device.
Per-device intelligence is critical here. Echobox maintains a profile for every device it’s seen, built from three layers: what the device advertises about itself, what we know about its family (all Denon receivers behave similarly), and what we’ve observed at runtime. A Denon AVR with reliable seek gets tight drift thresholds — we’ll correct at 150ms of drift using a seek command. A generic DLNA TV with unreliable seek gets much wider tolerance — 500ms before we even flag it, and we might skip corrections entirely if seek has failed too many times. The system learns: if a device’s seek commands fail 70% of the time, Echobox stops trying to seek-correct that device and accepts looser sync. See our guide on UPnP streaming for more on how device capabilities are detected.
UPnP has no native grouping standard, so we had to build the coordination layer ourselves. Every UPnP renderer in a group gets its own SOAP commands — SetAVTransportURI, Play, Seek — as if it were the only device playing. The coordination layer just makes sure those commands go out in the right order with the right timing.
One thing we’re particularly proud of: your phone can play locally and to network devices simultaneously. The local engine reads directly from the audio buffer with under 10ms latency, while the same track streams to network devices over HTTP. This means you can walk around with headphones connected via Bluetooth while speakers in the house play the same music. The local Android audio path handles its own timing independently, so local playback is never degraded by network coordination overhead.
When a device drops out — WiFi hiccup, speaker goes to sleep — the system doesn’t panic. It tracks consecutive failures and only marks a device as failed after five missed polls. If it comes back, three successful polls restore it. This prevents the group from constantly toggling devices in and out when network conditions are marginal.
Honestly, the hardest part wasn’t any single technical challenge. It was making the system work reliably across the sheer diversity of devices out there. A firmware update on one manufacturer’s speakers changed their seek behavior. Some renderers report position in integer seconds, which makes sub-second drift detection impossible. Some Chromecast models take so long to start that they’re already a full second behind by the time audio begins. Each quirk needed its own workaround, and the system had to degrade gracefully rather than break entirely.
Practical Setup Tips
Network requirements. All devices need to be on the same subnet. If you have a mesh WiFi system with separate IoT and main networks, your speakers and phone need to be on the same one. SSDP multicast discovery needs to work — some enterprise-grade routers block multicast by default. If Echobox can’t find your UPnP devices, multicast filtering is the first thing to check.
WiFi stability matters more than speed. Multi-room audio doesn’t need much bandwidth — even uncompressed CD-quality audio is only about 1.4 Mbps. But it needs consistent latency. If your WiFi drops packets or has intermittent congestion, drift will accumulate faster than the correction system can compensate. A 5 GHz network generally provides more consistent latency than 2.4 GHz, though 2.4 GHz has better range through walls.
Choosing devices for groups. For the tightest sync, group devices with similar characteristics. Two UPnP receivers from the same manufacturer will stay in sync better than a Chromecast paired with a UPnP TV. That said, mixed groups work — just set your expectations appropriately. Two UPnP streamers might stay within 50ms; a Chromecast plus a UPnP renderer might settle around 200ms.
Managing latency expectations. Hardware-synced systems like Sonos achieve 1-5ms accuracy because the speakers share a clock. Software-coordinated sync over WiFi, which is what every cross-protocol solution uses, tops out around 30-50ms in the best case. For background listening — music playing throughout the house while you cook or work — anything under 200ms is fine. You’ll only notice drift if you’re standing in a doorway between two rooms.
When to use what. If you have multiple Chromecast devices, creating a Google Home speaker group for them is simpler and will sync better than app-level coordination. Use Echobox groups for the cases that native grouping can’t handle: mixing Chromecast with UPnP, adding your phone as a group member, or grouping UPnP devices that have no native grouping support at all. The two approaches aren’t mutually exclusive — you can use Chromecast groups for some rooms and Echobox groups for others.
Troubleshooting. If a device consistently shows high drift, check its WiFi signal strength first. Weak WiFi means variable latency, which means the sync engine is constantly chasing a moving target. If a specific device causes problems in groups but works fine solo, it may have firmware quirks that affect seek or position reporting — Echobox’s signal path diagnostics will show you the device’s profile, including its sync suitability rating and any learned observations about its behavior.
Where Things Stand
Multi-room across different protocols is a hard problem, and we don’t pretend it’s solved perfectly. Software-coordinated sync will never match hardware-synced systems like Sonos for critical listening. Some devices have firmware quirks we haven’t encountered yet. Position reporting accuracy varies wildly across manufacturers.
But for the real-world scenario — different speakers from different eras and different brands, all playing your music library in sync from your phone — it works. Not with audiophile-grade synchronization, but with the kind of accuracy that makes whole-house listening genuinely enjoyable. And the system gets smarter over time: every playback session teaches Echobox a little more about how each device behaves, tightening corrections for reliable devices and loosening them for flaky ones.