Why are mobile networks dropping like flies?

Updated. Last week, Orange France’s(s fte) mobile network tanked, knocking out the mobile phones of millions of subscribers. This week the same thing happened to O2(s tef) in the U.K. The U.S. isn’t immune either. Just last week T-Mobile suffered from a smaller glitch, but the granddaddy of all network failures hit Verizon Wireless in December when its LTE network went down on three separate occasions in a single month.
Why are networks suddenly conking out all over the world? It looks like global networks are developing a signaling problem – more specifically a signaling overload problem.
Details are starting to emerge about just what caused the Orange and O2 outages. Computerworld UK and Information Age separately reported that the network element at fault in both cases was the home location register, or HLR. It’s not exactly the most commonly known piece of gear, but in brief the HLR acts as an anchor point to which we remain tethered as we move about the network. It stores our subscriber identities and knows what services we can access, but most importantly, it tracks each device’s present location so the network knows where to direct inbound and outbound traffic.
The HLR plays its dispatch role by receiving a constant stream of signals from devices updating the database on their current locations and activities.  According to Computerworld, a data glitch in an Orange HLR node generated error messages, which then multiplied as they got knocked back and forth around the network. Just because the HLR was failing, that didn’t stop devices from sending out their updates. Like a million kids screaming “look at me!” from the backseat while you’re trying to deal with the coffee you just spilled in your lap, smartphones kept pinging the suffering HLR creating a huge bottleneck. The end result: the whole system fails, leaving millions of handsets without their lifelines to the network core.
If the Orange and O2 failures sound familiar, it’s because the exact same thing that happened to Verizon in December. Since Verizon’s network is an LTE system, not an HSPA one, its core architecture is a bit different, but the basic problem seems to be the same. A software bug generated error messages that backed up its core elements, causing them to be oversaturated by signals and ultimately forcing the whole core to crash.

A whole lot of bandwidth but nowhere to go

In all three situations, the radio networks weren’t the problem. The networks still had plenty of capacity, and all devices were capable of connecting to their towers to send and receive data. But with a broken core, the networks had no idea where and whom to send that data to. Imagine playing Where’s Waldo? with 10 million people in a single storybook frame.
In Verizon’s case you could chalk it up to the relative newness of both the network and the LTE standard, but in the case of Orange and O2, their UMTS networks have been up and running for nearly a decade. For their HLRs to now start developing random terminal bugs seems rather odd. The problem doesn’t appear to be inherent in the equipment itself but in the sheer volume of signaling traffic traversing mobile networks driven by the smartphone boom.
That constant network chatter from smartphones and their applications are overwhelming network cores. On normal days they can handle that traffic, but even a small glitch throws everything out of whack. Smartphone use is only increasing, so this problem is only going to get worse.

What’s to be done?

If you talk to the signaling system vendors such as Tekelec(s tklc), Acme Packet(s apkt), Traffix Systems, Intellinet and Openet, you’ll get a single resounding answer: Diameter! Diameter is signaling protocol used in LTE core networks, and those aforementioned vendors claim that more robust and flexible routers using that protocol will nip the signaling problem in the bud. Diameter’s load balancing techniques would allow the network to shift the signaling load away from elements experiencing problems — isolating failures rather than allowing them to infect everything around them.
Given O2 and Orange’s failures, those vendors are jumping at the chance to claim diameter routers are now necessary for 3G networks as well, and they’re probably right. The vast majority of smartphone traffic currently runs through 3G towers, and it’s going to remain that way for a while. But diameter is by no means a cure-all.
Verizon has experienced a record number of network failures, even though its uses the next generation signaling protocol. despite the fact it implemented Tekelec’s diameter platform last year. Tekelec certainly isn’t to blame for the outages – they The outages were caused by software bugs in other elements, yet its diameter routers weren’t able to contain the problem, either, when the network started going haywire. Update: While Tekelec in August revealed that Verizon was a customer for its Diameter signaling router, Tekelec officials told me that Verizon hadn’t actually deployed its equipment by the time of the December outages.
Whatever the eventual cure, the wireless industry had better find it quick. O2’s London outage was particularly embarrassing because of the upcoming Olympics. But other operators should be just as worried. A network that needs to be shut down and rebooted every few months isn’t much of a network at all. 
Tower Image courtesy of Flickr user Nikhil Verma; Compass photo courtesy of Shutterstock user Sashkin