A nationwide T-Mobile US, Inc. network outage in June was caused by equipment failure and exacerbated by a network routing misconfiguration, according to a report prepared by the FCC’s Public Safety and Homeland Security Bureau and released today. The carrier failed to "follow several network reliability best practices that could have prevented the outage, or at least mitigated its effects," according to the report.
"T-Mobile’s outage was a failure," FCC Chairman Ajit Pai said in a statement. "Our staff investigation found that the company did not follow several established network reliability best practices that could have either prevented the outage or at least mitigated its impact. All telecommunications providers must ensure they are adhering to relevant industry best practices, and I encourage network reliability standards bodies to apply their expertise to the issues identified in this report for further study."
The report on the June 15 incident (TR Daily, June 16 and 17) said that "T-Mobile experienced an outage on its wireless networks that lasted over twelve hours, disrupting calling and texting services nationwide, including 911 service, as well as access to data service in some areas. The Public Safety and Homeland Security Bureau (PSHSB) estimates based on data provided by T-Mobile and other affected service providers that at least 41% of all calls that attempted to use T-Mobile’s network during the outage failed, including at least 23,621 failed calls to 911. The outage was initially caused by an equipment failure and then exacerbated by a network routing misconfiguration that occurred when T-Mobile introduced a new router into its network. In addition, the outage was magnified by a software flaw in T-Mobile’s network that had been latent for months and interfered with customers’ ability to initiate or receive voice calls during the outage."
"This outage provides the Federal Communications Commission (Commission) and stakeholders with the opportunity to learn valuable lessons about network reliability and the implementation of industry-accepted best practices. For example, the outage demonstrates the importance of network operators periodically auditing the diversity of their networks and taking appropriate measures to ensure resilience as needed," the report said.
In comments filed with the FCC in July, T-Mobile said the nationwide outage was an "anomaly" that it acted quickly to resolve and that it was taking steps to prevent such an incident from happening again (TR Daily, July 9).
The report released today said that T-Mobile took the following steps after the outage to prevent a recurrence: (1) "[o]ptimized Open Shortest Path First weights on links connecting to the routers in the Atlanta market"; (2) "[c]reated a separate communications channel to enable T-Mobile to manage the affected router even during an outage condition so that, in the case of a recurrence of a similar event, T-Mobile would be able to restore the affected router to a working state more quickly after intentionally taking it offline"; (3) "[a]ugmented processes regarding the phased integration of new devices into the network to include additional potential failure scenarios like those seen in this outage"; (4) "[a]ctivated additional IP Multimedia Subsystem registration nodes to increase capacity"; (5) "[r]evised IP Multimedia Subsystem registration nodes’ overload settings for better management of overload conditions"; (6) "[c]orrected the software error in the IP Multimedia Subsystem"; (7) "[i]ntroduced additional dedicated 911 nodes to enhance resiliency"; (8) "[r]educed the number of retries allowed by the IP Multimedia Subsystem registration nodes responsible for managing the secure connection with the mobile device from 4 to 2"; (9) "[i]mproved the clarity and specificity of the error message generated on nodes that interconnect with external networks when IP Multimedia Subsystem services are impacted to facilitate future troubleshooting"; (10) "[i]mproved call distribution logic for Voice over Wi-Fi services to allow regional containment during potential future outages"; (11) "[d]eployed new vendor software updates to improve IP Multimedia Subsystem node robustness and resiliency"; and (12) "[a]udited multiple systems across the circuit-switch, IP Multimedia Subsystem, and transport networks for potential enhancements."
The report added, "While fiber link failures are common, PSHSB finds that these steps, taken together, will reduce the likelihood that a fiber link failure could result in the recurrence of a similar event in T-Mobile’s network because traffic would be routed to an alternative path that could handle it. Moreover, if such an event recurred on T-Mobile’s network, it would not cause such a large service disruption because T-Mobile would have improved its networks’ ability to manage congestion in the case of a similar event and would have increased network capacity to maintain the network in a working state even with an increased volume of traffic."
The report said that the bureau "plans to engage in stakeholder outreach and guidance regarding industry-accepted, recommended network reliability best practices to protect against similar outages in the future." It outlined best practices that T-Mobile failed to follow.
"Network operators should periodically audit the physical and logical diversity called for by the design of their network segment(s) and take appropriate measures as needed," the report said. "The router that dropped signaling traffic and precipitated this outage could never have provided functional diversity for the link that failed because the router was not provisioned to process the signaling traffic that the failed link carried. Further, T-Mobile could have prevented the outage if it had audited its network during the new router integration to ensure that the traffic destined for the failed link would redirect to a router that was able to pass it. If the backup route had operated as it was designed, a nationwide outage would likely not have occurred."
"Network operators and service providers should consider validating upgrades, new procedures and commands in a lab or other test environment that simulates the target network and load prior to the first application in the field," the report added. "T-Mobile had a latent software error in its network that it failed to identify and address before it had a catastrophic impact. Had T-Mobile validated its IP Multimedia Subsystem registration node software and router integration in a test environment that simulated the relevant network segment, it could have discovered the software flaw and routing misconfiguration before they could impact live calls."
"Service providers should use virtual interfaces for routing protocols and network management to maintain connectivity to network elements in the presence of physical interface outages," according to the report. "The most severe impact on calling that this outage caused occurred when T-Mobile engineers intentionally took down a link in the course of troubleshooting and then were unable to restore it for an hour. Had T-Mobile maintained a separate communications channel to enable it to manage the affected router even when they took the suspected link down during troubleshooting, they could have maintained superior visibility into the network and potentially resolved the outage more quickly. T-Mobile implemented this best practice as a corrective action to prevent a recurrence of this event."
"Network operators and service providers should actively monitor and manage 911 network components using network management controls, where available, to quickly restore 911 service and provide priority repair during network failure events," the report said. "Reasonable 911 network monitoring would have revealed to T-Mobile in real time that the outage was causing call blocking on PSAP administrative lines, but the content of T-Mobile’s PSAP notification manifests that it likely did not understand the extent of its outage’s 911 impact while it was occurring. Had T-Mobile actively monitored its 911 network components, it might have been able to provide more accurate PSAP notification."
The report also suggested that network reliability standards bodies study: (1) "[w]hether VoLTE providers should prioritize redundancy for links that provide transport for signaling and registration traffic between IP Multimedia Subsystem cores and other networks"; and (2) "[w]hether, during any provisioning or rearrangement of IP Multimedia Subsystem routes, a VoLTE provider should prioritize audits of all signaling and registration traffic that would need to be rerouted in the event that the IP Multimedia Subsystem becomes unavailable."
The report concluded, "In keeping with past practice, the Bureau plans to release a Public Notice, based on its analysis of this and other recent outages, reminding companies of industry-accepted best practices, including those recommended by the Communications Security, Reliability and Interoperability Council, and their importance. In addition, the Bureau will contact other major transport providers to discuss their network practices and will offer its assistance to smaller providers to help ensure that our nation’s communications networks remain robust, reliable, and resilient."
In a statement today, T-Mobile said, "We take our commitment to keep our customers connected seriously and our thousands of dedicated and passionate engineers are working to deliver for our customers every single day. Immediately following this incident back in June we took the necessary steps to address the issues that created the service interruption and remain committed to continual improvement."
Dan Henry, director-government affairs for NENA, said that his group "appreciates the FCC staff report on the T-Mobile outage and thanks the Commission for their continued commitment to improving wireless-network resiliency and ensuring that all Americans have 24/7/365 uninterrupted access to 9-1-1 and emergency services. When outages like this do occur, it is important to conduct a thorough investigation to understand what happened and why. NENA continues to work alongside government and industry to develop and implement best practices that ensure no call for help ever goes unanswered."
"We are pleased to see the FCC’s rapid inquiry into this outage," said Harriet Rennie-Brown, executive director of the National Association of State 911 Administrators. "The reliability of networks is paramount to ensuring access to 9-1-1 during a consumer’s greatest time of need."
But Public Knowledge criticized the report released today.
"The staff report makes it clear that T-Mobile did not follow best industry practices. The report also makes clear that this resulted in serious disruption and financial loss for individuals and businesses. Nevertheless, the Report proposes no fines or corrective measures other than further study," complained Harold Feld, senior vice president of Public Knowledge.
"Once again, the Commission is confronted with stark evidence that voluntary best practices are not enough," he added. "Companies need clear rules to follow, and clear penalties when companies fail to follow them. The American people have a right to expect that their critical communications networks work reliably. Protecting that right is the core mission of the Federal Communications Commission. If the FCC will not hold T-Mobile or other carriers accountable, then Congress needs to hold the FCC accountable for its constant refusal to adopt mandatory reliability safeguards." —Paul Kirby, [email protected]
MainStory: FCC FederalNews PublicSafety
Interested in submitting an article?
Submit your information to us today!Learn More