Landstrike is a fictional disaster novel about a hurricane that hits the New York City area. Amazon.com called it “the gripping story of Hurricane Nicole’s … catastrophic rampage up New York City’s Hudson River. It is also an enthralling account of the horrific storm’s aftermath as residents, suddenly isolated from the world without electricity, food, water, and even communications, try to survive in Stone Age conditions!” (emphasis added)
With Superstorm Sandy fresh in our memory, the statement that the fictional hurricane was so bad that “even communications” were lost, seems quaintly naïve. But it captures the depth of a long-standing faith in the reliability of the PSTN.
Events of last year shattered that faith. The surprise “derecho” storm that hit the Washington, D.C. area cut off access to 9-1-1 for two million people. Four months later, Sandy flooded central offices, crippled cell service and cut power for so long that back-up generators were running out of fuel. The devastating storms of 2012 are the most recent events to spur enterprises to scrutinize their business continuity and disaster recovery (BC/DR) plans.
This article details the emerging lessons of Sandy and the derecho outlined in the FCC’s recent report, Impact of the June 2012 Derecho on Communications Networks and Services. While the report focuses on service providers, its lessons are equally important to enterprises, including critical infrastructure industries.
What All Enterprises Can Learn from the FCC Report
To the FCC, Superstorm Sandy in October, Hurricane Isaac in September, and the derecho in June, all
highlighted shortcomings in the reliability and resiliency of communications, and raised concerns about commercial power and telecommunications providers’ implementation of procedures to ensure adequate backup power. Moreover, such events shed light on the possible impact of power outages on consumers who rely at their premises on communications devices that operate on commercial power … and/or have a limited battery life …
These problems are preventable: “Communications failures during the derecho revealed that many providers failed to implement crucial best practices developed by CSRIC (the Communication Security, Reliability, and Interoperability Council) that could have prevented or mitigated many of the storm’s most serious effects …”
Here are the core lessons enterprises need to learn:
Five Nines Ain’t What it Used to Be: In the good old days, the PSTN delivered high reliability (though voice minutes cost a quarter). A recent article (Want a Five Nines Network?) noted that “[c]onverged networks today rarely” achieve 99.999% availability, which translates into “5.25 minutes of downtime per year … Most networks blow that figure in the first month.”
Lesson: Scrutinize your network, your provider(s), and your BC/DR plans. Plan not just to restore services, but also to operate with degraded or reduced connectivity.
Find a Way to Keep the Lights On: The FCC found that many 9-1-1 communications failures were caused by a loss of commercial power followed by generator failures. While wireless networks fared better during the derecho, the FCC noted that many wireless providers do not know how long battery back-up at cell sites will last. The FCC also found numerous instances of power-related design flaws, such as critical monitoring equipment with only 30 minutes of battery back-up power and dual generators that could not operate independently.
Lesson: Determine how much back-up power you need and engineer it properly.
Update and Test BC/DR Plans Regularly: While Sandy’s size and trajectory were tracked for days, the derecho struck with minimal warning. Disaster recovery became a “come as you are” affair that exposed widespread weaknesses. As a result, Verizon now plans to perform a
complete review and update of monthly and annual preventative maintenance requirements for generators, batteries, and rectifiers that supply power at host central offices. Power technicians will be trained in critical facility ‘blackout’ testing to simulate total commercial power failure, as well as manual generator start procedures and “prioritized system load transfer” scenarios to distribute backup power to critical equipment.
Lesson: Implement a similar testing plan.
Build BC/DR into Your Service Agreements: Here are several easy BC/DR “wins” in your service agreements:
Force majeure: carriers’ default force majeure clauses are overbroad and too easily excuse their failure to perform. Carriers should commit to taking reasonable precautions to mitigate the effects of such events.
Cooperation, testing and planning: these provisions can a) address cooperation between the parties during a disaster; b) allow customers to attend carrier disaster recovery exercise; c) foster the exchange of BC/DR plans and network diagrams; and d) let customers set restoration priorities during multiple outages.
Service levels: customers with specialized needs can negotiate special service levels for priority sites.
Negotiate agreements with more than one carrier: a multi-vendor strategy increases your options and boosts your network’s reliability.
Lesson: Your BC/DR plan starts with your service agreements.
Lessons for Public Safety, Utilities, and other Providers of Critical Services
Diversity audits. The FCC Report proposed requiring audits of the physical routes of 9-1-1 circuits and ALI links “to verify diversity and understand, avoid, or address instances where a single failure causes loss” of connectivity.
Lesson: Don’t take the carriers’ word for it; verify that you have the necessary physical and logical diversity for critical sites.
Service Priority Programs: The Telecommunications Service Priority Program (TSP), the Government Emergency Telecommunications Service (GETS) and the Wireless Priority Service (WPS) can help some customers obtain priority provisioning and restoration. However, participants must pre-register, service is “best efforts” only, and they include no guarantees.
Lesson: Service priority programs are valuable, but are not perfect.
Don’t Expect Regulators to Watch Your Back, Even if You’re “Critical”
Regulators are just starting to address whether they can help users – including public safety and other “critical infrastructure” users — obtain quality of service on broadband networks. Indeed, broadband providers question whether the FCC has statutory authority to impose mandates to improve network reliability. Others have suggested that reliability will improve if the FCC encourages carriers to adopt “best practices” while encouraging users to take self-help measures like purchasing diverse services for mission-critical sites, and maintaining duplicate “hot sites” for critical data and applications. Users should press their vendors to answer the critical technical questions that the sales team can’t or won’t.
IP-based networks raises particular concerns for “critical infrastructure” users; e.g., utilities, financial services, transportation, government services, and public health. These sectors are highly interdependent, thereby increasing the potential for cascading effects from network disruptions. Moreover, the mingling of traffic on IP networks impacts the carriers’ ability to offer priority restoration following disaster or to guarantee a certain level of reliability or throughput to any particular customer. While IP networks are highly robust, don’t assume that the underlying network infrastructure will survive or be restored in any particular timeframe. If service can be lost to 9-1-1 call centers in a major metropolitan area, outages can hit any customer anywhere. Moreover, don’t assume that regulators can do much beyond issuing reports on what should have been done.
It will take months before regulators, service providers and enterprises grasp the lessons of the communications outages of 2012 for business continuity and disaster recovery. But you can take the steps outlined above and realize tangible and cost-effective gains in reliability and robustness.