Articles

Lessons Learned: Flashback to Summer’s Great Rogers Outage (Part 2)

by Malik Datardina, CPA, CA, CISA, GRC Strategist, Auvenir

In our last post, we looked at the Great Rogers Outage of 2022.

 

Millions of Canadians experienced life without mobile and Internet service – a necessity in our pandemic life. The cause was traced back to a system-change gone wrong. It appears that though Rogers had tested some parts of the planned change, it was insufficient to identify all the issues. The result was that the network got flooded with traffic and then the systems went down.

 

What are some lessons we can learn from this outage?

 

Major Controls Frameworks, like COBIT and ISO27001, and audit standards, like SOC2, require that management implement change management controls. Consequently, the outage presents a unique opportunity to understand what can go wrong when it comes to change management. Moreover, it highlights what types of controls are relevant from a real-live scenario - as Rogers documented in its submission to the CRTC. 

 

With that in mind, let’s look at four lessons from the Great Rogers Outage of 2022. 

 

Lesson #1: The Importance of Redundancy

 

When commenting on the impact of the outage on governments within Canada, Rogers noted: “It is important to note that in most of the cases, we provide a portion of the telecommunications solution, but not all underlying services. Many institutional customers have redundant services.” 

 

Also, as previously noted that they had “established reciprocal agreements between Rogers and Bell, and between Rogers and TELUS, to exchange alternate carrier SIM cards in support of Business Continuity.”

 

The implication of this lesson is that we should try to diversify the telecom providers within our professional and personal lives. For example, my personal device is provisioned through Fido (a Rogers sub-brand), while my work cell is provisioned through Bell.  

 

Lesson #2: Test, Test, Test

 

They say in real-estate it’s about location, location, location. In change management it’s test, test, test. In the aftermath of the outage, Rogers doesn’t deny that they need to review their change implementation process:

 

“Most importantly, Rogers is examining its “change, planning and implementation” process to identify improvements to eliminate risk of further service interruptions.”

 

To be fair, it’s not like there was no testing done. Instead, Rogers had used a phased approach to rolling out the change:

 

“Concerning the July 8th outage, the proposed activities were very carefully reviewed, as we normally do with all network changes. We validated all aspects of this change.  In fact, we had begun introducing this change weeks ago, on February 8th and had already implemented successfully the first five (5) phases in our core network.” 

 

It’s a good reminder that in the world of IT General Controls, and IT Risk Management more broadly, it’s not about what goes right but what goes wrong. Consequently, companies should ensure that the scenarios tested are comprehensive enough to identify hidden assumptions or dependencies. For example, Rogers had a procedure that relied on “alternate carrier SIM Cards”. Hypothetically, testing whether this worked ahead of time could help identify whether the employee could find their SIM cards or how they activated such SIM cards when they have no Internet.

 

Lesson #3: Planning Crisis Communications from Content to Channels

 

According to the Rogers submission, the company conducted the following communications:

 

“During the outage, Rogers communicated with customers across several different channels, including social media, media outlets, Rogers Sports & Media properties, website banners, virtual assistants, interactive voice responses (“IVR”), public service announcements and community forums. In addition, Rogers’ CEO conducted broadcast interviews with CP24, Global News, CTV News, BNN, and CityNews. Rogers SVP of Access Networks & Operations also conducted broadcast interviews on CBC and CityNews.”

 

The following CBC news clip illustrates what was communicated and how:

 

 

As can be seen, the reporter was a little surprised that they got message from the IT team – instead of Rogers themselves. However, Rogers did admit that they “will be updating [their] plans and procedures”. Specifically, they plan to:

 

  • Equip the communications team with “back-up devices on [an] alternate network”
  • Be more timely “in posting details to customer care channels, web properties, social media, as well as public service announcements (“PSAs”) across media properties”
  • Provide more frequent updates “even if there is limited or no additional information to share”
  • Determine an alternative way for the communications team to authenticate themselves, when the second-factor registered with the social media service is reliant on “a device on the Rogers network”
  • Provide specific “status of critical services (such as 9-1-1), how they may be impacted by the outage, and advice for customers”

 

 

The outage is a good illustration of how critical crisis communications can be. Maintaining effective communications with customers or other stakeholders is key to minimizing the reputational damage that such incidents can potentially have.

 

Lesson #4: Monitoring

 

The final takeaway is the importance of having resources and tools to monitor the restoration efforts. That is, the fixes deployed may not resolve all the issues. Rogers reported the following results with respect to bringing things back online:

 

"Once the technology team confirmed stability of our core network, and that traffic volumes were returning to normal level across the network, we proceeded to inform customers that our network and systems were returning to fully operational service for the vast majority of our customers. We also notified them that some customers may experience intermittent issues, and that our technology teams are monitoring and would work to resolve any issue as quickly as possible.” 

 

As can be seen, Rogers was able to restore the service for the vast majority of customers. However, there were a few that still experienced lingering issues. Consequently, it’s important to have continuous monitoring in place to ensure that the service is restored fully before returning to business as usual.

 

Closing thoughts

 

The incident highlights how dependent society has become on the wireless carriers for the day-to-day transactions and functioning of society. Vass Bednar (also interviewed in the above CBC newsclip) summarized the situation in an op-ed in the Globe and Mail as follows: 

 

“Enormous advances in mobile tech have made Canada's telecoms enormously powerful, and that power has consolidated in just five major players. That number threatens to get smaller, too, with the proposed Rogers-Shaw merger currently under review by Canada's Competition Bureau. If the deal goes through, the company that caused so many Canadians to lose connection with each other would serve roughly 40 per cent of all households in English Canada… it reinforced the idea that our telecommunication networks are vital public infrastructure that is controlled by private corporations. We've lost sight of that balance, despite the ways we rely on those networks.”

 

As discussed in the first takeaway, the issue of redundancy is paramount when it comes to ensuring ongoing access. Ironically, the lack of sufficient alternatives in the mobile carrier space amplifies the availability risk for us all.

 

Malik Datardina, CPA, CA who has more than 20 years of experience in information systems, risk and assurance, information security governance and audit data analytics. In his current role as a Governance, Risk Management, and Compliance (GRC) Strategist, where he manages internal compliance at Auvenir and takes a strategic lens towards the latest trends in innovation to build the audit platform of the future. 

LinkedIn Profile