February 2 started just like any other Thursday morning, with students and faculty heading to class and University staff arriving at their offices to begin the workday.
By 8:30 a.m., Illinois State’s IT infrastructure was facing one of its most serious outages in recent memory, preventing Redbirds from using critical services like ReggieNet, Campus Solutions, iPeople, and even campus phones. A rapid, all-hands-on-deck response led by Administrative Technologies (AT) diagnosed the problem before lunchtime, with most services restored within 6 hours.
The work is far from over. AT is digging deep into the cause of the issue and is using the outage to further hone a new incident-response plan that, as of February 2, was only a few days old. The episode also revealed a deep level of cooperation between AT and various IT units across campus.
“There was urgency without panic,” said Craig Jackson, director of Infrastructure, Operations and Networking (ION) in AT. “It was great to see how the campus community worked together.”
The first sign of trouble February 2 was around 8:15 a.m., when campus phones and the chat service Jabber began acting up. Within 15 minutes, “everybody was realizing this is a major incident,” Jackson said.
AT snapped into action. While a technical team hopped on a conference call and tried to diagnose the problem, other AT staff gathered in a Situation Room in Julian Hall and through a shared chat room to plan communications to campus about the outage. Several campuswide emails and Tech Alerts were sent throughout the day, as were posts and tweets from AT’s social media accounts.
“We wanted people to know we were very focused on fixing it,” said Ed Vize, who handles communications for AT’s Client Services team.
The outage was also a valuable experience for AT’s new team of trained incident coordinators in the Technology Support Center that formally launched February 1, said Vize.
Finding the culprit
Jackson praised AT’s partner IT units across campus for offering up manpower and technical expertise to help solve the problem. But those first few hours were the most frustrating, simply because it was so challenging to identify what exactly was wrong and what was causing it.
“For IT people like us, it’s the not knowing part,” Jackson said. “We want to give answers as fast as everyone wants to have answers.”
AT is still not certain what caused the problem, but the prime suspect is a small piece of hardware—a network connector about the size of a flash drive—that failed. That hardware plays a crucial role in connecting ISU’s virtualized servers with our network and the storage needed to make everything work.
Once that issue was identified and fixed around 11:30 a.m., systems began to recover one by one. (Around midday, AT temporarily did open up alternative ways to log into ReggieNet and iPeople.)
AT continues to study the incident and address key questions, such as why a backup system—called a “redundancy”—did not kick in as it is supposed to do during a big outage like this.
“I want to acknowledge that I know this outage disrupted work and academics across the University for most or part of the day, unfortunately inconveniencing the community on a busy day,” said Charles Edamala, chief technology officer and associate vice president in AT. “While hardware failures are inevitable, this particular event exposed a serious vulnerability that my teams are focusing on mitigating. They have engaged vendor support to help understand the root cause and to engineer a solution around it.
“I offer my sincere gratitude to the ISU community for their patience,” he added.
Ryan Denham can be reached at rmdenha@IllinoisState.edu.