Incident Communications Procedures

Contents

Purpose

This document specifies both the duties and the communications required of various Office of Information Technology (OIT) personnel while they are responding to an incident that involves a computing resource that is provided or managed by OIT.

Definitions

In some circumstances, an OIT staff member may be serving in more than one of the roles defined here.

  • Service
    A combination of people, processes and technology that supports a customer’s business process.

    • Externally facing services that support customers outside of OIT; e.g. email.
    • Internally facing services that underpin externally facing services; e.g. the campus network.
  • Incident
    Outage or degradation of a service provided by or managed by OIT.
  • First Responder
    OIT staff member who:

    • Is first to recognize that an incident is occurring but usually cannot resolve it alone.
    • Is usually an Operations or NC State Help Desk staff member but can be any OIT staff member.
    • Contacts the appropriate service team about the incident.
      See Duties of a First Responder (below).
  • Service Team
    Group of OIT staff members responsible for maintaining a particular OIT-provided service or group of services.
    See Duties of a Service Team (below).
  • Service Responder
    Member of a service team, typically the manager of the service or his/her designee, who is contacted by a first responder about an incident and who accepts responsibility for responding to it.
    See Duties of a Service Responder (below).
  • Service Offering Manager
    OIT staff member who:

    • Oversees all aspects of a service offering in which the root cause of an incident appears to be located.
    • Likely knows the impact of an incident on all components of the service offering and perhaps on related services.
    • Will most likely be responsible for the overall coordination of OIT’s response to an incident involving the service offering.
      See Duties of a Service Offering Manager (below).
  • Stakeholders
    Consumers of a service as well as OIT staff who have a vested interest in the service or are responsible for it.

[Back to Contents]

Duties of a First Responder

  1. If the first responder knows which service team is responsible for the affected service, he/she will skip to Step 2; otherwise, will report the incident as follows:
    • During business hours:
      NC State Help Desk (919-515-4357)
      or
      Network Operations Center (NOC) (919-513-9675)
    • After business hours:
      Operations (919-515-5500) Once this report is made, the first responder’s duties are complete.
  2. Contact the service responder on the appropriate service team by the fastest of these four means:
    • > Phone.
    • > In person.
    • > Instant message / text.
  3. Obtain confirmation from the service responder that the service team has accepted responsibility for responding to the incident. Once this confirmation is received, the first responder’s duties are complete.

[Back to Contents]

Duties of a Service Responder

  1. Immediately confirm to the contacting party–first responder, NC State Help Desk, NOC, or Operations–in person, by phone, or through a previously established chat channel–e.g., OIT Critical Emergency Google Chat room–that the service team has accepted the responsibility for responding to the incident.
  2. Notify the other service team members about the incident.

[Back to top]

Duties of a Service Team

  1. Determine the following:
    • > Number of stakeholders affected by the incident. (Tens? Hundreds? Thousands?)
    • > Type of impact. (E.g., full service outage, reduced functionality.)
    • > Scope of impact to stakeholders. (E.g., portal outage, AutoCAD license server outage.)
    • > Expected time required to resolve the incident. (Minutes? Hours? Days?)
  2. Respond to the incident in one of the following two ways, based on its impact:
    • Incident that has relatively low impact
      The on-call service team member will inform the stakeholder and all other appropriate individuals via outage / degradation posts –and other means, as needed–as soon as the incident has been resolved, without the need for further communication, provided that the incident meets ALL of these three criteria:
    • > Can be resolved by the on-call Service Team within one hour.
    • > Has no negative effect on critical business functions.
    • > Adversely affects no more than one Stakeholder.
    • > Incident that has relatively high impact.
    • Incident has a relatively high impact.
      The on-call service team member will contact the service offering manager by phone or in person and transfer the responsibility for communicating about the incident response if the incident meets AT LEAST ONE of these three criteria:
    • > Will require a period of hours and multiple staff members to resolve.
    • > Has a negative effect on critical business functions.
    • > Adversely affects two or more stakeholders.
  3. Communicate status information to the service offering manager throughout the incident.
  4. Provide targeted information through additional means of communication. (Optional or as requested by key stakeholders.)

[Back to top]

Duties of a Service Offering Manager

After accepting responsibility for coordinating the incident response, the service offering manager will become the hub of all incident-related communications, including:

  • Determining which of his/her team members will need to help respond to the incident.
  • Determining whether assistance in posting updates will be needed from the NC State Help Desk or Project Portfolio and Process Services (PPPS).
  • Maintaining contact with the service team and any other response personnel as they attempt to restore the service.
  • Updating the response status to all stakeholders.

[Back to top]

Project Portfolio and Process Services (PPPS)

The service offering manager should confer with PPPS (ors@ncsu.edu) about the incident if at least one of the following is true:

  • A large number of stakeholders are affected.
  • Affected stakeholders are performing work critical to university business.
  • Multiple services are affected (e.g., data center issues).
  • The incident has remained unresolved for an extended period of time.

Even if none of these applies, the service offering manager may still want to confer with PPPS about the incident.

[Back to top]

Initial Outage / Degradation post

Immediately after accepting responsibility for coordinating the incident response, the service offering manager must ensure that the stakeholders are informed about the incident’s status.

  • If an initial post has already been made by a member of the NC State Help Desk or Operations (who in many cases will have been the first responder), then the service offering manager should use that post to make subsequent updates as indicated below.
  • If no initial post has been made, the service offering manager should create one following the directions described in the Creating an Outage knowledgebase article.

[Back to top]

Update frequency

For the duration of the incident response, the service offering manager must post frequent updates to all stakeholders. Frequency will depend on the nature, scope, and timing of the incident.

In general, posts should be made:

  • Every 1-2 hours during business hours.
  • Every 3-4 hours during non-business hours.

[Back to top]

Update content

When creating an update, the service offering manager must:

  • Indicate any change in the status of the incident.
  • Indicate any change in the service restoration time frame.
  • Include additional information as needed by stakeholders.

[Back to top]

Tailored communication

In addition to outage / degradation posts, the service offering manager should use the means of communication most appropriate for (or requested by) specific groups of stakeholders.  In general, a more immediate level of communication requires a faster means.

  • For the general public and all other Stakeholders:
    • Outage / degradation posts.
  • For OIT technical staff responding to the incident:
    • OIT Critical Emergency Google Chat channel.
    • Phone.
    • In-person contact.
    • Email, but without lengthy message threads that may not include all of the necessary staff.
  • For the NC State Help Desk and PPPS (and Operations, if after hours):
    • OIT Critical Emergency Google Chat channel.
    • Email.
    • Phone.
    • In-person contact.
  • For the Directors and CIO:
    • email summary.
    • Other means as requested.

[Back to top]

Additional communications

Depending on the stakeholders and the service involved, it may also be appropriate for the service offering manager to communicate the status of the incident through one or more of the following:

  • Online Notify.
  • service-specific email lists. (e.g., Majordomo, Bronto.)
  • websites. (e.g., OIT site, main NC State site.)
  • OIT News.
  • Technician.
  • News Services.

[Back to top]

Final actions

  • When the service has been restored, the service offering manager will send a final follow-up outage / degradation update  to all stakeholders, ensuring that they are aware of:
    • What happened.
    • What was done to restore the service.
    • Resumption of the service.
  • If an after-action review is needed, the service offering manager will coordinate with PPPS to arrange for it.

[Back to top]