Incident Communications Procedures

Contents

Purpose

This document specifies both the duties and the communications required of various Office of Information Technology (OIT) personnel while they are responding to an incident that involves a computing resource that is provided or managed by OIT.

Definitions

In some circumstances, an OIT staff member may be serving in more than one of the roles defined here.

  • Service
    A combination of people, processes and technology that supports a customer’s business process.

    • Externally facing services support customers outside of OIT; e.g., email.
    • Internally facing services (“Component Services”) support teams or units within OIT; e.g., server firewall.
  • Incident
    Abnormal functioning or outage of a service, server, or system provided by or managed by OIT.
  • First Responder
    OIT staff member who:

    • is first to recognize that an incident is occurring but usually cannot resolve it alone
    • is usually an Operations or NC State Help Desk staff member but can be any OIT staff member
    • contacts the appropriate Service Team about the incident
      See Duties of a First Responder (below).
  • Service Team
    Group of OIT staff members responsible for maintaining a particular OIT-provided service or group of services.
    See Duties of a Service Team (below).
  • Service Responder
    Member of a Service Team, typically the manager of the service (or his/her designee), who is contacted by a First Responder about an incident and who accepts responsibility for responding to it.
    See Duties of a Service Responder (below).
  • Service Owner
    OIT staff member who:

    • oversees all aspects of a service (typically an internally facing one) in which the root cause of an incident appears to be located
    • likely knows the impact of an incident on all components of the service and perhaps on related services
    • will most likely be responsible for the overall coordination of OIT’s response to an incident involving the service.
      See Duties of a Service Owner (below).
  • Stakeholders
    Users (“customers”) of a service as well as OIT staff who have a vested interest in the service or are responsible for it

[Back to Contents]

Duties of a First Responder

  1. If the First Responder knows which Service Team is responsible for the affected service, he/she will skip to Step 2; otherwise, will report the incident as follows:
    • During business hours:
      NC State Help Desk (919-515-4357)
      or
      Network Operations Center (NOC) (919-513-9675)
    • After business hours:
      Operations (919-515-5500) Once this report is made, the First Responder’s duties are complete.
  2. Contact the Service Responder on the appropriate Service Team by the fastest of these four means:
    • > phone
    • > in person
    • > chat
    • > page, using the Online Pager System, following the instructions for that Service Team’s pager group. If needed, re-page every 10 minutes until receipt is confirmed.
  3. Obtain confirmation from the Service Responder that the Service Team has accepted responsibility for responding to the incident.Once this confirmation is received, the First Responder’s duties are complete.

[Back to Contents]

Duties of a Service Responder

  1. Immediately confirm to the contacting party (First Responder, NC State Help Desk, NOC, or Operations), in person, by phone, or through a previously established chat channel (e.g., OIT jabber chat room), that the Service Team has accepted the responsibility for responding to the incident.
  2. Notify the other Service Team members about the incident.

[Back to top]

Duties of a Service Team

  1. Determine the following:
    • > number of Stakeholders affected by the incident (tens?, hundreds?, thousands?)
    • > type of impact (e.g., full service outage, reduced functionality)
    • > scope of impact to Stakeholders (e.g. Portal outage, AutoCAD license server outage)
    • > expected time required to resolve the incident (minutes? hours? days?)
  2. Respond to the incident in one of the following two ways, based on its impact:
    • Incident that has relatively low impact
      The on-call Service Team member will inform the Stakeholder and all other appropriate individuals via SysNews (and other means, as needed) as soon as the incident has been resolved, without the need for further communication, provided that the incident meets ALL of these three criteria:

      • > can be resolved by the on-call Service Team within one hour
      • > has no negative effect on critical business functions
      • > adversely affects no more than one Stakeholder
    • Incident that has relatively high impact
      The on-call Service Team member will contact the Service Owner by phone or in person and transfer the responsibility for communicating about the incident response if the incident meets AT LEAST ONE of these three criteria:

      • > will require a period of hours and multiple staff members to resolve
      • > has a negative effect on critical business functions
      • > adversely affects two or more Stakeholders
  3. Communicate status information to the Service Owner throughout the incident.
  4. (Optional or as requested by key Stakeholders) Provide targeted information through additional means of communication.

[Back to top]

Duties of a Service Owner

After accepting responsibility for coordinating the incident response, the Service Owner will become the hub of all incident-related communications, including:

  • determining which of his/her team members will need to help respond to the incident
  • determining whether assistance in posting updates will be needed from the NC State Help Desk or Project Portfolio Services (PPS)
  • maintaining contact with the Service Team and any other response personnel as they attempt to restore the service
  • updating the response status to all Stakeholders

[Back to top]

Project Portfolio Services (PPS)

The Service Owner should confer with PPS (ors@ncsu.edu) about the incident if at least one of the following is true:

  • A large number of Stakeholders are affected.
  • Affected Stakeholders are performing work critical to university business.
  • Multiple services are affected (e.g., Data Center issues).
  • The incident has remained unresolved for an extended period of time.

Even if none of these applies, the Service Owner may still want to confer with PPS about the incident.

[Back to top]

Initial SysNews post

Immediately after accepting responsibility for coordinating the incident response, the Service Owner must ensure that the Stakeholders are informed about the incident’s status.

  • If an initial SysNews post has already been made by a member of the NC State Help Desk or Operations (who in many cases will have been the First Responder), then the Service Owner should use that post to make subsequent updates as indicated below.
  • If no initial SysNews post has been made, the Service Owner should create one as follows:
    1. Go to SysNews.
    2. In the User Info section, if you are already logged in, skip to Step 3; otherwise, click the login now link and follow the prompts to enter your Unity ID and password. (If you need assistance making the post, call one of the numbers listed under Duties of a First Responder (above) as appropriate.)
    3. In the System Announcements section, look for the Post a News Entrylink (near the bottom).
      • If this link is present, then click on it and continue to Step 4.
      • If this link is not present, then skip the rest of these instructions and call one of the numbers listed under Duties of a First Responder (above) as appropriate for assistance in making the post.
    4. Click the New Post button.
    5. Follow the instructions for completing the form.
    6. If applicable, post a notice about the incident to additional groups (e.g., Online Notify) as indicated at the bottom of the form.

[Back to top]

Updates

  • Updates should be posted to SysNews before posting them through other channels.
  • The Service Owner is responsible for submitting these updates but may contact the NC State Help Desk or ORS for assistance in posting them.
  • SysNews posts are accessible to both the general public and the entire campus.
  • These posts ensure that all Stakeholders are notified, especially those whose work effectiveness and efficiency are most heavily impacted by the incident.

To update a Sysnews post (create a follow-up post):

  1. Go to SysNews.
  2. In the User Info section, if you are already logged in, skip to Step 3; otherwise, click the login now link and follow the prompts to enter your Unity ID and password. (If you need assistance making the post, call one of the numbers listed under Duties of a First Responder (above) as appropriate.)
  3. In the System Announcements section, click on the Post a News Entry link (near the bottom).
  4. On the page that opens, in the list of posts, highlight the one you want to update and do one of the following:
    • Click on the Followup On Post button
      or
    • Click on the Edit Post button and from the Type drop-down menu, select FollowUp.
  5. Follow the instructions for completing the form.
  6. If applicable, post a notice about the incident to additional groups (e.g., Online Notify) as indicated at the bottom of the form.

[Back to top]

Update frequency

For the duration of the incident response, the Service Owner must post frequent updates to all Stakeholders. Frequency will depend on the nature, scope, and timing of the incident. In general:

  • every 1-2 hours during business hours
  • every 3-4 hours during non-business hours

[Back to top]

Update content

When creating an update, the Service Owner must:

  • indicate any change in the status of the incident
  • indicate any change in the service restoration time frame
  • include additional information as needed by Stakeholders

[Back to top]

Tailored communication

In addition to SysNews posts, the Service Owner should use the means of communication most appropriate for (or requested by) specific groups of Stakeholders.  In general, a more immediate level of communication requires a faster means.

  • For the general public and all other Stakeholders:
    • SysNews posts
  • For OIT technical staff responding to the incident:
    • previously established chat channel
    • phone
    • in-person contact
    • email, but without lengthy message threads that may not include all of the necessary staff
  • For the NC State Help Desk and ORS (and Operations, if after hours):
    • Help Desk jabber chat channel as much as possible
    • email
    • phone
    • in-person contact
  • For the Directors and CIO:
    • email summary
    • other means as requested

[Back to top]

Additional communications

Depending on the Stakeholders and the service involved, it may also be appropriate for the Service Owner to communicate the status of the incident through one or more of the following:

  • Online Notify
  • service-specific email lists (e.g., Majordomo, Bronto)
  • websites (e.g., OIT site, main NC State site)
  • OIT News
  • Technician
  • News Services

[Back to top]

Final actions

  • When the service has been restored, the Service Owner will send a final followup SysNews post to all Stakeholders, ensuring that they are aware of:
    • what happened
    • what was done to restore the service
    • resumption of the service
  • If an after-action review is needed, the Service Owner will coordinate with ORS to arrange for it.

[Back to top]