Contents
- Purpose
- Definitions
- Duties of a First Responder
- Duties of a Service Responder
- Duties of a Service Team
- Duties of a Service Offering Manager
Purpose
This document specifies both the duties and the communications required of various Office of Information Technology (OIT) personnel while they are responding to an incident that involves a computing resource that is provided or managed by OIT.
Definitions
In some circumstances, an OIT staff member may be serving in more than one of the roles defined here.
- Service
A combination of people, processes and technology that supports a customer’s business process.- Externally facing services that support customers outside of OIT; e.g. email.
- Internally facing services that underpin externally facing services; e.g. the campus network.
- Incident
Outage or degradation of a service provided by or managed by OIT. - First Responder
OIT staff member who:- Is first to recognize that an incident is occurring but usually cannot resolve it alone.
- Is usually an Operations or NC State Help Desk staff member but can be any OIT staff member.
- Contacts the appropriate service team about the incident.
See Duties of a First Responder (below).
- Service Team
Group of OIT staff members responsible for maintaining a particular OIT-provided service or group of services.
See Duties of a Service Team (below). - Service Responder
Member of a service team, typically the manager of the service or his/her designee, who is contacted by a first responder about an incident and who accepts responsibility for responding to it.
See Duties of a Service Responder (below). - Service Offering Manager
OIT staff member who:- Oversees all aspects of a service offering in which the root cause of an incident appears to be located.
- Likely knows the impact of an incident on all components of the service offering and perhaps on related services.
- Will most likely be responsible for the overall coordination of OIT’s response to an incident involving the service offering.
See Duties of a Service Offering Manager (below).
- Stakeholders
Consumers of a service as well as OIT staff who have a vested interest in the service or are responsible for it.
Duties of a First Responder
- If the first responder knows which service team is responsible for the affected service, he/she will skip to Step 2; otherwise, will report the incident as follows:
- During business hours:
NC State Help Desk (919-515-4357)
or
Network Operations Center (NOC) (919-513-9675) - After business hours:
Operations (919-515-5500) Once this report is made, the first responder’s duties are complete.
- During business hours:
- Contact the service responder on the appropriate service team by the fastest of these four means:
- > Phone.
- > In person.
- > Instant message / text.
- Obtain confirmation from the service responder that the service team has accepted responsibility for responding to the incident. Once this confirmation is received, the first responder’s duties are complete.
Duties of a Service Responder
- Immediately confirm to the contacting party–first responder, NC State Help Desk, NOC, or Operations–in person, by phone, or through a previously established chat channel–e.g., OIT Critical Emergency Google Chat room–that the service team has accepted the responsibility for responding to the incident.
- Notify the other service team members about the incident.
Duties of a Service Team
- Determine the following:
- > Number of stakeholders affected by the incident. (Tens? Hundreds? Thousands?)
- > Type of impact. (E.g., full service outage, reduced functionality.)
- > Scope of impact to stakeholders. (E.g., portal outage, AutoCAD license server outage.)
- > Expected time required to resolve the incident. (Minutes? Hours? Days?)
- Respond to the incident in one of the following two ways, based on its impact:
- Incident that has relatively low impact
The on-call service team member will inform the stakeholder and all other appropriate individuals via outage / degradation posts –and other means, as needed–as soon as the incident has been resolved, without the need for further communication, provided that the incident meets ALL of these three criteria: - > Can be resolved by the on-call Service Team within one hour.
- > Has no negative effect on critical business functions.
- > Adversely affects no more than one Stakeholder.
- > Incident that has relatively high impact.
- Incident has a relatively high impact.
The on-call service team member will contact the service offering manager by phone or in person and transfer the responsibility for communicating about the incident response if the incident meets AT LEAST ONE of these three criteria: - > Will require a period of hours and multiple staff members to resolve.
- > Has a negative effect on critical business functions.
- > Adversely affects two or more stakeholders.
- Incident that has relatively low impact
- Communicate status information to the service offering manager throughout the incident.
- Provide targeted information through additional means of communication. (Optional or as requested by key stakeholders.)
Duties of a Service Offering Manager
After accepting responsibility for coordinating the incident response, the service offering manager will become the hub of all incident-related communications, including:
- Determining which of his/her team members will need to help respond to the incident.
- Determining whether assistance in posting updates will be needed from the NC State Help Desk or Project Portfolio and Process Services (PPPS).
- Maintaining contact with the service team and any other response personnel as they attempt to restore the service.
- Updating the response status to all stakeholders.
Project Portfolio and Process Services (PPPS)
The service offering manager should confer with PPPS (ors@ncsu.edu) about the incident if at least one of the following is true:
- A large number of stakeholders are affected.
- Affected stakeholders are performing work critical to university business.
- Multiple services are affected (e.g., data center issues).
- The incident has remained unresolved for an extended period of time.
Even if none of these applies, the service offering manager may still want to confer with PPPS about the incident.
Initial Outage / Degradation post
Immediately after accepting responsibility for coordinating the incident response, the service offering manager must ensure that the stakeholders are informed about the incident’s status.
- If an initial post has already been made by a member of the NC State Help Desk or Operations (who in many cases will have been the first responder), then the service offering manager should use that post to make subsequent updates as indicated below.
- If no initial post has been made, the service offering manager should create one following the directions described in the Creating an Outage knowledgebase article.
Update frequency
For the duration of the incident response, the service offering manager must post frequent updates to all stakeholders. Frequency will depend on the nature, scope, and timing of the incident.
In general, posts should be made:
- Every 1-2 hours during business hours.
- Every 3-4 hours during non-business hours.
Update content
When creating an update, the service offering manager must:
- Indicate any change in the status of the incident.
- Indicate any change in the service restoration time frame.
- Include additional information as needed by stakeholders.
Tailored communication
In addition to outage / degradation posts, the service offering manager should use the means of communication most appropriate for (or requested by) specific groups of stakeholders. In general, a more immediate level of communication requires a faster means.
- For the general public and all other Stakeholders:
- Outage / degradation posts.
- For OIT technical staff responding to the incident:
- OIT Critical Emergency Google Chat channel.
- Phone.
- In-person contact.
- Email, but without lengthy message threads that may not include all of the necessary staff.
- For the NC State Help Desk and PPPS (and Operations, if after hours):
- OIT Critical Emergency Google Chat channel.
- Email.
- Phone.
- In-person contact.
- For the Directors and CIO:
- email summary.
- Other means as requested.
Additional communications
Depending on the stakeholders and the service involved, it may also be appropriate for the service offering manager to communicate the status of the incident through one or more of the following:
- Online Notify.
- service-specific email lists. (e.g., Majordomo, Bronto.)
- websites. (e.g., OIT site, main NC State site.)
- OIT News.
- Technician.
- News Services.
Final actions
- When the service has been restored, the service offering manager will send a final follow-up outage / degradation update to all stakeholders, ensuring that they are aware of:
- What happened.
- What was done to restore the service.
- Resumption of the service.
- If an after-action review is needed, the service offering manager will coordinate with PPPS to arrange for it.