webrgp/incident-report-template.md

Last active August 1, 2024 17:42

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/webrgp/4524b736ad63b5e441a80f89ff75f0fa.js"></script>
Save webrgp/4524b736ad63b5e441a80f89ff75f0fa to your computer and use it in GitHub Desktop.

Download ZIP

Raw

incident-report-template.md

Incident Report / Post Mortem

Title: Example Website Outage

Date: MM/DD/YYYY

Severity	Critical / High / Medium / Low
Root cause	Describe the root cause of the issue, if discovered.
Incident duration	1.5 hours
Time to detect the incident	MMMM DDD @ hh:mm a
Time to mitigate the incident	MMMM DDD @ hh:mm a

Executive summary

Keep it brief, 1-2 paragraphs.

Problem Summary

What happened?

Keep it brief and direct, just a few sentences.

Who was impacted?

Keep it brief and direct, just a few sentences.

Why did it happen?

Keep it brief and direct, just a few sentences.

Product(s) affected:

List the properties and/or services affected.

User impact:

Keep it brief and direct, just a few sentences.

Detection:

Keep it brief and direct, just a few sentences.

Resolution:

Keep it brief and direct, just a few sentences.

Duration:

Describe duration, for example:

Approximately 1.5 hours (from 9:00 to 10:00 am on June 5th UTC).

Timeline (in UTC): (Example)

June 5th, 9:00 am: Updown monitor reported outage
June 5th, 9:10 am: HC notified client about the incident
June 5th, 9:40 am: HC completed review of the issue and proceeded and started on mitigation measures.
June 5th, 9:50 am: Client notified HC of additional issues.
June 5th, 10:05 am: HC deployed fix.
June 5th, 10:05 am: HC confirmed the website was back up and running.
June 5th, 10:15 am: Client confirmed the website was back up and running.
June 5th, 10:30 am: HC completed comprehensive testing of the website.

What are we going to do about it?

Lessons learned:

List of learned lessons and areas of improvement.

Lesson learned 1
Lesson learned 2
Lesson learned 3

What went wrong:

List of failures that lead to the issue.

Failure 1
Failure 2
Failure 3

What went well:

List things that went well during the incident (Mitigation plans, response time, etc).

Success 1
Success 2
Success 3

Where we were lucky:

Identify and list how the incident could have led to a bigger problem.

Luck 1
Luck 2
Luck 3

Action items

Create a list of action items that we can take to avoid a similar incident from happening in the future.

Action item 1
Action item 2
Action item 3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment