Skip to content

Instantly share code, notes, and snippets.

@webrgp
Last active August 1, 2024 17:42
Show Gist options
  • Save webrgp/4524b736ad63b5e441a80f89ff75f0fa to your computer and use it in GitHub Desktop.
Save webrgp/4524b736ad63b5e441a80f89ff75f0fa to your computer and use it in GitHub Desktop.

Incident Report / Post Mortem

Title: Example Website Outage

Date: MM/DD/YYYY
Severity Critical / High / Medium / Low
Root cause Describe the root cause of the issue, if discovered.
Incident duration 1.5 hours
Time to detect the incident MMMM DDD @ hh:mm a
Time to mitigate the incident MMMM DDD @ hh:mm a

Executive summary

Keep it brief, 1-2 paragraphs.

Problem Summary

What happened?

Keep it brief and direct, just a few sentences.

Who was impacted?

Keep it brief and direct, just a few sentences.

Why did it happen?

Keep it brief and direct, just a few sentences.

Product(s) affected:

List the properties and/or services affected.

User impact:

Keep it brief and direct, just a few sentences.

Detection:

Keep it brief and direct, just a few sentences.

Resolution:

Keep it brief and direct, just a few sentences.

Duration:

Describe duration, for example:

Approximately 1.5 hours (from 9:00 to 10:00 am on June 5th UTC).

Timeline (in UTC): (Example)

  • June 5th, 9:00 am: Updown monitor reported outage
  • June 5th, 9:10 am: HC notified client about the incident
  • June 5th, 9:40 am: HC completed review of the issue and proceeded and started on mitigation measures.
  • June 5th, 9:50 am: Client notified HC of additional issues.
  • June 5th, 10:05 am: HC deployed fix.
  • June 5th, 10:05 am: HC confirmed the website was back up and running.
  • June 5th, 10:15 am: Client confirmed the website was back up and running.
  • June 5th, 10:30 am: HC completed comprehensive testing of the website.

What are we going to do about it?

Lessons learned:

List of learned lessons and areas of improvement.

  • Lesson learned 1
  • Lesson learned 2
  • Lesson learned 3

What went wrong:

List of failures that lead to the issue.

  • Failure 1
  • Failure 2
  • Failure 3

What went well:

List things that went well during the incident (Mitigation plans, response time, etc).

  • Success 1
  • Success 2
  • Success 3

Where we were lucky:

Identify and list how the incident could have led to a bigger problem.

  • Luck 1
  • Luck 2
  • Luck 3

Action items

Create a list of action items that we can take to avoid a similar incident from happening in the future.

  • Action item 1
  • Action item 2
  • Action item 3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment