Art of Software Reuse
  • About
  • Introduction
  • Why Do Reuse Efforts Fail?
    • Common Pitfalls
    • Conway's Law
    • Care About Risks
    • Pursuing Perfection
    • Lack of Domain Focus
    • Entropy
  • Tenets
  • Success Factors
    • Revisit Assumptions
    • Communicate, Constantly
    • Collaborate
    • Review Code
    • Be Domain Driven
    • Target Quick Wins
    • Reduce Friction
    • Document
    • Build for Immediate Use
    • Address Support Needs
    • Managing Complexity
  • Practices
    • Minimise Jargon
    • Leverage Interception Points
    • Delay Commitment
    • Never Waste A Production Incident
    • Be Disciplined
    • Be Demand Driven
    • Continuous Alignment
    • Iterate, Iterate, Iterate
    • Build a Product Line
    • Understand Lack of Reuse
  • Design
    • Wrap Legacy APIs
    • Think Products, Not Applications
    • Identify Common Needs
    • Create Common Connectivity Components
    • Consistent APIs
    • Manage Domain Variations
    • Evolve Functionality Iteratively
    • Offer Reusable Assets with Multiple Interfaces
    • Leverage Services Across Functional Flows
    • Mediate Service Requests & Responses
    • Refactoring
    • Abstract Utility Functions
    • Reduce Technical Debt
    • Facilitate Extensibility
    • Encapsulate Variations Using Patterns
    • Understand Adoption Barriers
    • Ease Testability
    • Supportability
  • Tips
  • Resources
Powered by GitBook
On this page

Was this helpful?

  1. Practices

Never Waste A Production Incident

Production incidents are one of the best avenues to accelerate the maturity of a reusable asset. While incidents are stressful when we are dealing with them they provide clear and direct feedback on gaps in the platform. First, don’t indulge in blame games and don’t waste time fretting that it has happened. Second, if you step back from the heat incidents are an excellent means to learn more about your assumptions and risks.

  • Did you assume that an external dependency will always be available? More specifically, did you assume that the dependency will respond within a certain threshold latency window?

  • Was there manual effort involved in identifying the problem? if so, how much time did it take to get to the root cause? what was missing in your supportability tooling? Every manual task opens the door for additional risks so examining them is key. Think about how to get to the root cause faster:

    • Instrumentation about what was happening during the incident – were there pending transactions? pending events to be processed? how “busy” was your process or service and was that below or above expected thresholds?

    • Is there a particular poison / rogue message that triggered a chain reaction of sorts? did your platform get overwhelmed by too many requests within a certain time window?

    • Did you get alerted? if so, was the alert about a symptom or did it provide any clues to the underlying root cause? did it include enough diagnostic information for additional troubleshooting? was there an opportunity to treat the issue as an intermittent failure – instead of alerting, could the platform have automatically healed itself?

    • Was the issue caused by a ill-behaved component or external dependency? If so, has this happened before (routine) or is it new behavior?

  • Think about defect prevention and proactive controls. There are a variety of strategies to achieve this: load shedding, deferring non-critical maintenance activities, monitoring trends for out of band behavior, and so on. Invest in automated controls that warn threshold breaches: availability of individual services within the platform, unusual peak / drop in requests, rogue clients that hog file system or other critical platform resources, etc.

PreviousDelay CommitmentNextBe Disciplined

Last updated 5 years ago

Was this helpful?