shiftyp/incident.md

Incident Report: OpenTelemetry Demo System Outage

1. Executive Summary

The OpenTelemetry Demo system is experiencing multiple cascading failures across several microservices. The issues appear to have started around 2025-05-27T00:06:00Z when the system was deployed or restarted. The primary symptoms include load generator errors, service connectivity issues, and potential performance degradation across multiple services.

2. Incident Analysis

2.1 Timeline of Events

Time (UTC)	Event
2025-05-27T00:06:53Z	Frontend pod started
2025-05-27T00:06:55Z	Image provider pod started
2025-05-27T00:06:56Z	Load generator pod started
~2025-05-27T01:30:00Z	Multiple errors observed in load generator

2.2 Affected Services

graph TD
    A[Frontend] --> B[Product Catalog]
    A --> C[Cart]
    A --> D[Recommendation]
    A --> E[Ad Service]
    F[Checkout] --> C
    F --> G[Payment]
    F --> H[Shipping]
    F --> I[Email]
    F --> J[Currency]
    F --> K[Kafka]
    K --> L[Fraud Detection]
    M[Load Generator] -- "Synthetic Traffic" --> A
    N[Feature Flags] -- "Controls Behavior" --> A
    N -- "Controls Behavior" --> B
    N -- "Controls Behavior" --> D
    N -- "Controls Behavior" --> E
    N -- "Controls Behavior" --> G
    N -- "Controls Behavior" --> L
    N -- "Controls Behavior" --> M

2.3 Key Metrics and Observations

Based on the OpenTelemetry data, we observed:

Error Logs: Significant errors in the load generator service with "Browser.new_context: Target page, context or browser has been closed" errors.
Service Activity: High activity in the flagd service (191,439 metrics records), followed by load-generator (87,014) and recommendation service (84,608).
Feature Flag Status: Several feature flags are defined that can trigger failures, including:
- kafkaQueueProblems: Introduces delays in Kafka message processing
- paymentFailure: Causes payment service to fail at configurable rates
- loadGeneratorFloodHomepage: Floods the frontend with excessive requests
- imageSlowLoad: Causes slow image loading in the frontend

3. Code Analysis

3.1 Load Generator Issues

The load generator is experiencing browser context errors:

# src/load-generator/locustfile.py
class WebsiteBrowserUser(PlaywrightUser):
    headless = True  # to use a headless browser, without a GUI

    @task
    @pw
    async def open_cart_page_and_change_currency(self, page: PageWithRetry):
        try:
            page.on("console", lambda msg: print(msg.text))
            await page.route('**/*', add_baggage_header)
            await page.goto("/cart", wait_until="domcontentloaded")
            await page.select_option('[name="currency_code"]', 'CHF')
            await page.wait_for_timeout(2000)  # giving the browser time to export the traces
        except:
            pass

The error suggests that the browser context is being closed prematurely, possibly due to resource constraints or timing issues. The broad exception handling (except: pass) is hiding the root cause details.

3.2 Kafka Queue Problems

The fraud detection service has a feature flag that can introduce artificial delays:

// src/fraud-detection/src/main/kotlin/frauddetection/main.kt
if (getFeatureFlagValue("kafkaQueueProblems") > 0) {
    logger.info("FeatureFlag 'kafkaQueueProblems' is enabled, sleeping 1 second")
    Thread.sleep(1000)
}

This intentional delay can cause message processing backlogs in Kafka, affecting the checkout flow.

3.3 Payment Service Issues

The payment service has a feature flag to introduce failures:

// src/payment/charge.js
const numberVariant = await OpenFeature.getClient().getNumberValue("paymentFailure", 0);

if (numberVariant > 0) {
  // n% chance to fail with app.loyalty.level=gold
  if (Math.random() < numberVariant) {
    span.setAttributes({'app.loyalty.level': 'gold' });
    span.end();

    throw new Error('Payment request failed. Invalid token. app.loyalty.level=gold');
  }
}

This can cause payment failures at a configurable rate, disrupting the checkout process.

4. Root Causes

Based on the analysis, the following root causes have been identified:

Feature Flag Activation: Several feature flags appear to be enabled that are intentionally causing failures:
- kafkaQueueProblems: Causing delays in the fraud detection service
- paymentFailure: Introducing random failures in payment processing
- loadGeneratorFloodHomepage: Creating excessive load on the frontend
Resource Constraints: The load generator's browser automation is failing with "Target page, context or browser has been closed" errors, suggesting resource constraints or improper cleanup.
Cascading Failures: Issues in one service (like payment or Kafka) are causing downstream effects in dependent services.

5. Mitigation Strategies

5.1 Immediate Actions

Disable Problematic Feature Flags:

// Update src/flagd/demo.flagd.json
"kafkaQueueProblems": {
  "defaultVariant": "off"
},
"paymentFailure": {
  "defaultVariant": "off"
},
"loadGeneratorFloodHomepage": {
  "defaultVariant": "off"
}

Reduce Load Generator Traffic:
- Scale down the load generator deployment or reduce the rate of requests
Restart Affected Services:
- Restart the load generator service to clear browser contexts
- Restart the fraud detection service to clear any backlog

5.2 Development Improvements

Improve Error Handling:

# Better error handling in load generator
try:
    # Existing code
except Exception as e:
    logger.error(f"Browser automation error: {str(e)}")
    # Implement proper cleanup

Add Circuit Breakers:

// Add circuit breaker to payment service
const circuitBreaker = new CircuitBreaker({
  failureThreshold: 0.3,
  resetTimeout: 30000
});

module.exports.charge = circuitBreaker.wrap(async request => {
  // Existing charge code
});

Implement Graceful Degradation:

// In frontend API gateway
async getCart(currencyCode: string) {
  try {
    return await request<IProductCart>({
      url: `${basePath}/cart`,
      queryParams: { sessionId: userId, currencyCode },
      timeout: 3000 // Add timeout
    });
  } catch (error) {
    // Log error
    // Return empty cart as fallback
    return { userId, items: [] };
  }
}

5.3 Infrastructure Improvements

Resource Allocation:
- Increase memory and CPU limits for the load generator pod
- Configure proper resource requests and limits for all services
Monitoring Improvements:
- Add alerts for Kafka consumer lag
- Set up alerts for payment service error rates
- Monitor browser resource usage in load generator
Resilience Testing:
- Implement chaos engineering practices to regularly test system resilience
- Create automated tests that verify system behavior when feature flags are enabled

5.4 DevOps Improvements

Feature Flag Management:
- Implement approval workflow for enabling disruptive feature flags
- Add automatic rollback if metrics exceed thresholds after flag changes
Deployment Strategy:
- Implement canary deployments to detect issues before full rollout
- Add pre-deployment checks for feature flag status
Incident Response:
- Create runbooks for common failure scenarios
- Automate initial diagnosis steps

6. Additional Recommendations

Service Mesh Implementation:
- Consider implementing a service mesh like Istio for better traffic management, retries, and circuit breaking
Load Testing Improvements:
- Separate synthetic monitoring from load testing
- Implement more gradual ramp-up of load testing traffic
Observability Enhancements:
- Add custom metrics for feature flag status
- Implement distributed tracing visualization for better dependency analysis
- Create dashboards specific to critical user journeys (checkout flow)
Architecture Review:
- Review the tight coupling between services
- Consider implementing event-driven patterns for better resilience

7. Conclusion

The current incidents appear to be primarily caused by intentionally activated feature flags designed to simulate failures. These flags are creating cascading failures across the system. By disabling these flags and implementing the suggested improvements, the system's stability and resilience can be significantly improved.

The incident highlights the importance of proper feature flag management, robust error handling, and comprehensive monitoring in microservice architectures.

	We're experiencing a perfect storm of incidents. We're unsure when it started or whats going on. Create an incident report from open telemetry data and our source code that includes:

	1. all relevant analysis of incidents and relationships
	2. code analysis with snippets
	3. visualizations with mermaid and tables where useful
	4. root causes
	5. mitigation strategies in terms of dev, infra, devops
	6. anything else you can think of