The OpenTelemetry Demo system is experiencing multiple cascading failures across several microservices. The issues appear to have started around 2025-05-27T00:06:00Z when the system was deployed or restarted. The primary symptoms include load generator errors, service connectivity issues, and potential performance degradation across multiple services.
Time (UTC) | Event |
---|---|
2025-05-27T00:06:53Z | Frontend pod started |
2025-05-27T00:06:55Z | Image provider pod started |
2025-05-27T00:06:56Z | Load generator pod started |
~2025-05-27T01:30:00Z | Multiple errors observed in load generator |
graph TD
A[Frontend] --> B[Product Catalog]
A --> C[Cart]
A --> D[Recommendation]
A --> E[Ad Service]
F[Checkout] --> C
F --> G[Payment]
F --> H[Shipping]
F --> I[Email]
F --> J[Currency]
F --> K[Kafka]
K --> L[Fraud Detection]
M[Load Generator] -- "Synthetic Traffic" --> A
N[Feature Flags] -- "Controls Behavior" --> A
N -- "Controls Behavior" --> B
N -- "Controls Behavior" --> D
N -- "Controls Behavior" --> E
N -- "Controls Behavior" --> G
N -- "Controls Behavior" --> L
N -- "Controls Behavior" --> M
Based on the OpenTelemetry data, we observed:
- Error Logs: Significant errors in the load generator service with "Browser.new_context: Target page, context or browser has been closed" errors.
- Service Activity: High activity in the flagd service (191,439 metrics records), followed by load-generator (87,014) and recommendation service (84,608).
- Feature Flag Status: Several feature flags are defined that can trigger failures, including:
kafkaQueueProblems
: Introduces delays in Kafka message processingpaymentFailure
: Causes payment service to fail at configurable ratesloadGeneratorFloodHomepage
: Floods the frontend with excessive requestsimageSlowLoad
: Causes slow image loading in the frontend
The load generator is experiencing browser context errors:
# src/load-generator/locustfile.py
class WebsiteBrowserUser(PlaywrightUser):
headless = True # to use a headless browser, without a GUI
@task
@pw
async def open_cart_page_and_change_currency(self, page: PageWithRetry):
try:
page.on("console", lambda msg: print(msg.text))
await page.route('**/*', add_baggage_header)
await page.goto("/cart", wait_until="domcontentloaded")
await page.select_option('[name="currency_code"]', 'CHF')
await page.wait_for_timeout(2000) # giving the browser time to export the traces
except:
pass
The error suggests that the browser context is being closed prematurely, possibly due to resource constraints or timing issues. The broad exception handling (except: pass
) is hiding the root cause details.
The fraud detection service has a feature flag that can introduce artificial delays:
// src/fraud-detection/src/main/kotlin/frauddetection/main.kt
if (getFeatureFlagValue("kafkaQueueProblems") > 0) {
logger.info("FeatureFlag 'kafkaQueueProblems' is enabled, sleeping 1 second")
Thread.sleep(1000)
}
This intentional delay can cause message processing backlogs in Kafka, affecting the checkout flow.
The payment service has a feature flag to introduce failures:
// src/payment/charge.js
const numberVariant = await OpenFeature.getClient().getNumberValue("paymentFailure", 0);
if (numberVariant > 0) {
// n% chance to fail with app.loyalty.level=gold
if (Math.random() < numberVariant) {
span.setAttributes({'app.loyalty.level': 'gold' });
span.end();
throw new Error('Payment request failed. Invalid token. app.loyalty.level=gold');
}
}
This can cause payment failures at a configurable rate, disrupting the checkout process.
Based on the analysis, the following root causes have been identified:
-
Feature Flag Activation: Several feature flags appear to be enabled that are intentionally causing failures:
kafkaQueueProblems
: Causing delays in the fraud detection servicepaymentFailure
: Introducing random failures in payment processingloadGeneratorFloodHomepage
: Creating excessive load on the frontend
-
Resource Constraints: The load generator's browser automation is failing with "Target page, context or browser has been closed" errors, suggesting resource constraints or improper cleanup.
-
Cascading Failures: Issues in one service (like payment or Kafka) are causing downstream effects in dependent services.
-
Disable Problematic Feature Flags:
// Update src/flagd/demo.flagd.json "kafkaQueueProblems": { "defaultVariant": "off" }, "paymentFailure": { "defaultVariant": "off" }, "loadGeneratorFloodHomepage": { "defaultVariant": "off" }
-
Reduce Load Generator Traffic:
- Scale down the load generator deployment or reduce the rate of requests
-
Restart Affected Services:
- Restart the load generator service to clear browser contexts
- Restart the fraud detection service to clear any backlog
-
Improve Error Handling:
# Better error handling in load generator try: # Existing code except Exception as e: logger.error(f"Browser automation error: {str(e)}") # Implement proper cleanup
-
Add Circuit Breakers:
// Add circuit breaker to payment service const circuitBreaker = new CircuitBreaker({ failureThreshold: 0.3, resetTimeout: 30000 }); module.exports.charge = circuitBreaker.wrap(async request => { // Existing charge code });
-
Implement Graceful Degradation:
// In frontend API gateway async getCart(currencyCode: string) { try { return await request<IProductCart>({ url: `${basePath}/cart`, queryParams: { sessionId: userId, currencyCode }, timeout: 3000 // Add timeout }); } catch (error) { // Log error // Return empty cart as fallback return { userId, items: [] }; } }
-
Resource Allocation:
- Increase memory and CPU limits for the load generator pod
- Configure proper resource requests and limits for all services
-
Monitoring Improvements:
- Add alerts for Kafka consumer lag
- Set up alerts for payment service error rates
- Monitor browser resource usage in load generator
-
Resilience Testing:
- Implement chaos engineering practices to regularly test system resilience
- Create automated tests that verify system behavior when feature flags are enabled
-
Feature Flag Management:
- Implement approval workflow for enabling disruptive feature flags
- Add automatic rollback if metrics exceed thresholds after flag changes
-
Deployment Strategy:
- Implement canary deployments to detect issues before full rollout
- Add pre-deployment checks for feature flag status
-
Incident Response:
- Create runbooks for common failure scenarios
- Automate initial diagnosis steps
-
Service Mesh Implementation:
- Consider implementing a service mesh like Istio for better traffic management, retries, and circuit breaking
-
Load Testing Improvements:
- Separate synthetic monitoring from load testing
- Implement more gradual ramp-up of load testing traffic
-
Observability Enhancements:
- Add custom metrics for feature flag status
- Implement distributed tracing visualization for better dependency analysis
- Create dashboards specific to critical user journeys (checkout flow)
-
Architecture Review:
- Review the tight coupling between services
- Consider implementing event-driven patterns for better resilience
The current incidents appear to be primarily caused by intentionally activated feature flags designed to simulate failures. These flags are creating cascading failures across the system. By disabling these flags and implementing the suggested improvements, the system's stability and resilience can be significantly improved.
The incident highlights the importance of proper feature flag management, robust error handling, and comprehensive monitoring in microservice architectures.