Incident Post Mortem: June 25–26, 2019


Mark Hudnall

From 15:16 PT to 15:46 PT (22:16–22:46 UTC), buy/sell/trade functionality on was severely degraded with a 97% error rate. In other words, 97% of buy/sell/trade requests received an error response.

From 13:37 PT to 14:09 PT (20:37–21:09 UTC), experienced sustained error rates across all endpoints. Error rates hovered around 35% for the duration of the incident.

  1. Soon after, a large number of real-time price alerts were triggered, causing significantly increased query throughput for this cluster.
  2. Increased query throughput, combined with the prior cache evictions, caused additional cache pressure and resulted in query queueing and increased query latency. These are initial conclusions; in collaboration with MongoDB our investigation is ongoing.
  3. Due to increased query latency during the request/response cycle, web workers became saturated, serving HTTP 502s.
Storage engine cache activity for the affected cluster
  • We’ve removed the background job and are performing an audit of similar queries, moving them to analytics nodes.
  • We’ve tuned certain high-throughput endpoints that are hit hardest during price alerts.
  • We’re continuing ongoing work to reduce load on MongoDB through caching and reading from secondary data stores that can be scaled horizontally.

