Cloud Tenant Outage
Incident Report for ZenML Public Services
Postmortem

Incident Summary

  • On September 7, 2023, at 12:28 CST, it was reported that the ZenML Cloud service was down, affecting all customers.
  • The root cause of the outage was that the load balancer was unable to send any traffic to the ingress controllers in the nodes because they didn't respond to health checks from the target group. The nodes got removed from the group after a set number of failed health checks.
  • The incident was resolved by adding the nodes back to the target group.
  • No data was lost during the outage.

Root Cause Analysis

  • The NGINX ingress controller currently only runs on one node, which means that our target group (which consisted of all nodes in our cluster) only has one healthy node.
  • The nodes got evicted from the target group, leading to no traffic reaching the ingress controller pods.

Corrective and Preventive Actions

The ZenML Cloud team took the following corrective actions to resolve the issue:

  • Added evicted nodes back to the target group.
  • Investigating about creating more points of contact between the load balancer and the cluster.
Posted Sep 08, 2023 - 10:57 UTC

Resolved
- On September 7, 2023, at 12:28 CST, it was reported that the ZenML Cloud service was down, affecting all customers.
- The root cause of the outage was that the load balancer was unable to send any traffic to the ingress controllers in the nodes because they didn't respond to health checks from the target group. The nodes got removed from the group after a set number of failed health checks.
- The incident was resolved by adding the nodes back to the target group.
- No data was lost during the outage.
Posted Sep 07, 2023 - 10:48 UTC