Infrastructure

Constraint Management for cluster operation safety and reliability at Twitter

Thursday, 19 January 2023

As any team and its infrastructure grows, the risk of an unplanned service outage and potential data loss will grow. As an example, the Platforms Messaging (Apache Kafka and PubSub services) team at Twitter runs dozens of Kafka broker clusters. These Kafka brokers are stateful services which run on hybrid-mesos and the size of the clusters run from tens to several hundred brokers. Since these are stateful services, the information stored on the SSDs of these machines cannot be lost. In order to accomplish this, the team uses ‘n’ replicas. This means the same data is replicated across ‘n’ machines in the cluster. Therefore, we can lose up to n-1 machines without data loss, but losing more machines can cause publishing failures. The loss of machines in the clusters can be for many reasons, some of which are inevitable, for example, hardware failures, power failures, network failures, while some can be mitigated, if we are careful, for example, mistakes in maintenance operations, technology refreshes and other general operational events.

There have been a few occasions in the past where the cluster has unintentionally gone into an under-replicated state, primarily because of operator to operator collisions and operator to automation collisions. For example, the operator removes a machine from service before realizing there was already another removal happening from a different operator. Another example is, the cluster is undergoing a kernel upgrade rolling reboot and an operator unknowingly performs a different independent operation targeting the same systems.

In addition to those examples, humans are bound to make mistakes. There are several historical incidents where operational tasks were not executed as intended. There have been occasions where the operator was attempting to perform a rolling reboot of a cluster, but accidentally put in the wrong batch size - causing more than the intended number of machines to be rebooted at once. There were other occasions where a staging cluster was meant to be decommissioned and a production cluster was accidentally targeted.

We realized that in order to safely execute maintenance operations (e.g. throttle, drain, task restart, host swap, etc), there needs to be a cluster aware constraint management service. This service needs to be aware of the constraints of the cluster. It also needs to be aware of the current cluster state and the operations that it’s undergoing.

Hence, we built an in-house internal service for constraint management which is a centralized, cluster aware service which serves as the final arbiter of cluster operations. Its primary function is to determine what actions are safe and ready for execution. The determination if a cluster is safe is based on cluster resource minimums which are to be defined by the owner of the cluster. There are certain operations that are deemed potentially dangerous (i.e. draining, rebooting, killing shards, restarting shards). If these operations are executed in an unsafe manner, this may cause the service to enter an unhealthy state. The goal of the constraint management service is to gather data from multiple sources and determine if the operation is safe to be executed.

These are a few example constraints while automating our fleet management, which are stored in ConfigBus , and our constraint management service can inject this at runtime -

You can perform no more than X operations on Y number of machines at once in one cluster. The operations that have this constraint are rebooting, draining, aurora restart, and aurora kill. For instance, in Kafka, you should only perform certain operations on one shard at a time to prevent under-replicated partitions and data loss.
Cluster X should not experience operation Y more than Z times within a given time frame. We need to rate-limit certain operations. This is essential when we are trying to perform automation tasks to solve pages. We can define these rate limits in ConfigBus. If the rate limit is breached, the operation will be rejected.
A machine can only undergo one operation at a time. For example, a machine should not be draining and undergoing a reboot at the same time.
We need to ensure that operations from multiple parties in the same cluster do not cause constraints to be violated. We want to allow multiple people to work together on the same cluster in the instance of a site impacting incident.
During a site impacting incident, any automatic operations will not be executed unless forced. This is to ensure that we do not worsen the situation, since services may be fragile at that point in time. We will allow humans to use a –force flag manually with a –reason flag.

While building this system, we had to keep in mind two major concepts - resources and constraints.

Resources

Fundamentally, a cluster is defined by resources. These resources include compute, storage, memory and network. In the case of bare metal services, this is defined by the number of machines and their individual SKUs which define their capacities.

The cluster is defined to have a certain amount of these resources to perform the task that it was designed for. For instance, it could be a storage cluster which is provisioned with a certain amount of storage capacity and must maintain a minimal amount of storage to serve customers. Another example is network throughput. The cluster may need a certain amount of network throughput in order to sustain operations.

To be sure that a cluster has excellent success rates, there needs to be a resource buffer built into the cluster. For example, the cluster may have 20% more storage than it really needs or 30% more network throughput capacity than it really needs. One of the purposes of this is for maintenance. Machines can have hardware issues, require software/kernel updates, can go out of warranty and need to be replaced, etc. The cluster needs to have redundancy and a buffer built in such that removal or reboot of machines does not affect its operation or have negative customer effects.

Constraints

The constraint is defined as the minimal amount of resources required to maintain cluster operations and also the maximum number of resources that can temporarily be removed from a cluster for maintenance at a time. The constraints will be defined for each cluster individually. Some examples include - maximum simultaneous hosts in maintenance, maximum simultaneous hosts In maintenance by percentage of cluster, maximum simultaneous in maintenance per rack, maximum number of non-drained downtime hosts, minimum network bandwidth for cluster, minimum number of healthy and active hosts, minimum compute capacity for the cluster etc.

The constraint management service would need to be able to query the cluster’s number of hosts, number of healthy hosts, cluster success rate, machine types, etc.

In addition, custom success rate queries can be defined with an associated percentage and threshold. For example, the Messaging team would define a query to check for under-replication of a Kafka cluster before allowing maintenance.

Moratoriums

In addition to constraints, a customer can set their cluster or role to be in moratorium. This is in case there is a production change freeze for a period of time or any other reason why they do not want any maintenance on the cluster. The emergency flag in the request will still allow for emergency maintenance.

ConfigBus

The constraints configuration files for each cluster are stored in ConfigBus under a well defined directory. Each role will have a subdirectory and under each role subdirectory will be a cluster specific constraints definitions file. This makes the configuration manageable and easy and fast to modify.

REST API

We selected REST APIs as the interface because they are language agnostic, simple, light-weight and flexible. The client invokes a REST API with appropriate authentication and authorization; the response to this request will be based on the current state of the cluster and the number of machines that are being requested to operate on. The constraint management service will look at the request and will predict if the cluster, based on the defined constraints, can handle losing those machines or a subset of those machines for the duration of the maintenance. The constraint management service will calculate if all of the machines or just a subset of those machines can be operated on and will send back that list to the requester. The requester will then need to request to start maintenance on each host or set of hosts in that response and notify when completed. Some of the examples include APIs to request, start, stop and cancel maintenance. It should be noted that the maintenance operation itself is not carried out by the constraint manager. There should be another maintenance orchestration service for modularity and flexibility.

With each component that operates on the clusters communicating with the constraint management service, we can be sure that no more than the allowed numbers of hosts can be operated on at a time.

Metrics

The constraint management service also collects metrics on the maintenance activities and provides a dashboard that reports the maintenance state of each cluster.

Some metrics that are collected are - operation type, requester, reason, role, cluster, requested duration, hosts per role, accepted/denied requests and platform type. These metrics provide us with various insights.

Audit Log

There is an audit log of each request and the resulting response. This audit log will reside in a database table which can be used for post-mortem or other types of analysis. The log can be used to see the sequence of all requests beginning with the request, start of maintenance, end of maintenance and cancellation of maintenance. This is immensely useful for finding the root cause of issues that happen in production.

Database

We used MySQL databases to store audit logs, maintenance states and reasons as well as maintenance operations.

Having the constraint management service in place has allowed us to ensure that maintenance operations on a role or cluster are performed safely without maintenance overlaps and that the maintenance will not negatively impact success rate, required replication or capacity minimums. This has drastically reduced the number of site impacting incidents and hence, has increased the reliability and robustness of the fleet.

We would like to recommend the use of a constraint management service to anyone in the industry who runs clusters at scale for increased fleet reliability, reduced toil and increased ease of operations.

Acknowledgements

We’d like to thank Matthew Swartzendruber, Morgan Horst, Tom Handal and many others who were involved in the design, development and deployment of the constraint management service at Twitter.

This post is unavailable

This post is unavailable.