Engineering Blog

Part #3: How Cayan helps protect merchants from costly downtime

(This is part 3 of a 3 part series)

In our second article, we discussed several strategies that merchants can employ to protect themselves against power losses and internet outages. In the third and final article in this series, we'll dive into detail about Cayan's high availability architecture, disaster recovery and redundancy, for how we can deliver best in class uptime and service.

We've previously discussed what merchants can do to harden themselves against outages, so now it's time to talk about what payment processors, like Cayan, do to harden themselves and provide the best possible uptime guarantees. It turns out that we're up against many of the same challenges that merchants are, just at a larger scale.

30,000 foot view of Cayan's High Availability architectureCayan's high availability architecture begins with our data centers. Our data centers, managed by CenturyLink, are like most data centers, in that they provide lots of redundancy and safeguards out of the box. Most data centers have redundant electricity vendors, with backup generators, and backup fuel suppliers. They usually have redundant cooling and fire detection & suppression systems. They also ensure that all of the cages are 24" off the floor, as a precaution against moderate flooding.

Cayan's data centers are run in an "active/active" configuration. This means that both sites are always processing credit card transactions, so that we can continuously prove that both data centers are "fit for purpose". Having two data centers also allows us to do routine maintenance on one of them without impacting merchants' service, and helps assure service to our merchants in the event of a catastrophic outage at one of our data centers.

Cayan's data centers are geographically diverse, located in Boston and Chicago, separated by just under 1000 miles / 1600 km. Geographic diversity is imperative when it comes to disaster recovery planning. Primary and secondary sites should have enough distance between each other to minimize the potential for a disaster, such as an outage, earthquake, flood, tornado, hurricane, or fire, to take down both sites.

Which of our data centers your transaction is routed to is determined by our global server load balancer (GSLB), which used the Domain Name System (DNS) under the hood. Since last October, Cayan's DNS providers are redundant, as a response to when the internet's #1 DNS provider was DDOS'd. Our GSLB is primarily responsible for managing our global availability. If, for any reason, our DNS providers can't successfully perform health checks against a particular data center, it'll remove that data center from the High Availability Group, and temporarily redirect all traffic to our second data center. When service is restored, the GSLB's health checks will automatically restore service to the affected data center.

Our GSLB also allows us to control how much traffic either of our data centers accepts. Typically, it's configured for a "50/50" traffic split, with roughly equal amounts of traffic being routed to each data center. During a maintenance window, it'll be 100/0, with all traffic being routed to a single data center. After we bring a site back online from maintenance, our GSLB solution allows us to slowly ramp up traffic to a data center in a 90/10 or 80/20 split, while we actively monitor our merchants' transactions for any sort of anomalies.

Traffic arrives at our data centers via redundant ISPs. Using different ISPs at each of our data centers helps ensure that an issue affecting one of our ISPs or their peering partners is isolated to a single data center, leaving our other data center intact. Our ISPs also offer protection against Denial of Service attacks.

Within our data centers, all of our networking and computing equipment is fully redundant, with redundant switches, firewalls, load balancers, application servers, and virtual machine hosts ensuring that our services are available to our merchants if we were to lose multiple components within a data center. This infrastructure is self-monitoring and self-healing – the loss of our most critical components within a data center are automatically detected and self-corrected in under 20 seconds.

Communication between our data centers happens over redundant dedicated circuits, to ensure that we always have 1Gbps of bandwidth available, so that our databases are sync'd off-site in real-time.

This system design, with redundancy in place at every level of our architecture, helps Cayan provide its merchants with the highest level of service and availability.