PingOne runs on two separate data layer technologies for customer facing applications. Short of this configuration we simply could not claim true geographic fault tolerance or high availability.

One is MySQL Galera and the other is Cassandra DB. Both are geographically distributed write/read anywhere technologies.  Our data layers are distributed in a hybrid cloud architecture made up of two private data centers and one public. The reason for this set-up? Scalability, flexibility and high availability.

PingOne can autonomously handle failure in the stack ranging from single sub systems to an entire datacenter.

Today, I'm going to layout our Galera MySQL architecture. I will dissect other parts of the architecture in later articles, so stick with me. 
As so many people say but so few follow, choosing the right technology for the problem can be the difference between success and failure. Galera is not the right MySQL solution in every case study, nor is it perfect, but it does have some advantages we found attractive.

The Pain
High availability and fault tolerance come at a cost, but once you’ve paid it your enterprise application is a lot better off.  Our MySQL Galera implementation required massive amounts of time and effort into every level of our stack.

Among other considerations, WAN replication forced programming consideration for developers and hurdles to clear for performance engineers, monitoring and reporting engines had to be built to simplify troubleshooting, database queries were rewritten to boost performance, test engineers sacrificed sleep, and databases were reconfigured. By no means was this a small endeavor.

The Benefits
Hurricane Sandy unfortunately battered the East Coast, but it couldn't bring us down even though PingOne lost a datacenter. As a result of our distributed architecture, the majority of PingOne customers knew Sandy only as a video clip on the nightly news.
As one Ping employee put it, “Yup...Still Up”. I think there were even a few songs written about it that were wildly popular here.

The Architecture
Our MySQL Galera implementation contains all configuration data for PingOne. Various subsystems contribute or retrieve information from this geographically dispersed relational database.

Functional requirements for the RDBMS were fairly stringent. Security around authentication and authorization were paramount. If any way possible, we wanted reads/writes from anywhere. Why not right? We have a small dataset, our writes aren't enormous, lets take advantage of it. Data needed to be consistent across the architecture. The cluster was required to live through datacenter failure. Write performance is important but not in the nanosecond realm but needed to be relatively fast. Fast being a holistic view of our entire stack. Is PingOne within web standards and meeting customer needs?

The Galera MySQL architecture was designed in a six node cluster distributed across three dispersed datacenters. This equates to two servers per data center, allowing for a single server in any data center to go down without triggering a data center failover event.
The largest performance impact as many would guess is network latency and thus WAN replication. Its the bane of distributed database. Galera uses cert based replication (aka certification based conflict resolution) which requires far less overhead and thus better performance over the WAN than traditional 2 phase commit. This however does come with a limitation.

The single largest hurdle with certification based replication and thus Galera is Deadlocks. Known by the PingOne team as WANlocks. Why do we call them WANlocks? Because these aren’t traditional deadlocks even though the exception generated is a deadlock.

Traditional deadlocks occur when-

Process 1 holds a lock on TableA and wants a lock on TableB. Process 2 holds a lock on TableB and wants a lock on TableA.

WANlocks are the result of optimistic locking on replication across the cluster. This can happen in a LAN environment too, its simply more pronounced in a WAN model. Optimistic locking is inherent in certification based replication. More on this from my friends over at Severalnines.

 

galera mysql optimistic locking

 

The question is how did we get around this? There are several options including:

  • only writing to one MySQL server
  • partitioning writes to different servers depending on the database
  • eliminating hot tables/rows

The most common being all writes directed at a single MySQL server. Often recommended because its dead simple to implement. Just redirect all writes to one location and your done. Set and forget. However now there is an added layer of vertical scalability. You have to add server resources for both the primary and secondary machines designated for writes. If your servers are large and you have plenty of headroom, then this is a perfectly viable solution.

The second option is partitioning writes. Not a bad idea if PingOne had more than a couple databases to distribute writes to.

Last is eliminating hot tables whether through application changes or schema changes and quite possibly both. Yes it can hurt the brains a little bit. Yes it will take more time to get it straightened out. Yes we chose it. Why? Because once its done, you have eliminated a major road block in being able to write anywhere.

Benefits of certification based replication and thus optimistic locking allow for less over-head on replication. This combined with Galera’s parallel replication allows for very fast synchronization, especially when considering the WAN. This was probably one of the largest factors in our design choice.

As a result of the PingOne team’s effort and our current configuration, we have managed to achieve 5500 randomized reads/writes per second or 330,000 reads/writes per minute at 17% writes and 83% reads in stress testing. The percent writes to reads is fairly indicative of our typical workload.

Next post I'll delve a little bit into our Cassandra architecture.

About the Author: Michael Ward is a Site Reliability Engineer for Ping Identity, specializing in high availability solutions and system architecture. mward@pingidentity.com

 

AUTHOR: Robert Hodges
EMAIL: robert.hodges@continuent.com
DATE: 02/21/2013 02:53:40 PM Quick question--if you eliminate optimistic deadlocks for Galera doesn't this mean any multi-master replication would work?

DATE: 02/21/2013 04:59:54 PM Robert,

This is a good question. Thanks for asking it.

Galera addresses a number of issues over Multi-Master however here is a list of the benefits we recieved by moving to our configuration. Granted not all of these are purely Galera.

I'm sure I'm missing some things but here are a few.

-Management of the Cluster
I really really hated managing, monitoring the Master/Slave connections
With Galera I don't even worry about this anymore
-Write conflicts when writing to multiple masters
One could debate whether I got rid of one tiger and replaced it with another on this one. WANlocks anyone?
-Replication lag
There are several things that helped us out on this front.
1) row based replication (We used Statement based prior)
Caveat: This is not particular to Galera, just a change we made
2) cert based replication (No two phase commit)
3) parrallel replication
-Resynchronization after node failure (Now possible without Donor node becoming Read-Only)
Dead simple - Start up MySQL after failure, point it at a node in the cluster and it does the rest.
-Parrallel replication
Galera replicates changes out to other nodes in the cluster in parrallel
whereas MySQL Multi-Master replicates data in a ring. If at any point you loose connectivity somewhere in that ring, that data is missing until the connection is re-established, which as you know, is not always automatic. This is a problem if you are read/writing to every node in the cluster as we are.
-Less complications when adding nodes to the cluster
As long as the base configuration is good (my.cnf), point it at a node in the cluster and Galera does the rest.
-Auto-increment and auto-increment-offset are well....... automatic
Go Figure!!!
-Synchronous replication and thus guaranteed consistency
Not huge for us but in asynchoronous replication, if you completely loose a server prior to it replicating data, that data is lost. If the server happens to be under high write load, replication takes a back seat.
With Galera, if the write was successful, it now lives on multiple servers.
-User creds automatically replicated across the cluster
-Write scalability
Writes cost alot in server resource consumption however in Galera that cost only pertains to the initial server being written to. The nodes recieving the replication incur a far smaller resource penalty.

Yes I understand there are tools to automate some of the list above such as re-establishing connections between Multi-Masters or automating auto-increment configurations however why build a tool to do this if I don't need it?

Cons to Galera other than listed in my article:
One thing I didn't mention in my article is the ability to replicate databases to certain nodes in the cluster.
Currently, if the database exists in Galera, its replicated everywhere. In order for good horizontal scalability, Galera needs some form of partitioning to allow better write scalability. Hopefully that's coming
in the not too distant future. It MAY already be out but honestly I haven't kept up in the last several months.

DATE: 02/22/2013 11:28:48 AM Another advantage I failed to mention:

Tunable options for network latency
Galera provides fairly granular options in tuning a cluster for network latency and allows a cluster admin to determine a what point a node should fail in a cluster. This is very important when considering brownout conditions which can occur frequently in a WAN cluster. Configuring these params too tight allows for the potential for flapping which can reak havoc on a cluster.
Here is a link for tuning in WAN replication.
http://www.codership.com/wiki/doku.php?id=configuration_tips

These are base settings for WAN replication and should be tested and modified for any given cluster depending on failure requirements around the cluster.

* Required Fields