For a few years now, enterprises have leveraged the idea of a "hybrid cloud" as a way to start deploying workloads into IaaS platforms where it makes sense, while still complying to regulatory standards and security practices attached to legacy applications.
Ideas start on a napkin, or a whiteboard, or (sometimes) with crayons on a table at Croc's in Denver over a couple large margiritas. It's great fun to add that extra cloud to the mix and think of the possibilities all these services provide with a credit card and a simple API call. In reality, once you really dig in, there are tons of very interesting problems to be solved. Engineers love interesting problems.
What makes a hybrid cloud?
There are many interpretations of what a "hybrid cloud" actually is. In the earliest stages of adoption, it could just mean that a company uses Gmail combined with Microsoft Windows file shares located in a private datacenter. Others may refer to a hybrid cloud as the use of public cloud resources for development or test purposes.
Combining the use of SaaS applications and on-premise resources can technically be called a "hybrid cloud strategy." These initial cloud strategies invariably lead to more discussions around how to further leverage resources outside the enterprise datacenter.
Ping's been lucky enough to enjoy highly available VMware private cloud data centers that we've been scaling up since 2005. We were early adopters of the virtualization trends, and have optimized the design our private data centers around hypervisors and supporting storage, networking and security technologies. Our SaaS applications have leveraged this infrastructure to great success.
Understand The Detailed Differences
This might seem like a "duh" to people, but it's one of the most important things to have nailed down. With all its infinite scale, the public cloud has been designed largely with its details hidden under the hood of a glossy API or a crappy web interface. How do you map the intricately crafted persistence profiles of your F5's LTM Virtual server over to Amazon's feature nerfed Elastic Load Balancer? The ELB doesn't even allow for different load balancing algorithms, let alone iRules that you may be leveraging to enforce traffic and access rules. We had to push a lot of the infrastructure's intelligence into each subsystem of our application for it to work properly in any environment. You'll find that your applications may be leaning heavily on your enterprise infrastructure with all its intricate capabilities. Moving all those capabilities to your application will make it far, far more platform agnostic
Virtual Private Clouds
Until the capabilities of VPCs matured, this was a blocking feature of the public cloud for us. Amazon's VPC feature set has grown to match up nicely with our own network architecture, allowing the creation of real subnets with security groups, network access rules, and the ability to route layers of traffic back to our own data centers. PingOne uses meticulous network segmentation within the core application. Network segmentation makes sure the fleet of subsystems are only talking to authorized peers. Some applications don't rely heavily on network segmentation as a security feature. Consulting with your own security and network teams to map regulatory requirements to public cloud networks is a big one. We took the 1:1 approach, meaning the network architecture of our VPC's are required to map perfectly against what we choose to build in the private cloud. This can have a big impact on the cloud vendors you choose. Some providers have very mature VPC capabilities, others do not.
Flexible & Scalable Data Layer
Mike Ward has posted some very detailed walkthroughs on our Galera MySQL and Cassandra infrastructure in this blog, and I cannot stress enough how important it is to have your database architecture configured properly when using multiple data centers. The ability of applications to write locally while the data layer pushes the to all your other locations is key to truly distributing your application between clouds with 100% functionality and failover capability. Extending PingOne into AWS was successful in large part to the stability and geographic scalability of our databases. Speaking strictly as an Ops guy, I love Cassandra and Datastax DSE for this role for it's incredible resilience and scalability. The ability to write anywhere to any node, and granular control over how that data is distributed among inside the ring is a dream come true for reliability. Reads and writes need to happen as close to the application as possible for good performance, and that's very tricky when distributing nodes all over the globe. In the private cloud, we can set up EVPL links, fiber, MPLS, or whatever else is needed to get the best performance out of data center intercommunication. Linking up through IPSec VPCs is entirely different. Your database technology needs to have configuration options that allow for tuning to these differing conditions, and understand where they are in the environment.
Global DNS Load Balancers
I'm kind of a DNS geek, I'll admit this. In my career I've build and re-built quite a few DNS infrastructures for various companies. Initially at Ping, we used our own in-house F5 GTM DNS load balancers for global data center balancing but quickly realized the F5's functionality wouldn't be scalable enough to handle the diversity of our platforms. We moved over to DynECT in early 2011, prior to deploying PingOne GA, and I couldn't be happier with their service. The ability to direct traffic and monitor data center endpoints is absolutely key to running a hybrid application. It not only gives PingOne the ability to fail production traffic away from affected data center regions, but allows us to direct failovers into data centers with the ability to automatically increase scale. The concept of a "Scaling Data Center" is specifically designed to take advantage of the strengths of the public cloud. If PingOne gets a blast of traffic, or an east coast region goes offline due to a weather event, we can easily direct traffic over to AWS and let the autoscaler groups take over. There are some tricks to this transition, however. Specifically, DynECT's traffic manager can handle A records or CNAMEs, but not both at the same time within a traffic managed group. This transition can be a bit problematic when you toss AWS into the mix because of their CNAME requirement.
Our future plans for PingOne include additional platforms and providers both within the US, and globally. We're also looking into ways to integrate "edge services" into the hybrix mix. These nodes and services would be subsystems able to cache and cluster locally, while calling home through a secure API to access data objects.
So grab a margarita and some crayons, or scribble some ideas on a napkin after a couple beers at Freshcraft like our DevOps team does every Friday afternoon. As always, please feel free to contact me or the Site Reliability Engineering team with ideas or questions about our infrastructure.
About the Author: Beau Christensen is the Manager of SaaS Operations & Reliability at Ping Identity, responsible for cloud infrastructure architecture and reliability of all SaaS applications. firstname.lastname@example.org@beauchristensen