a good thing!
PingOne Site Reliability Engineering
The PingOne service has a unique and sometimes frightening job of providing authentication middleware service to some of the worlds largest companies. We take this responsibility quite seriously within Ping Identity, so much so that a new type of operations team was embedded within the engineering organization to handle uptime, security, and integrity of the service. It's our interpretation of Site Reliability Engineering. Combining Site Reliability with Development is what we call the PingOne DevOps Team.
Hosting a Tier 1 SSO service that the most security conscious enterprises in the world depend on forces you to make the most secure design choices possible, even when it makes things very, very hard to execute. This is a common ethos throughout Ping Identity’s culture.
Do things the right way, even if it’s hard. No compromises.
The SRE Team had hands-on input designing PingOne, working within DevOps to build security and reliability into the service at the foundation. Ping has been running PingConnect, the first cloud ever enterprise identity SaaS Application, since 2008. Operations has had plenty of experience developing process, infrastructure, and sourcing talent. The PingConnect infrastructure provided us with a world-class private-cloud based on VMware, spanning two data centers in Boston and Denver already 98% virtualized and a foundation for automating the private cloud.
PingOne was designed to be a completely different system than PingConnect. Where PingConnect has components built for multi-tenancy, PingOne is 100% multi-tenant. From the foundation, PingOne was designed as a cloud scale application with components and services built for durability in a multi-datacenter environment. The aim of the team was to build an application that could retain 100% functionality in the event of a complete datacenter failure, allowing reads and writes to continue unimpeded, and failover to occur autonomously by intelligent monitoring systems. The key to this challenge is what we call PingOne’s multi-master data layer. It utilizes a mix of Galera MySQL clusters that allows applications to write to any node in any cluster in any datacenter location, and a enterprise-class Cassandra cluster for our NoSQL storage needs. We will talk more in depth about this technology in a later post.
This highly capable data layer, combined with the twin data centers we already had in place for servicing PingConnect, provided the foundation we needed to fulfill PingOne’s design requirement for 100% active/active data center load balancing and autonomous failover. On top of this foundation, the DevOps team built multiple sub-systems and services with the ability to scale individually at different rates, and attached them to the multi-master data layer. We designed load balancer pools and virtual servers to peer into each one of theses services and take them offline if needed, and feed that failure information up to the top tier DynECT global load balancers which then decide whether or not to take a service offline and fail it over to another location.
The PingOne application itself is not a large, monolithic stack. It's designed in little pieces. Subsystems that all have one or maybe two functions. The interconnecting systems and services communicate over internal networks only, never between public cloud resources or networks, and only in private secure zones. This fleet of subsystems and services can be deployed all at once during data center provisioning, or as individual layers, one at a time. This gives us a few key advantages:
Service Segmentation. We deploy these services into different security zones, depending on how sensitive their function is. We can create incredibly granular security between datacenters, zones and services. It gives the SRE Team the ability to statefully inspect traffic where needed, at any interchange between security zones. IDS systems not only inspect traffic at the border DMZ, but can also be deployed into internal zones, monitoring inter-system traffic in our own private networks for anomalies.
Independently Scalable Services. Maybe the messaging systems don’t require as many resources as the web site? The token processor nodes are way more efficient than the logging service? No problem, each one of these gets it’s own scaling group and pool of servers, letting the SRE Team provision resources far more efficiently across the service stack. If we need to change the shirt size of an instance for one component that requires additional memory, the change is added to deployment automation, and pushed to only the system that needs it.
Instance deployment, not code deployment. Because each of the services is small and self contained, it allows us to spin up entirely new layers of the service during deployment. Developers don’t push code into production, they spin up a new version of the service, test it, and pass the packages to Site Reliability. SRE then deploys the new service version into production along side the current service. Once complete, traffic is flipped to the new service version, and monitored closely for errors or performance anomalies. If we need to roll back changes, the versions are flipped back and we’re running the previous version of code within a few minutes. Once the new service version has bedded in for a few hours, the old versions are recycled. Generally between all the system and service deployments, this usually equates to an entirely new PingOne infrastructure every 3 weeks. The front-end of the system gets re-deployed every 7 days. Instances usually aren’t around long enough to fail or fill up disk. It’s like getting a new fleet of fresh new machines all the time.
Since going live, the SRE Team has continued to improve speed, process and functionality of PingOne’s infrastructure. We’ve gone through quite a few iterations of monitoring, building on mistakes of the past, and filling gaps where we see them. Deployment of instances and code has become far faster and more feature packed as our internal tools get more and more sophisticated. We have grown the system into a true AWS / VMware hybrid cloud while retaining our fundamental design requirement of active/active autonomous failover. We will dive deeper into these challenges, and hope to provide a transparent and holistic view of the design and philosophy we use to keep our customers coming back to us time and time again.
Do things the right way, even if it’s the hard way.