How we started building our cloud in 2009 and what pitfalls we fell into
In October 2009, we checked everything once again. We were to build an 800-rack data center. Our decision was backed up by our intuition, local market forecasts, and U.S. market analytics. It sounded logical enough but still we were a bit nervous as cloud computing or cloud hosting was a new thing for the Russian market that time...
Even the term itself was not a buzzword. However, we saw that such cloud installations were in demand in the U.S., and we had already succeeded in building large-scale 500-node HPC clusters for aircraft design companies and believed a cloud to be the same big computing cluster.
In 2009, we never imagined clouds would be used for anything except for distributed computing, and that was a trap we fell into. CPU time was what everyone needed, we thought and thus started building an architecture just like we used to do when constructing HPC clusters for R&D centers.
Do you know what’s the main difference between such a cluster and modern cloud infrastructures? Very few disk accesses and mostly sequential reads. Each job is segmented, and every machine does its own part. At that time, who could know that disk subsystem load in an HPC cluster dramatically differs from that in the cloud: sequential reads/writes vs. totally random operations. It was not the only challenge we had to face.
The first critical choice was between InfiniBand and Ethernet for a network inside the main cloud site. We spent much time comparing them and finally opted for InfiniBand. Ask me why. First, again, we believed cloud to be yet another HPC cluster, and, second, 10 Gbps connections were universal bricks for any networking project those days. InfiniBand promised marvelous speed, simpler support, and cheaper network operations.
The first cloud platform released in 2010 was powered by 10G InfiniBand. At the time, we started using the world’s first SDN solution originally made by Nicira and then acquired by VMware for heaps of money and renamed into VMware NSX. We were learning how to build clouds with trial and error, just as Nicira built SDN. Surely, there were pitfalls, even crashes a few times. Then-current NICs failed after long use, making us truly mad. Shit happens, you know. For a while, after yet another major update from Nicira, our operations team took antidepressants. However, by the time when 56G InfiniBand was released, we jointly with Nicira engineers fixed some of the bugs, the storm seemed to be almost over and everybody relaxed.
If we were to design a cloud today, we would bet on Ethernet, I guess, simply because that was apparently the right way for architecture to evolve. However, it was InfiniBand that gave us great advantages to leverage a bit further.
The first rise started in 2011-2012, with two types of customers to follow. «We want it to be Amazon-like but cheaper and in Russia», some customers demanded. «Show us some magic», asked the others. Since all ads back then glorified cloud as a magic wand for infrastructure uptime, there were occasionally some misunderstandings between us and customers. Very soon, cloud market players got a kick-in-the-butt boomerang from large customers who had got accustomed to near-zero downtime of physical infrastructure. A failed server was instantly followed by a department director taken to task. Due to a virtualization layer and a certain orchestration pool, clouds are a bit less stable than physical servers. Dealing with VM failures was a nightmare as the then-current cloud was all about manual setup, with neither automation, nor cluster solutions being used to improve the situation. Amazon said that anything could break in the cloud but it wasn’t the thing customers were happy with. They believed in the magic of cloud: ever-continuous performance, with VMs migrating between data centers all by themselves... All recently onboarded customers had a single server instance per VM. It was really far from IT maturity: lack of automation, all done just once and manually, followed by «if it ain’t broke, don’t fix it» approach. That’s why physical host reboot always led to manual restore of all VMs — something our helpdesk did for customers as well and, actually, one of the first things we learned how to do internally.
Who jumped into the cloud? Oh, customers of all sorts and types. Distributed online stores were first to knock on the cloud’s door, followed by companies who started migrating their business-critical services built on proper architecture. Many folks saw the cloud as a failover site, something like a backup data center. Afterwards, they migrated to the cloud as their main site, leaving another site as a backup one. Most customers, who opted for such architecture as early as then, still have been doing very fine. Prudently configured migration scheme was something we were proud of; it was really cool to see how, after a major crash in Moscow, customer’s services automatically migrated and went live on a backup site.
Disks and flash
At first, we boomed, even quicker than we could expect when designing the architecture. Fast in hardware procurements, we suddenly faced disk limits. At the time, we were about to lay the cornerstone for our third data center: it was our second cloud-oriented data center, future Kompressor, to be certified by the Uptime Institute for Tier III compliance.
In 2014, we hooked truly large customers, and it turned out to be too much for our storage systems. When you serve seven banks, five retail chains, a travel agency, and some R&D center dealing with geological exploration, they all can have peak loads at the same time. Rather unexpectedly, of course.
Then-current typical data storage architecture didn’t have any write speed quotas for users. All read or write jobs were FIFO-processed by the data storage. And then it was a black day, actually Black Friday, the sales peak when retailers conquered the entire write capacity, ruining data storage speed almost by 30 times for all other users. A healthcare center’s website was down, taking 15 minutes to open a single page. We had to save the world ASAP.
Even high-performance disk arrays that usually cost a fortune didn’t support job priority management, meaning customers could anyway impact each other. Either overwrite a hypervisor driver, or come up with something else as soon as possible — that was the challenge for us.
Finally, we got that fixed by switching to all-flash arrays with almost one million IOPS throughput, some 100,000 IOPS per volume. As for capacity, it was more than enough, but anyway we had to limit R/W somehow — a challenge being impossible to address on a disk array level that time (late 2014). Our cloud platform was based on non-proprietary KVM, so we could dive deep into the source code. It took us some nine months to gently rewrite the code and test functionality.
At that moment, the combination of InfiniBand and all-flash technology allowed us to be pioneers on the Russian market with a never-seen-before service: guaranteed disk performance with penalties set in the strictest SLA. It was a jaw-dropping news for our competitors. We offered 100,000 IOPS per volume. «It’s not possible,» other providers said. «Well, it’s all guaranteed,» we responded. «No way! You guys are going nuts!» It was a complete shock for the market. With such leverage, we won eight large contracts out of 10. We felt like kings of the world those days.
Sixteen arrays, 1M IOPS and 40 TB each! All were directly connected to servers via InfiniBand. None thought something could go wrong, but it did. The least expected thing. The six-month-long testing went with flying colors.
The point is that when InfiniBand array controller failed, rerouting took some 30 seconds. It could be reduced down to 15 seconds but no more due to underlying protocol limitations. It turned out that, with a certain number of volumes created (by customers themselves), all-flash storage controller had a rare heisenbug. Once requested to create a new volume, the controller could go crazy, pick up 100% of the load, go to thermal shutdown and cause that very 15-second rerouting. VMs lost volumes. Nice. Together with the data storage vendor, we were hunting for the bug during several months, and ultimately got it fixed, with the vendor then having to update the array controller microcode. Meanwhile, we coded a fat layer of middleware to fix the issue, and had to recode almost the entire control stack.
Our support engineers still keep array-related demotivational posts on the walls.