Get a demo
I hereby consent to the processing of my personal data specified herein by CROC, for the purposes and within the scope set forth by the Personal Data Protection legislation of the Russian Federation, in conjunction with the activities performed and for an indefinite term.
Solution of interest
Get a quote
I hereby consent to the processing of my personal data specified herein by CROC, for the purposes and within the scope set forth by the Personal Data Protection legislation of the Russian Federation, in conjunction with the activities performed and for an indefinite term.
Solution of interest
Try for free
I hereby consent to the processing of my personal data specified herein by CROC, for the purposes and within the scope set forth by the Personal Data Protection legislation of the Russian Federation, in conjunction with the activities performed and for an indefinite term.
Solution of interest
Stay tuned

Our Long-lasting Experiment, or How We Deployed Dell EMC ScaleIO in CROC’s Cloud

26.06.2018 20 minutes 496

In this post, I’ll tell you how we deployed ScaleIO, walking blindfolded across a minefield. I’ll also tell you about the storage architectural specifics, its integration with our cloud, and, of course, load testing.

For two years or maybe even more, we have mercilessly used a large testing installation with Dell EMC ScaleIO storage system running on it. After many changes made, we finally managed to tailor it to our cloud infrastructure. In our case, this storage occupies a demanded niche between a conventional slow HDD-based data warehouse and fast all-flash solution. Moreover, its Software Defined nature allows for building fault-tolerant storages simply from what you have at hand. Nevertheless, don’t be too cheapskate when it comes to hardware, since license cost will outweigh your savings.

Almost as good as AWS

As I mentioned, CROC’s Cloud offers two types of storage: standard and all-flash. The first type is built from conventional HDDs, with storage system LUNs being presented to Cloud servers and assembled into a cluster file system. The second one is based on all-flash arrays (aka «violins»), with their LUNs being presented directly to Cloud servers with hypervisor and forwarded directly to VMs to ensure maximum performance.


During operation, we faced certain troubles with both. When there are many LUNs (or, to be exact, their exports to servers), violins respond to API requests with much greater latency or even freeze (our worst case). LUN presentation time also increases dramatically. Once the load increases, especially during scaling attempts, HDD-based cluster file system becomes less stable. We also faced hardware problems.


Both storage types are built in a rather conservative manner and have minimum number of abstraction layers. This dramatically complicates their maintenance compared to traditional «cloud» solutions, not to mention certain scalability and fault tolerance issues. That is why we opted for Software Defined Storage (SDS) where most of these issues are inherently resolved by architectural means.


We compared many commercially available SDS solutions, including Dell EMC product, Ceph (RBD), GlusterFS, MooseFS, LizardFS, etc. In addition to functionality, we looked at SDS performance on HDDs, with ScaleIO prevailing in both sequential and random IO operations. In his posts, our colleague Roman (@RPOkruchin) described some interesting ScaleIO-based solutions: Do not throw away your old servers — just spend an hour of your time to build a fast Ethernet storage system from them and How to make a fault-tolerant storage system from Russian servers.


To jump ahead of myself, I can say that, following performance tests, we decided to position the new storage type as a faster «Throughput Optimized HDD (st1)» AWS type with smaller minimum disk size limit (32 GB) and boot support. We called the new storage type «universal» (st2) for its flexibility and positioned it as an improved AWS type.

Architecture

First, let’s talk about Dell EMC ScaleIO cluster components.


  • Meta Data Manager (MDM) manages ScaleIO cluster, stores its entire configuration, and handles any and all tasks for managing other ScaleIO components. It can run on one server or a management cluster of three or five servers. MDM cluster also monitors the entire system, including: malfunction and error detection, cluster rebuilding and rebalancing, and log transmission to a remote server;
  • Tie Breaker MDM (TB) is used in MDM cluster for voting and maintaining quorum, and does not store data;
  • ScaleIO Data Server (SDS) is a service that aggregates server local disks in ScaleIO Storage Pools, handles all client IO operations, and performs data replication and other operations;
  • ScaleIO Data Client (SDC) is a device driver representing ScaleIO disks as block devices. In Linux, it is covered by a kernel module;
  • Gateway (GW) is a service receiving REST API requests and forwarding them to MDM and is the main communication hub between the Cloud and MDM cluster. These servers also host Installation Manager (IM) with its web console used for automatic cluster installation/update/analysis;
  • Also, to manage the entire cluster via this IM, you need to install Light Installation Agent (LIA) on all cluster servers for automatic package installation and update.

In general, ScaleIO cluster built in our Cloud on each site looks like this:

cloud.jpg


The Cloud has Storage Controller (SC) service that controls all VM disk related operations: LUN creation on various storage systems we use and further export to a proper hypervisor, as well as expansion, deletion, and the entire device lifecycle management. Given such architecture, it paid to deploy the MDM cluster on these particular servers. SC and MDM are installed on Dell EMC PowerEdge R510 servers running CentOS 7.2. In addition, two servers of this group host ScaleIO Gateways providing REST API for cluster management. They are both Active, but Cloud accesses them through a balancer configured in Active-Backup mode.


For pre-production testing, we built a pool of 12 SDSs, each containing four Seagate SAS HDDs (1.8 TB, 12 Gbps, 10K, 2.5″). Then, we added all SDS in one Protection Domain (since servers are located in one data center) and in one Storage Pool. Afterwards, SDSs were grouped in Fault Sets. Fault Set is a very interesting concept where servers are divided into bunches with high probability of simultaneous failure, irrespective of a cause, from OS specifics to placement in one rack in a location exposed to meteorite falling. Data will be mirrored so that to avoid storing copies in one risk group.


When using a Fault Set, some portion of raw capacity should be allocated for Spare space. This is a reserved space equal to the largest Fault Set, which will be used in case of a disk / server / Fault Set loss during cluster rebuild. That is why, we group servers by SDS location in server racks, thus ensuring data copy availability, even if all servers in one Fault Set are lost (for example, if an entire rack is de-energized), although probability of such an event in our TIER III certified data center is extremely low. Total size of a pool under testing:


  • 78.5 TB raw space
  • 19.6 TB (25%) spare space
  • 29.5 TB useful space (considering replication factor equal to 2)

All SDS are hosted on nodes that already have a hypervisor (NC) and SDC installed. That is how we deal with system convergence. It was our conscious decision to locate SDS on servers with hypervisor. During initial tests, we monitored the load on server with disks and tried to detect any significant impact on CPU, RAM and OS utilization — all in vain. It was a good surprise for us. It wasn’t wise to dedicate servers to disks only, because of idle CPU. Moreover, the number of SDSs was to grow with time, so the decision came by itself.


Respectively, clients (SDC) are servers with hypervisors provisioning ScaleIO disks to VMs as block devices. SDC and SDS are located on Dell EMC PowerEdge R720/R740 servers running CentOS 7.2. As for the network, data transmission uses 56 Gbps InfiniBand (IPoIB), with management using 1 Gbps Ethernet.


After assembling the entire cluster and making sure it worked, we proceeded with testing.

Initial tests and analysis

We conducted initial performance tests without a hypervisor and VMs: volumes were mounted to empty servers and mercilessly loaded to reach IOPS and Mbps ceilings. We also identified maximum latency under various conditions: empty cluster, fully utilized cluster, etc. After obtaining inconsistent results, we realized that HDDs lack stability, especially under max load. However, we managed to roughly estimate cluster performance ceiling: for a test cluster consisting of 48 SAS HDDs (1.8 TB, 12 Gbps, 10K) with 78.5 TB raw space, 8,000 — 9,000 IOPS was the max value achieved in the heaviest tests (test methods and other details will be given later). Subsequently, we used these values as reference.


We also analyzed load on production storage systems and virtual disks that we planned to migrate to ScaleIO. For two sites, we obtained the following average 3-month results:


Site (Availability Zone)

IOPS

Throughput, Mbps

Design block size, KB

ru-msk-vol51

Read

2,480

86.5

36

Write

3,000

67.5

23

ru-msk-comp1p

Read

2,250

106.5

48

Write

3,020

61

21


On the one hand, writes are more frequent than reads (W/R = 55/45), while, on the other, reads use larger blocks and load storage system by 30-70% more. We then used the obtained results in VM testing to reproduce real-life server load as accurately as possible.

Deployment never goes smoothly

A cheap chamber of horrors: too dark and full of traps.


It was a challenge for all members of our cloud team. However, every challenge is the best thing to learn from. Here are the most critical problems we faced.

API client to ScaleIO

When selecting a solution, we — taking our great experience with «violin» array APIs into account — decided to test each array’s API loaded by parallel requests. As most Cloud services, including SC, are written in Python, an API client of the solution should also be written in Python.


And here, we got into trouble: only ScaleIO API client for v.1.32 was at hand, but we were already testing v.2.0. So we had to put off the testing for a while and write an API client pyscaleio ourselves. In addition to basic CRUD operations, it supports full-fledged ORM for certain API entities, response validation, and client instance management. And, more importantly, the entire code was covered by unit tests, while, for key API operations, we developed functional tests to be performed on a running ScaleIO cluster.


And, of course, we detected some bugs in API: for example, you could set IO limit for a disk (throttling) but couldn’t unset it, because «iopsLimit (0) must be a number larger than 10». However, since we did not plan to use throttling at the SDS level, it wasn’t critical for us. In general, the loaded ScaleIO API performed with flying colors: neither a lot of created disks, nor parallel request execution caused any problems.

Throttling

In the course of the initial load testing described earlier, we figured out max capacity of ScaleIO cluster and then had to decide how to limit customer disks and position the new storage type. After analyzing the results, we decided to set 500 IOPS per disk and limit throughput depending on a disk size. It is exactly what AWS does for their «Throughput Optimized HDD». The limit is automatically recalculated with disk expansion.


For example, here are the figures for a 104 GB disk (ScaleIO disk size must be a multiple of 8 GB):



Size (GB)

MBPS Baseline

MAX MBPS

104

0.25

26



Based on these restrictions, a VM disk can use its 500 IOPS until it begins writing blocks larger than 53 KB. In case of a 104 GB disk:

(MBPS Throughput / KB per IO) * 1024 = MAX IOPS

26 / 53 * 1024 = 502 IOPS

15.6 / 32 * 1024 = 500 IOPS

26 / 64 * 1024 = 416 IOPS

26 / 128 * 1024 = 208 IOPS


If you need higher performance, you should expand your disk, and if you need high performance on small disks, you have to choose all-flash storage type with guaranteed IOPS, which however may be changed. Luckily, Cloud supports live migration between storage types, thus making life much easier.

Thin disks and empty space

It happened just when we were testing live migration from other storage types to our experimental ScaleIO: we discovered that a disk which had been created as thin, became thick and fully allocated, with a source disk hardly half-filled with data. Inefficient resource utilization never makes anybody happy, so we began searching for a leakage cause.


Since Cloud used QEMU-KVM as a hypervisor, we could investigate the problem on a low level. A block device driver within the hypervisor performed BLKDISCARD system call and deallocated space within a disk in order to write zeros to a device. However, in addition to BLKDISCARD, a block device should also support BLKDISCARDZEROES. Otherwise, reading blocks handled by BLKDISCARD may return non-zeros.


ScaleIO disks turned out to lack BLKDISCARDZEROES support, but if we applied BLKDISCARD to the entire ScaleIO disk, space was deallocated correctly. We concluded that either ScaleIO BLKDISCARDZEROES semantics was implemented in a very peculiar way or a flag was just set incorrectly. We visited Dell EMC forum and learned that BLKDISCARDZEROES semantics was not fully supported and that in order to write zeros effectively, BLKDISCARD requests should be multiple of 1 MB.


So we knew what to do, and the only thing left was to make the hypervisor adopt this peculiar semantics of effective zero-writing. To do this, we modified QEMU-KVM block device driver to bring its ScaleIO-related operations and zeros deallocation closer to that of XFS. We adjust a hypervisor rather often; it’s mostly about backports from newer versions or upstreams. However, this time, these are our hard-won changes, and this makes us happy. Now, cluster space is used efficiently during a live migration.

Really empty?

When re-testing live migration with an improved hypervisor, we noticed one more issue: source and target disks after migration differed, moreover, they differed randomly: sometimes in the first blocks, sometimes somewhere in the middle. So we had to double-check our recent hypervisor driver corrections, source disk image integrity, and page cache use during writing. Problem isolation was rather difficult, but we succeeded in several hours by excluding all irrelevant factors one-by-one. We created a new thin disk, made sure that it had zeros in the beginning, wrote a certain pattern to the beginning, deallocated the beginning with BLKDISCARD, read the disk again, and saw that very pattern we wrote!


This means that space is deallocated, but data is left intact. Although it looked much like a bug, we didn’t panic but carefully read documentation instead. It turned out that Storage Pool in ScaleIO has «Zero padding policy» option responsible for both filling sectors with zeros during the first write and further readability of such zeroes. By default, this option is disabled and can only be enabled if no physical disks are added to a pool. Since ScaleIO had already been assembled in the production environment, ready and intensively tested, the pool should be reassembled.


No doubt, extra zero write operations executed during the first writing to an unallocated block should impact performance. To estimate this impact, we repeated some of the load tests and observed up to 15% degradation during the first write, with SDS load left unchanged.


Thus, our proprietary driver helped us find our own errors in the cluster assembly. After pool reassembling and option enabling, everything worked just perfectly.

Device naming

As I said earlier, each server with SDS is also a hypervisor and hosts many VMs. This means that overall quantity of block devices on a server may be quite big; moreover, they may change very often since VMs can be migrated and disks, mounted and unmounted.


Among all these disks, special attention should be paid to devices presented to ScaleIO. If you mount such device using its short name (for example, /dev/sdx) and the disk is added to a VM after any temporary devices, then after restart the name will change and ScaleIO will lose this disk. As a result, you will have to add it to the pool again, with rebalancing required after every maintenance. Naturally, we couldn’t be happy with it.


We assembled a ScaleIO pool and ran the first tests on empty servers without VMs. We realized all this only when we started planning the main pool expansion and were waiting for a new lot of SAS drives to arrive.


How to avoid such a mess? Well, you can use symbolic links to a block device:




# udevadm info -q symlink /dev/sdx
disk/by-id/scsi-36d4ae5209bf3cc00225e154d1dafd64d
disk/by-id/wwn-0x6d4ae5209bf3cc00225e154d1dafd64d
disk/by-path/pci-0000:02:00.0-scsi-0:2:2:0

The Red Hat documentation states that accurate SCSI device identification requires a system-independent identifier — WWID (World Wide Identifier). This is not new to us — we use such identifiers to present all-flash block devices. However, the situation is trickier, as we use physical disks connected to a RAID controller, which cannot be used in a Non-RAID mode (device pass-through, JBOD). So we have to assemble RAID-0 from each physical disk, using MegaCli. As reassembling actually results in a new SCSI device with a new WWID, it doesn’t make sense to opt for it.


This forced us to a trade-off: use a symbolic disk/by-path link, which contains data identifying this physical disk on a PCI bus. Since disks were mounted incorrectly, duty admins had to add each physical disk to ScaleIO pool, using a new path, and then anxiously observe the cluster rebalancing progress bar for dozens of hours.

Intra-VM performance testing

Since we planned using ScaleIO disks as VM disks, we waited until our developers issued a major release including all the above improvements, and, given the results of initial tests and analysis of load on an old storage system, we then began testing ScaleIO disks on VMs.


We used the following configuration: 30 VMs running CentOS 7.2, with a 128 GB ScaleIO disk mounted to each VM as the second block device (NB: 32 Mbps and 500 IOPS limits are valid), one Zabbix server for monitoring and measurement (once per second), and one more VM sequentially writing 4k blocks in a continuous manner and alerting to the writes dropping below 500 IOPS meaning that this VM lacked storage resources. Then we ran fio utility to simultaneously perform tests with identical parameters on a certain number of VMs, and observed ScaleIO GUI and Zabbix test server. fio start parameters:


fio --fallocate=keep --ioengine=libaio --direct=1 --buffered=0 --iodepth=16 --bs=64k,32k --name=test --rw=randrw --rwmixread=45 --loops=2 --filename=/dev/vdb


In ScaleIO GUI, creating load by 10 VMs, we precisely simulate average load on production storage:

Workload.jpg


Zabbix:

Zabbix.png

Each VM gets IOPS and MB/s «promised» to it, while high latency is explained by the specified test queue depth. When iodepth=1, latency is 1.6 — 2 ms, which is quite good, given a virtualization layer. By the way, I strongly advise you to read a useful post about all this stuff: Understanding IOPS, Latency and Storage Performance.


Looking at graphs related to the separate VM, which tested the impact and performance sags, we saw no deviations from normal behavior during the entire test.
Then, we gradually increased the number of simultaneously writing VMs and reached the ceiling at about 9,000 IOPS and 800 Mbps, when other VMs were not yet affected by sags. These values were almost similar to those obtained during cluster testing without VMs, so we deemed them to be normal given the current disk quantity and parameters.
Then, we decided to explore performance growth with cluster expansion. We disabled two SDS and began measuring performance after enabling each node again. Thus, we found out that adding each new SDS to a cluster proportionally increased IOPS/Mbps ceiling. So we are now waiting for new disks to raise our useful capacity five- of maybe six-fold, and then, armed with better performance and smart throttling, we will easily cope with peak loads of the existing production storage.
We also tested fault tolerance: we removed disks from servers, shut down all servers within one Fault Set, and did other nasty things, except, perhaps, cable cutting. The cluster performance was good and data was still accessible: we observed 10-15% IOPS sags for a minute. The key to success was a proper rebuild/rebalance policy, which is specified by the following commands:


scli --set_rebalance_policy
scli --set_rebuild_policy

We ended up with limiting rebuild/rebalance IOPS and Mbps for each disk connected to SDS.

Conclusion

The cluster is now on trial, with our internal users being guinea pigs instead of our customers and trying to detect as many bugs and problems as possible. Once the last minor bugs in Cloud services are corrected, the new «Universal» storage type will be made available to all our customers.

Don't miss the most important, interesting and helpful posts of the week

Success

More stories