Get a demo
I hereby consent to the processing of my personal data specified herein by CROC, for the purposes and within the scope set forth by the Personal Data Protection legislation of the Russian Federation, in conjunction with the activities performed and for an indefinite term.
Solution of interest
Get a quote
I hereby consent to the processing of my personal data specified herein by CROC, for the purposes and within the scope set forth by the Personal Data Protection legislation of the Russian Federation, in conjunction with the activities performed and for an indefinite term.
Solution of interest
Try for free
I hereby consent to the processing of my personal data specified herein by CROC, for the purposes and within the scope set forth by the Personal Data Protection legislation of the Russian Federation, in conjunction with the activities performed and for an indefinite term.
Solution of interest
Stay tuned

Glusterfs crash test. Our experience

23.04.2018 17 minutes 1675

A year ago, our good fellow, colleague and an enterprise storage guru appeared and said: “Hi, guys, I’ve got a cool 90 TB storage with all that fancy features, you know.” I can’t say we needed it too much, but it would have been rather stupid to refuse. So we configured a couple of backups and forgot about it.

From time to time, we used the device to move large files between hosts or build WALs for Postgre replicas, etc. Eventually, we ended with moving there all scattered stuff related to our project and also configured Rotate and notifications of backups, both successful and failed. Over the year, this storage became one of the key infrastructure elements for our operation team.


Everything was good... Until the guru came to us again and said he wanted his thing back. Instantly. Urgently. Right now.


We had a little choice. Next to nothing in fact: either push our stuff just anywhere or build our own storage from what we had at hand. Wise old birds that time, we had seen a lot of not quite fault-tolerant systems, so that fault-tolerance had become our buzz-bug.


Among many alternatives we opted for Gluster. It just looked promising. So we began our crash test.

What is Gluster and why do we need it?

It is a distributed file system compatible with Openstack and integrated in oVIrt/RHEV. Although our IaaS was not based on Openstack, Gluster had a large and active community and native QEMU support in libgfapi interface. We thus killed two birds with one stone:


1. Getting a backup storage we could support ourselves without waiting for a vendor to supply a replacement part

2. Testing a new volume type to further offer it to our customers

We checked whether:

1. Gluster worked. — Yes, it really did.

2. Gluster was fault-tolerant: any node could be rebooted with cluster still running and data being available, or even several nodes could be rebooted without losing data. Proven.

3. Gluster was reliable, did not crash on its own, had no memory leaks, etc. — True to some extent. It took time to understand that the problem wasn’t with us but a Striped volume, which remained unstable in all configurations we built (see details below).


A month away, we were experimenting and building different configurations and versions. Then we tried it in production as the second destination for technical backups. Before adopting, we watched it closely during six months to make sure it worked well.

How we did it?

We had enough hardware for experimenting: a rack with Dell PowerEdge R510 and a number of not too fast two-terabyte SATA disks left after our legacy S3. A 20 TB storage seemed to be more than enough, and it took us about half an hour to install 10 disks in each of two old Dell PowerEdge R510 and add one more server as an arbiter, download software, and deploy all that stuff. It looked like this:

mjnhbv.jpg

We opted for a striped replicated volume with an arbiter as it was fast (data were evenly distributed across several bricks), reliable enough (Replica 2), and could survive one node crash without a split brain. Boy, were we wrong.


A major shortcoming of then-current cluster configuration was a very narrow (1G) channel, which was not a problem for us as this post is about system resilience and disaster recovery rather than its speed. Although we are planning to adopt Infiniband 56G with RDMA and test performance, it will be another story.

I won’t dive deep into cluster creation, since it is quite simple:


Make directories for bricks:


for i in {0..9} ; do mkdir -p /export/brick$i ; done

Create xfs on disks for bricks:

for i in {b..k} ; do mkfs.xfs /dev/sd$i ; done

Add mount points to /etc/fstab:


/dev/sdb /export/brick0/ xfs defaults 0 0
/dev/sdc /export/brick1/ xfs defaults 0 0
/dev/sdd /export/brick2/ xfs defaults 0 0
/dev/sde /export/brick3/ xfs defaults 0 0
/dev/sdf /export/brick4/ xfs defaults 0 0
/dev/sdg /export/brick5/ xfs defaults 0 0
/dev/sdh /export/brick6/ xfs defaults 0 0
/dev/sdi /export/brick7/ xfs defaults 0 0
/dev/sdj /export/brick8/ xfs defaults 0 0
/dev/sdk /export/brick9/ xfs defaults 0 0

Mount:

mount -a
Add a directory for a volume called holodilnik, to bricks:
for i in {0..9} ; do mkdir -p /export/brick$i/holodilnik ; done
Then make cluster hosts peers and create a volume:
Install software packages on all three hosts:
pdsh -w server[1-3] -- yum install glusterfs-server -y
Start Gluster:

systemctl enable glusterd
systemctl start glusterd
Remember that Gluster has several processes:

glusterd = management daemon.
It’s a main daemon that manages a volume and other daemons controlling bricks and data recovery.

glusterfsd = per-brick daemon
It starts its own glusterfsd daemon for each brick.

glustershd = self-heal daemon
It rebuilds data on replicated volumes in case of cluster node failures

glusterfs = usually client-side, but also NFS on servers
For example, it may come with glusterfs-fuse native client package Make nodes peers:

gluster peer probe server2
gluster peer probe server3
Assemble a volume (brick sequence is important here and replicated bricks must follow each other):

gluster volume create holodilnik stripe 10 replica 3 arbiter 1 transport tcp server1:/export/brick0/holodilnik server2:/export/brick0/holodilnik server3:/export/brick0/holodilnik server1:/export/brick1/holodilnik server2:/export/brick1/holodilnik server3:/export/brick1/holodilnik server1:/export/brick2/holodilnik server2:/export/brick2/holodilnik server3:/export/brick2/holodilnik server1:/export/brick3/holodilnik server2:/export/brick3/holodilnik server3:/export/brick3/holodilnik server1:/export/brick4/holodilnik server2:/export/brick4/holodilnik server3:/export/brick4/holodilnik server1:/export/brick5/holodilnik server2:/export/brick5/holodilnik server3:/export/brick5/holodilnik server1:/export/brick6/holodilnik server2:/export/brick6/holodilnik server3:/export/brick6/holodilnik server1:/export/brick7/holodilnik server2:/export/brick7/holodilnik server3:/export/brick7/holodilnik server1:/export/brick8/holodilnik server2:/export/brick8/holodilnik server3:/export/brick8/holodilnik server1:/export/brick9/holodilnik server2:/export/brick9/holodilnik server3:/export/brick9/holodilnik force
To get a stable Gluster, we had to try many parameter combinations, kernel versions (3.10.0, 4.5.4), and versions of glusterfs itself (3.8, 3.10, 3.13). Eventually, we also arrived at the following parameter values:

gluster volume set holodilnik performance.write-behind on
gluster volume set holodilnik nfs.disable on
gluster volume set holodilnik cluster.lookup-optimize off
gluster volume set holodilnik performance.stat-prefetch off
gluster volume set holodilnik server.allow-insecure on
gluster volume set holodilnik storage.batch-fsync-delay-usec 0
gluster volume set holodilnik performance.client-io-threads off
gluster volume set holodilnik network.frame-timeout 60
gluster volume set holodilnik performance.quick-read on
gluster volume set holodilnik performance.flush-behind off
gluster volume set holodilnik performance.io-cache off
gluster volume set holodilnik performance.read-ahead off
gluster volume set holodilnik performance.cache-size 0
gluster volume set holodilnik performance.io-thread-count 64
gluster volume set holodilnik performance.high-prio-threads 64
gluster volume set holodilnik performance.normal-prio-threads 64
gluster volume set holodilnik network.ping-timeout 5
gluster volume set holodilnik server.event-threads 16
gluster volume set holodilnik client.event-threads 16
Additional useful parameters:

sysctl vm.swappiness=0
sysctl vm.vfs_cache_pressure=120
sysctl vm.dirty_ratio=5
echo "deadline" > /sys/block/sd[b-k]/queue/scheduler
echo "256" > /sys/block/sd[b-k]/queue/nr_requests
echo "16" > /proc/sys/vm/page-cluster
blockdev --setra 4096 /dev/sd[b-k]
Although these values are good in our case when it's all about backups, i.e. linear operations, random reads/writes call for something different.

Now let's talk about pros and cons of different Gluster mount types and failed test results.

We tested all basic volume mount options:



Gluster Native Client (glusterfs-fuse) with backupvolfile-server parameter

What’s bad:

  • Additional client software to be installed
  • Speed

Bad, but OK:

  • Long-lasting data inaccessibility if one cluster node fails. To fix the problem, use network.ping-timeout parameter on a server side. If we set it to 5, then access to a shared folder will be lost for 5 seconds.

What’s good:

  • Rather stable operation, and few problems with corrupt files

    Gluster Native Client (gluster-fuse) + VRRP (keepalived)

    We configured an IP migrating between two cluster nodes and shut down one of them.


    What’s bad:

    • Additional software to be installed

    What’s good:

    • Configurable failover timeout in case of a cluster node failure

    It turned out that specifying backupvolfile-server parameter or setting keepalived was unnecessary, since a client could connect to Gluster daemon (using whatever address), obtain other addresses and start writing to all cluster nodes. We observed a symmetric traffic from a client to server1 and server2. Even if you specify a VIP address, the client anyway uses Glusterfs cluster addresses. So this parameter is useful when a starting client attempts connecting to a glusterfs server, finds it inaccessible, and then connects to a host specified in backupvolfile-server.


    Comment from a white paper:

    The FUSE client allows the mount to happen with a GlusterFS «round robin» style connection. In /etc/fstab, the name of one node is used; however, internal mechanisms allow that node to fail, and the clients will roll over to other connected nodes in the trusted storage pool. The performance is slightly slower than the NFS method based on tests, but not drastically so. The gain is automatic HA client failover, which is typically worth the effect on performance.

    NFS-Ganesha server with Pacemaker

    Mount type recommended if, for any reason, you don’t want to use a native client


    What’s bad:

    NFSv3 and NLM + VRRP (keepalived)

    Classic NFS with lock support and IP migrating between two cluster nodes


    What’s good:

    • Fast failover in case of a node failure
    • Simple setup of keepalived
    • nfs-utils is installed on all our client hosts by default

    What’s bad:

    • NFS client hangs in D status after several minutes of data copying by RSYNC to a mount point
    • Complete crash of a node with a running client — BUG: soft lockup — CPU stuck for Xs!
    • Many cases when files were corrupted with «stale file handle», «Directory not empty at rm -rf», «Remote I/O error» and other errors

    This option is the worst and even deprecated in later Glusterfs versions.


    Finally, we chose glusterfs-fuse without keepalived and with backupvolfile-server parameter, since it was the only stable option despite its relatively low speed.

    In addition to configuring a high-availability solution, when it comes to production use, we also must be able to restore services after disasters. That is why, having a stable cluster at last, we proceeded with destructive tests:

    Сold reboot

    We started rsync for a large number of files from one client, switched off one of cluster nodes hard, and got very interesting results. After the node crash, writing first stopped for 5 seconds (as defined by network.ping-timeout = 5), then the speed of writing to a shared folder doubled, since the client could not replicate data anymore and began to send all the traffic to the remaining node, while being limited by our 1G channel.

    iuyg.png

    After the server reboot, glustershd daemon automatically ran data healing, and the speed sagged dramatically.

    You may view the number of files being healed after a node crash:

    
    	 [16:41]:[root@sl051 ~]# gluster volume heal holodilnik info
    
    
    	 ...
    
    	 Brick server2:/export/brick1/holodilnik
    
    	 /2018-01-20-weekly/billing.tar.gz 
    
    	 Status: Connected
    
    	 Number of entries: 1
    
    	 Brick server2:/export/brick5/holodilnik
    
    	 /2018-01-27-weekly/billing.tar.gz 
    
    	 Status: Connected
    
    	 Number of entries: 1
    
    	 Brick server3:/export/brick5/holodilnik
    
    	 /2018-01-27-weekly/billing.tar.gz 
    
    	 Status: Connected
    
    	 Number of entries: 1
    ...

    After the healing, counters zeroed and writing speed restored.


    Disk failure and replacement

    Neither failure nor replacement of a disk with a brick slowed down the speed of writing to a shared folder. Perhaps, a bottleneck here was a channel between nodes rather than the disk speed. As soon as we have additional Infiniband cards, we’ll try a faster channel.

    Mind that a failed disk and a replacement disk must have the same name in sysfs (/dev/sdX). Although a new disk is often assigned the next letter, do not leave it as is, otherwise after the next reboot the disk will get its old name, block device names will change, and bricks will not work. That is why certain actions are to be done.

    It’s most likely that old disk mount points were left somewhere in the system. So, do umount.

    
    	 umount /dev/sdX
    

    Also, check what process may hold the device:

    
    	 lsof | grep sdX
    

    And stop it.

    Then rescan:

    Check dmesg-H for more details on failed disk location.

    
    
    	 [Feb14 12:28] quiet_error: 29686 callbacks suppressed
    
    	 [ +0.000005] Buffer I/O error on device sdf, logical block 122060815
    
    	 [ +0.000042] lost page write due to I/O error on sdf
    
    	 [ +0.001007] blk_update_request: I/O error, dev sdf, sector 1952988564
    
    	 [ +0.000043] XFS (sdf): metadata I/O error: block 0x74683d94 ("xlog_iodone") error 5 numblks 64
    
    	 [ +0.000074] XFS (sdf): xfs_do_force_shutdown(0x2) called from line 1180 of file fs/xfs/xfs_log.c. Return address = 0xffffffffa031bbbe
    
    	 [ +0.000026] XFS (sdf): Log I/O Error Detected. Shutting down filesystem
    
    	 [ +0.000029] XFS (sdf): Please umount the filesystem and rectify the problem(s)
    
    	 [ +0.000034] XFS (sdf): xfs_log_force: error -5 returned.
    
    	 [ +2.449233] XFS (sdf): xfs_log_force: error -5 returned.
    
    	 [ +4.106773] sd 0:2:5:0: [sdf] Synchronizing SCSI cache
    
    	 [ +25.997287] XFS (sdf): xfs_log_force: error -5 returned.
    
    

    Where sd 0:2:5:0 is:

    
    
    	 h == hostadapter id (first one being 0)
    
    	 c == SCSI channel on hostadapter (first one being 2), which is also a PCI slot
    
    	 t == ID (5), which is also slot number of a failed disk
    
    	 l == LUN (first one being 0)
    
    

    Rescan:

    
    
    	 echo 1 > /sys/block/sdY/device/delete
    
    	 echo "2 5 0" > /sys/class/scsi_host/host0/scan
    
    

    where sdY is a wrong name of a disk to be replaced.

    Then, for brick replacement, make a new directory for mounting, create a file system, and mount it:

    
    
    	 mkdir -p /export/newvol/brick
    
    	 mkfs.xfs /dev/sdf -f
    
    	 mount /dev/sdf /export/newvol/
    
    

    Replace the brick:

    
    
    	 gluster volume replace-brick holodilnik server1:/export/sdf/brick server1:/export/newvol/brick commit force
    
    

    Start healing:

    
    
    	 gluster volume heal holodilnik full
    
    	 gluster volume heal holodilnik info summary
    
    

    Arbiter failure:

    hgff.png

    The same 5-7 second inaccessibility of a shared folder and a 3-second sag caused by metadata sync with a quorum node.


    Summary

    The destructive test results encouraged us to partially adopt it for production use, but it was a short-lived joy...


    Problem 1, which is a known bug also:

    When deleting a large number of files and directories (about 100,000), we got the following:

    
    
    	 rm -rf /mnt/holodilnik/*
    
    	 rm: cannot remove ‘backups/public’: Remote I/O error
    
    	 rm: cannot remove ‘backups/mongo/5919d69b46e0fb008d23778c/mc.ru-msk’: Directory not empty
    
    	 rm: cannot remove ‘billing/2018-02-02_before-update_0.10.0/mongodb/’: Stale file handle
    
    

    I’ve read about 30 similar user complaints posted since 2013. There is no solution to the problem.


    Red Hat recommends version update, but it was of no help to us.

    https://access.redhat.com/solutions/1264803

    Our workaround is to simply clear leftovers of broken directories in bricks on all nodes

    
    

    pdsh -w server[1-3] -- rm -rf /export/brick[0-9]/holodilnik/<failed_dir_path>

    Cheer up, the worst was yet to come.


    Problem 2, the worst:

    We tried to unpack an archive containing many files in a shared folder on a Striped volume and got tar xvfz hanging in «Uninterruptible sleep» state, only curable by rebooting a client node.


    After realizing that we could stand it no longer, we opted for the only not-yet-tried but rather tricky configuration — erasure coding. The only difficult thing about it was to understand its volume creation principles. Red Hat offers a good manual with examples

    https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.1/html/administration_guide/chap-recommended-configuration_dispersed.


    After running all the same destructive tests, we obtained the same encouraging results. We wrote and then deleted millions of files. All our attempts to break Dispersed volume failed. We observed higher CPU load, but it was not critical for us, not yet.


    Now, it stores our infrastructure segment backups and is used as a file depot for our internal needs. We want to spend some time using it and see how it will behave under various loads. As of now, it is clear that striped volumes work in a strange way, while others behave very well. We also plan to build a 50 TB dispersed volume (4 + 2) on 6 servers with wide Infiniband channel, test its performance, and continue studying its operating principles.

    Don't miss the most important, interesting and helpful posts of the week

    Success