Cheap network storage system

By: Eddie Aronovich, School of Computer science, Tel-Aviv University

July. 2012

Abstract

This document describes the set-up that we are using for a cheap (less than 10K$ for 120TB) and fast (500MB/sec read and almost 350MB/sec write) storage system. Assuming that one is familiar with the Backblaze JBOD, we focus on the configuration for performance.

Motivation

We need large storage for research. It means very large storage systems, but the reliability is moderately important and the price is a big issue. We looked at many vendors and solutions, but the common issues were:

The price was too high. Most of the storage vendors are software suppliers and it means that you pay for the hardware and much more for the software. Taking in count the second point, you pay for something that you do need since it is a bundle.
Too many features. We needed a place to write data using mainly NFS and some CIFS. The other capabilities are not important.
Fast I/O is expensive. Our fast processing is performed using parallel computing cluster and we write from many hosts (a few hundreds) concurrently. This is an expensive feature.

The solution that we implemented

We decided to go for Backblaze as a low cost storage system. During the procurement (which is a long lasting process by itself at least in our institution) Backblaze v2.0 appeared which is very similar to our planning, but we already started with hardware that was a but different. We bought (2) cases and due to the problems that we encountered, one runs with Sil cards and the other with FastTrak cards. Our experience and measurement indicate that the Sil cards are faster and have better driver support but we had some heating problems with the disks that were connected to the Sil cards. The heating problem vanished (without he patch mentioned later).

Hardware deviation

The hardware components that we used compared to the original Backblaze configuration is described below.

Processor - we used i5
Motherboard - We used Intel DH55HC motherboard since the original one was no longer available.
Port multipliers cards - We used FastTrak TX4650 - this caused us major problems but when we solved them the solution worked for the Sil cards as well. One of the issues that we found out is what was obvious - less port multipliers on one card will gain better performance.
40 disks rather than 45 disks were used in 8 chunks of 5 disks each.
Changed the memory to 16GB. During the setup, we noticed that for fast I/O uses large RAM.

Building the case

The case was assembled and wired as recommended by Backblaze.
mounting the fans

Challenges

After we got all the hardware and assembled it, we installed Ubuntu Linux but almost nothing worked. The problems we encountered were:

The boot process was slow (almost 3 min).
Not all the disks were recognized. And the recognized devices changed between reboots.
Disks that were recognized got different device filename in each boot.
The read/writing speed was awful. Building a raid on 8, 3TB disks, SATA III gave read speed of ~40MB/sec and writing was almost half of it.
While writing large files (e.g. backup tar files), some of the raid array disks stopped working.

We tried to debug the system, but it did not worked. Actually it looked bad! We invested almost 10K$ and nothing work. But it gave hell of motivation to make it work!
We also tried to use openfiler and freenas. The openfiler recognized all the disks, but when we created raid arrays, some disks disappeared and it was not able to detect them even after reboot. The freenas could not work with the port multiplier cards that we had. We tried to change the kernel to a newer one but we did not succeed.

Solutions

At he end, we used Debian (we switched form Ubuntu to Debian without any specific reason) with mdadm as raid mechanism. The disks were partitioned into 5 raid of 8 disks each. Stripe (n) was build of the n^th disks in each chunk. In this way, each stripe uses all the bandwidth of the port multipliers. Since the port multipliers and the multipliers cards support only SATA II, we load of each stripe is degraded to the speed of SATA II.
We performed the following changes in the Linux kernel and the configuration of the mdadm to make it work:

Patch the kernel for Sil3726 Port Multipliers (details can be found libata: Allow SOFT_RESET for Sil3726). We applied only the patch for drivers/ata/libata-pmp.c and it was enough.
Change the stripe cache size to 32786.
(replace the X - by the raid id)

Some measurements

Summary of IOzone filesystem benchmark can be seen here

Working on a single raid array

The measurements below were performed to one of the raid arrays (strips) while the rest are idle. Reading using zcav after increasing the cache size: # echo "32768" > /sys/block/md3/md/stripe_cache_size # zcav -b 1024 /dev/md3 #block offset (GiB), MiB/s, time 0.00 308.82 3.316 1.00 497.39 2.059 2.00 506.10 2.023 3.00 495.45 2.067 4.00 505.45 2.026 5.00 501.43 2.042 6.00 496.08 2.064 7.00 504.34 2.030 8.00 499.66 2.049 9.00 494.67 2.070 10.00 503.68 2.033 11.00 498.78 2.053 12.00 494.46 2.071 Writing to a single raid array while the other arrays are idle. # date ; dd if=/dev/zero of=/dev/md0 bs=64M Tue Jun 12 17:09:40 IDT 2012 30802+0 records in 30802+0 records out 2067087228928 bytes (2.1 TB) copied, 6006.19 s, 344 MB/s Reading from a network station to that raid array: > time dd if=aaa-2012-07-01 of=/dev/null bs=64M 100+0 records in 100+0 records out 6710886400 bytes (6.7 GB) copied, 77.9076 s, 86.1 MB/s 0.000u 5.008s 1:18.02 6.4% 0+0k 13107944+8io 1pf+0w

So - we have a cheap storage that can read locally at ~500MB/sec and write locally at 344MB/sec.

Writing from a regular network node to a single raid array > time dd if=/dev/zero of=aaa-2012-07-03 bs=64M count=100 100+0 records in 100+0 records out 6710886400 bytes (6.7 GB) copied, 97.0069 s, 69.2 MB/s 0.000u 9.196s 1:37.06 9.4% 0+0k 0+13107208io 0pf+0w > time dd if=/dev/zero of=aaa bs=64M count=100 100+0 records in 100+0 records out 6710886400 bytes (6.7 GB) copied, 77.2204 s, 86.9 MB/s 0.000u 9.336s 1:17.29 12.0% 0+0k 0+13107208io 0pf+0w Writing speed from the same station to a Network Appliance system (that was not idle): > time dd if=/dev/zero of=aaa bs=64M count=50 50+0 records in 50+0 records out 3355443200 bytes (3.4 GB) copied, 59.4391 s, 56.5 MB/s 0.000u 4.252s 0:59.51 7.1% 0+0k 0+6553608io 0pf+0w Using htparm: # hdparm -tT /dev/md3 /dev/md3: Timing cached reads: 11042 MB in 2.00 seconds = 5523.24 MB/sec Timing buffered disk reads: 1110 MB in 3.03 seconds = 366.81 MB/sec #

What's next ?

If you can support Jbod based on SSD - please contact me!

Thanks !

First I would like to acknowledge prof. Ronitt Rubinfeld for her generous support of this project

The team that made this project happen which include :

Avi Shtibel - our devoted technician that spent many hours in building this case and performing infinite tests for debugging.
Leon Ankonina - our Linux super sys. admin that made major debugging efforts and supported Avi while building the case.
Diana Yalin - our procurement superwoman that dealt with all the bureaucracy of bringing the hardware from around the world.

http://www.cs.tau.ac.il/research/eddie.aronovich/docs/Jbod-120TB.html