Disk performance

I’m doing a small side project, that requires me to randomly query/update about 1TB of data. Managing random access to this kind of data is non trivial.

Attempt 1 or the problem

So I took 2TB hard disk out of my drawer and tried to use it.

Lessons learned:

Lesson 1: Don’t use BTRFS for data volumes

Don’t use BTRFS for data volumes (at least not without serious tweaking: for starters you need to disable COW for data volumes).

On the other hand: BTRFS is totally cool fs for daily use, and has native support for compression and native out of band deduplication.

I decided to use xfs which is a nice tweakable file system.

Lesson 2: There are some kernel settings that require tweaking

Kernel is not really tweaked for serious disk IO initially.

Especially following settings are critical:

  • /proc/sys/vm/dirty_ratio or /proc/sys/vm/dirty_bytes. Both control maximal amount of dirty pages that system can hold. If any process dirties more pages, writes will be stopped until enough bytes are written do disk so we finish before the threshold.

    If some process sends writes a lot of data, and then issues fsync system may become unresponsive. (This is what inserting a lot of data into SQL database does).

  • /proc/sys/vm/dirty_background_bytes or /proc/sys/vm/dirty_background_ratio

    If there are more dirty pages than this threshold system starts background process to write some data to disk.

_ratio variables are older, and are mentioned everywhere. They basically mean if more than N percent of RAM is used as dirty cache then flush.

_bytes variables were added “recently” and often they are not mentioned on tutorials. They are easier to understand, and are a better fit for systems with a lot of RAM. They just mean threshold in bytes.

Default values are ridiculously high on systems with a lot of RAM – as by default _ratio variant is used with setting of about 10 od 20, which may mean that your system will flush gigabytes of data on fsync.

This values should be much lower than defaults. You need to measure this (this is one thing I learned much later)

Easiest way to set these variables is to add following to rc.local.

Lesson three: set noatime to your data partition

noatime is a mount option that disables last access time attribute on files. However this attribute might be useful, it turns every file read into a write (to update this attribute) which is a big no.

To do this add noatime to appropriate /etc/fstab entry:

/dev/mapper/..      /mount-point xfs defaults,noatime 0 0

Lesson four: Read a book

I recommend: PostgreSQL High Performance. If you know any books on tuning IO devices I’d be happy to read them (so e-mail me)

Lesson five: IOPS is an important parameter

Iops stands for “IO Operations Per Second”, or “how many random reads” your disk can make in a second.

Nice thing about SSD’s is that they mostly allow random access to data — that is it doesn’t matter much whether you read one big continoous file or just read random parts of said file.

HDD’s don’t have this property — if you read data sequentially you’ll have much better read speeds (orders of magnitude better).

So for HDD’s IOPS are limiting factor in case of random access.

To see how much iops is your disk currently using install sysstat Debian package and user iostat:

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sdc             113.53

and tps is column you need.

Typical number of IOPS you can get from HDD is 15-50 for 5400 RPM disks, and 50-100 for 7200 RPM disks (according to Wikipedia). iostat showed more IOPS for my 5400 RMP disk which might be caused by on-disk cache.

Problem

I just didn’t get enough IOPS.

Non solutions

Use cloud

Easiest solution would be just do to what everybody does: “just put it in the cloud”. The problem is it’s expensive, and I’m stingy. Probably there will be no money from this project, and if I can just scrap it from parts that lie in my drawer why spend money.

For now let’s ignore VM costs, and focus on the hardware:

  • On AWS I could buy 1TB of SSD storage for 119$ a month “throughput optimized

HDD” for 50$. SSD would be way faster than my solution, and I believe HDD’s would be slower. * On Google Cloud both HDD and SSD would be faster than my solution as google’s HDD’s are not that throughput oriented. SSDs cost 200$ and HDDs 50$.

If I add VM to this I get slightly about 200$ a month.

Note: I’m totally for using cloud for deploying your services, managing hardware that has couple of nines availability is not trivial, and well worth the price.

Just buy SSDs

Just buy SSD disks. This would work, but again: I’m stingy and 2 TB SSD disk is about 850$ (and this is not one of my favourite makes).

If I was sure total data will be less than 1TB I’d probably buy 1TB SSD, but prices of SSD’s skyrocket after 1 TB in size.

Attempt 2

I bought two 2TB 7400 RPM Western Digital hard disks. I’m a big fan od WD disks.

And I decided to just use them as some form of RAID-0. This way I’ll get 4TB of total storage with decent speeds (but if any drive dies I loose my data).

Lesson 0: Raid is not a backup

This is not really something I encountered right now, but I feel this can’t be stressed enough.

Note

I also dislike using RAID controllers — as there is no such thing as raid disk format and each manufacturer might make their own disk format, which means that when controller dies your data might be hard to recover.

Lesson 1: Raid is everywhere

In linux at least you might get RAID or RAID-like behaviour on many different levels:

  • RAID controller
  • Software RAID
  • LVM
  • Filesystems — this is totally cool that you can have native RAID-X in your BTRFS system on file system level.

Lesson 2: LVM striping is cool

If you have more than single disk in your LVM volume group, you can use striping — works more or less like RAID-0 but with some advantages:

  • It’s easy to set up.
  • You can easily create and delete volumes with and without striping.
  • You don’t have this assumption that: “size of volume is lowest of the size of physical volumes”.

Downside is that probably Raid is gives you better performance (at least this is what I read on the Internet)!

Lesson 3: Stripe size matters

I basically created striped volume on two disks, and I got no performance increase whatsoever. Synthetic tests showed performance decrease, yes: two 7400RPM disks got slower than single 5400RPM disk.

Lesson 4: Measure performance

There are very nice performance tools. On I used was sysbench (it’s in debian repositories).

My initial assumption is that I selected wrong stripe size. Stripe is amount of contiguous data that LVM puts on a single disk, next stripe is on next disk and so forth.

There is no “best” stripe size, I would suggest having a stripe larger than your typical random read — so typical random read will not require coordinating couple of disks.

Lesson 5: How to measure performance

Currently I did very simple benchmark:

  1. I created a volume with given stripe size
  2. I created a XFS filesystem with block size equal to stripe size (not sure if this is good idea)
  3. Prepared sysbench by: sysbench fileio --file-total-size=128GB prepare
  4. Ran sysbench fileio --file-total-size=128GB --file-test-mode=rndrw --file-fsync-freq=10 --threads=8 for couple of block sizes (block size is how much data is read or written in single batch)

If you can read python, here is script I used: :download:data/striped-benchmark.py.

My results were as follows:

  • For larger block sizes bigger stripes gave me much more throughput.
  • For smaller block sized bigger stripes were not that important.

Ultimately I decided I’ll use 64kb stripe size.

Right now I get about 100 IOPS for each drive and more than 100mb/s read and write speeds from both drives (in my specific workload!).

Which probably is enough for time being, and I’m not sure I could get better performance from two disks that costed less than 300$ (or 1000 złotys).

If I’ll do some tweaking I’ll let you know what and how I did it.