I’m doing a small side project, that requires me to randomly query/update about 1TB of data. Managing random access to this kind of data is non trivial.
Attempt 1 or the problem¶
So I took 2TB hard disk out of my drawer and tried to use it.
Lesson 1: Don’t use BTRFS for data volumes¶
Don’t use BTRFS for data volumes (at least not without serious tweaking: for starters you need to disable COW for data volumes).
On the other hand: BTRFS is totally cool fs for daily use, and has native support for compression and native out of band deduplication.
I decided to use
xfs which is a nice tweakable file system.
Lesson 2: There are some kernel settings that require tweaking¶
Kernel is not really tweaked for serious disk IO initially.
Especially following settings are critical:
/proc/sys/vm/dirty_bytes. Both control maximal amount of dirty pages that system can hold. If any process dirties more pages, writes will be stopped until enough bytes are written do disk so we finish before the threshold.
If some process sends writes a lot of data, and then issues
fsyncsystem may become unresponsive. (This is what inserting a lot of data into SQL database does).
If there are more dirty pages than this threshold system starts background process to write some data to disk.
_ratio variables are older, and are mentioned everywhere. They basically mean
if more than
N percent of RAM is used as dirty cache then flush.
_bytes variables were added “recently” and often they are not mentioned on
tutorials. They are easier to understand, and are a better fit for systems
with a lot of RAM. They just mean threshold in bytes.
Default values are ridiculously high on systems with a lot of RAM – as by default
_ratio variant is used with setting of about 10 od 20, which may mean that
your system will flush gigabytes of data on
This values should be much lower than defaults. You need to measure this (this is one thing I learned much later)
Easiest way to set these variables is to add following to
Lesson three: set noatime to your data partition¶
noatime is a mount option that disables
last access time attribute on
files. However this attribute might be useful, it turns every file read into a
write (to update this attribute) which is a big no.
To do this add
noatime to appropriate
/dev/mapper/.. /mount-point xfs defaults,noatime 0 0
Lesson four: Read a book¶
PostgreSQL High Performance. If you know any books on tuning
IO devices I’d be happy to read them (so e-mail me)
Lesson five: IOPS is an important parameter¶
Iops stands for “IO Operations Per Second”, or “how many random reads” your disk can make in a second.
Nice thing about SSD’s is that they mostly allow random access to data — that is it doesn’t matter much whether you read one big continoous file or just read random parts of said file.
HDD’s don’t have this property — if you read data sequentially you’ll have much better read speeds (orders of magnitude better).
So for HDD’s IOPS are limiting factor in case of random access.
To see how much iops is your disk currently using install
package and user
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sdc 113.53
tps is column you need.
Typical number of IOPS you can get from HDD is
5400 RPM disks, and
7200 RPM disks
(according to Wikipedia).
iostat showed more IOPS for my
5400 RMP disk
which might be caused by on-disk cache.
I just didn’t get enough IOPS.
Easiest solution would be just do to what everybody does: “just put it in the cloud”. The problem is it’s expensive, and I’m stingy. Probably there will be no money from this project, and if I can just scrap it from parts that lie in my drawer why spend money.
For now let’s ignore VM costs, and focus on the hardware:
- On AWS I could buy 1TB of SSD storage for 119$ a month “throughput optimized
HDD” for 50$. SSD would be way faster than my solution, and I believe HDD’s would be slower. * On Google Cloud both HDD and SSD would be faster than my solution as google’s HDD’s are not that throughput oriented. SSDs cost 200$ and HDDs 50$.
If I add VM to this I get slightly about 200$ a month.
Note: I’m totally for using cloud for deploying your services, managing hardware that has couple of nines availability is not trivial, and well worth the price.
Just buy SSDs¶
Just buy SSD disks. This would work, but again: I’m stingy and 2 TB SSD disk is about 850$ (and this is not one of my favourite makes).
If I was sure total data will be less than 1TB I’d probably buy 1TB SSD, but prices of SSD’s skyrocket after 1 TB in size.
I bought two 2TB 7400 RPM Western Digital hard disks. I’m a big fan od WD disks.
And I decided to just use them as some form of
RAID-0. This way I’ll get
4TB of total storage with decent speeds (but if any drive dies I loose my data).
Lesson 0: Raid is not a backup¶
This is not really something I encountered right now, but I feel this can’t be stressed enough.
I also dislike using RAID controllers — as there is no such thing as
raid disk format and each manufacturer might make their own disk format,
which means that when controller dies your data might be hard to recover.
Lesson 1: Raid is everywhere¶
In linux at least you might get RAID or RAID-like behaviour on many different levels:
- RAID controller
- Software RAID
- Filesystems — this is totally cool that you can have native
RAID-Xin your BTRFS system on file system level.
Lesson 2: LVM striping is cool¶
If you have more than single disk in your LVM volume group, you can use
striping — works more or less like
RAID-0 but with some advantages:
- It’s easy to set up.
- You can easily create and delete volumes with and without striping.
- You don’t have this assumption that: “size of volume is lowest of the size of physical volumes”.
Downside is that probably Raid is gives you better performance (at least this is what I read on the Internet)!
Lesson 3: Stripe size matters¶
I basically created striped volume on two disks, and I got no performance
increase whatsoever. Synthetic tests showed performance decrease,
7400RPM disks got slower than single
Lesson 4: Measure performance¶
There are very nice performance tools. On I used was
sysbench (it’s in
My initial assumption is that I selected wrong stripe size. Stripe is amount of contiguous data that LVM puts on a single disk, next stripe is on next disk and so forth.
There is no “best” stripe size, I would suggest having a stripe larger than your typical random read — so typical random read will not require coordinating couple of disks.
Lesson 5: How to measure performance¶
Currently I did very simple benchmark:
- I created a volume with given stripe size
- I created a XFS filesystem with block size equal to stripe size (not sure if this is good idea)
sysbench fileio --file-total-size=128GB prepare
sysbench fileio --file-total-size=128GB --file-test-mode=rndrw --file-fsync-freq=10 --threads=8for couple of block sizes (block size is how much data is read or written in single batch)
If you can read python, here is script I used: :download:
My results were as follows:
- For larger block sizes bigger stripes gave me much more throughput.
- For smaller block sized bigger stripes were not that important.
Ultimately I decided I’ll use
64kb stripe size.
Right now I get about 100 IOPS for each drive and more than 100mb/s read and write speeds from both drives (in my specific workload!).
Which probably is enough for time being, and I’m not sure I could get better performance from two disks that costed less than 300$ (or 1000 złotys).
If I’ll do some tweaking I’ll let you know what and how I did it.