Dr. StrangeRAID, or, how I learned to stop worrying and love ZFS

Standard

So now that we’ve chosen to go with Solaris and use NFS for our filesystem, let’s consider how we’re going to use ZFS to best suit our needs.

So let’s say that I have four 1Tb disks lying around which I want to thow into my new file server. At some point in the future, I’d like to upgrade my capacity by replacing one or all of the disks with larger ones (say, 2Tb).

How do we go about it?

Before we start, there’s a few things that we need to be mindful of. Rules of ZFS, if you will. They are:

  • A vdev cannot have any additional devices added to it, but its devices can be replaced.
  • A vdev cannot be removed from a pool. You have to destroy the pool and start over.
  • A raidz vdev must contain disks of equal size.
  • A raidz vdev can be expanded, but some trickery is required.

Essentially when you create a group of disks in a zpool, you’re creating what’s known as a vdev. A vdev can be a single disk, an N-way mirrored array, a RaidZ1 array or a RaidZ2 array. It can also be a logging disk, a cache disk or a spare disk, but we’ll get on to them down the track.

For my experiment, I’ll be using files instead of disks, but creating 1Tb files is plain silly, so I’ll use 1Gb files instead. So wherever you see a space reading in Gigabytes, assume that it’s Terabytes. To create the files, we’ll first create a folder for them to live in, then make a few 1Gb “disks” for each scenario we want to experiment with.

root@Zues:~# mkdir /zdev
root@Zues:~# cd /zdev
root@Zues:/zdev# mkfile 1g flat1a && mkfile 2g flat2a && mkfile 1g 1b.... and so on. 49Gb of disk images total!

Scenario 1: Flat dynamically striped array (“Dynamic” RAID0)

The reason ZFS striped arrays are called “Dynamic” is because… well, they are. Adding or replacing disks is simple and frighteningly quick – most times involving only a single command, and no reboots to make it take effect.

To create a dynamically striped array is as simple as:

root@Zues:/zdev# zpool create dynamic /zdev/flat_1a /zdev/flat_1b /zdev/flat_1c /zdev/flat_1d

Which gives us:

root@Zues:/zdev# zpool list dynamic
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
dynamic 3.97G 79.5K 3.97G 0% ONLINE -
root@Zues:/zdev# df -h |grep dynamic
dynamic 4.0G 21K 4.0G 1% /dynamic

Four gigs of space. Not bad. Replacing our disks in the future happens like this:

root@Zues:/zdev# zpool replace dynamic /zdev/flat_1a /zdev/flat_2a
root@Zues:/zdev# zpool replace dynamic /zdev/flat_1b /zdev/flat_2b
root@Zues:/zdev# zpool replace dynamic /zdev/flat_1c /zdev/flat_2c
root@Zues:/zdev# zpool replace dynamic /zdev/flat_1d /zdev/flat_2d

root@Zues:/zdev# zpool list dynamic
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
dynamic 3.97G 104K 3.97G 0% ONLINE -
root@Zues:/zdev# df -h |grep dynamic
dynamic 4.0G 21K 4.0G 1% /dynamic

So It’s as easy as pie to replace disks to increase the size of your array, and although I did them all at the same time, it’s quite simple to do them one at a time.

Let’s see how it performs:

MWSnap028

Pros:

  • Maximum storage
  • Maximum simplicity
  • Great performance
  • Replace any disk at any time to expand the array

Cons:

  • Zero Redundancy

Conclusion:
Best if you want something really simple, with lots of storage space and very little redundancy.

Scenario 2: Mirroring (RAID1)

Building a mirrored array is a little tricker to get your head around. The first way to do it is to add disks in pairs and let the system figure it out. My preferred way, however, is to create a standard ZFS array of two disks, then attach the other two disks as hot mirrors of the two first ones. I’ll have to call this array ‘reflect’ because the words mirror and mirroring are reserved by ZFS.

root@Zues:/zdev# zpool create reflect /zdev/mirror_1a /zdev/mirror_1b

and now we’ll attach the mirror disks:

root@Zues:/zdev# zpool attach reflect /zdev/mirror_1a /zdev/mirror_1c
root@Zues:/zdev# zpool attach reflect /zdev/mirror_1b /zdev/mirror_1d
root@Zues:/zdev# zpool status reflect
pool: reflect
state: ONLINE
scrub: resilver completed after 0h0m with 0 errors on Wed Oct 14 21:27:04 2009
config:
NAME STATE READ WRITE CKSUM
reflect ONLINE 0 0 0
mirror ONLINE 0 0 0
/zdev/mirror_1a ONLINE 0 0 0
/zdev/mirror_1c ONLINE 0 0 0
mirror ONLINE 0 0 0
/zdev/mirror_1b ONLINE 0 0 0
/zdev/mirror_1d ONLINE 0 0 0 55.5K resilvered
errors: No known data errors

The available space of this array, as you would imagine, is 2Gb thanks to their being two 1Gb disks available.

To upgrade a mirrored array, you simply need to detach the mirror disk, replace the primary, then attach a new mirror disk. For example:

root@Zues:/zdev# zpool detach reflect /zdev/mirror_1c
root@Zues:/zdev# zpool replace reflect /zdev/mirror_1a /zdev/mirror_2a
root@Zues:/zdev# zpool attach reflect /zdev/mirror_2a /zdev/mirror_2c
root@Zues:/zdev# zpool status reflect
pool: reflect
state: ONLINE
scrub: resilver completed after 0h0m with 0 errors on Wed Oct 14 21:30:16 2009
config:
NAME STATE READ WRITE CKSUM
reflect ONLINE 0 0 0
mirror ONLINE 0 0 0
/zdev/mirror_2a ONLINE 0 0 0
/zdev/mirror_2c ONLINE 0 0 0 34K resilvered
mirror ONLINE 0 0 0
/zdev/mirror_1b ONLINE 0 0 0
/zdev/mirror_1d ONLINE 0 0 0

errors: No known data errors

But hang on a second! Even though we’ve upgraded one of the mirrors, we’re still only showing 2Gb of disk space! We should have at least 3Gb avaiable!

root@Zues:/zdev# zpool list reflect
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
reflect 1.98G 96K 1.98G 0% ONLINE -

Don’t panic, this is standard operating procedure. In order to make the new disk space available to the pool, we need to export it and re-import it.

root@Zues:/zdev# zpool export reflect
root@Zues:/zdev# zpool import -d /zdev/ reflect
root@Zues:/zdev# zpool list reflect
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
reflect 2.98G 94.5K 2.98G 0% ONLINE -

Note the -d slash and the file path are only necessary when you’re working with files. Were these actual disks, I could just type “zpool import <poolname>” and get the same result.

As for performance, mirroring is definitely the way to go if you’re concerned about throughput. Over a 100 megabit network, this array gives perfect performance:

MWSnap029

Pros:

  • Excellent Redundancy
  • Excellent Performance
  • Still fairly simple
  • Can upgrade the array mirror at a time
  • Can roll back to dynamic array at any time by detaching mirrors and adding them to the pool

Cons:

  • Very expensive

Conclusion:
Great mix of redundancy and flexibility, but not for the faint of wallet.

Scenario 3: RaidZ1 (RAID5)

A Z1 array is essentially a RAID5 array, but much, much sexier. Why? Because it has a Z in front of it!

Seriously, though, the RAIDZ array sits pretty much half way between naked disks and mirrored disks, providing n-1 redundancy so that while you’re secure, you can still have a good amount of storage without breaking the bank. Let’s go ahead and set one up.

root@Zues:/zdev# zpool create z1 raidz /zdev/raidz1_1a /zdev/raidz1_1b /zdev/raidz1_1c /zdev/raidz1_1d
root@Zues:/zdev# zpool status z1
pool: z1
state: ONLINE
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
z1 ONLINE 0 0 0
raidz1 ONLINE 0 0 0
/zdev/raidz1_1a ONLINE 0 0 0
/zdev/raidz1_1b ONLINE 0 0 0
/zdev/raidz1_1c ONLINE 0 0 0
/zdev/raidz1_1d ONLINE 0 0 0

errors: No known data errors

So now that the RaidZ array has been created (note that the vdev specifies that it’s raidz1), let’s look at how much space we’ve actually got to play with.

root@Zues:/zdev# zfs list z1
NAME USED AVAIL REFER MOUNTPOINT
z1 96.5K 2.92G 31.4K /z1

Remember, n-1 spacing means that even though we’ve got 4Gb of disks, we only get 3Gb of space, and then 1/64th of that space is reserved by ZFS.

Now let’s say it’s a year down the track, we’ve mostly filled our array with files, and we want to replace these disks with larger ones. First, you’ll have to make sure that the “Expand” option is enabled on the array, otherwise it won’t automatically grow when we add the new disks.

root@Zues:/zdev# zpool get expand z1
NAME PROPERTY VALUE SOURCE
z1 autoexpand off default
root@Zues:/zdev# zpool set expand=on z1
root@Zues:/zdev# zpool get expand z1
NAME PROPERTY VALUE SOURCE
z1 autoexpand on local

Excellent. Now we can go ahead and replace out our disks with the new 2Gb models we just bought.

root@Zues:/zdev# zpool replace z1 /zdev/raidz1_1a /zdev/raidz1_2a
root@Zues:/zdev# zpool replace z1 /zdev/raidz1_1b /zdev/raidz1_2b
root@Zues:/zdev# zpool replace z1 /zdev/raidz1_1c /zdev/raidz1_2c
root@Zues:/zdev# zpool replace z1 /zdev/raidz1_1d /zdev/raidz1_2d
root@Zues:/zdev# zpool status z1
pool: z1
state: ONLINE
scrub: resilver completed after 0h0m with 0 errors on Wed Oct 14 21:54:17 2009
config:

NAME STATE READ WRITE CKSUM
z1 ONLINE 0 0 0
raidz1 ONLINE 0 0 0
/zdev/raidz1_2a ONLINE 0 0 0
/zdev/raidz1_2b ONLINE 0 0 0
/zdev/raidz1_2c ONLINE 0 0 0
/zdev/raidz1_2d ONLINE 0 0 0 31K resilvered

errors: No known data errors

root@Zues:/zdev# zfs list z1
NAME USED AVAIL REFER MOUNTPOINT
z1 123K 5.87G 31.4K /z1

Excellent. Just under 6Gb of space to play with, although I shudder to think how much four of those disks cost. Let’s see how it performs.

MWSnap030

Ok, so performance is a little bit degraded because of the fact that the system needs to calculate parity blocks during the write process. It also increases CPU utilization for the same reason.

Pros:

  • Single disk Redundancy
  • Can also add spares which can kick in if a drive fails
  • Halfway between naked drives and mirroring for price

Cons:

  • Not so stellar performance
  • To upgrade, all drives must be replaced

Conclusion:
Not for the faint of heart when it comes to upgrade paths, but definitely one of the most versitile options

Scenario 4: RaidZ2 (RAID6)

Essentially this is the ZFS equivalent of RAID6, and offers two disk redundancy – i.e. two disks can fail before you need to start worrying about data. The only downside is that you’ll lose two disks to parity information, so it tends to get rather pricey. First, let’s create a RaidZ2 array:

root@Zues:/zdev# zpool create z2 raidz2 /zdev/raidz2_1a /zdev/raidz2_1b /zdev/raidz2_1c /zdev/raidz2_1d
root@Zues:/zdev# zpool status z2
pool: z2
state: ONLINE
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
z2 ONLINE 0 0 0
raidz2 ONLINE 0 0 0
/zdev/raidz2_1a ONLINE 0 0 0
/zdev/raidz2_1b ONLINE 0 0 0
/zdev/raidz2_1c ONLINE 0 0 0
/zdev/raidz2_1d ONLINE 0 0 0

errors: No known data errors

and the available space?

root@Zues:/zdev# zfs list z2
NAME USED AVAIL REFER MOUNTPOINT
z2 101K 1.95G 31.4K /z2

Ew. Not much space to play with, considering we just threw 4 1G disks at it, although we had the same sort of thing with Mirroring. Upgrading this array is the same as a RaidZ1 array, in that we need to enable exand, then upgrade all four disks at once.

root@Zues:/zdev# zpool get expand z2
NAME PROPERTY VALUE SOURCE
z2 autoexpand off default
root@Zues:/zdev# zpool set expand=on z2
root@Zues:/zdev# zpool replace z2 /zdev/raidz2_1a /zdev/raidz2_2a
root@Zues:/zdev# zpool replace z2 /zdev/raidz2_1b /zdev/raidz2_2b
root@Zues:/zdev# zpool replace z2 /zdev/raidz2_1c /zdev/raidz2_2c
root@Zues:/zdev# zpool replace z2 /zdev/raidz2_1d /zdev/raidz2_2d
root@Zues:/zdev# zpool status z2
pool: z2
state: ONLINE
scrub: resilver completed after 0h0m with 0 errors on Wed Oct 14 22:17:12 2009
config:

NAME STATE READ WRITE CKSUM
z2 ONLINE 0 0 0
raidz2 ONLINE 0 0 0
/zdev/raidz2_2a ONLINE 0 0 0
/zdev/raidz2_2b ONLINE 0 0 0
/zdev/raidz2_2c ONLINE 0 0 0
/zdev/raidz2_2d ONLINE 0 0 0 56.5K resilvered

errors: No known data errors
root@Zues:/zdev# zfs list z2
NAME USED AVAIL REFER MOUNTPOINT
z2 126K 3.91G 31.4K /z2

So with four 2Gb disks, we now have just shy of 4Gb of space which is double redundant. Let’s see what its performance is like:

MWSnap031

Another small drop thanks to the extra parity we have to write, but nothing out of the ordinary.

Pros:

  • Double disk Redundancy
  • Can also add spares which can kick in if a drive fails

Cons:

  • Underwhelming performance
  • To upgrade, all drives must be replaced
  • Lots of disk space lost to Parity

Conclusion:
In a massive array, RaidZ2 would be very worthwhile, but in our example, two disk redundancy can be performed by Mirroring just as easily and have the added bonus of being much easier to expand.

So there you have it, folks. I’ll admit that this look into ZFS and its various options isn’t exactly in-depth, but hopefully it gives you an idea of what is available.

Final Conclusion.

I can’t tell you which style of zpool layout to go for. For most people, a flat dynamically striped array is probably all they need, as all they really care about is the size of the array. Others might baulk at the idea of putting so much data in the hands of the gods – after all, a flat array is horribly susceptible to disk failure.

Having weighed up all the options, I can say that the most likely one I’ll go for is a four-disk mirrored array. The fact that I can upgrade two of the disks and leave the other two alone is a massive bonus, as it means that while I enjoy the luxury of double redundant disks, I also don’t have to buy four new disks when it comes time to upgrade. Two (or even one) new 2Tb disks will bump my storage array to 3Tb – more than enough to keep me trucking.

If you’d like to simulate any of these options, all you need to do is grab a copy of the Opensolaris LiveCD, whack it in to your computer (or your virtual machine software of choice) and give it a go.

Enjoy!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s