The Funtoo Linux project has transitioned to "Hobby Mode" and this wiki is now read-only.
ZFS Fun
This tutorial is under a heavy revision to be switched from ZFS Fuse to ZFS on Linux.
Introduction
ZFS features and limitations
ZFS offers an impressive amount of features even putting aside its hybrid nature (both a filesystem and a volume manager -- zvol) covered in detail on Wikipedia. One of the most fundamental points to keep in mind about ZFS is it targets a legendary reliability in terms of preserving data integrity. ZFS uses several techniques to detect and repair (self-healing) corrupted data. Simply speaking it makes an aggressive use of checksums and relies on data redundancy, the price to pay is a bit more CPU processing power. However, the Wikipedia article about ZFS also mention it is strongly discouraged to use ZFS over classic RAID arrays as it can not control the data redundancy, thus ruining most of its benefits.
In short, ZFS has the following features (not exhaustive):
- Storage pool dividable in one or more logical storage entities.
- Plenty of space:
- 256 zettabytes per storage pool (2^64 storages pools max in a system).
- 16 exabytes max for a single file
- 2^48 entries max per directory
- Virtual block-devices support support over a ZFS pool (zvol) - (extremely cool when jointly used over a RAID-Z volume)
- Read-only Snapshot support (it is possible to get a read-write copy of them, those are named clones)
- Encryption support (supported only at ZFS version 30 and upper, ZFS version 31 is shipped with Oracle Solaris 11 so that version is mandatory if you plan to encrypt your ZFS datasets/pools)
- Built-in RAID-5-like-over-steroid capabilities known as RAID-Z and RAID-6-like-over-steroid capabilities known as RAID-Z2. RAID-Z3 (triple parity) also exists.
- Copy-on-Write transactional filesystem
- Meta-attributes support (properties) allowing you to you easily drive the show like "That directory is encrypted", "that directory is limited to 5GiB", "That directory is exported via NFS" and so on. Depending on what you define, ZFS do the job for you!
- Dynamic striping to optimize data throughput
- Variable block length
- Data deduplication
- Automatic pool re-silvering
- Transparent data compression
- Transparent encryption (Solaris 11 and later only)
Most notable limitations are:
- Lack a features ZFS developers knows as "Block Pointer rewrite functionality" (planned to be developed), without it ZFS suffers of currently not being able to:
- Pool defragmentation (COW techniques used in ZFS mitigates the problem)
- Pool resizing
- Data compression (re-applying)
- Adding an additional device in a RAID-Z/Z2/Z3 pool to increase it size (however, it is possible to replace in sequence each one of the disks composing a RAID-Z/Z2/Z3)
- NOT A CLUSTERED FILESYSTEM like Lustre, GFS or OCFS2
- No data healing if used on a single device (corruption can still be detected), workaround if to force a data duplication on the drive
- No support of TRIMming (SSD devices)
ZFS on well known operating systems
Linux
Despite the source code of ZFS is open, its license (Sun CDDL) is incompatible with the license governing the Linux kernel (GNU GPL v2) thus preventing its direct integration. However a couple of ports exists, but suffers of maturity and lack of features. As of writing (February 2014) two known implementations exists:
- ZFS-fuse: a totally userland implementation relying on FUSE. This implementation can now be considered as defunct as of February 2014). The original site of ZFS FUSE seems to have disappeared nevertheless the source code is still available on http://freecode.com/projects/zfs-fuse. ZFS FUSE stalled at version 0.7.0 in 2011 and never really evolved since then.
- ZFS on Linux: a kernel mode implementation of ZFS in kernel mode which supports a lot of NFS features. The implementation is not as complete as it is under Solaris and its siblings like OpenIndiana (e.g. SMB integration is still missing, no encryption support...) but a lot of functionality is there. This is the implementation used for this article. As ZFS on Linux is an out-of-tree Linux kernel implementation, patches must be waited after each Linux kernel release. ZfsOnLinux currently supports zpools version 28 and since its version 0.6.2 is considered as ready for production.
Solaris/OpenIndiana
- Oracle Solaris: remains the de facto reference platform for ZFS implementation: ZFS on this platform is now considered as mature and usable on production systems. Solaris 11 uses ZFS even for its "system" pool (aka rpool). A great advantage of this: it is now quite easy to revert the effect of a patch at the condition a snapshot has been taken just before applying it. In the "old good" times of Solaris 10 and before, reverting a patch was possible but could be tricky and complex when possible. ZFS is far from being new in Solaris as it takes its roots in 2005 to be, then, integrated in Solaris 10 6/06 introduced in June 2006.
- OpenIndiana: is based on the Illuminos kernel (a derivative of the now defunct OpenSolaris) which aims to provide absolute binary compatibility with Sun/Oracle Solaris. Worth mentioning that Solaris kernel and the Illumos kernel were both sharing the same code base, however, they now follows a different path since Oracle announced the discontinuation of OpenSolaris (August 13th 2010). Like Oracle Solaris, OpenIndiana uses ZFS for its system pool. The illumos kernel ZFS support lags a bit behind Oracle: it supports zpool version 28 where as Oracle Solaris 11 has zpool version 31 support, data encryption being supported at zpool version 30.
*BSD
- FreeBSD: ZFS is present in FreeBSD since FreeBSD 7 (zpool version 6) and FreeBSD can boot on a ZFS volume (zfsboot). ZFS support has been vastly enhanced in FreeBSD 8.x (8.2 supports zpool version 15, version 8.3 supports version 28), FreeBSD 9 and FreeBSD 10 (both supports zpool version 28). ZFS in FreeBSD is now considered as fully functional and mature. FreeBSD derivatives such as the popular FreeNAS takes befenits of ZFS and integrated it in their tools. In the case of that latter, it have, for example, supports for zvol though its Web management interface (FreeNAS >= 8.0.1).
- NetBSD: ZFS has been started to be ported as a GSoC project in 2007 and is present in the NetBSD mainstream since 2009 (zpool version 13).
- OpenBSD: No ZFS support yet and not planned until Oracle changes some policies according to the project FAQ.
ZFS alternatives
- WAFL seems to have severe limitation [1] (document is not dated), also an interesting article lies here
- BTRFS is advancing every week but it still lacks such features like the capability of emulating a virtual block device over a storage pool (zvol) and built-in support for RAID-5/6 is not complete yet (cf. Btrfs mailing list). At date of writing, it is still experimental where as ZFS is used on big production servers.
- VxFS has also been targeted by comparisons like this one (a bit controversial). VxFS has been known in the industry since 1993 and is known for its legendary flexibility. Symantec acquired VxFS and proposed a basic version (no clustering for example) of it under the same Veritas Storage Foundation Basic
- An interesting discussion about modern filesystems can be found on OSNews.com
ZFS vs BTRFS at a glance
Some key features in no particular order of importance between ZFS and BTRFS:
Feature | ZFS | BTRFS | Remarks |
---|---|---|---|
Transactional filesystem | YES | YES | |
Journaling | NO | YES | Not a design flaw, but ZFS is robust by design... See page 7 of "ZFS The last word on filesystems". |
Dividable pool of data storage | YES | YES | |
Read-only snapshot support | YES | YES | |
Writable snapshot support | YES | YES | |
Sending/Receiving a snapshot over the network | YES | YES | |
Rollback capabilities | YES | YES | While ZFS knows where and how to rollback the data (on-line), BTRFS requires a bit more work from the system administrator (off-line). |
Virtual block-device emulation | YES | NO | |
Data deduplication | YES | YES | Built-in in ZFS, third party tool (bedup) in BTRFS |
Data blocks reoptimization | NO | YES | ZFS is missing a "block pointer rewrite functionality", true on all known implementations so far. Not a major performance crippling however. BTRFS can do on-line data defragmentation. |
Built-in data redundancy support | YES | YES | ZFS has a sort of RAID-5/6 (but better! RAID-Z{1,2,3}) capability, BTRFS only fully supports data mirroring at this point, however some works remains to be done on parity bits handling by BTRFS. |
Management by attributes | YES | NO | Nearly everything touching ZFS management is related to attributes manipulation (quotas, sharing over NFS, encryption, compression...), BTRFS also retain the concept but it les less aggressively used. |
Production quality code | NO | NO | ZFS support in Linux is not considered as production quality (yet) although it is very robust. Several operating systems like Solaris/OpenIndiana have a production quality implementation, Solaris/OpenIndiana is now installed in ZFS datasets by defaults. |
Integrated within the Linux kernel tree | NO | YES | ZFS is released under the CDDL license... |
ZFS resource naming restrictions
Before going further, you must be aware of restrictions concerning the names you can use on a ZFS filesystem. The general rule is: you can can use all of the alphanumeric characters plus the following specials are allowed:
- Underscore (_)
- Hyphen (-)
- Colon (:)
- Period (.)
The name used to designate a ZFS pool has no particular restriction except:
- it can't use one of the following reserved words:
- mirror
- raidz (raidz2, raidz3 and so on)
- spare
- cache
- log
- names must begin with an alphanumeric character (same for ZFS datasets).
Some ZFS concepts
Once again with no particular order of importance:
ZFS | What it is | Counterparts examples |
---|---|---|
zpool | A group of one or many physical storage media (hard drive partition, file...). A zpool has to be divided in at least one ZFS dataset or at least one zvol to hold any data. Several zpools can coexists in a system at the condition they each hold a unique name. Also note that zpools can never be mounted, the only things that can are the ZFS datasets they hold. |
|
dataset | A logical subdivision of a zpool mounted in your host's VFS where your files and directories resides. Several uniquely named ZFS datasets can coexist in a single system at the conditions they each own a unique name within their zpool. |
|
snapshot | A read-only photo of a ZFS dataset state as is taken at a precise moment of time. ZFS has no way to cooperate on its own with applications that read and write data on ZFS datasets, if those latter still hold data at the moment the snapshot is taken, only what has been flushed will be included in the snapshot. Worth mentioning that snapshot do not take diskspace aside of sone metadata at the exact time they are created, they size will grow as more and data blocks (i.e. files) are deleted or changed on their corresponding live ZFS dataset. |
|
clone | What is is... A writable physical clone of snapshot |
|
zvol | An emulated block device whose data is hold behind the scene in the zpool the zvol has been created in. | No known equivalent even in BTRFS |
Your first contact with ZFS
Requirements
- ZFS userland tools installed (package sys-fs/zfs)
- ZFS kernel modules built and installed (package sys-fs/zfs-kmod), there is a known issue with kernel 3.13 series see this thread on Funtoo's forum
- Disk size of 64 Mbytes as a bare minimum (128 Mbytes is the minimum size of a pool). Multiple disk will be simulated through the use of several raw images accessed via the Linux loopback devices.
- At least 512 MB of RAM
Preparing
Once your have emerged sys-fs/zfs and sys-fs/zfs-kmod you have two options to start using ZFS at this point :
- Either you start /etc/init.d/zfs (will load all of the zfs kernel modules for you plus a couple of other things)
- Either you load the zfs kernel modules by hand (will load all of the zfs kernel modules for you)
So :
root # rc-service zfs start
Or:
root # modprobe zfs root # lsmod | grep zfs zfs 874072 0 zunicode 328120 1 zfs zavl 12997 1 zfs zcommon 35739 1 zfs znvpair 48570 2 zfs,zcommon spl 58011 5 zfs,zavl,zunicode,zcommon,znvpair
Your first ZFS pool
To start with, four raw disks (2 GB each) are created:
root # for i in 0 1 2 3; do dd if=/dev/zero of=/tmp/zfs-test-disk0${i}.img bs=2G count=1; done 0+1 records in 0+1 records out 2147479552 bytes (2.1 GB) copied, 40.3722 s, 53.2 MB/s ...
Then let's see what loopback devices are in use and which is the first free:
root # losetup -a root # losetup -f /dev/loop0
In the above example nothing is used and the first available loopback device is /dev/loop0. Now associate all of the disks with a loopback device (/tmp/zfs-test-disk00.img -> /dev/loop/0, /tmp/zfs-test-disk01.img -> /dev/loop/1 and so on):
root # for i in 0 1 2 3; do losetup /dev/loop${i} /tmp/zfs-test-disk0${i}.img; done root # losetup -a /dev/loop0: [000c]:781455 (/tmp/zfs-test-disk00.img) /dev/loop1: [000c]:806903 (/tmp/zfs-test-disk01.img) /dev/loop2: [000c]:807274 (/tmp/zfs-test-disk02.img) /dev/loop3: [000c]:781298 (/tmp/zfs-test-disk03.img)
ZFS literature often names zpools "tank", this is not a requirement you can use whatever name of you choice (as we did here...)
Every story in ZFS takes its root with a the very first ZFS related command you will be in touch with: zpool. zpool as you might guessed manages all ZFS aspects in connection with the physical devices underlying your ZFS storage spaces and the very first task is to use this command to make what is called a pool (if you have used LVM before, volume groups can be seen as a counter part). Basically what you will do here is to tell ZFS to take a collection of physical storage stuff which can take several forms like a hard drive partition, a USB key partition or even a file and consider all of them as a single pool of storage (we will subdivide it in following paragraphs). No black magic here, ZFS will write some metadata on them behind the scene to be able to track which physical device belongs to what pool of storage.
root # zpool create myfirstpool /dev/loop0 /dev/loop1 /dev/loop2 /dev/loop3
And.. nothing! Nada! The command silently returned but it did something, the next section will explain what.
Your first ZFS dataset
root # zpool list NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT myfirstpool 7.94G 130K 7.94G 0% 1.00x ONLINE -
What does this mean? Several things: First, your zpool is here and has a size of, roughly, 8 Go minus some space eaten by some metadata. Second is is actually usable because the column HEALTH says ONLINE. Other columns are not meaningful for us for the moment just ignore them. If want more crusty details you can use the zpool command like this:
root # zpool status pool: myfirstpool state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM myfirstpool ONLINE 0 0 0 loop0 ONLINE 0 0 0 loop1 ONLINE 0 0 0 loop2 ONLINE 0 0 0 loop3 ONLINE 0 0 0
Information is quite intuitive: your pool is seen as being usable (state is similar to HEALTH) and is composed of several devices each one listed as being in a healthy state ... at least for now because they will be salvaged for demonstration purpose in a later section. For your information the columns READ,WRITE and CKSUM list the number of operation failures on each of the devices respectfully:
- READ for reading failures. Having a non-zero value is not a good sign... the device is clunky and will soon fail.
- WRITE for writing failures. Having a non-zero value is not a good sign... the device is clunky and will soon fail.
- CKSUM for mismatch between the checksum of the data at the time is had been written and how it has been recomputed when read again (yes, ZFS uses checksums in a agressive manner). Having a non-zero value is not a good sign... corruption happened, ZFS will do its best to recover data by its own but this is definitely not a good sign of a healthy system.
Cool! So far so good you have a new 8 Gb usable brand new storage space on you system. Has been mounted somewhere?
root # mount | grep myfirstpool /myfirstpool on /myfirstpool type zfs (rw,xattr)
Remember the tables in the section above? A zpool in itself can never be mounted, never ever. It is just a container where ZFS datasets are created then mounted. So what happened here? Obscure black magic? No, of course not! Indeed a ZFS dataset named after the zpool's name should have been created automatically for us then mounted. Is is true? We will check this shortly. For the moment you will be introduced with the second command you will deal with when using ZFS : zfs. While the zpool command is used with anything related to zpools, the zfs is used to anything related to ZFS datasets (a ZFS dataset always resides in a zpool, always no exception on that).
zfs and zpool commands are the two only ones you will need to remember when dealing with ZFS.
So how can we check what ZFS datasets are currently known by the system? As you might already guessed like this:
root # zfs list NAME USED AVAIL REFER MOUNTPOINT myfirstpool 114K 7.81G 30K /myfirstpool
Tala! The mystery is busted! the zfs command tells us that not only a ZFS dataset named myfirstpool has been created but also it has been mounted in the system's VFS for us. If you check with the df command, you should also see something like this:
root # df -h Filesystem Size Used Avail Use% Mounted on (...) myfirstpool 7.9G 0 7.9G 0% /myfirstpool
The $100 question:"what to do with this band new ZFS /myfirstpool dataset ?". Copy some files on it of course! We used a Linux kernel source but you can of course use whatever you want:
root # cp -a /usr/src/linux-3.13.5-gentoo /myfirstpool root # ln -s /myfirstpool/linux-3.13.5-gentoo /myfirstpool/linux root # ls -lR /myfirstpool /myfirstpool: total 3 lrwxrwxrwx 1 root root 32 Mar 2 14:02 linux -> /myfirstpool/linux-3.13.5-gentoo drwxr-xr-x 25 root root 50 Feb 27 20:35 linux-3.13.5-gentoo /myfirstpool/linux-3.13.5-gentoo: total 31689 -rw-r--r-- 1 root root 18693 Jan 19 21:40 COPYING -rw-r--r-- 1 root root 95579 Jan 19 21:40 CREDITS drwxr-xr-x 104 root root 250 Feb 26 07:39 Documentation -rw-r--r-- 1 root root 2536 Jan 19 21:40 Kbuild -rw-r--r-- 1 root root 277 Feb 26 07:39 Kconfig -rw-r--r-- 1 root root 268770 Jan 19 21:40 MAINTAINERS (...)
A ZFS dataset behaves like any other filesystem: you can create regular files, symbolic links, pipes, special devices nodes, etc. Nothing mystic here.
Now we have some data in the ZFS dataset let's see what various commands report:
root # df -h Filesystem Size Used Avail Use% Mounted on (...) myfirstpool 7.9G 850M 7.0G 11% /myfirstpool
root # zfs list NAME USED AVAIL REFER MOUNTPOINT myfirstpool 850M 6.98G 850M /myfirstpool
root # zpool list NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT myfirstpool 7.94G 850M 7.11G 10% 1.00x ONLINE -
Notice the various sizes reported by zpool and zfs commands. In this case it is the same however it can differ, this is true especially with zpools mounted in RAID-Z.
Unmounting/remounting a ZFS dataset
Only ZFS datasets can be mounted inside your host's VFS, no exception on that! Zpools cannot be mounted, never, never, never... please pay attention to the terminology and keep things clear by not messing up with terms. We will introduce ZFS snapshots and ZFS clones but those are ZFS datasets at the basis so they can also be mounted and unmounted.
If a ZFS dataset behaves just like any other filesystem, can we unmount it?
root # umount /myfirstpool root # mount | grep myfirstpool
No more /myfirstpool the line of sight! So yes, it is possible to unmount a ZFS dataset just like you would do with any other filesystem. Is the ZFS dataset still present on the system even it is unmounted? Let's check:
root # zfs list NAME USED AVAIL REFER MOUNTPOINT myfirstpool 850M 6.98G 850M /myfirstpool
Hopefully and obviously it is else ZFS would not be very useful. Your next concern would certainly be: "How can we remount it then?" Simple! Like this:
root # zfs mount myfirstpool root # mount | grep myfirstpool myfirstpool on /myfirstpool type zfs (rw,xattr)
The ZFS dataset is back! :-)
Your first contact with ZFS management by attributes or the end of /etc/fstab
At this point you might be curious about how the zfs command know what it has to mount and where is has to mount it. You might be familiar with the following syntax of the mount command that, behind the scenes, scans the file /etc/fstab and mount the specified entry:
root # mount /boot
Does /etc/fstab contain something related to our ZFS dataset?
root # cat /etc/fstab | grep myfirstpool root #
Doh!!!... Obvisouly nothing there. Another mystery? Sure not! The answer lies in a extremely powerful feature of ZFS: the attributes. Simply speaking: an attribute is named property of a ZFS dataset that holds a value. Attributes govern various aspects of how the datasets are managed like: "Is the data has to be compressed?", "Is the data has to be encrypted?", "Is the data has to be exposed to the rest of the world by NFS or SMB/Samba?" and of course... '"Where the dataset has to be mounted?". The answer to that latter question can be tell by the following command:
root # zfs get mountpoint myfirstpool NAME PROPERTY VALUE SOURCE myfirstpool mountpoint /myfirstpool default
Bingo! When you remounted the dataset just some paragraphs ago, ZFS automatically inspected the mountpoint attribute and saw this dataset has to be mounted in the directory /myfirstpool.
A step forward with ZFS datasets
So far you were given a quick tour of what ZFS can do for you and it is very important at this point to distinguish a zpool from a ZFS dataset and to call a dataset for what it is (a dataset) and not for what is is not (a zpool). It is a bit confusing and an editorial choice to have choosen a confusing name just to make you familiar with the one and the other.
Creating datasets
Obviously it is possible to have more than one ZFS dataset within a single zpool. Quizz: what command would you use to subdivide a zpool in datasets? zfs or zpool? Stops reading for two seconds and try to figure out this little question. Frankly.
Answer is... zfs! Although you want to operate on the zpool to logically subdivide it in several datasets, you manage datasets at the end thus you will use the zfs command. It is not always easy at the beginning, do not be too worry you will soon get the habit when to use one or the other. Creating a dataset in a zpool is easy: just give to the zfs command the name of the pool you want to divide and the name of the dataset you want to create in it. So let's create three datasets named myfirstDS, mysecondDS and mythirdDS in myfirstpool(observe how we use the zpool and datasets' names) :
root # zfs create myfirstpool/myfirstDS root # zfs create myfirstpool/mysecondDS root # zfs create myfirstpool/mythirdDS
What happened? Let's check :
root # zfs list NAME USED AVAIL REFER MOUNTPOINT myfirstpool 850M 6.98G 850M /myfirstpool myfirstpool/myfirstDS 30K 6.98G 30K /myfirstpool/myfirstDS myfirstpool/mysecondDS 30K 6.98G 30K /myfirstpool/mysecondDS myfirstpool/mythirdDS 30K 6.98G 30K /myfirstpool/mythirdDS
Obviously we have there what we asked. Moreover if we inspect the contents of /myfirstpool we can notice three new directories having the same than just created:
root # ls -l /myfirstpool total 8 lrwxrwxrwx 1 root root 32 Mar 2 14:02 linux -> /myfirstpool/linux-3.13.5-gentoo drwxr-xr-x 25 root root 50 Feb 27 20:35 linux-3.13.5-gentoo drwxr-xr-x 2 root root 2 Mar 2 15:26 myfirstDS drwxr-xr-x 2 root root 2 Mar 2 15:26 mysecondDS drwxr-xr-x 2 root root 2 Mar 2 15:26 mythirdDS
No surprise here! As you might have guessed, those three new directories serves as mountpoints:
root # mount | grep myfirstpool myfirstpool on /myfirstpool type zfs (rw,xattr) myfirstpool/myfirstDS on /myfirstpool/myfirstDS type zfs (rw,xattr) myfirstpool/mysecondDS on /myfirstpool/mysecondDS type zfs (rw,xattr) myfirstpool/mythirdDS on /myfirstpool/mythirdDS type zfs (rw,xattr)
As we did before, we can copy some files in the newly created datasets just like they were regular directories:
root # cp -a /usr/portage /myfirstpool/mythirdDS root # ls -l /myfirstpool/mythirdDS/* total 697 drwxr-xr-x 48 root root 49 Aug 18 2013 app-accessibility drwxr-xr-x 238 root root 239 Jan 10 06:22 app-admin drwxr-xr-x 4 root root 5 Dec 28 08:54 app-antivirus drwxr-xr-x 100 root root 101 Feb 26 07:19 app-arch drwxr-xr-x 42 root root 43 Nov 26 21:24 app-backup drwxr-xr-x 34 root root 35 Aug 18 2013 app-benchmarks drwxr-xr-x 66 root root 67 Oct 16 06:39 app-cdr(...)
Nothing really too exciting here, we have file in mythirdDS. A bit more interesting output:
root # zfs list NAME USED AVAIL REFER MOUNTPOINT myfirstpool 1.81G 6.00G 850M /myfirstpool myfirstpool/myfirstDS 30K 6.00G 30K /myfirstpool/myfirstDS myfirstpool/mysecondDS 30K 6.00G 30K /myfirstpool/mysecondDS myfirstpool/mythirdDS 1002M 6.00G 1002M /myfirstpool/mythirdDS
root # df -h Filesystem Size Used Avail Use% Mounted on (...) myfirstpool 6.9G 850M 6.1G 13% /myfirstpool myfirstpool/myfirstDS 6.1G 0 6.1G 0% /myfirstpool/myfirstDS myfirstpool/mysecondDS 6.1G 0 6.1G 0% /myfirstpool/mysecondDS myfirstpool/mythirdDS 7.0G 1002M 6.1G 15% /myfirstpool/mythirdDS
Noticed the size given for the 'AVAIL' column? At the very beginning of this tutorial we had slightly less than 8 Gb of available space, it now has a value of roughly 6 Gb. The datasets are just a subdivision of the zpool, they compete with each others for using the available storage within the zpool, no miracle here. To what limit? The pool itself as we never imposed a quota on datasets. Hopefully df and zfs list gives a coherent result.
Second contact with attributes: quota management
Remember how painful is the quota management under Linux? Now you can say goodbye to setquota, edquota and other quotacheck commands, ZFS handle this in the snap of fingers! Guess with what? An ZFS dataset attribute of course! ;-) Just to make you drool here is how a 2Gb limit can be set on myfirstpool/mythirdDS :
root # zfs set quota=2G myfirstpool/mythirdDS
Et voila! The zfs command is bit silent however if we check we can see that myfirstpool/mythirdDS is now capped to 2 Gb (forget about 'REFER' for the moment): around 1 Gb of data has been copied in this dataset thus leaving a big 1 Gb of available space.
root # zfs list NAME USED AVAIL REFER MOUNTPOINT myfirstpool 1.81G 6.00G 850M /myfirstpool myfirstpool/myfirstDS 30K 6.00G 30K /myfirstpool/myfirstDS myfirstpool/mysecondDS 30K 6.00G 30K /myfirstpool/mysecondDS myfirstpool/mythirdDS 1002M 1.02G 1002M /myfirstpool/mythirdDS
Using the df command:
root # df -h Filesystem Size Used Avail Use% Mounted on (...) myfirstpool 6.9G 850M 6.1G 13% /myfirstpool myfirstpool/myfirstDS 6.1G 0 6.1G 0% /myfirstpool/myfirstDS myfirstpool/mysecondDS 6.1G 0 6.1G 0% /myfirstpool/mysecondDS myfirstpool/mythirdDS 2.0G 1002M 1.1G 49% /myfirstpool/mythirdDS
Of course you can use this technique for the home directories of your users /home this also having the a advantage of being much less forgiving than a soft/hard user quota: when the limit is reached, it is reached period and no more data can be written in the dataset. The user must do some cleanup and cannot procastinate anymore :-)
To remove the quota:
root # zfs set quota=none myfirstpool/mythirdDS
none is simply the original value for the quota attribute (we did not demonstrate it, you can check by doing a zfs get quota myfirstpool/mysecondDS for example).
Destroying datasets
There is no way to resurrect a destroyed ZFS dataset and the data it contained! Once you destroy a dataset the corresponding metadata is cleared and gone forever so be careful when using zfs destroy notably with the -r option ...
We have three datasets, but the third is pretty useless and contains a lot of garbage. Is it possible to remove it with a simple rm -rf? Let's try:
root # rm -rf /myfirstpool/mythirdDS rm: cannot remove `/myfirstpool/mythirdDS': Device or resource busy
This is perfectly normal, remember that datasets are indeed something mounted in your VFS. ZFS might be ZFS and do alot for you, it cannot enforce the nature of a mounted filesystem under Linux/Unix. The "ZFS way" to remove a dataset is to use the zfs command like this at the reserve no process owns open files on it (once again, ZFS can do miracles for you but not that kind of miracles as it has to unmount the dataset before deleting it):
root # zfs destroy myfirstpool/mythirdDS root # zfs list NAME USED AVAIL REFER MOUNTPOINT myfirstpool 444M 7.38G 444M /myfirstpool myfirstpool/myfirstDS 21K 7.38G 21K /myfirstpool/myfirstDS myfirstpool/mysecondDS 21K 7.38G 21K /myfirstpool/mysecondDS
Et voila! No more myfirstpool/mythirdDS dataset. :-)
A bit more subtle case would be to try to destroy a ZFS dataset whenever another ZFS dataset is nested in it. Before doing that nasty experiment myfirstpool/mythirdDS must be created again this time with another nested dataset (myfirstpool/mythirdDS/nestedSD1):
root # zfs create myfirstpool/mythirdDS root # zfs create myfirstpool/mythirdDS/nestedSD1 root # zfs list NAME USED AVAIL REFER MOUNTPOINT myfirstpool 851M 6.98G 850M /myfirstpool myfirstpool/myfirstDS 30K 6.98G 30K /myfirstpool/myfirstDS myfirstpool/mysecondDS 30K 6.98G 30K /myfirstpool/mysecondDS myfirstpool/mythirdDS 124K 6.98G 34K /myfirstpool/mythirdDS myfirstpool/mythirdDS/nestedDS1 30K 6.98G 30K /myfirstpool/mythirdDS/nestedDS1
Now let's try to destroy myfirstpool/mythirdDS again:
root # zfs destroy myfirstpool/mythirdDS cannot destroy 'myfirstpool/mythirdDS': filesystem has children use '-r' to destroy the following datasets: myfirstpool/mythirdDS/nestedDS1
The zfs command detected the situation and refused to proceed on the deletion without your consent to make a recursive destruction (-r parameter). Before going any step further let's create some more nested datasets plus a couple of directories inside myfirstpool/mythirdDS:
root # zfs create myfirstpool/mythirdDS/nestedDS1 root # zfs create myfirstpool/mythirdDS/nestedDS2 root # zfs create myfirstpool/mythirdDS/nestedDS3 root # zfs create myfirstpool/mythirdDS/nestedDS3/nestednestedDS root # mkdir /myfirstpool/mythirdDS/dir1 root # mkdir /myfirstpool/mythirdDS/dir2 root # mkdir /myfirstpool/mythirdDS/dir3
root # zfs list NAME USED AVAIL REFER MOUNTPOINT myfirstpool 851M 6.98G 850M /myfirstpool myfirstpool/myfirstDS 30K 6.98G 30K /myfirstpool/myfirstDS myfirstpool/mysecondDS 30K 6.98G 30K /myfirstpool/mysecondDS myfirstpool/mythirdDS 157K 6.98G 37K /myfirstpool/mythirdDS myfirstpool/mythirdDS/nestedDS1 30K 6.98G 30K /myfirstpool/mythirdDS/nestedDS1 myfirstpool/mythirdDS/nestedDS2 30K 6.98G 30K /myfirstpool/mythirdDS/nestedDS2 myfirstpool/mythirdDS/nestedDS3 60K 6.98G 30K /myfirstpool/mythirdDS/nestedDS3 myfirstpool/mythirdDS/nestedDS3/nestednestedDS 30K 6.98G 30K /myfirstpool/mythirdDS/nestedDS3/nestednestedDS
Now what happens if myfirstpool/mythirdDS is destroyed again with '-r'?
root # zfs destroy -r myfirstpool/mythirdDS root # zfs list NAME USED AVAIL REFER MOUNTPOINT myfirstpool 851M 6.98G 850M /myfirstpool myfirstpool/myfirstDS 30K 6.98G 30K /myfirstpool/myfirstDS myfirstpool/mysecondDS 30K 6.98G 30K /myfirstpool/mysecondDS
myfirstpool/mythirdDS and everything it contained is now gone!
Snapshotting and rolling back datasets
This is, by far, one of the coolest features of ZFS. You can:
- take a photo of a dataset (this photo is called a snapshot)
- do whatever you want with the data contained in the dataset
- restore (roll back) the dataset in in the exact same state it was before you did your changes just as if nothing had ever happened in the middle.
Single snapshot
Only ZFS datasets can be snapshotted and rolled back, not the zpool.
To start with, let's copy some files in mysecondDS:
root # cp -a /usr/portage /myfirstpool/mysecondDS root # ls /myfirstpool/mysecondDS/portage total 672 drwxr-xr-x 48 root root 49 Aug 18 2013 app-accessibility drwxr-xr-x 238 root root 239 Jan 10 06:22 app-admin drwxr-xr-x 4 root root 5 Dec 28 08:54 app-antivirus drwxr-xr-x 100 root root 101 Feb 26 07:19 app-arch drwxr-xr-x 42 root root 43 Nov 26 21:24 app-backup drwxr-xr-x 34 root root 35 Aug 18 2013 app-benchmarks (...) drwxr-xr-x 62 root root 63 Feb 20 06:47 x11-wm drwxr-xr-x 16 root root 17 Aug 18 2013 xfce-base drwxr-xr-x 64 root root 65 Dec 14 19:09 xfce-extra
Now, let's take a snapshot of mysecondDS. What command would be used? zpool or zfs? In that case it is zfs because we manipulate a ZFS dataset (this time you problably got it right!):
root # zfs snapshot myfirstpool/mysecondDS@Charlie
The syntax is always pool/dataset@snapshot, the snapshot's name is left at your discretion however you must use an arobase sign (@) to separate the snapshot's name from the rest of the path.
Let's check what /myfirstpool/mysecondDS contains after taking the snapshot:
root # ls -la /myfirstpool/mysecondDS total 9 drwxr-xr-x 3 root root 3 Mar 2 18:22 . drwxr-xr-x 5 root root 6 Mar 2 17:58 .. drwx------ 170 root root 171 Mar 2 18:36 portage
Nothing really new the portage directory is here nothing more a priori. If you have used BTRFS before reading this tutorial you probably expected to see a @Charlie lying in /myfirstpool/mysecondDS? So where the check is Charlie? In ZFS a dataset snapshot is not visible from within the VFS tree (if you are not convinced you can search for it with the find command but it will never find it). Let's check with the zfs command:
root # zfs list root # zfs list -t all NAME USED AVAIL REFER MOUNTPOINT myfirstpool 1.81G 6.00G 850M /myfirstpool myfirstpool/myfirstDS 30K 6.00G 30K /myfirstpool/myfirstDS myfirstpool/mysecondDS 1001M 6.00G 1001M /myfirstpool/mysecondDS
Wow... No sign of the snapshot. What you mus know is that indeed zfs list shows only datasets by default and omits snapshots. If the command is invoked with the parameter -t set to all it will list everything:
root # zfs list root # zfs list -t all NAME USED AVAIL REFER MOUNTPOINT myfirstpool 1.81G 6.00G 850M /myfirstpool myfirstpool/myfirstDS 30K 6.00G 30K /myfirstpool/myfirstDS myfirstpool/mysecondDS 1001M 6.00G 1001M /myfirstpool/mysecondDS myfirstpool/mysecondDS@Charlie 0 - 1001M -
So yes, @Charlie is here! Also notice here the power of copy-on-write filesystems: @Charlie takes only a couple of kilobytes (some ZFS metadata) just like any ZFS snapshot at the time they are taken. The reason snapshots occupy very little space in the datasets is because data and metadata blocks are the same and no physical copy of them are made. At the time goes on and more and more changes happens in the original dataset (myfirstpool/mysecondDS here), ZFS will allocate new data and metadata blocks to accommodate the changes but will leave the blocks used by the snapshot untouched and the snapshot will tend to eat more and more pool space. It seems odd at first glance because a snapshot is a frozen in time copy of a ZFS dataset but this the way ZFS manage them. So caveat emptor: remove any unused snapshot to not full your zpool...
Now we have found Charlie, let's do some changes in the mysecondDS:
root # rm -rf /myfirstpool/mysecondDS/portage/[a-h]* root # echo "Hello, world" > /myfirstpool/mysecondDS/hello.txt root # cp /lib/firmware/radeon/* /myfirstpool/mysecondDS root # ls -l /myfirstpool/mysecondDS /myfirstpool/mysecondDS: total 3043 -rw-r--r-- 1 root root 8704 Mar 2 19:29 ARUBA_me.bin -rw-r--r-- 1 root root 8704 Mar 2 19:29 ARUBA_pfp.bin -rw-r--r-- 1 root root 6144 Mar 2 19:29 ARUBA_rlc.bin -rw-r--r-- 1 root root 24096 Mar 2 19:29 BARTS_mc.bin -rw-r--r-- 1 root root 5504 Mar 2 19:29 BARTS_me.bin (...) -rw-r--r-- 1 root root 60388 Mar 2 19:29 VERDE_smc.bin -rw-r--r-- 1 root root 13 Mar 2 19:28 hello.txt drwx------ 94 root root 95 Mar 2 19:28 portage /myfirstpool/mysecondDS/portage: total 324 drwxr-xr-x 16 root root 17 Oct 26 07:30 java-virtuals drwxr-xr-x 303 root root 304 Jan 21 06:53 kde-base drwxr-xr-x 117 root root 118 Feb 21 06:24 kde-misc drwxr-xr-x 2 root root 756 Feb 23 08:44 licenses drwxr-xr-x 20 root root 21 Jan 7 06:56 lxde-base (...)
Now let's check again what the zpool command gives:
root # zfs list -t all NAME USED AVAIL REFER MOUNTPOINT myfirstpool 1.82G 6.00G 850M /myfirstpool myfirstpool/myfirstDS 30K 6.00G 30K /myfirstpool/myfirstDS myfirstpool/mysecondDS 1005M 6.00G 903M /myfirstpool/mysecondDS myfirstpool/mysecondDS@Charlie 102M - 1001M -
Noticed the size's increase of myfirstpool/mysecondDS@Charlie? This is mainly due to new files copied in the snasphot: ZFS had to retained the original blocks of data. Now time to roll this ZFS dataset back to its original state (if some processes would have open files in the dataset to be rolled back, you should terminate them first) :
root # zfs rollback myfirstpool/mysecondDS@Charlie root # ls -l /myfirstpool/mysecondDS total 6 drwxr-xr-x 164 root root 169 Aug 18 18:25 portage
Again, ZFS handled everything for you and you now have the contents of mysecondDS exactly as it was at the time the snapshot Charlie was taken. Not more complicated than that. Not illustrated here but if you look at the output given by zfs list -t all at this point you will notice that the Charlie snapshot only eat very little space. This is normal: the modified blocks have been dropped so myfirstpool/mysecondDS and its myfirstpool/mysecondDS@Charlie snapshot are the same module some metadata (hence the few kilobytes of space taken).
the .zfs pseudo-directory or the secret passage to your snapshots
Any directory where a ZFS dataset is mounted (having snapshots or not) secretly contains a pseudo-directory named .zfs (dot-ZFS) and you will not see it even with the option -a given to a ls command unless you specify it. It is a contradiction to Unix and Unix-like systems' philosophy to not hide anything to the system administrator. It is not a bug of ZFS On Linux implementation and the Solaris implementation of ZFS exposes the exact behavior. So what is inside this little magic box?
root # cd /myfirstpool/mysecondDS root # ls -la | grep .zfs root # ls -lad .zfs dr-xr-xr-x 1 root root 0 Mar 2 15:26 .zfs
root # cd .zfs root # pwd /myfirstpool/mysecondDS/.zfs root # ls -la total 4 dr-xr-xr-x 1 root root 0 Mar 2 15:26 . drwxr-xr-x 3 root root 145 Mar 2 19:29 .. dr-xr-xr-x 2 root root 2 Mar 2 19:47 shares dr-xr-xr-x 2 root root 2 Mar 2 18:46 snapshot
We will focus on the snapshot directory and since we did not dropped the Charlie snapshot (yet) let's see what lies there:
root # cd snapshot root # ls -l total 0 dr-xr-xr-x 1 root root 0 Mar 2 20:16 Charlie
Yes we found Charlie here (also!), the snapshot is seen as regular directory but pay attention to its permissions:
- owning user (root) has read+execute
- owning group (root) has read+execute
- rest of the world has read+execute
Did you notice? Not a single write permission on this directory, the only action any user can do is to enter in the directory and list its contents. This not a bug but the nature of ZFS snapshots: they are read-only stuff at the basis. Next question is naturally: can we change something in it? For that we have to enter inside the Charlie directory:
root # cd Charlie root # ls -la total 7 drwxr-xr-x 3 root root 3 Mar 2 18:22 . dr-xr-xr-x 3 root root 3 Mar 2 18:46 .. drwx------ 170 root root 171 Mar 2 18:36 portage
No surprise here: at the time we took the snapshot, myfirstpool/mysecondDS held a copy of the portage tree stored in a directory named portage. At first glance this one seems to be writable for the root user let's try to create a file in it:
root # cd portage root # touch test touch: cannot touch ‘test’: Read-only file system
Thing are a bit tricky here: indeed nothing has been mounted (check with the mount command!), we are walking though a pseudo-directory exposed by ZFS that holds the Charlie snapshot. Pseudo-directory because in fact .zfs had no physical existence even in the ZFS metadata as they exists in the zpool. It is just a convenient way provided by the ZFS kernel modules to walk inside the various snapshots' content. You can see but you cannot touch :-)
Backtracking changes between a dataset and its snapshot
Is it possible to know what is the difference between a a live dataset and its snapshot? Answer to this question is yes and the zfs command will help us in this task. Now we rolled back the myfirstpool/mysecondDS ZFS dataset back to its original state we have to botch it again:
root # cp -a /lib/firmware/radeon/C* /myfirstpool/mysecondDS
Now inspect the difference between the live ZFS dataset myfirstpool/mysecondDS and its snasphot Charlie, this is done via zfs diff and by giving only the snapshot's name (you can inspect the difference between snasphot with that command with a slightly change in parameters):
root # # zfs diff myfirstpool/mysecondDS@Charlie M /myfirstpool/mysecondDS/ + /myfirstpool/mysecondDS/CAICOS_mc.bin + /myfirstpool/mysecondDS/CAICOS_me.bin + /myfirstpool/mysecondDS/CAICOS_pfp.bin + /myfirstpool/mysecondDS/CAICOS_smc.bin + /myfirstpool/mysecondDS/CAYMAN_mc.bin + /myfirstpool/mysecondDS/CAYMAN_me.bin (...)
So do we have here? Two things: First it shows we have changed something in /myfirstpool/mysecondDS (notice the 'M' for Modified), second it shows the addition of several files (CAICOS_mc.bin, CAICOS_me.bin, CAICOS_pfp.bin...) by putting a plus sign ('+') on their left.
If we botch a bit more myfirstpool/mysecondDS by removing the file /myfirstpool/mysecondDS/portage/sys-libs/glibc/Manifest :
root # rm /myfirstpool/mysecondDS/portage/sys-libs/glibc/Manifest root # zfs diff myfirstpool/mysecondDS@Charlie M /myfirstpool/mysecondDS/ M /myfirstpool/mysecondDS/portage/sys-libs/glibc - /myfirstpool/mysecondDS/portage/sys-libs/glibc/Manifest + /myfirstpool/mysecondDS/CAICOS_mc.bin + /myfirstpool/mysecondDS/CAICOS_me.bin + /myfirstpool/mysecondDS/CAICOS_pfp.bin + /myfirstpool/mysecondDS/CAICOS_smc.bin + /myfirstpool/mysecondDS/CAYMAN_mc.bin + /myfirstpool/mysecondDS/CAYMAN_me.bin (...)
Obviously deleted content is marked by a minus sign ('-').
Now a real butchery:
root # rm -rf /myfirstpool/mysecondDS/portage/sys-devel/gcc root # zfs diff myfirstpool/mysecondDS@Charlie root # zfs diff myfirstpool/mysecondDS@Charlie M /myfirstpool/mysecondDS/ M /myfirstpool/mysecondDS/portage/sys-devel - /myfirstpool/mysecondDS/portage/sys-devel/gcc - /myfirstpool/mysecondDS/portage/sys-devel/gcc/files - /myfirstpool/mysecondDS/portage/sys-devel/gcc/files/awk - /myfirstpool/mysecondDS/portage/sys-devel/gcc/files/awk/fixlafiles.awk - /myfirstpool/mysecondDS/portage/sys-devel/gcc/files/awk/fixlafiles.awk-no_gcc_la - /myfirstpool/mysecondDS/portage/sys-devel/gcc/files/c89 - /myfirstpool/mysecondDS/portage/sys-devel/gcc/files/c99 - /myfirstpool/mysecondDS/portage/sys-devel/gcc/files/gcc-4.6.4-fix-libgcc-s-path-with-vsrl.patch - /myfirstpool/mysecondDS/portage/sys-devel/gcc/files/gcc-spec-env.patch - /myfirstpool/mysecondDS/portage/sys-devel/gcc/files/gcc-spec-env-r1.patch - /myfirstpool/mysecondDS/portage/sys-devel/gcc/files/gcc-4.8.2-fix-cache-detection.patch - /myfirstpool/mysecondDS/portage/sys-devel/gcc/files/fix_libtool_files.sh - /myfirstpool/mysecondDS/portage/sys-devel/gcc/files/gcc-configure-texinfo.patch - /myfirstpool/mysecondDS/portage/sys-devel/gcc/files/gcc-4.8.1-bogus-error-with-int.patch - /myfirstpool/mysecondDS/portage/sys-devel/gcc/gcc-4.3.3-r2.ebuild - /myfirstpool/mysecondDS/portage/sys-devel/gcc/metadata.xml - /myfirstpool/mysecondDS/portage/sys-devel/gcc/gcc-4.6.4-r2.ebuild - /myfirstpool/mysecondDS/portage/sys-devel/gcc/gcc-4.6.4.ebuild - /myfirstpool/mysecondDS/portage/sys-devel/gcc/gcc-4.8.1-r1.ebuild - /myfirstpool/mysecondDS/portage/sys-devel/gcc/gcc-4.8.1-r2.ebuild - /myfirstpool/mysecondDS/portage/sys-devel/gcc/gcc-4.6.2-r1.ebuild - /myfirstpool/mysecondDS/portage/sys-devel/gcc/gcc-4.8.1-r3.ebuild - /myfirstpool/mysecondDS/portage/sys-devel/gcc/gcc-4.8.2.ebuild - /myfirstpool/mysecondDS/portage/sys-devel/gcc/gcc-4.8.1-r4.ebuild - /myfirstpool/mysecondDS/portage/sys-devel/gcc/Manifest - /myfirstpool/mysecondDS/portage/sys-devel/gcc/gcc-4.7.3-r1.ebuild - /myfirstpool/mysecondDS/portage/sys-devel/gcc/gcc-4.8.2-r1.ebuild M /myfirstpool/mysecondDS/portage/sys-libs/glibc - /myfirstpool/mysecondDS/portage/sys-libs/glibc/Manifest + /myfirstpool/mysecondDS/CAICOS_mc.bin + /myfirstpool/mysecondDS/CAICOS_me.bin + /myfirstpool/mysecondDS/CAICOS_pfp.bin + /myfirstpool/mysecondDS/CAICOS_smc.bin + /myfirstpool/mysecondDS/CAYMAN_mc.bin + /myfirstpool/mysecondDS/CAYMAN_me.bin (...)
No need to explain that digital mayhem! What happens if, in addition, we change the contents of the file /myfirstpool/mysecondDS/portage/sys-devel/autoconf/Manifest?
root # zfs diff myfirstpool/mysecondDS@Charlie M /myfirstpool/mysecondDS/ M /myfirstpool/mysecondDS/portage/sys-devel M /myfirstpool/mysecondDS/portage/sys-devel/autoconf/Manifest - /myfirstpool/mysecondDS/portage/sys-devel/gcc - /myfirstpool/mysecondDS/portage/sys-devel/gcc/files - /myfirstpool/mysecondDS/portage/sys-devel/gcc/files/awk - /myfirstpool/mysecondDS/portage/sys-devel/gcc/files/awk/fixlafiles.awk - /myfirstpool/mysecondDS/portage/sys-devel/gcc/files/awk/fixlafiles.awk-no_gcc_la (...)
ZFS shows that the file /myfirstpool/mysecondDS/portage/sys-devel/autoconf/Manifest has changed. So ZFS can help to track files deletion, creation and modifications. What it does not show is the difference of a file's content between as it exists in a live dataset and this dataset's snapshot. Not a big issue! You can explore a snapshot's content via the .zfs pseudo-directory and use a command like /usr/bin/diff to examine the difference with the file as it exists on the corresponding live dataset.
root # diff -u /myfirstpool/mysecondDS/.zfs/snapshot/Charlie/portage/sys-devel/autoconf/Manifest /myfirstpool/mysecondDS/portage/sys-devel/autoconf/Manifest --- /myfirstpool/mysecondDS/.zfs/snapshot/Charlie/portage/sys-devel/autoconf/Manifest 2013-08-18 08:52:01.742411902 -0400 +++ /myfirstpool/mysecondDS/portage/sys-devel/autoconf/Manifest 2014-03-02 21:36:50.582258990 -0500 @@ -4,7 +4,4 @@ DIST autoconf-2.62.tar.gz 1518427 SHA256 83aa747e6443def0ebd1882509c53f5a2133f50... DIST autoconf-2.63.tar.gz 1562665 SHA256 b05a6cee81657dd2db86194a6232b895b8b2606a... DIST autoconf-2.64.tar.bz2 1313833 SHA256 872f4cadf12e7e7c8a2414e047fdff26b517c7... -DIST autoconf-2.65.tar.bz2 1332522 SHA256 db11944057f3faf229ff5d6ce3fcd819f56545... -DIST autoconf-2.67.tar.bz2 1369605 SHA256 00ded92074999d26a7137d15bd1d51b8a8ae23... -DIST autoconf-2.68.tar.bz2 1381988 SHA256 c491fb273fd6d4ca925e26ceed3d177920233c... DIST autoconf-2.69.tar.xz 1214744 SHA256 64ebcec9f8ac5b2487125a86a7760d2591ac9e1d3... (...)
Dropping a snapshot
A snapshot is no more than a dataset frozen in time and thus can be destroyed in the exact same way seen in the paragraphs before. Now we do not need the Charlie snapshot we can remove it. Simple:
root # zfs destroy myfirstpool/mysecondDS@Charlie root # zfs list -t all NAME USED AVAIL REFER MOUNTPOINT myfirstpool 1.71G 6.10G 850M /myfirstpool myfirstpool/myfirstDS 30K 6.10G 30K /myfirstpool/myfirstDS myfirstpool/mysecondDS 903M 6.10G 903M /myfirstpool/mysecondDS
And Charlie is gone forever ;-)
The time travelling machine part 1: examining differences between snapshots
So far we only used a single snapshot just to keep things simple. However a dataset can hold several snapshots and you can do everything seen so far with them like rolling back, destroying them or examining the difference not only between a snapshot and its corresponding live dataset but also between two snapshots. For this part we will consider the myfirstpool/myfirstDS dataset which should be empty at this point.
root # ls -la /myfirstpool/myfirstDS total 3 drwxr-xr-x 2 root root 2 Mar 2 21:14 . drwxr-xr-x 5 root root 6 Mar 2 17:58 ..
Now let's generate some contents, take a snapshot (snapshot-1), add more content, take a snapshot again (snapshot-2), do some modifications again and take a third snapshot (snapshot-3):
root # echo "Hello, world" > /myfirstpool/myfirstDS/hello.txt root # cp -R /lib/firmware/radeon /myfirstpool/myfirstDS root # ls -l /myfirstpool/myfirstDS total 5 -rw-r--r-- 1 root root 13 Mar 3 06:41 hello.txt drwxr-xr-x 2 root root 143 Mar 3 06:42 radeon root # zfs snapshot myfirstpool/myfirstDS@snapshot-1
root # echo "Goodbye, world" > /myfirstpool/myfirstDS/goodbye.txt root # echo "Are you there?" >> /myfirstpool/myfirstDS/hello.txt root # cp /proc/config.gz /myfirstpool/myfirstDS root # rm /myfirstpool/myfirstDS/radeon/CAYMAN_me.bin root # zfs snapshot myfirstpool/myfirstDS@snapshot-2
root # echo "Still there?" >> /myfirstpool/myfirstDS/goodbye.txt root # mv /myfirstpool/myfirstDS/hello.txt /myfirstpool/myfirstDS/hello_new.txt root # cat /proc/version > /myfirstpool/myfirstDS/version.txt root # zfs snapshot myfirstpool/myfirstDS@snapshot-3
root # zfs list -t all NAME USED AVAIL REFER MOUNTPOINT myfirstpool 1.81G 6.00G 850M /myfirstpool myfirstpool/myfirstDS 3.04M 6.00G 2.97M /myfirstpool/myfirstDS myfirstpool/myfirstDS@snapshot-1 47K - 2.96M - myfirstpool/myfirstDS@snapshot-2 30K - 2.97M - myfirstpool/myfirstDS@snapshot-3 0 - 2.97M - myfirstpool/mysecondDS 1003M 6.00G 1003M /myfirstpool/mysecondDS
You saw to how use zfs diff to compare the difference between a snapshot and its corresponding "live" dataset in the above paragraphs. Doing the same exercise with two snapshots is not that much different as you just have to explicitly tell the command what datasets are to be compared against and the command will oputput the result in the exact same manner.So what are the differences between snapshots myfirstpool/myfirstDS@snapshot-1 and myfirstpool/myfirstDS@snapshot-2? Let's make the zfs command work for us:
root # zfs diff myfirstpool/myfirstDS@snapshot-1 myfirstpool/myfirstDS@snapshot-2 M /myfirstpool/myfirstDS/ M /myfirstpool/myfirstDS/hello.txt M /myfirstpool/myfirstDS/radeon - /myfirstpool/myfirstDS/radeon/CAYMAN_me.bin + /myfirstpool/myfirstDS/goodbye.txt + /myfirstpool/myfirstDS/config.gz
Before digging farther, let's think about what we did between the time we created the first snapshot and the second snapshot:
- We modified the file /myfirstpool/myfirstDS/hello.txt hence the 'M' shown on left of the second line (thus we changed something under /myfirstpool/myfirstDS hence a 'M' is also shown on the left of the first line)
- We deleted the file /myfirstpool/myfirstDS/radeon/CAYMAN_me.bin hence the minus sign ('-') shown on the left of the fourth line (and the 'M' shown on left of the third line)
- We added two files which were /myfirstpool/myfirstDS/goodbye.txt and /myfirstpool/myfirstDS/config.gz hence the plus sign ('+') shown on the left of the fifth and sixth lines (also this is a change happening in /myfirstpool/myfirstDS hence another reason to show a 'M' on the left of the first line)
Now same exercise this time with snapshots myfirstpool/myfirstDS@snapshot-2 and myfirstpool/myfirstDS@snapshot-3:
root # zfs diff myfirstpool/myfirstDS@snapshot-2 myfirstpool/myfirstDS@snapshot-3 M /myfirstpool/myfirstDS/ R /myfirstpool/myfirstDS/hello.txt -> /myfirstpool/myfirstDS/hello_new.txt M /myfirstpool/myfirstDS/goodbye.txt + /myfirstpool/myfirstDS/version.txt
Try to interpret what you see except for the second line where a "R" (standing for "Rename") is shown. ZFS is smart enough to also show both the old the new names!
Why not push the limit and try a few fancy things. First things first: what happens if we tell to compare two snapshots but in a reverse order?
root # zfs diff myfirstpool/myfirstDS@snapshot-3 myfirstpool/myfirstDS@snapshot-2 Unable to obtain diffs: Not an earlier snapshot from the same fs
Is ZFS would be a bit more happy if we ask the difference between two snapshots this time with a gap in between (so snapshot 1 with snapshot 3):
root # zfs diff myfirstpool/myfirstDS@snapshot-1 myfirstpool/myfirstDS@snapshot-3 M /myfirstpool/myfirstDS/ R /myfirstpool/myfirstDS/hello.txt -> /myfirstpool/myfirstDS/hello_new.txt M /myfirstpool/myfirstDS/radeon - /myfirstpool/myfirstDS/radeon/CAYMAN_me.bin + /myfirstpool/myfirstDS/goodbye.txt + /myfirstpool/myfirstDS/config.gz + /myfirstpool/myfirstDS/version.txt
Amazing! Here again, take a couple of minutes to think about all operations you did on the dataset between the time you took the first snapshot and the time you took the last snapshot: this summary is the exact reflect of all your previous operations.
Just to put a conclusion on this subject, let's see the differences between the myfirstpool/myfirstDS dataset and its various snapshots:
root # zfs diff myfirstpool/myfirstDS@snapshot-1 M /myfirstpool/myfirstDS/ R /myfirstpool/myfirstDS/hello.txt -> /myfirstpool/myfirstDS/hello_new.txt M /myfirstpool/myfirstDS/radeon - /myfirstpool/myfirstDS/radeon/CAYMAN_me.bin + /myfirstpool/myfirstDS/goodbye.txt + /myfirstpool/myfirstDS/config.gz + /myfirstpool/myfirstDS/version.txt
root # zfs diff myfirstpool/myfirstDS@snapshot-2 M /myfirstpool/myfirstDS/ R /myfirstpool/myfirstDS/hello.txt -> /myfirstpool/myfirstDS/hello_new.txt M /myfirstpool/myfirstDS/goodbye.txt + /myfirstpool/myfirstDS/version.txt
root # zfs diff myfirstpool/myfirstDS@snapshot-3
Having nothing reported for the last zfs diff is normal as changed in the dataset since the snapshot has been taken.
The time travelling machine part 2: rolling back with multiple snapshots
Examining the differences between the various snapshots of a dataset or the dataset itself would be quite useless if we would not be able to roll the dataset back to one of its previous states. How we have salvaged myfirstpool/myfirstDS a bit, it would the time to restore it at it was when the first snapshot had been taken:
root # zfs rollback myfirstpool/myfirstDS@snapshot-1 cannot rollback to 'myfirstpool/myfirstDS@snapshot-1': more recent snapshots exist use '-r' to force deletion of the following snapshots: myfirstpool/myfirstDS@snapshot-3 myfirstpool/myfirstDS@snapshot-2
Err... Well, ZFS just tells us that several more recent snapshots exists and it refuses to proceed without dropping those latter. Unfortunately for us there is no way to circumvent that: once you jump backward you have no way to move forward again. We could demonstrate the rollback to myfirstpool/myfirstDS@snapshot-3 then myfirstpool/myfirstDS@snapshot-2 then myfirstpool/myfirstDS@snapshot-1 but it would be of very little interest previous sections of this tutorial did that already so second attempt:
root # zfs rollback -r myfirstpool/myfirstDS@snapshot-1 root # zfs list -t all NAME USED AVAIL REFER MOUNTPOINT myfirstpool 1.81G 6.00G 850M /myfirstpool myfirstpool/myfirstDS 2.96M 6.00G 2.96M /myfirstpool/myfirstDS myfirstpool/myfirstDS@snapshot-1 1K - 2.96M - myfirstpool/mysecondDS 1003M 6.00G 1003M /myfirstpool/mysecondDS
myfirstpool/myfirstDS effectively returned to the desired state (notice the size of myfirstpool/myfirstDS@snapshot-1) and the snapshots snapshot-2 and snapshot-3 vanished. Just to convince you:
root # zfs diff myfirstpool/myfirstDS@snapshot-1 root #
No differences at all!
Snapshots and clones
A clone and a snapshot are two very close things in ZFS:
- A clone appears as mounted dataset (i.e. you can read and write data in it) while a snapshot stays apart and is always read-only
- A clone is always spawned from a snapshot
So it is absolutely true to say that a clone is just indeed a writable snapshot. The copy-on-write feature of ZFS plays its role even there: the data blocks hold by the snapshot are only duplicated upon modification. So cloning 20Gb snapshot of data does not lead to an additional 20 Gb of data being eaten from the pool.
How to make a clone? Simple, once again with the zfs command used like this:
root # zfs clone myfirstpool/myfirstDS@snapshot-1 myfirstpool/myfirstDS_clone1 root # fs list -t all NAME USED AVAIL REFER MOUNTPOINT myfirstpool 1.81G 6.00G 850M /myfirstpool myfirstpool/myfirstDS 2.96M 6.00G 2.96M /myfirstpool/myfirstDS myfirstpool/myfirstDS@snapshot-1 1K - 2.96M - myfirstpool/myfirstDS_clone1 1K 6.00G 2.96M /myfirstpool/myfirstDS_clone1 myfirstpool/mysecondDS 1003M 6.00G 1003M /myfirstpool/mysecondDS
Noticed the value of MOUNTPOINT for myfirstpool/myfirstDS_clone1? No we have a dataset that is mounted! Let's check with the mount command:
root # mount | grep clone myfirstpool/myfirstDS_clone1 on /myfirstpool/myfirstDS_clone1 type zfs (rw,xattr)
In theory we can change or write additional data in the clone as it is mounted as being writable (rw). Let it be!
root # # ls /myfirstpool/myfirstDS_clone1 hello.txt radeon
root # cp /proc/config.gz /myfirstpool/myfirstDS_clone1 root # echo 'This is a clone!' >> /myfirstpool/myfirstDS_clone1/hello.txt
root # ls /myfirstpool/myfirstDS_clone1 config.gz hello.txt radeon root # cat /myfirstpool/myfirstDS_clone1/hello.txt Hello, world This is a clone!
Unfortunately it is not possible to ask the difference between a clone and a snapshot, zfs diff expects to see either a snapshot name either two snapshots names. Once spawned, a clone starts its own existence and the clone that served as a seed for it remains attached to its own original dataset.
Because clones are nothing more than a ZFS dataset they can be destroyed just like any ZFS dataset:
root # zfs destroy myfirstpool/myfirstDS_clone1 root # zfs list -t all NAME USED AVAIL REFER MOUNTPOINT myfirstpool 1.81G 6.00G 850M /myfirstpool myfirstpool/myfirstDS 2.96M 6.00G 2.96M /myfirstpool/myfirstDS myfirstpool/myfirstDS@snapshot-1 1K - 2.96M - myfirstpool/mysecondDS 1003M 6.00G 1003M /myfirstpool/mysecondDS
Streaming ZFS datasets
A ZFS snapshot can not only be cloned or explored but also streamed in a local file or even over the network thus allowing to back up or simply an exact bit to bit copy of a ZFS dataset between two machines for example. Snapshots being differential (i.e. incremental) by nature very little network overhead is induced when consecutive snapshots are streamed over the network. A nifty move from the designers was to use stdin and stdout as transmission/reception channels thus allowing great a flexibility in processing the ZFS stream. You can envisage, for instance, to compress your stream then crypt it then encode it in base64 then sign it and so on. It sounds a bit overkill but it is possible and in the general case you can use any tool that swallows the data from stdin and spit it through stdout in your plumbing.
First things first, just to illustrate some basic concepts here is how to stream a ZFS dataset snapshot to a local file:
root # zfs send myfirstpool/myfirstDS@snapshot-1 > /tmp/myfirstpool-myfirstDS@snapshot-snap1 root # cat /tmp/myfirstpool-myfirstDS@snapshot-snap1 | zfs receive myfirstpool/myfirstDS@testrecv
Now let's stream it back:
root # cannot receive new filesystem stream: destination 'myfirstpool/myfirstDS' exists must specify -F to overwrite it
Ouch... ZFS refuses to go any step further because some data would be overwritten. We do now own any critical data on the dataset so we could destroy it and try again or use a different name nevertheless, just for the sake of the demonstration, let's create another zpool prior restoring the dataset there:
root # dd if=/dev/zero of=/tmp/zfs-test-disk04.img bs=2G count=1 0+1 records in 0+1 records out 2147479552 bytes (2.1 GB) copied, 6.35547 s, 338 MB/s root # losetup -f /dev/loop4 root # losetup /dev/loop4 /tmp/zfs-test-disk04.img root # zpool create testpool /dev/loop4 root # zpool list NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT myfirstpool 7.94G 1.81G 6.12G 22% 1.00x ONLINE - testpool 1.98G 89.5K 1.98G 0% 1.00x ONLINE -
Take two:
root # cat /tmp/myfirstpool-myfirstDS@snapshot-snap1 | zfs receive testpool/myfirstDS@testrecv root # zfs list -t all NAME USED AVAIL REFER MOUNTPOINT myfirstpool 1.81G 6.00G 850M /myfirstpool myfirstpool/myfirstDS 2.96M 6.00G 2.96M /myfirstpool/myfirstDS myfirstpool/myfirstDS@snapshot-1 1K - 2.96M - myfirstpool/mysecondDS 1003M 6.00G 1003M /myfirstpool/mysecondDS testpool 3.08M 1.95G 31K /testpool testpool/myfirstDS 2.96M 1.95G 2.96M /testpool/myfirstDS testpool/myfirstDS@testrecv 0 - 2.96M -
Very interesting things happened there! First the data previously stored in the file /tmp/myfirstpool-myfirstDS@snapshot-snap1 been copied as a snapshot in the destination zpool (testpool here) and it has been copied exactly in the same manner given on the command line. Second a clone of this snapshot has been crated for you by ZFS and the snapshot myfirstpool/myfirstDS@snapshot-1 now appears as a live ZFS dataset where data can be read and written! Think two seconds about the error message we got just above, the reason ZFS protested becomes clear now.
An alternative would have been to use the original zpool but this time with a different name for the dataset:
root # cat /tmp/myfirstpool-myfirstDS@snapshot-snap1 | zfs receive myfirstpool/myfirstDS_copy@testrecv root # zfs list -t all NAME USED AVAIL REFER MOUNTPOINT myfirstpool 1.82G 6.00G 850M /myfirstpool myfirstpool/myfirstDS 2.96M 6.00G 2.96M /myfirstpool/myfirstDS myfirstpool/myfirstDS@snapshot-1 1K - 2.96M - myfirstpool/myfirstDS_copy 2.96M 6.00G 2.96M /myfirstpool/myfirstDS_copy myfirstpool/myfirstDS_copy@testrecv 0 - 2.96M - myfirstpool/mysecondDS 1003M 6.00G 1003M /myfirstpool/mysecondDS
Now something a bit more interesting: instead of using a local file, we will stream the dataset to a Solaris 11 machine (OpenIndiana can be used also) over the network using the GNU flavour of netcat (net-analyzer/gnu-netcat) over the port TCP/7000 , in that case the Solaris host is a x86 machine but a SPARC machine would have given the exact same result as ZFS contrary to UFS is platform agnostic.
On the Solaris machine:
root # nc -l -p 7000 | zfs receive nas/zfs-stream-test@s1
On the Linux machine:
root # zfs send myfirstpool/myfirstDS@snapshot-1 | netcat -c 192.168.1.13 7000
The nc command coming with the net-analyzer/netcat package does not automatically close the network connection when its input stream is closed (i.e. when zfs send command terminates its job) thus its Solaris conterpart also waits "forever" at the other end of the "pipe". It is not possible to override this behaviour hence the reason we use its GNU variant (package net-analyzer/netcat).
After the dataset has been received on the Solaris machine the nas zpool now contains the sent snapshot and its corresponding clone, that latter being automatically created:
root # zfs list -t snapshot NAME USED AVAIL REFER MOUNTPOINT (...) nas/zfs-stream-test 3.02M 6.17T 3.02M /nas/zfs-stream-test nas/zfs-stream-test@s1 0 - 3.02M -
A quick look in the /san/zfs-stream-test directory on the same Solaris machine gives:
root # ls -lR /nas/zfs-stream-test /nas/zfs-stream-test/: total 12 -rw-r--r-- 1 root root 13 Mar 3 18:59 hello.txt drwxr-xr-x 2 root root 143 Mar 3 18:59 radeon /nas/zfs-stream-test/radeon: total 6144 -rw-r--r-- 1 root root 8704 Mar 3 18:59 ARUBA_me.bin -rw-r--r-- 1 root root 8704 Mar 3 18:59 ARUBA_pfp.bin -rw-r--r-- 1 root root 6144 Mar 3 18:59 ARUBA_rlc.bin -rw-r--r-- 1 root root 24096 Mar 3 18:59 BARTS_mc.bin -rw-r--r-- 1 root root 5504 Mar 3 18:59 BARTS_me.bin -rw-r--r-- 1 root root 4480 Mar 3 18:59 BARTS_pfp.bin (...)
The dataset is exactly what it is on the Linux machine!
We took only a simple case here: ZFS can is able to handle snapshots is a very flexible way. You can ask, for example, to combine several consecutive snapshots then send them as a single snapshot or you can choose to proceed in incremental steps. A man zfs will tell you the art of streaming your snapshots.
Govern a dataset by attributes
In the ZFS world, many aspects are now managed by simply setting/clearing a property attached to a ZFS dataset through the now so well-known command zfs. You can, for example:
- put a size limit on a dataset
- control if new files are encrypted and/or compressed
- define a quota
- control checksum usage => never turn that property off unless having very good reasons you are likely to never have (no checksums = no silent data corruption detection)
- share a dataset by NFS/CIFS (Samba)
- control data deduplication
Not all of a dataset properties are settable, some of them are set and managed by the operating system in the background for you and thus cannot be modified. Like any other action concerning datasets, properties are sets and unset via the zfs command. Let's start by checking the value of all supported attributes for the dataset myfirstpool/myfirstDS:
root # zfs get all myfirstpool/myfirstDS NAME PROPERTY VALUE SOURCE myfirstpool/myfirstDS type filesystem - myfirstpool/myfirstDS creation Sun Mar 2 15:26 2014 - myfirstpool/myfirstDS used 2.96M - myfirstpool/myfirstDS available 6.00G - myfirstpool/myfirstDS referenced 2.96M - myfirstpool/myfirstDS compressratio 1.00x - myfirstpool/myfirstDS mounted yes - myfirstpool/myfirstDS quota none default myfirstpool/myfirstDS reservation none default myfirstpool/myfirstDS recordsize 128K default myfirstpool/myfirstDS mountpoint /myfirstpool/myfirstDS default myfirstpool/myfirstDS sharenfs off default myfirstpool/myfirstDS checksum on default myfirstpool/myfirstDS compression off default myfirstpool/myfirstDS atime on default myfirstpool/myfirstDS devices on default myfirstpool/myfirstDS exec on default myfirstpool/myfirstDS setuid on default myfirstpool/myfirstDS readonly off default myfirstpool/myfirstDS zoned off default myfirstpool/myfirstDS snapdir hidden default myfirstpool/myfirstDS aclinherit restricted default myfirstpool/myfirstDS canmount on default myfirstpool/myfirstDS xattr on default myfirstpool/myfirstDS copies 1 default myfirstpool/myfirstDS version 5 - myfirstpool/myfirstDS utf8only off - myfirstpool/myfirstDS normalization none - myfirstpool/myfirstDS casesensitivity sensitive - myfirstpool/myfirstDS vscan off default myfirstpool/myfirstDS nbmand off default myfirstpool/myfirstDS sharesmb off default myfirstpool/myfirstDS refquota none default myfirstpool/myfirstDS refreservation none default myfirstpool/myfirstDS primarycache all default myfirstpool/myfirstDS secondarycache all default myfirstpool/myfirstDS usedbysnapshots 1K - myfirstpool/myfirstDS usedbydataset 2.96M - myfirstpool/myfirstDS usedbychildren 0 - myfirstpool/myfirstDS usedbyrefreservation 0 - myfirstpool/myfirstDS logbias latency default myfirstpool/myfirstDS dedup off default myfirstpool/myfirstDS mlslabel none default myfirstpool/myfirstDS sync standard default myfirstpool/myfirstDS refcompressratio 1.00x - myfirstpool/myfirstDS written 1K - myfirstpool/myfirstDS snapdev hidden default
the manual page of the zfs command gives a list and description of every attributes supported by a dataset.
May be something poked your curiosity: "what SOURCE means?". SOURCE describes how the property has been determined for the dataset and can have several values:
- local: the property has been explicitly set for this dataset
- default: a default value has been assigned by the operating system if not explicitely set by the system adminsitrator
- dash (-): immutable property (e.g. dataset creation time, whether the dataset is currently mounted or not...)
Of course you can get the property of a single attribute if you know its name instead of asking for all properties.
Compressing data
root # zfs get compression myfirstpool/myfirstDS NAME PROPERTY VALUE SOURCE myfirstpool/myfirstDS compression off default
Let's activate the compression on the volume (notice the change in the SOURCE column). That is being achieved through an attribute simply named compression which can be changed by running the zfs command with the set sub-command followed by the attribute's name (compression here) and value (on here) like this:
root # zfs set compression=on myfirstpool/myfirstDS root # zfs get compression myfirstpool/myfirstDS NAME PROPERTY VALUE SOURCE myfirstpool/myfirstDS compression on local
The attribute's new value becomes immediately effective no need to unmount and remount anything. compression set to on will only affect new data and not what already exists on the dataset. For your information, the lzjb compression algorithms is used when compression is set to on, you can override and use another compression algorithm by explicitly tell your choice. For example if you want to activate LZ4 compression on the dataset:
root # zfs get compression myfirstpool/myfirstDS NAME PROPERTY VALUE SOURCE myfirstpool/myfirstDS compression off default
root # zfs set compression=lz4 myfirstpool/myfirstDS root # zfs get compression myfirstpool/myfirstDS NAME PROPERTY VALUE SOURCE myfirstpool/myfirstDS compression lz4 local
Assuming myfirstpool/myfirstDS is empty with no snapshots:
root # cp -a /usr/src/linux-3.13.5-gentoo /-a /usr/src/linux-3.13.5-gentoo root # zfs get all myfirstpool/myfirstDS NAME PROPERTY VALUE SOURCE myfirstpool/myfirstDS type filesystem - myfirstpool/myfirstDS creation Sun Mar 2 15:26 2014 - myfirstpool/myfirstDS used 584M - myfirstpool/myfirstDS available 5.43G - myfirstpool/myfirstDS referenced 584M - myfirstpool/myfirstDS compressratio 1.96x - <<<< Compression ratio myfirstpool/myfirstDS mounted yes - myfirstpool/myfirstDS quota none default myfirstpool/myfirstDS reservation none default myfirstpool/myfirstDS recordsize 128K default myfirstpool/myfirstDS mountpoint /myfirstpool/myfirstDS default myfirstpool/myfirstDS sharenfs off default myfirstpool/myfirstDS checksum on default myfirstpool/myfirstDS compression on local <<<< LZJB compression active myfirstpool/myfirstDS atime on default myfirstpool/myfirstDS devices on default myfirstpool/myfirstDS exec on default myfirstpool/myfirstDS setuid on default myfirstpool/myfirstDS readonly off default myfirstpool/myfirstDS zoned off default myfirstpool/myfirstDS snapdir hidden default myfirstpool/myfirstDS aclinherit restricted default myfirstpool/myfirstDS canmount on default myfirstpool/myfirstDS xattr on default myfirstpool/myfirstDS copies 1 default myfirstpool/myfirstDS version 5 - myfirstpool/myfirstDS utf8only off - myfirstpool/myfirstDS normalization none - myfirstpool/myfirstDS casesensitivity sensitive - myfirstpool/myfirstDS vscan off default myfirstpool/myfirstDS nbmand off default myfirstpool/myfirstDS sharesmb off default myfirstpool/myfirstDS refquota none default myfirstpool/myfirstDS refreservation none default myfirstpool/myfirstDS primarycache all default myfirstpool/myfirstDS secondarycache all default myfirstpool/myfirstDS usedbysnapshots 0 - myfirstpool/myfirstDS usedbydataset 584M - myfirstpool/myfirstDS usedbychildren 0 - myfirstpool/myfirstDS usedbyrefreservation 0 - myfirstpool/myfirstDS logbias latency default myfirstpool/myfirstDS dedup off default myfirstpool/myfirstDS mlslabel none default myfirstpool/myfirstDS sync standard default myfirstpool/myfirstDS refcompressratio 1.96x - myfirstpool/myfirstDS written 584M - myfirstpool/myfirstDS snapdev hidden default
Notice the value for compressionratio: it no longer shows 1.00x but a shiny 1.96 here (1.96:1 ratio). We have a high compression ratio here because we copied a lot of source code files but if we put a lot of compressed data (images in jpeg or png format for example) the ratio would have decreased a lot.
Changing the mountpoint
Let's change the mount point of myfirstpool/myfirstDS to something like /mnt/floppy instead of /myfirstpool/myfirstDS for the sake of demonstration purposes. Changing a dataset mountpoint is done via its mountpoint attribute:
root # zfs get mountpoint myfirstpool/myfirstDS NAME PROPERTY VALUE SOURCE myfirstpool/myfirstDS mountpoint /myfirstpool/myfirstDS default
root # zfs set mountpoint=/mnt/floppy myfirstpool/myfirstDS root # zfs list NAME USED AVAIL REFER MOUNTPOINT myfirstpool 2.38G 5.43G 850M /myfirstpool myfirstpool/myfirstDS 584M 5.43G 584M /mnt/floppy myfirstpool/mysecondDS 1003M 5.43G 1003M /myfirstpool/mysecondDS
root # mount | grep floppy myfirstpool/myfirstDS on /mnt/floppy type zfs (rw,xattr)
Notice the dataset has been automatically unmounted and remounted at the new location for you and once again the change is effective immediately. If the indicated mountpoint would not be empty ZFS is smart enough to warn you and to not remount it.
Sharing a dataset through NFS
Now that you are a bit more familiar with ZFS properties you won't be that much surprised to learn that sharing a dataset can be done by setting one of its properties. You can, of course, go the "traditional" way and edit Samba's or NFS related configuration files by hand however why hassle with manual editing since ZFS can do that for you? ZFS On Linux has support for both systems.
Next let's share the myfirstpool/myfirstDS dataset by NFS to any host within the network 192.168.1.0/24 (read-write access) . An important detail here : the zfs command will use NFS v4 by default so any options related to NFS v4 can be passed on the command line, refer to options supported by your NFS server documentation for further information on what is supported and how use the feature. To share the dataset by NFS, you must change a property named sharenfs:
root # zfs set sharenfs='rw=@192.168.1.0/24' myfirstpool/myfirstDS
What happened? Simple:
root # zfs get sharenfs myfirstpool/myfirstDS NAME PROPERTY VALUE SOURCE myfirstpool/mfirstDS sharenfs rw=@192.168.1.0/24 local
root # cat /etc/dfs/sharetab /myfirstpool/myfirstDS - nfs rw=@192.168.1.0/24
The syntax and behaviour is similar to what is found under Solaris 11: zfs share' reads and updates entries coming from the file /etc/dfs/sharetab (not /etc/exports). This is a Solaris touch: under Solaris 11 the zfs and share commands now acts on /etc/dfs/sharetab, /etc/dfs/dfstab being no longer supported.
By a checking with the showmount command:
root # showmount -e Export list for .... : /myfirstpool/myfirstDS 192.168.1.0/24
At this point it should be possible to mount the dataset from another host on the network (here a Solaris 11 machine) and write some data in it:
root # mkdir -p /mnt/myfirstDS root # mount 192.168.1.19:/myfirstpool/myfirstDS /mnt/myfirstDS root # mount | grep myfirst /mnt/myfirstDS on 192.168.1.19:/myfirstpool/myfirstDS remote/read/write/setuid/devices/rstchown/xattr/dev=89c0002 on Sun Mar 9 14:28:55 2014
root # cp /kernel/amd64/genunix /mnt/myfirstDS
Et voila!No sign of protest so the file has been copied. If we check what the ZFS dataset looks like on the Linux host where the ZFS dataset resides, the copied file (a Solaris kernel image here) is present:
root # ls -l /myfirstpool/myfirstDS/genunix -rwxr-xr-x 1 root root 5769456 Mar 9 14:32 /myfirstpool/myfirstDS/genunix
$100 question: How to "unshare" the dataset? Simple: just set sharenfs to off! Be aware that the NFS server will cease to share the dataset no matter if this one is still in use by client machines. Any NFS client still having the dataset mounted at this point will encounter RPC errors whenever an I/O operation is attempted on the share (Solaris NFS client here):
root # ls /mnt/myfirstDS NFS compound failed for server 192.168.1.19: error 7 (RPC: Authentication error)
Sharing a dataset through Samba/SMB
Let's push the limit a bit and use Samba instead of NFS. ZFS relies on Samba (net-fs/samba on Gentoo/Funtoo) to get the job done as it does not implement a SMBFS server on its own. So Samba must be emerged first making sure :
- it has built-in ACL support (acl use flag)
- client tools are built (client use flag) as ZoL invokes the net command behind the scene (i.e. net usershare ... )
- usershare must be functional
Quoting the zfs command's manual page, your Samba server must also be configured like this:
- Samba will need to listen to 'localhost' (127.0.0.1) for the zfs utilities to communicate with samba. This is the default behaviour for most Linux distributions.
- Samba must be able to authenticate a user. This can be done in a number of ways, depending on if using the system password file, LDAP or the Samba specific smbpasswd file. How to do this is outside the scope of this manual. Please refer to the smb.conf(5) manpage for more information.
- See the USERSHARE section of the smb.conf(5) man page for all configuration options in case you need to modify any options to the share afterwards. Do note that any changes done with the 'net' command will be undone if the share is every unshared (such as at a reboot etc). In the future, ZoL will be able to set specific options directly using sharesmb=<option>.
What you have to know at this point is that, once emerged on your Funtoo box, Samba has no configuration file thus will refuse to start. You can use the provided example file /etc/samba/smb.conf.example as a starting point for /etc/samba/smb.conf, just copy it:
root # cd /etc/samba root # cp smb.conf.example smb.conf
Now create the directory /var/lib/samba/usershares (will host the definitions of all usershares), leaving default permissions (0755) and owner (root:root) untouched for the context of this tutorial, unless you use ZFS delegation, is acceptable.
root # mkdir /var/lib/samba/usershares
Several important things to know unless you have hours to waste with your friend Google:
- When you set the sharesmb property to on, the zfs command will invoke Samba's net command behind the scenes to create a usershare (comment and ACL are values are both specified). E.g. zfs sharesmb=on myfirstpool/myfirstDS => net usershare add myfirstpool_myfirstDS /myfirstpool/myfirstDS "Comment:/myfirstpool/myfirstDS" "Everyone:F" guest_ok=n
- Under which user the net usershare command will be invoked? Unless ZFS delegation is used, root will be the owner of the usershare created by root which is specified in a textual file (named after the usershare's name) located in the directory /var/lib/samba/usershares. There is per Samba requirement three very important details about the directory /var/lib/samba/usershares :
- Its owner must be root , the group is of secondary importance and left to your discretion
- Its permissions must be 1775 (so owner = rwx, group = rwx, others = r-x with sticky bit armed).
- If the directory is not set as above Samba will simply ignore any usershares you define so if you have errors like BAD_NETWORK_NAME when connecting a usershare created by ZFS double check the owner and permissions set for /var/lib/samba/usershares or the directory you use on your Funtoo box to hold usershares definition...
- Unless explicitly overridden in /etc/samba/smb.conf:
- usershare max shares default value is zero so no usershare can be created. If you forget to set a value greater than zero for usershare max shares any zfs set sharesmb=on command will complain with the message cannot share (...) smb add share failed (also any net usershare add command will show the error message net usershare: usershares are currently disabled).
- usershare path = /var/lib/samba/usershares
- usershare owner only is set to true by default so Samba will refuse the share to any remote user not opening a session as root on the share
So basically a super-minimalistic configuration for Samba would be:
[global] workgroup = MYGROUP server string = Samba Server security = user log file = /var/log/samba/log.%m max log size = 50 # Permits the usershares of being accessed by any other user than 'root' from a remote client machine usershare owner only = False # WARNING: default value for usershare max shares is 0 so No usershares possible... usershare max shares = 10
This configuration is obviously for the sake of demonstration purposes within the scope of this tutorial, do not use it for the real world!
At this point reload or restart Samba if you have altered /etc/samba/smb.conf. Now the usershares are possible, let's share a ZFS dataset over Samba:
root # zfs set sharesmb=on myfirstpool/myfirstDS root # zfs get sharesmb myfirstpool/myfirstDS NAME PROPERTY VALUE SOURCE myfirstpool/myfirstDS sharesmb on local
The command must return without any error message, if you have something like "cannot share myfirstpool/myfirstDS smb add share failed" then usershares are not functional on your machine (see the notes just above). Now a Samba usershare named after the zpool and the dataset names should exist:
root # net usershare list myfirstpool_myfirstDS
root # net usershare info myfirstpool_myfirstDS [myfirstpool_myfirstDS] path=/myfirstpool/myfirstDS comment=Comment: /myfirstpool/myfirstDS usershare_acl=Everyone:F, guest_ok=n
So far so good! So let's try this on the machine itself:
root #
Data redundancy with ZFS
Nothing is perfect and the storage medium (even in datacenter-class equipment) is prone to failures and fails on a regular basis. Having data redundancy is mandatory to help in preventing single-points of failure (SPoF). Over the past decades, RAID technologies were powerful however their power is precisely their weakness: as operating at the block level, they do not care about what is stored on the data blocks and have no ways to interact with the filesystems stored on them to ensure data integrity is properly handled.
Some statistics
It is not a secret to tell that a general trend in the IT industry is the exponential growth of data quantities. Just thinking about the amount of data Youtube, Google or Facebook generates every day taking the case of the first some statistics gives:
- 24 hours of video is generated every minute in March 2010 (May 2009 - 20h / October 2008 - 15h / May 2008 - 13h)
- More than 2 billions views a day
- More video is produced on Youtube every 60 days than 3 major US broadcasting networks did in the last 60 years
Facebook is also impressive (Facebook own stats):
- over 900 million objects that people interact with (pages, groups, events and community pages)
- Average user creates 90 pieces of content each month (750 millions users active)
- More than 2.5 million websites have integrated with Facebook
What is true with Facebook and Youtube is also true with many other cases (think one minutes about the amount of data stored in iTunes) especially with the growing popularity of cloud computing infrastructures. Despite the progress of the technology a "bottleneck" still exists: the storage reliability is nearly the same over the years. If only one organization in the world generate huge quantities of data it would be the CERN (Conseil Européen pour la Recherche Nucléaire, now officially known as European Organization for Nuclear Research) as their experiments can generate spikes of many terabytes of data within a few seconds. A study done in 2007 quoted by a ZDNet article reveals that:
- Even ECC memory cannot be always be helpful: 3 double-bit errors (uncorrectable) occurred in 3 months on 1300 nodes. Bad news: it should be zero.
- RAID systems cannot protect in all cases: monitoring 492 RAID controller for 4 weeks showed an average error rate of 1 per ~10^14 bits, giving roughly 300 errors for every 2.4 petabytes
- Magnetic storage is still not reliable even on high-end datacenter class drives: 500 errors found over 100 nodes while writing 2 GB file to 3000+ nodes every 2 hours then read it again and again for 5 weeks.
Overall this means: 22 corrupted files (1 in every 1500 files) for a grand total of 33700 files holding 8.7TB of data. And this study is 5 years old....
Source of silent data corruption
http://www.zdnet.com/blog/storage/50-ways-to-lose-your-data/168
Not an exhaustive list but we can quote:
- Cheap controller or buggy driver that does not reports errors/pre-failure conditions to the operating system;
- "bit-leaking": an harddrive consists of many concentric magnetic tracks. When the hard drive magnetic head writes bits on the magnetic surface it generates a very weak magnetic field however sufficient to "leak" on the next track and change some bits. Drives can generally, compensate those situations because they also records some error correction data on the magnetic surface
- magnetic surface defects (weak sectors)
- Hard drives firmware bugs
- Cosmic rays hitting your RAM chips or hard drives cache memory/electronics
Building a mirrored pool
ZFS RAID-Z
ZFS/RAID-Z vs RAID-5
RAID-5 is very commonly used nowadays because of its simplicity, efficiency and fault-tolerance. Although the technology did its proof over decades, it has a major drawback known as "The RAID-5 write hole". if you are familiar with RAID-5 you already know that is consists of spreading the stripes across all of the disks within the array and interleaving them with a special stripe called the parity. Several schemes of spreading stripes/parity between disks exists in the natures, each one with its own pros and cons, however the "standard" one (also known as left-asynchronous) is:
Disk_0 | Disk_1 | Disk_2 | Disk_3 [D0_S0] | [D0_S1] | [D0_S2] | [D0_P] [D1_S0] | [D1_S1] | [D1_P] | [D1_S2] [D2_S0] | [D2_P] | [D2_S1] | [D2_S2] [D2_P] | [D2_S0] | [D2_S1] | [D2_S2]
The parity is simply computed by XORing the stripes of the same "row", thus giving the general equation:
- [Dn_S0] XOR [Dn_S1] XOR ... XOR [Dn_Sm] XOR [Dn_P] = 0
This equation can be rewritten in several ways:
- [Dn_S0] XOR [Dn_S1] XOR ... XOR [Dn_Sm] = [Dn_P]
- [Dn_S1] XOR [Dn_S2] XOR ... XOR [Dn_Sm] XOR [Dn_P] = [Dn_S0]
- [Dn_S0] XOR [Dn_S2] XOR ... XOR [Dn_Sm] XOR [Dn_P] = [Dn_S1]
- ...and so on!
Because the equations are a combinations of exclusive-or, it is possible to easily compute a parameter if it is missing. Let say we have 3 stripes plus one parity composed of 4 bits each but one of them is missing due to a disk failure:
- D0_S0 = 1011
- D0_S1 = 0010
- D0_S2 = <missing>
- D0_P = 0110
However we know that:
- D0_S0 XOR D0_S1 XOR D0_S2 XOR D0_P = 0000 also rewritten as:
- D0_S2 = D0_S1 XOR D0_S2 XOR D0_P
Applying boolean algebra it gives: D0_S2 = 1011 XOR 0010 XOR 0110 = 1111. Proof: 1011 XOR 0010 XOR 1111 = 0110 this is the same as D0_P
'So what's the deal?' Okay now the funny part, forgot the above hypothesis and imagine we have this:
- D0_S0 = 1011
- D0_S1 = 0010
- D0_S2 = 1101
- D0_P = 0110
Applying boolean algebra magics gives 1011 XOR 0010 XOR 1101 => 0100. Problem: this is different of D0_P (0110). Can you tell which one (or which ONES) of the four terms lies? If you find a mathematically acceptable solution, found your company because you have just solved a big computer science problem. If humans can't solve the question, imagine how hard it is for the poor little RAID-5 controller to determine which stripe is right and which one lies and the resulting "datageddon" (i.e. massive data corruption on the RAID-5 array) when the RAID-5 controller detect error and start to rebuild the array.
This is not science fiction, this a pure reality and the weakness stays in the RAID-5 simplicity. Here is how it can happen: an urban legend with RAID-5 arrays is that they update stripes in an atomic transaction (all of the stripes+parity are written or none of them). Too bad, this is just not true, the data is written on the fly and if for a reason or another the machine where the RAID-5 array has a power outage or crash, the RAID-5 controller will simply have no idea about what he was doing and which stripes are up to date which ones are not up to date. Of course, RAID controllers in servers do have a replaceable on-board battery and most of the time the server they reside in is connected to an auxiliary source like a battery-based UPS or a diesel/gas electricity generator. However, Murphy laws or unpredictable hazards can, sometimes, happens....
Another funny scenario: imagine a machine with a RAID-5 array (on UPS this time) but with non ECC memory. the RAID-5 controller splits the data buffer in stripes, computes a data stripe and starts to write them on the different disks of the array. But...but...but... For some odd reason, only one bit in one of the stripes flips (cosmic rays, RFI...) after the parity calculation. Too bad too sad, one of the written stripes contains corrupted data and it is silently written on the array. Datageddon in sight!
Not to make you freaking: storage units have sophisticated error correction capability (a magnetic surface or an optical recording surface is not perfect and reading/writing error occurs) masking most the cases. However, some established statistics estimates that even with error correction mechanism one bit over 10^16 bits transferred is incorrect. 10^16 is really huge but unfortunately in this beginning of the XXIst century with datacenters brewing massive amounts of data with several hundreds to not say thousands servers this this number starts to give headaches: a big datacenter can face to silent data corruption every 15 minutes (Wikepedia). No typo here, a potential disaster may silently appear 5 times an hour for every single day of the year. Detection techniques exists but traditional RAID-5 arrays in them selves can be a problem. Ironic for a so popular and widely used solution :)
If RAID-5 was an acceptable trade-off in the past decades, it simply made its time. RAID-5 is dead? *Horray!*
More advanced topics
Z-Volumes (ZVOLs)
ZFS Intention Log (ZIL)
Permission delegation
ZFS brings a feature known as delegated administration. Delegated administration enables ordinary users to handle administrative tasks on a dataset without being administrators. It is however not a sudo replacement as it covers only ZFS related tasks such as sharing/unsharing, disk quota management and so on. Permission delegation shines in flexibility because such delegation can be handled by inheritance though nested datasets. Pewrmission deleguation is handled via zfs through its allow and disallow options.
Final words and lessons learned
ZFS on Linux, while still in development, showed strong capabilities and supported many of the features found in the Solaris/OpenIndiana implementation. It also seems to be very stable as no crashes or kernel oops happened while writing this tutorial. Funtoo does not officially support an installations over ZFS datasets however you can always read ZFS Install Guide to have a Funtoo box relying on ZFS!
Footnotes & references
Source: solaris-zfs-administration-guide