2009-11-15

FUSE protocol tutorial for Linux 2.6

Introduction and copyright

This is a tutorial on writing FUSE (Filesystem in UserSpacE for Linux and other systems) servers speaking the FUSE protocol and the wire format, as used on the /dev/fuse device for communication between the Linux 2.6 kernel (FUSE client) and the userspace filesystem implementation (FUSE server). This tutorial is useful for writing a FUSE server witout using libfuse. This tutorial is not a reference: it doesn't contain everything, but you should be able to figure out the rest yourself.

This document has been written by Péter Szabó on 2009-11-15. It can be used and redistributed under a Creative Commons Attribution-Share Alike 2.5 Switzerland License.

This tutorial was based on versions 7.5 and 7.8 of the FUSE protocol (FUSE_KERNEL_VERSION . FUSE_KERNEL_MINOR_VERSION), but it should work with newer versions in the 7 major series.

This tutorial doesn't give all the details. For that see the sample Python source code for a FUSE server.

Further reading

There is no complete and up-to-date reference manual for the FUSE protocol. The best documents and sources are:

Requirements

  • Linux 2.6 (even though FUSE has been ported to many operating systems, this tutorial focuses on the default Linux implementatation);
  • the fuse kernel module 7.5 or later compiled and loaded;
  • you being familiar with writing a FUSE server (with libfuse or one of the Perl, Python, Ruby or other scripting language bindings);
  • support for running external programs (like with system(3)) in the programming language;
  • support for creating socketpairs (socketpair(2)) in the programming language;
  • support for receiving filehandles (with recvmsg(2)) in the programming language (this is tricky, see below).

Overview of the operation of a FUSE server

This overview assumes that the FUSE server is single-threaded.
  1. Fetch the mount point and the mount options from the command-line.
  2. Optionally, create the directory $MOUNT_POINT (libfuse doesn't do this).
  3. Optionally, do a fusermount -u $MOUNT_POINT to clean up an existing or stale FUSE filesystem in the mount point directory. (libfuse doesn't do this).
  4. Run fusermount(1) to mount a filesystem.
  5. Receive the /dev/fuse filehandle from fusermount(1).
  6. Receive, process and rely o the FUSE_INIT message.
  7. In an infinite loop:
    1. Receive a message (with a buffer or ≤ 8192 bytes, recommended 65536 + 100 bytes) from the FUSE client on the file descriptor.
    2. If ENODEV is the result, or FUSE_DESTROY is received, break from the loop.
    3. Process the message.
    4. Send the reply to on the file descriptor, except for FUSE_FORGET.
  8. Clean up so your backend filesystem remains in consistent state.
  9. Exit from the FUSE server process.

fusermount(1), when called above does the following:

  1. Uses its setuid bit to run as root.
  2. Opens the character device /dev/fuse.
  3. Mounts the filesystem with the mount(2) system call, passing the file descripto of /dev/fuse to it.

Steps of running fusermount(1) and obtaining the /dev/fuse file descriptor:

  1. Create a socketpair (AF_UNIX, SOCK_STREAM) with fd0 and fd1 as file descriptors.
  2. Run (with system(3)): export _FUSE_COMMFD=$FD0; fusermount -o $OPTS $MOUNT_POINT . Example: export _FUSE_COMMFD=3; fusermount -o ro /tmp/foo .
  3. Receive the /dev/fuse file descriptor from fd1. This is tricky. See receive_fd function in the sample receive_fd.c for this. The sample fuse0.py contains a Python implementation (using the ctypes or the dl module to call C code).
  4. Close fd0 and fd1.

Wire format and communication

Once you have received the /dev/fuse file descriptor, do a read(dev_fuse_fd, bug, 8192) on it to read the FUSE_INIT message, and you have to send your reply. After that, you should be reading more messages, and reply to all of them in sequence (except for FUSE_DESTROY and FUSE_FORGET messages, which don't require a reply). All input (FUSE client → server) message types share the same, fixed-length header format, but the message may contain optional, possible variable-length parts as well, depending the message type (opcode). Nevertheless, the whole message must be read in a single read(3), so you have to preallocate a buffer for that (at least 8192 bytes, may be larger based on FUSE_INIT negotiation, preallocate 65536 + 100 bytes to be safe). All integers in messages are unsigned (except for the negative of errno).

The input message header is:

  • uint32 size; size of the message in bytes, including the header;
  • uint32 opcode; one of the FUSE_* constants describing the message type and the interpretation of the rest of the header;
  • uint64 unique; unique identifier of the message, must be repeated in the reply;
  • uint64 nodeid; nodeid (describing a file or directory) this message applies to (can be FUSE_ROOT_ID == 1, or a larger number, what you have returned in a previous FUSE_LOOKUP repy);
  • uint32 uid; the fsuid (user ID) of the process initiating the operation (use this for access control checks if needed);
  • uint32 gid; the fsgid (group ID) of the process initiating the operation (use this for access control checks if needed);
  • uint32 gid; the PID (process ID) of the process initiating the operation;
  • uint32 padding; zeroes to pad up to 64-bits.
The interpretation of the rest of the input message depends on the opcode. The most common input message types are:
  • FUSE_LOOKUP = 1: input is a '\0'-terminated filename without slashes (relative to nodeid), output is struct fuse_entry_out;
  • FUSE_FORGET = 2: input is a struct fuse_forget_in, there is no output message;
  • FUSE_GETATTR = 3: input is empty, output is struct fuse_attr_out;
  • FUSE_OPEN = 14: input is struct fuse_open_in, output is struct fuse_open_out;
  • FUSE_READ = 15: input is struct fuse_read_in, output is the byte sequence read;
  • FUSE_RELEASE = 18: input is struct fuse_release_in, output is empty;
  • FUSE_INIT = 26: input is struct fuse_init_in, output is struct fuse_init_out;
  • FUSE_OPENDIR = 27: input is struct fuse_open_in, output is struct fuse_open_out;
  • FUSE_READDIR = 28: input is struct fuse_read_in, output is the byte sequence read (serialized as FUSE-specific dirents);
  • FUSE_RELEASEDIR = 29: input is struct fuse_release_in, output is empty;
  • FUSE_DESTROY = 38: input is empty; there is no output message.
For a read-only filesystem with some files and directories, it is enough to implement only the opcodes above. See more opcodes and their coressponding C structs in the table. The linked document contains more details about some of the message fields. The complete up-to-date opcodes and message structs can be found in fuse_kernel.h. Each reply output message (FUSE server → client) starts with this header:
  • uint32 size; size of the message in bytes, including the header;
  • int32 error; zero for successful completion, a negative errno value (such as -EIO or -ENOENT) on failure; upon failure, only the reply header is sent;
  • uint64 unique; unique identifier copied from the input message;

Please note that you have to write the whole reply at once (one write(2) call). Using any kind of buffered IO (such as stdio.h or C++ streams) can lead to problems, so don't do that.

Feel free to experiment: whatever junk you write as a reply, it won't make the kernel crash, but you'll get an EINVAL errno for the write(2) call.

Your FUSE server doesn't have to implement all possible operations (opcodes). By default, you can just return ENOSYS as errno for any operation (except for FUSE_INIT, FUSE_DESTROY and FUSE_FORGET) you don't want to implement.

Common errno values the FUSE server can return:

  • ENOSYS: The operation (opcode) is not implemented.
  • EIO: Generic I/O error, if other errno values are not appropriate.
  • EACCES: Permission denied.
  • EPERM: Operation not permitted. Most of the time you need EACCES instead.
  • ENOENT: No such file or directory.
  • ENOTDIR: Not a directoy. Return it if a directory operation was attempted on a nodeid which is not a directory.

The format of struct fuse_init_in used in FUSE_INIT:

  • uint32 init_major; the FUSE_KERNEL_VERSION in the kernel; must be exactly the same your code supports;
  • uint32 init_minor; the FUSE_KERNEL_MINOR_VERSION in the kernel; must be at least what your code supports;
  • uint32 init_readahead; ??;
  • uint32 init_flags; ??;

The format of struct fuse_init_out reply used in FUSE_INIT:

  • uint32 major; the same as FUSE_KERNEL_VERSION in the input;
  • uint32 minor; at most FUSE_KERNEL_MINOR_VERSION (init_minor) in the input, feel free to set it to less if you don't support the newest version;
  • uint32 max_readahead; ?? set it to 65536;
  • uint32 flags; ?? set it to 0;
  • uint32 unused; set it to 0;
  • uint32 max_write; ?? set it to 65536;
You have to implement FUSE_GETATTR to make the user able to do an ls -l (or stat(2)) on the mount point. It will be caled with nodeid FUSE_ROOT_ID (== 1) for the mount point.

The format of struct fuse_attr_out reply used in FUSE_GETATTR:

  • uint64 attr_value; number of seconds the kernel is allowed to cache the attributes returned, without issuing a FUSE_GETATTR call again; a zero value is OK; for non-networking filesystems you can set a very high value, since nobody else would change the attributes anyway;
  • uint32 attr_value_ns; number of nanoseconds to add to attr_value;
  • uint32 padding; to 64 bits;
  • struct fuse_attr attr; node attributes (permissions, owners etc.).

The format of struct fuse_attr reply used in FUSE_GETATTR and FUSE_LOOKUP:

  • uint64 ino; inode number copied to st_ino; can be any positive integer, the kernel doesn't depend on its uniqueness; it has no releation to nodeids used in FUSE (except for the name);
  • uint64 size; file size in bytes (or 0 for devices); make sure you set it correctly, because the kernel would truncate rads at this size even if your FUSE_READ returns more; be aware of the size being cached (using attr_value);
  • uint64 blocks; number of 512-byte blocks occupied on disk; you can safely set it to zero or any arbitrary value;
  • uint64 atime; the last access (read) time, in seconds since the Unix epoch;
  • uint64 mtime; the last content modification (write) time, in seconds since the Unix epoch;
  • uint64 ctime; the last attribute (inode) change time, in seconds since the Unix epoch;
  • uint32 atime_ns; nanoseconds part of atime;
  • uint32 mtime_ns; nanoseconds part of mtime;
  • uint32 ctime_ns; nanoseconds part of ctime;
  • uint32 more; file type and permissions; example file: S_IFREG | 0644; example directory: S_IFDIR | 0755;
  • uint32 nlink; total number of hard links; set it to 1 for both files and directories by default; for directories, you can speed up some listing operations (such as find(1)) by setting it to 2 + the number of subdirectories;
  • uint32 uid; user ID of the owner
  • uint32 gid; group ID of the owner
  • uint32 rdev; device major and minor number for device for character devices (mode & S_IFCHR) and block devices (mode & S_IFBLK).

Nodeid and generation number rules

In FUSE_LOOKUP you should return entry_nodeid and generation numbers. If I undestand correctly, the following rules hold:
  1. When a (nodeid, name) pair selected which you have never returned before, you can return any entry nodeid and generation number (except for those which are in use, see below). These two numbers uniquely identify the node for the kernel.
  2. When called again for the same (nodeid, name) pair, you must return the same entry_nodeid and generation numbers. (So you must remember what numbers you have returned previously).
  3. You should count the number of FUSE_LOOKUP requests on the same (nodeid, name). When you receive a FUSE_FORGET request for the specified entry nodeid, you must decrement the counter by the nlookups field of the FUSE_FORGET request. Once the counter is 0, you may safely forget about the entry nodeid (so it no longer considered to be in use), and next time you may return the same or a different nodeid at your choice for the same (nodeid, name) -- but with an increased generation number.
  4. You must never return the same nodeid with the same generation number again for a different inode, even after FUSE_FORGET dropped the reference counter to 0. That is: nodeids that have been released by the kernel may be recycled with a different generation number (but not with the same one!).

How to list the entries in a directory

TODO

How to read the contents of a file

TODO

Other TODO

This tutorial is work in progress. In the meantime, please see the sample Python source code for a FUSE server.

Open questions

  • How does nodeid and generation allocation and deallocation work?
  • How to run a multithreaded FUSE server?

1 comment:

Unknown said...

I've been looking at fuse and it's handling of extended file attributes (xattrs) in relation to SELinux. A RedHat guy created a kernel patch which during the mount make a getxattr call to see if the filesystem supports xattrs. However this doesn't work with fuse currently because of the serialization of the mount and the opening of /dev/fuse. Can you see anyway to decouple these actions so that the kernel could talk to the userspace fuse server before the mount was complete?