Ideas For A Semantic File System

My recent discovery of Tracker has prompted my to again examine what my design for a semantic, database-driven file system would be. Instead of writing about what a semantic file system (SFS) is I thought I'd just launch into a description of what I think a SFS should look like and how it should work.

First of all, I believe the most logical thing to do is use a fully database-driven back end for the file system. Microsoft's experimental file system WinFS, which was originally supposed to be the default in Vista (then code-named Longhorn), used something like this but seemed to be built on top of NTFS. I don't think building on top of an existing file system is necessarily a bad thing but I don't think it's strictly required and might just make things more complex. I'd have to look at the details a bit more though as an idea I had (which I'll go into later) might be easier if the SFS is a pseudo-file system. Having a full database allows us to leverage the power of an existing database system straight away. Databases are good at storing lots of data and accessing it quickly. A SFS would be required to find files based on meta information about the desired files so being able to do fast queries is essential. I believe a small database such as SQLite is preferable since large database management systems (DBMS) like MySQL take up a lot of room on disk and in memory. However, for the computers in the near future, being able to scale to multiple processors will be necessary to leverage all the power of a system so I'd have to check if SQLite is capable of doing that efficiently.

Well, now that I've explained the back end, I'll explain my thoughts on how I think everything should work. The main idea here that's different from older SFSes, and that Tracker seems to do, is to use common Web-based semantic concepts to organise files, not just metadata. The most important is tagging. Tagging is quite popular in the Web 2.0 world and, for my SFS, would basically work by attaching tags (descriptive text strings) to a file. Files could have an arbitrary number of tags attached to them. An example of this could be a PDF I download from IEEE Xplore about semantic file systems – I could add tags "IEEE", "semantic", "file system" or whatever else.

The other idea from the web is, of course, searching. As files are added to the system they are indexed for future searches and, at the same time, metadata is extracted from them. Metadata for files is simply information about that file. Examples of this are the ID3 tags for MP3s, author, title and other fields for Word documents, and Exif data in images. On top of this you could also use creation and modification timestamps as metadata, as well as Unix file permissions. The file type itself could also be stored as metadata (maybe as a MIME type to continue in the web-based concepts) so that we don't rely on things like the file extension to let us know what to do with the file.

The idea for both of these things – tags, indexing and metadata, allow us to more efficiently find the files we want to deal with. On the whole, it seems logical that accessing files in this manner would be better but we immediately run into the problem of backwards compatibility with existing file systems and system software. I'm going to be talking about Unix systems in general but the same ideas should apply to Windows systems as well.

The main piece of software, or should I say group of softwares, is the shell and its related tools. The shell allows us to access and modify files so it's natural that we'd need to way to work with programs like ls, which expect the file system to be laid out in a hierarchical manner. I think the easiest way to do this is the allow tags and metadata to be used as directories. This is not a new idea and I know at least one SFS paper I read proposed the same thing, albeit with different syntax and lacking tags.

So, the way you list files with tags is by using each tag as a directory name. For example:

ls IEEE/paper/file\ system/

The above would list all files with "IEEE", "paper" and "file system" as tags. I don't believe order should matter at all so this would return the same files:

ls file\ system/IEEE/paper/

You can see that the use of tags is a way of narrowing down the number of files that we want to look at. Not using any arguments to ls would simply list all the files in our home directory.

Metadata would be treated a similar way but would allow for named arguments:

ls artist=metallica/

Which would find all files that had a metadata item of "artist" equal to "Metallica" (I'm assuming case independence). If we wanted to drill this down to something more specific:

ls artist=metallica/album=reload/

This concept could be extended to allow more advanced options for selecting metadata, such as similarity:

ls artist=metallica/album~load/

Comparisons:

ls artist=metallica/year>1995/

And so on. Even though this isn't as powerful as a proper query it does allow us to use the existing infrastructure for accessing files and should be fairly simple for people to understand. I guess one question is whether or not to treat tags as another form of metadata, which might simplify the development process and allow us to use comparisons and such for tags as well:

ls tags~metal/album~load/

I think this might get a bit confusing though. Most likely you'd be using tags to find your files anyway, which brings up the question of whether or not tags should be automatically created from metadata ("auto-tagging"). I think automatically adding tags is a bad idea in general as you want to effectively separate tags, which are only created by the user, from metadata, which is usually created by the system as part of indexing and extraction.

Searching could be done through a built-to-purpose query program as it's not a common operation done in the shell. Alternatively another pseudo-directory could be used for this:

ls search/haruhi\ suzumiya/

This would conflict with tags named "search" though so we might need to use something more neutral:

ls :/haruhi\ suzumiya/

Similar problems are going to be faced with this as well though (what if someone names a tag ":"?) and also with metadata (what if a tag is named "artist~john"?). These would be fringe cases though, as well as being generally silly, so it would probably be ok to simply disallow this behaviour. However, disallowing it requires us to determine where it's being prevented! Seeing as we're going for backwards compatibility, it would have to be done at the file system level which, unfortunately, would complicate things a little.

Now, this is just the command line interface but hopefully you can see how backwards compatibility can be achieved whilst still allowing for a great deal of flexibility to use our database-driven SFS. I'll have to think about the graphical interface some more and how it can be integrated into programs like Nautilus.