As previously mentioned I am in the process of moving all my online stuff, which basically comes down to web hosting, e-mail hosting, small file syncing and large file storage, to SDF. The main SDF server cluster is a bunch of NetBSD boxes with home directories, web directories and mailspools all mounted via NFS, with fairly tight quotas: as someone at the top level of membership, I only receive 250MB on each of the mounts.

However as of today a new cluster has been opened with an absurd amount of storage, initially offering 100GB storage for every MetaARPA member, and it’s set to rise with time:

-bash-4.1$ df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg_ma-lv_root
                       50G  3.1G   44G   7% /
tmpfs                 3.9G     0  3.9G   0% /dev/shm
/dev/sda1             485M   69M  392M  15% /boot
/dev/mapper/vg_ma-lv_src
                       51G  6.8G   42G  15% /src
/dev/mapper/ma0-rd0   4.5T  6.4G  4.5T   1% /meta

…though as you can see the disk space isn’t being much made use of yet. I imagine that eventually MetaArray will become the cluster for MetaARPA users, but for now at least, I am treating my data on the two clusters very differently and separately. Here’s a write-up of how I’m making use of my space and keeping it secure.

The MetaArray, btw, is pretty absurdly powerful. I just MD5’d 4GB in less than 10 seconds, whereas locally my drive is still churning away 40 seconds on. I imagine it will get less good as more people get it doing MD5s and stuff.

My use of the main SDF cluster

Access

There is a round-robin address, tty.sdf.org, which I use to get a shell on one of the clustered machines. As a MetaARPA member I could use one of the private MetaARPA (or arpa) boxes and leave a screen session running there or something but that would be wasting resources since I can just use the main cluster. telnet and SSH are available; the latter is not restricted to SSH keys only but given the existence of the telnet login I guess that wouldn’t make much difference.

Use

Reading e-mail, and my Emacs updates my website and blog for me. Running shell utilities to access other SDF resources e.g. syncing cron jobs to the cron server.

Also storage of my master git repositories for my public repositories: that is, config files, as opposed to ~/doc/. Once I set it up these will be server via http so that it’s easy for me to download config files into a new shell.

Data

The only data on the main NFS mounts that matters to me is the last month or so of e-mail. Each month my mail client renames all my mail boxes (yes, no Maildir, using mboxes), appending e.g. -apr-11, and at some point I gzip these, and put them in my git-annex repository, ~/var/. An encrypted copy of the mailboxes is sent to the MetaArray across the LAN, I then download this to my local machine, and drop the copy on the main SDF cluster. This is the ideal—I may have to download from main cluster and upload an encrypted copy to MetaArray in the end, frustratingly.

duplicity backs my data in the three NFS mounts to Amazon S3 each night. These saves the various SDF-specific helper scripts I have in ~/local/, and also this precious last month of e-mail.

Security assessment

By using things like SSH agent forwarding, while I do trust the SDF admins, I manage to avoid putting very much valuable data onto the cluster with this setup. The best that’s there is my Oxford login so for POP3 pickup of my university e-mail. Of course, I care about my recent e-mail and my helper scripts a heck of a lot. The main flaw is that a set of S3 access keys have to be stored on the cluster so that duplicity can do its backups, but then, an attacker with those keys can just as easily login and delete my backups. That’s the risk: not that the data be read, but that the backups be deleted so I can’t get it back.

Of course this is not really a concern at all. No-one is going to attack SDF with the resources a successful attack would require purely to delete Sean Whitton’s last month of e-mail. If the SDF was attacked for other reasons, chances are my home directory would be ignored for long enough that I could disable the access keys, so no problem.

My use of the MetaArray

Access

Standard SSH basically, for use with rsync etc.

Annex data

Using a git-annex special remote I am able to store my large files at the MetaArray, encrypted, and with git-annex doing the hard work of keeping track of things. I cannot say enough about how useful git-annex is for this. It keeps track of how many copies are available and which machines they are on. It can maintain a particular number of copies for subdirectories: e.g. at least two copies of my music library but I can tolerate just one copy of, say, a rip of a DVD from home that I wanted to take to university with me.

Right now I have about 150GB of annex’d files and an additional 100GB or so written to DVD-Rs in my room at home. When standard hard drives get bigger and when the MetaArray gets more space, I’d love to add these in and destroy the disks, with everything safe in the cloud.

TrueCrypt volumes

I’m using a TrueCrypt volume to store my document git repositories. It’s only 4GB because it will take a good long while for my documents to overspill that. It would be very easy to add more TrueCrypt volumes for storing more stuff. It seems to work fine over sshfs, example usage:

$ afuse -o mount_template="sshfs %r: %m -ouid=1000,gid=1000,allow_root" -o unmount_template="fusermount -u -z %m" ~/net/
$ truecrypt ~/net/ma/local/tc/gitcrypt.tc mnt/gitcrypt -p `cat ~/.file-containing-password`
< git pull/push/etc. >
$ truecrypt -d
$ fusermount -u -z ~/net/

Security assessment

All data on the MetaArray is encrypted (except basic shell config files, pulled via http from the main SDF cluster, of course!). git-annex’d stuff is replicated locally. Git repositories obviously stored locally too. So everything’s fine.

Conclusion

It’s really great to have all my data stored safely, on systems not under my control so I’m not responsible for keeping them up, but that I can trust; all I have to do is pay my MetaARPA membership dues once per year. The key is that almost everything is stored everywhere; by this I mean that if SDF disappears or my computer and laptop blow up, I’m fine. The only key to this is that the keys and passwords to access all this data need to be burned onto a DVD (encrypted with symmetric GPG encryption via a passphrase, of course) and kept somewhere, because if I lose my two computers, I lose access to the keys needed to access SDF again, whoops, then I’m lost. This is important because I don’t really trust my two local hard drives, as one is very old, and the other has been abused by careless power management settings. Now to head to bed, leaving my local machine copying git-annex stuff up to the MetaArray.