building a distributed data warehouse with FreeBSD
July 4, 2009

A data warehouse is, as the name implies, a place where large amounts of data are archived, usually for the purposes of future browsing and analysis.

The problem with a standard data warehouse is that it requires a fat server, in particular, it requires a server with a large amount of disk space. This can be expensive, especially as consumer-level PCs rarely contain more than two drive bays, and diskspace, if available, is usually found on these same consumer-level PCs, not on the server.

Enter FreeBSD. With the right combination of tools, all of the consumer-level PCs on the LAN can be utilised to create a single, seamless, distributed data warehouse. Rather than concentrating the diskspace in several large drives in the server, a distributed data warehouse spreads the chore of holding the data amongst the PCs. The server merely provides the interface, and serves the files - although it could of course host data directories as well.

The required tools are as follows:

The distributed data warehouse is built by mounting shares on the workstations, and placing a symlink to the mountpoints in a directory that is browsable via Apache. This creates a website consisting of links to remote volumes, however the fact that the volumes are remote is transparent to the website user, who sees a single website. To the website user, the remote volumes look identical, and work identically, to directories on the server.

It is, in essence, a read-only SMB-to-HTTP gateway, which presents a unified interface to distributed resources. It allows each machine on the LAN to share files via HTTP, without needing to install a webserver on each machine.

Benefits:

Drawbacks:

How to build the warehouse:

Note: you will likely need root-level server access for the tasks below.

  1. Install FreeBSD and Apache and get them working. Ensure you can see, in your web browser, a directory listing of the root of the website you have created. You may wish to create an alias for your warehouse - if you want it to be accessible from the web, you'll probably need to use a dynamic DNS account, and forward a port on your router, then you can give people an address such as http://warehouse.mydomain.dynamic-dns-provider.com/

  2. Go to each PC that will be a part of the warehouse, and create a shared directory, with permissions as appropriate. For example, create a share, world-readable. Or, first create a "warehouse" user on the PC, and when creating the share, give the warehouse user read permissions.

  3. On the server, login as root, and create mountpoints for each remote volume. For example, if you have two PCs called PC1 and PC2, each with two shared directories:
    mkdir /mnt/PC1-1
    mkdir /mnt/PC1-2
    mkdir /mnt/PC2-1
    mkdir /mnt/PC2-2
    

  4. On the server, change to the directory to contain the symlinks (usually the root directory of the website), and create the symlinks. For example:
    cd /usr/local/www/warehouse
    ln -s /mnt/PC1-1/ PC1-1
    ln -s /mnt/PC1-2/ PC1-2
    ln -s /mnt/PC2-1/ PC2-1
    ln -s /mnt/PC2-2/ PC2-2
    

  5. On the server, mount the remote volumes (you may wish to create a script to do this). For example:
    mount_smbfs //warehouse@PC-1/share-1 /mnt/PC1-1
    mount_smbfs //warehouse@PC-1/share-2 /mnt/PC1-2
    mount_smbfs //warehouse@PC-2/share-1 /mnt/PC2-1
    mount_smbfs //warehouse@PC-2/share-2 /mnt/PC2-2
    

    Note: you'll need to enter warehouse's password for each mount_smbfs command. Also, the mounts are lost if the server is rebooted. To rebuild the warehouse, if the server is rebooted, re-run the mount_smbfs commands, entering the password for each share.

    Note: if mount_smbfs can't resolve the hostnames your provide, either use IP addresses instead, enter the hostnames into /etc/hosts, or install Samba.

  6. (optional) Create an index page for the root of the website. If this is not done, Apache will show the directory listing its usual way (directory listing must be permitted in htaccess or httpd.conf for this to work).

Notes:

So, when would you want to serve large quantities of non-mission-critical data, with consumer-level hardware, you may ask?