How to create your own digital archive: DIY [a response to my own negativity]
A few months ago, I posted an article to this blog (“How to archive a digital file: Print it”) that some readers may have interpreted as a pessimistic screed concerning our futile attempts to archive digital materials for future generations [at least, I interpreted it that way].
Although my post was the product of considerable rumination (it hurt my brain and left me disheartened), I really hate to be that the kind of person who relishes in reporting what we can’t do. So I’m here to describe what we can do for better digital archiving [And for those souls who like DIY projects, I have included some notes so that you can do-it-yourself].
Besides the issues that I outlined in my last post, the main long-term problem with digital storage is “data degradation”. Things break down. Stuff rots. Entropy eats away at all order. In other words, your data decays and files won’t open.
How does this manifest? Bits of data disappear. This is sometime referred to as “bit rot” [this term is used for other phenomena, so for disambiguation purposes, see this Wikipedia page], but I’ll just call it “data degradation”.
Data degradation occurs in all media (hard drives, floppy discs, compact discs, tapes, etc.), and it ‘s hard to detect. You usually don’t know it has happened until you find a file that doesn’t open. By then it may be too late to recover it, or you may have to spend a small fortune employing a data forensics expert to extract as much data as she can from the damaged file.
The solution is to have utilities that constantly check for “data integrity” on archival media. These data integrity verification systems employ, among other things, checksums and/or cyclic redundancy checks (crc). When they discover that a bit of data has been lost, they generate a report concerning which hard drive has been effected.
But it’s not enough to know that a bit of data has disappeared. That lost bit of data has to be reinserted into the original file. That requires a backup copy.
In order to secure files from data degradation, digital archives require at least three components: 1) Multiple storage media; 2) Data integrity verification utilities; and 3) Backup systems. And all three must work in tandem. When a system detects that a bit of data has been lost, the archive must then retrieve the lost information from the backup system and reinsert it. But it must also advise the user whether the affected hard drive may be failing and that it needs to be replaced.
What if a hard drive fails completely? What if two hard drives fail completely at the same time? A good digital archive system will have enough redundant hard drives to survive such a calamity. Depending on the amount of storage needed, a user can configure a digital archive so that it survives even three or more hard drives failing at one time.
FreeNAS (http://www.freenas.org/about/features.html) is an open-source operating system built for this type of need. It has the following components:
1) A FreeNAS machine contains multiple hard drives for storage and replication;
2) It employs the ZFS file system (https://www.freebsd.org/doc/handbook/zfs.html) which provides data integrity tools;
3) It employs a RAID Z backup/replication structure for redundancy and restoration.
You can build your own FreeNAS machine. It isn’t any more complicated than building a regular computer. So for a fun DIY project:
1- Here is a guide for building your own FreeNAS machine: http://blog.brianmoses.net/2015/01/diy-nas-2015-edition.html
2- Here is a guide for configuring the operating system on a FreeNAS machine: https://drive.google.com/file/d/0BzHapVfrocfwbXYxcGgycEIzNG8/edit
Before you begin to build a FreeNAS machine, be sure to read the configuration guide first. The configuration of the operating system is critical to protecting your data, especially if you plan to use the machine “at work”. The author of the configuration guide, someone who goes by the name of “Cyberjock”, is very good at enumerating all the pitfalls of poor configuration that will mitigate all the advantages of FreeNAS. In other words, if you don’t do it right, you may lose everything: Everything. All your data. Gone.
For those of you looking for solutions “at work” and who are too frightened to configure their own systems after reading Mr. Cyber Jock’s guide, the FreeNAS website does advertise FreeNAS machines from a vendor. I have never used such a system; nor do I endorse them here. But they appear to exist.
I have read some literature from another vendor that sounds like its machines run FreeNAS, but the language is so vague that it was unclear whether the machines were “digital ARCHIVE” ready as described above. Data degradation is slow enough that most vendors aren’t concerned with providing tools to detect it. Most commercial storage systems are not created to store data for twenty, fifty, or a hundred years.
An alternative to FreeNAS is the software from the Data Conservancy Project begun at Johns Hopkins University: https://dataconservancy.org/. Its software runs on three Linux servers using Fedora and Solr, both open-source applications.
Last summer I attempted to create my own system using the DataConservancy software. I was not successful. I made it pretty far into the process, but in the end, there was a file from the DataConservancy that created the library structure (or some such thing) in SOLR that I just could not get to work.
For those interested in pursuing this as a DIY project, I have attached my step-by-step notes on the entire installation process that I completed successfully [I include my original document, an -.odt file made using OpenOffice, which offers working hyperlinks that move you around to different portions of the process. Also attached will be a -.doc version, but the hyperlinks don’t work in the Word version.] This includes many commands [for the dreaded Linux “command line”] as well as notes on discrepancies and other problems. If you follow the process and fail at the same spot I did, you still should have a working CentOS Linux machine running the Fedora archive application and SOLR.
Recapitulation: Digital archiving can work if the system has multiple hard drives configured to take advantage of data integrity and of backup/restoration tools. Such systems are within the reach of even the most budget-beleaguered organizations, because you can do-it-yourself. With a little help from a Mr. Cyber Jock.
Notes on DataConservancy Project-OpenOffice version– with working hyperlinks