ECC/Checksumming support of file system for data

Please post discussions that do not fit into any other category.
Post Reply
User avatar
dan_mccurdy
V.I.P Member
V.I.P Member
Posts: 271
Joined: Wed Feb 15, 2012 2:34 am
12
Full Name: Daniel McCurdy
Company Details: Geometria Ltd.
Company Position Title: Technical Director
Country: New Zealand
Linkedin Profile: Yes
Location: Auckland, New Zealand
Contact:

ECC/Checksumming support of file system for data

Post by dan_mccurdy »

This is possibly a bit esoteric, but I have recently been forced to think very hard about the limitations of some file systems for the long term storage of laser scan data and was wondering if other people have encountered this issue yet.

We have laser scan data dating back to 2003, which has been stored on a variety of file systems over the years - FAT32, NTFS, ReiserFS, EXT2, EXT3 and EXT4, and almost all our old datasets from the early years show signs of bitrot (as per : http://en.wikipedia.org/wiki/Data_rot#D ... rage_media ). Pretty much every decade old dataset has at least a few scans that now have errors in the database - generally only a few, but sometimes taking out entire scans.

This is caused by hard disk or RAM data bit flipping (going from a 1 to a 0 or vice versa) from external sources (like being hit by random cosmic rays). Most common file systems (such as those listed above) silently ignoring the flipped bits. As time goes on you are pretty much guaranteed to have some bitrot on your datasets, and this will become more of an issue for everyone in the future as we collect more and more data (massively increasing the chance of bitrot).

So how does everyone deal with this ? Ignoring it is one option obviously - old projects are the most vulnerable and often we don't care too much about them now. Our solution is relatively simple - we've now moved all our server storage to ZFS (Sun/Oracle, primarily on Unix), which does data checksumming to detect and repair bitrot (among many other awesome features). Other file systems can do this, the only other common ones I know of are BTRFS (Linux, in development) and ReFS (Windows 8.1+ and Windows Server, still very new). RAID alone does not have the ability to repair silent bitrot (and can actually compound the problem).

Those of you in a corporate environment possibly have this all dealt with for you by IT folk so don't need to be worried, but if you are storing all your data on an NTFS volume for long periods (for example) then you do need to be worried. It is also worth checking with your IT people that your data is being checksummed and protected from bitrot.

I'm really just trying to start a conversation about this and raise awareness. For many industries a few flipped bits here and there aren't a serious problem, but for laser scanning it's a demonstrable issue. If you are looking at structural change over decades (which we have done), then those really old datasets really do matter ! Also obviously if you are offering clients a data retention guarantee then it's an issue that you can't ignore.

PS - there are other (not as reliable) options, such as archiving data in a format that includes a recovery record, such as RAR - but this has also failed us on at least one occasion.
Dan McCurdy
Geometria Heritage Management
http://geometria.co.nz/
jedfrechette
V.I.P Member
V.I.P Member
Posts: 1235
Joined: Mon Jan 04, 2010 7:51 pm
14
Full Name: Jed Frechette
Company Details: Lidar Guys
Company Position Title: CEO and Lidar Supervisor
Country: USA
Linkedin Profile: Yes
Location: Albuquerque, NM
Has thanked: 61 times
Been thanked: 219 times
Contact:

Re: ECC/Checksumming support of file system for data

Post by jedfrechette »

This is definitely a hard problem, and one I certainly don't have a perfect solution for. Judging from the studies I've seen that were commissioned by everyone from the Motion Picture industry to various national libraries, I'm not sure if anyone does. It seems like there are only a couple of practical options for any organization without a huge IT budget and a large amount of storage experience.

Write your archive data to magnetic tape and figure out a way to store them safely in multiple locations.
I think you can get 100 year tapes but most only have a life expectancy of 10-30 years.

Add another abstraction on top of the filesystem and use an object store rather than a file store.
That way you can easily hide all of the parity tracking and redundancy required. You could build one yourself using something like OpenStack. However, even with a really good IT department, I suspect there aren't many companies whose core business isn't storage that can match the 11 9s durability that someone like Amazon can provide for S3 and Glacier.

I guess one advantage of our inefficient workflows is that we're inadvertently making use of a modified version of the LOCKSS (Lots Of Copies Keep Stuff Safe) principle. By the time we deliver a project we will typically have generated 4-5 slightly different copies of each scan so if we lose one, chances are we could still recover from another.

I'm pretty confident I'll be able to recover any needed data from our current projects 5-10 years from now. Will they still be usable in 20-50 years though? I'm not so sure about that.
Jed
User avatar
dan_mccurdy
V.I.P Member
V.I.P Member
Posts: 271
Joined: Wed Feb 15, 2012 2:34 am
12
Full Name: Daniel McCurdy
Company Details: Geometria Ltd.
Company Position Title: Technical Director
Country: New Zealand
Linkedin Profile: Yes
Location: Auckland, New Zealand
Contact:

Re: ECC/Checksumming support of file system for data

Post by dan_mccurdy »

jedfrechette wrote:Lots Of Copies Keep Stuff Safe
This is actually a very good point, and one that I neglected. It turns out that in the specific project that made me write the original post - I was saved by exactly that. Never have I been so glad to have an inefficient workflow! I recently discovered 5 seperate copies of the IMP files from an excavation we did back in 2004 - all with different file sizes (of course back then a huge project was 20gb. Hah !)

This of course only works if you're saving in a format that is resilient to bit flipping errors. It turns out that Cyclone is very resilient - it seems that despite the IMP file appearing to be a monolithic store, it is actually quite happy to accept and ignore errors within the file, and move on to the next set of data. I am sure that this was a deliberate design choice by Cyrax/Leica - but other vendors have not been so savvy. I have had many Faro scan files corrupted, as well as entire Scene projects effectively trashed due to this lack of resiliency. There is no way that I would ever store my archive of a Faro project in Faro's native formats, especially since their FLS format changes seeming with every release of Scene (although to be fair, so does the IMP database I guess...). I'm happy to be corrected on this point, but that's been my experience over two years with Scene.

I'm glad you pointed out Glacier, since it exists almost exclusively for this type of use case - storing truck loads of data, for a long time, on the off chance that you might need it sometime, and need to be sure that it is safe. I'm fairly sure that they're probably using ZFS or similar as their disk format of choice ... but of course pushing data to the cloud is just handing off responsibility for your data to someone else, and hoping that they take good enough care of it. We use Crashplan for our cloud backups - but not long ago they accidentally, and unrecoverably, deleted a whole bunch of people's backups ... due to "human error". Ooops.

Obviously we're all supposed to be archiving everything in e57 format in theory, but I wonder how resilient that is to bitrot. I remember reading a white paper on exactly that subject some time ago, but I can't remember the conclusion :-( And of course not every vendor supports e57 completely - so there's data loss right at the start.

I worry about this issue a lot, because being in heritage, quite a few of our projects look at change over the 10-50 year range - so if our data isn't still usable in 50 years then that's a really big problem for our entire thesis.
Dan McCurdy
Geometria Heritage Management
http://geometria.co.nz/
User avatar
dan_mccurdy
V.I.P Member
V.I.P Member
Posts: 271
Joined: Wed Feb 15, 2012 2:34 am
12
Full Name: Daniel McCurdy
Company Details: Geometria Ltd.
Company Position Title: Technical Director
Country: New Zealand
Linkedin Profile: Yes
Location: Auckland, New Zealand
Contact:

Re: ECC/Checksumming support of file system for data

Post by dan_mccurdy »

As a side note - our solution is that we now have our data stored on mirrored ZFS servers, replicated over three geographic locations (separate cities) in real time, as well as having two independent cloud backups. I'm not sure what else we can do, but I am keen to hear suggestions.

We have split over geographic locations in the wake of the devastating Christchurch earthquake 3 years ago - when IT departments found themselves in a messy situation if they had all their data eggs in one geographic basket.
Dan McCurdy
Geometria Heritage Management
http://geometria.co.nz/
jedfrechette
V.I.P Member
V.I.P Member
Posts: 1235
Joined: Mon Jan 04, 2010 7:51 pm
14
Full Name: Jed Frechette
Company Details: Lidar Guys
Company Position Title: CEO and Lidar Supervisor
Country: USA
Linkedin Profile: Yes
Location: Albuquerque, NM
Has thanked: 61 times
Been thanked: 219 times
Contact:

Re: ECC/Checksumming support of file system for data

Post by jedfrechette »

dan_mccurdy wrote:but of course pushing data to the cloud is just handing off responsibility for your data to someone else, and hoping that they take good enough care of it.
Very true, and if you read the fine print in any of the contracts for the online storage providers you'll be hard-pressed to find any actual guarantees about reliability. Nonetheless, the major vendors are still probably much much better than throwing everything on a SAN in a single data center.
dan_mccurdy wrote:We have split over geographic locations in the wake of the devastating Christchurch earthquake 3 years ago - when IT departments found themselves in a messy situation if they had all their data eggs in one geographic basket.
I think this is pretty key. Beyond the obvious risk of natural disasters there are other geographic factors that might make one location more reliable than another. For example, I have a colleague with data centers here in Albuquerque and in Los Alamos. Even though the equipment is identical and the environmental conditions are as close to optimal as possible they see more file system corruption at the Los Alamos location. The current working hypothesis is that this is because Los Alamos's elevation is about 600 m higher than Albuquerque so more cosmic rays are hitting the drives.
Jed
Post Reply

Return to “Any Other Issues”