Verifying Object Integrity when copying to Cloud Platforms (Part 1)
January 11, 2019
January 11, 2019
Companies have data distributed across several mediums. One common place is data in files. Because of this, the Concinnity Data Engineering platform allows for traditional batch processing where files can be uploaded into the system. Although the uploading of files seems like something trivial, in systems where the integrity of data is important, you want to verify that objects copied into the system arrive at their destination as they were when they left the source.
Although most protocols can do error checking and resend data if errors exist, the Concinnity Platform takes a “don’t trust anything” approach. There are several factors that can cause data corruption when copying data up into the cloud platforms. Things like:
- corrupt hardware on the machine/device that is uploading data
- corrupt hardware on the machine/device receiving the data
- poor network connectivity between endpoints or bad network device
- attacks that can change a file as it transferred
If not monitored for, the bad data in the files can have negative effects on data used in any decision processes. In this post, we’ll discuss two types of Hashes used for checking object integrity; md5 and CRC. In later posts, we’ll discuss how to use them with the different cloud providers to verify uploading objects into storage.
MD5 is a way to verify data integrity by calculating a 128-bit hash value.
On Linux machines, an md5 value can be calculated using the command md5sum.
nelsone@chromebook:~$ md5sum /etc/hosts
md5 was originally developed as a cryptographic but is no longer used for cryptography but remains available as a way of detecting if files have been changed when transferred. The fact that it was initially developed for security, calculating md5 sum is more resource intensive.
CRC stands for cyclic redundancy check and is another way of detecting changes to data. CRC was not initially developed nor was intended to be used for security. Hence it is less resource intensive than md5. To calculate a files crc value, you can run the cksum command.
nelsone@chromebook:~$ cksum /etc/hosts
3289016063 140 /etc/hosts
The output of the command is the CRC value and byte count of the file.
You can verify data integrity of transferred files by comparing hash values generated before a transfer starts to a hash returned by the cloud service after the data has completed being transferred. In the next few weeks, we’ll be posting how this can be accomplished with various cloud providers.
Advanced Data Engineering Platform for Cleansing, Preprocessing and Analytics