giant file-rsync+dd+md5sum=no cry

I recently ran into a situation where:

  • I transferred a large file (in my case 32GB)
  • The md5 at the recipient didn’t match my source file (sadface)
  • The source or destination machine did not have rsync installed, but both have md5sum and dd.

Resending the entire 32GB file would be a waste of time. Why not just resend the chunks that failed?

The correct answer in this situation is usually “just use rsync, that’s what it’s for”. But I couldn’t since the target system doesn’t have rsync installed and I couldn’t install it. If you can install rsync at both ends, use it to fix the broken file. Here’s a great example.

You can’t do that? Then do this:

Giant file-rsync+dd+md5sum=no cry

  1. Create a bash/cmd script at each end to break the file into pieces with dd.
  2. md5sum each piece at both ends and compare to figure out which chunks are bad
  3. transfer the bad chunks from source to target
  4. dd the chunks back into the giant file
  5. recheck the md5sum of the file to make sure it matches

Create a bash/cmd script at each end to break the file into pieces

Tip: rename the file to something which doesn’t require escape sequences, especially if your source/target are running different OSes. For example, spaces mean the name has to be enclosed in quotes on Windows and have a backslash prepended on Linux. So get those spaces out of there.

dd  thinks in terms of blocks.

blocksize \times count = chunk size

I set the blocksize to 1 megabyte to make the math easier. I want each chunk to be 128MB. The size of the chunk is up to you, but the trade-off is waiting for excess data to transfer versus dealing with more part files. Anyhow, we have bs=1048576 count=128 .

To tell dd  where to start when it’s copying data out of a file, supply the skip option. So the first chunk has skip=0 , the second chunk has skip=128 , the third has skip=256 , and so on. Why?

dd  thinks in terms of blocks.

I usually create an Excel workbook and use fill-down to create the correct skip numbers and then CONCATENATE()  to create the actual dd command lines. Copy and paste them into a text document. Send it to both ends with the correct extensions/permissions/shebang line/etc.

Snippet of Excel sheet showing formulas
Here’s how I set up my excel sheet to create my batch/shell script
Snippet of an Excel sheet to fill in the right values in my dd batch/shell script.
The formulas allow me to fill down to create the correct lines in my batch/shell script

Run the batch/shell script at each end to create corresponding partXXXX files. If you follow my example, the value in the K column shows you where to stop copying; it changes to false at the line where you’ve passed the final dd required.

md5sum the pieces at each end and compare

Pretty easy; use md5sum  on all of the partXXXX files at each end. Save the output into an md5 file and then get both files in the same place so you can compare.

Using the command line diff  tool will work, but if you have a GUI tool it should make it easier to see which files don’t match. Let’s hope there aren’t many.

Transfer the bad chunks from source to target

This part should be easy; just send the good chunks from the source to the target to replace the bad chunks. To make sure you haven’t wasted your time, md5sum  the replacement chunks once they reach the destination. Re-retransfer any that don’t match.

dd the chunks back into the giant file

We will use dd again. Instead of redoing the whole process in reverse, we only need to dd in the fixed chunks.

Either redo your Excel sheet or just find and replace in your target batch/shell script.

The key things here are that the if and of have been swapped, we must add conv=notrunc, and we use seek instead of skip. We swap the input and output files because we’re outputting to the big file. We use conv=notrunc  because by default dd will truncate the destination file at the point where you start writing. We don’t want to destroy the file, so this is important. Finally, when we need to write the destination file anywhere other than the start, we have to use seek  instead of skip .

You only need the lines corresponding to the fixed chunks. So your final batch/shell script might end up looking like this:

Recheck the md5sum of the file to make sure it matches

You’re all done, assuming it matches. (Cue spooky music)

Hey nevermind, here’s my Excel Workbook. Just use that. That’s what I’m going to do from now on.