I recently ran into a situation where:
- I transferred a large file (in my case 32GB)
- The md5 at the recipient didn’t match my source file (sadface)
- The source or destination machine did not have rsync installed, but both have md5sum and dd.
Resending the entire 32GB file would be a waste of time. Why not just resend the chunks that failed?
The correct answer in this situation is usually “just use rsync, that’s what it’s for”. But I couldn’t since the target system doesn’t have rsync installed and I couldn’t install it. If you can install rsync at both ends, use it to fix the broken file. Here’s a great example.
You can’t do that? Then do this:
Giant file-rsync+dd+md5sum=no cry
- Create a bash/cmd script at each end to break the file into pieces with dd.
- md5sum each piece at both ends and compare to figure out which chunks are bad
- transfer the bad chunks from source to target
- dd the chunks back into the giant file
- recheck the md5sum of the file to make sure it matches
Create a bash/cmd script at each end to break the file into pieces
Tip: rename the file to something which doesn’t require escape sequences, especially if your source/target are running different OSes. For example, spaces mean the name has to be enclosed in quotes on Windows and have a backslash prepended on Linux. So get those spaces out of there.
dd thinks in terms of blocks.
I set the blocksize to 1 megabyte to make the math easier. I want each chunk to be 128MB. The size of the chunk is up to you, but the trade-off is waiting for excess data to transfer versus dealing with more part files. Anyhow, we have bs=1048576 count=128 .
To tell dd where to start when it’s copying data out of a file, supply the skip option. So the first chunk has skip=0 , the second chunk has skip=128 , the third has skip=256 , and so on. Why?
dd thinks in terms of blocks.
I usually create an Excel workbook and use fill-down to create the correct skip numbers and then CONCATENATE() to create the actual dd command lines. Copy and paste them into a text document. Send it to both ends with the correct extensions/permissions/shebang line/etc.
Run the batch/shell script at each end to create corresponding partXXXX files. If you follow my example, the value in the K column shows you where to stop copying; it changes to false at the line where you’ve passed the final dd required.
md5sum the pieces at each end and compare
Pretty easy; use md5sum on all of the partXXXX files at each end. Save the output into an md5 file and then get both files in the same place so you can compare.
md5sum part* > part.md5
Using the command line diff tool will work, but if you have a GUI tool it should make it easier to see which files don’t match. Let’s hope there aren’t many.
Transfer the bad chunks from source to target
This part should be easy; just send the good chunks from the source to the target to replace the bad chunks. To make sure you haven’t wasted your time, md5sum the replacement chunks once they reach the destination. Re-retransfer any that don’t match.
dd the chunks back into the giant file
We will use dd again. Instead of redoing the whole process in reverse, we only need to dd in the fixed chunks.
Either redo your Excel sheet or just find and replace in your target batch/shell script.
dd of=Bigfile.iso if=partXXXX bs=1048576 count=128 conv=notrunc seek=YYYY
The key things here are that the if and of have been swapped, we must add conv=notrunc, and we use seek instead of skip. We swap the input and output files because we’re outputting to the big file. We use conv=notrunc because by default dd will truncate the destination file at the point where you start writing. We don’t want to destroy the file, so this is important. Finally, when we need to write the destination file anywhere other than the start, we have to use seek instead of skip .
You only need the lines corresponding to the fixed chunks. So your final batch/shell script might end up looking like this:
dd of=Bigfile.iso if=part0015 bs=1048576 count=128 conv=notrunc seek=1920
dd of=Bigfile.iso if=part0016 bs=1048576 count=128 conv=notrunc seek=2048
dd of=Bigfile.iso if=part0061 bs=1048576 count=128 conv=notrunc seek=7808
dd of=Bigfile.iso if=part0113 bs=1048576 count=128 conv=notrunc seek=14464
dd of=Bigfile.iso if=part0114 bs=1048576 count=128 conv=notrunc seek=14592
dd of=Bigfile.iso if=part0115 bs=1048576 count=128 conv=notrunc seek=14720
dd of=Bigfile.iso if=part0129 bs=1048576 count=128 conv=notrunc seek=16512
Recheck the md5sum of the file to make sure it matches
You’re all done, assuming it matches. (Cue spooky music)
Hey nevermind, here’s my Excel Workbook. Just use that. That’s what I’m going to do from now on.