November 30th, 2012
I came home from Thanksgiving with a bloated stomach and few enormous files that I planned to back up on my home computer. These files were on the order of ~100s of GB and left me only about 30GB free on my laptop hard drive. And now, fair readers, I will share the harrowing tale of… The File That Wouldn’t Fit Twice. Using a couple of electronics tricks learned in ME210, I ended up with a pretty neat solution to transfer these huge files to my server.
If you’re looking to transfer a big file that doesn’t fit in little pieces on your server, please save yourself some time and skip to the bottom, my final solution!
Take one: plain ol’ rsync
Before this was even an issue, I just tried to rsync the file over to my home server.
The problem was that, for whatever reason, the transfer between the two machines would hiccup and rsync would die after some time. And when I would retry, rsync had to go and check all the bits it had already transferred before sending more data. And then it would die again. Gross. I tried this a few times before realizing that I needed a better solution.
Take two: Pulse-width modulation
OK, well, the solution to a file that’s too big is to split it into a bunch of little pieces using something like this:
But the problem is that The File That Wouldn’t Fit Twice is 300GB and I only have 30GB free, so I wouldn’t be able to store both the original and the split copies! So I needed to come up with some way to transfer the pieces as they were coming out and then deleting them to make space for more pieces to come through.
My next try was to slow the split command using PWM. You know how you can pause a process with Ctrl+Z and resume it with
fg? That’s just using signals, so you could instead do something like
kill -SIGSTOP <pid> and
kill -SIGCONT <pid>. Or, if you’re only running one
killall -SIGSTOP split and
killall -SIGCONT split.
With that line, split runs for a minute and then sleeps for a minute, basically slowing it down by 50%. This is a super-cheap way to do pulse-width modulation.
So, now, to transfer the parts (but not ALL the parts, only the parts that have completed!):
In short, this finds all files that are of size 536870912 (blocksize 512m from our split command above) and sends them off to the server. We run this in a loop because rsync won’t find files that were created after the command started.
To get rid of the files that we’ve already transferred, I found comm, which is like the opposite of diff:
This takes the checksum of the individual pieces on the transferring machine and compares them to the checksum of the files on the server. All files that have the same checksum on the host and the server are deleted. It also uses GNU parallel, which allows you to execute multiple tasks in, well, parallel. Nice!
Writing those lines made me feel pretty awesome, but unfortunately, it didn’t work very well at all. First, the PWM duty cycle was completely arbitrary, which meant that split could either run too fast for rsync to catch up or not fast enough, where rsync is waiting for more data to transfer. Second, computing checksums is really expensive and likely unnecessary, but the sizes on the transferring computer and the server often didn’t agree. Whatever.
Take three: RTFM and hysteresis
As it turned out, this hacky way of figuring out which files had successfully transferred was completely unnecessary. As it turns out, rsync has a –remove-source-files option, which removes a source file once it’s done transferring to the server. Gold. Here’s our new rsync command:
Note that we still need to limit ourselves to files that are complete so we don’t end up transferring (and then deleting!) files that split hasn’t finished writing yet. This ends up leaving the final file at the very end. You can just transfer that one by hand, but in combination with the hysteresis below, you can actually just remove the size check.
Our previous PWM solution for split was pretty silly. In reality, we want to stop split if our disk is running out of space and start it once we’ve cleared out room for new files.
Basically, if we have more than 25GB free, we want to ensure that the split is running. If we have less than 15GB free, we want to ensure that the split isn’t running. Anything in between just stays the course. If we weren’t using hysteresis and instead had one cutoff to determine whether we should run split or not, we’d end up flipping it on and off rapidly. The split and rsync commands compete for disk I/O and slow each other down a lot, so in my very-rough analysis, it made more sense to let split run (or not) for a sustained period of time.
Once the pieces were all transferred, it was just a matter of catting them together: