Reproducibility and Data

TL;DR → Version control for your data doesn’t mean s**t if you’re not versioning right up to the point you use it

/via https://www.digital-science.com/blog/guest/digital-science-doodles-lab-life-and-experimental-reproducibility/

I’ve written about the Reproducibility Crisis in Machine Learning before. Without going into gory details (just go read the post instead), the field is a perfect storm that comes from combining huge — nay massive — data sets, jobs that take forever to run, pipelined workflows within these jobs, workflows running in parallel, and poor version control.

“Poor Version Control → This one is the killer here. While you might (might! Admit it, you don’t!) have version control for your individual models and datasets, what you inevitably end up not doing is versioning all your tweaks in flight. It is, after all, easier to just tweak some of the data on the servers rather than copying all your assets up again. Or, it’s just faster to spawn a bunch of jobs in a loop with slightly modified parameters to see what the effect is going to be.

Let’s posit thought that you’re actually versioning everything — as we were. The thing is, even if you’re versioning everything you can still mess up!

A large data-set had me chasing ghosts recently, and I’m still mad at myself about it. The dataset in question was publicly available property information in the US, which alongside the usual stuff like pictures and whatnot, included a positively huge file containing names, addresses, and transaction records (sale date and price) for each property. Good stuff, and useful for all sorts of Deep Learning projects. For example

• Train against a random subset of the properties in Connecticut, and check the results against the other properties in the same state. (Of course, using a PRNG to select the “random” subset, so that you can reproduce the same subset )
• Train against properties in Connecticut, and compare against properties in Texas. (Use only the Connecticut and Texas data)
• Predict whether a property will come up for sale, based on Valuations for other properties in the same neighborhood. (Use the IDs and transaction dates as prediction targets)

Fairly straightforward stuff this. And yet, yet, we kept running headlong into the Reproducibility Barrier. The models that worked fine for me barfed all over the place for my colleagues, and vice-versa.
We went through the usual stuff — check that we’re using the same versions of the datasets, the same PRNG seeds, etc. and, after a lot of un-necessary pain and suffering, figured out that the issue was Excel.

Excel?
Yes, Excel. You see, the aforesaid huge file was not really the kind of thing you mess around in vi. So, when testing our models, we’d compare the results against what we saw in Excel, and it wouldn’t even be close .

Until we realized that Excel will, happily, format your dates for you. And when it does so, it’ll switch it into your locale. For example, when you see 3/8/1995, is that March 8th? Or August 3rd? (Having a multinational team really doesn’t help here…)

Comprehensive map of countries that only use MMDDYY

That was it. Just that one little tiny thing — where Excel helpfully auto-formatted our dates for us, successfully blowing about a weeks worth of debugging time. And yes, it was entirely our own fault.

The point behind all of this is that this step — opening up your datasets in Excel — is such default behavior that we don’t even think about it. And, in the meantime, Excel, quite merrily, might be doing the following:

1. Formatting data. And it’s not just dates. Currencies, number formatting, oh, the list goes on
2. Converting data. You know, when Excel decides something is a date, and all of a sudden 3.2 becomes March 2nd?
3. Encoding data. What’s the default character set your data is stored in? UTF-8? UTF-16? Excel usually silently converts this in the background. Yay!

If you’re not paying attention, this is exactly the kind of issue that makes for reproducibility nightmares, because this happens after you pull data out of Version Control. And yes, if you process the data in Excel and check it back in, you now have other people trying to figure out how this worked (e.g. they have different system defaults, causing their “Excel” step to give different results!)

So yes, be careful. Be very, very careful. Reproducibility is hard enough in general — don’t make it harder! Make sure that your data is versioned right up to the time you use it.

Search This Blog

And you are here why?

Reproducibility and Data

Comments

Popular posts from this blog

Sysadmin Day - July 27th

Cannonball Tree!

Erlang, Binaries, and Garbage Collection (Sigh)