Reproducibility and Machine Learning

I can say with my hand on my heart, that machine learning is by far the worst environment I’ve ever found for collaborating and keeping track of changes.
— Pete Warden

I’d actually quite agree. Mind you, it’s not because of something fundamentally bad about the world of Deep Learning (•), it’s more about a collection of things that add up to a lot of pain. To summarize a much longer writeup about this

1. Size: The source data is large. This is a problem in and as of itself, since the state-of-the-art in managing large amounts of data is still, well, sucky. Think “where do you put it”, “how do you get at it”, “how do others get at it”, etc. And yeah, DropBox, and it’s ilk is where most of this stuff lives.
2. Canonical Data: The data used isn’t canonical (Actually it’s worse, it is near-canonical). Your version of the data may be different by just a few records, or you tweaked just a few records. And this may be before you get your hands on the data. Think “I’m doing stuff with only cats, so am going to chuck all the images that have cats and dogs”.
3. Version Control: Keeping track of the changes in your software is a wee bit more complicated than “Just Use Git”, because you’re also using results — Model weights / evaluation scores / etc. — from one run into the next one.
4. Deployment: You actually run your stuff on a GPU/TPU cluster somewhere. Which means you’ve got to get your local code/data there. And given turn-around times, etc., sometimes it’s just easier to tweak things on the fly if something barfs. Which makes -1- through -3- above that much more fun
5. Runtime: Training can take forever (Sigh). Depending on what you’re doing, it can, literally, takes days or weeks. So you have “mini” models that you use to test what you’re doing, which really don’t map all that well to the actual thing, but serve as your intuitive tool to see what might happen. It’s totally subjective, and you’re managing these models in parallel with your actual work.
6. Pipelines: And it’s not like you’re sitting still while the real runs are happening. You’re merrily banging away on your code, deploying new runs while the old ones are still ongoing, some of those runs cause you to preemptively terminate other runs, etc. etc. Basically, you’ve got a huge set of multi-stage pipelines going…
7. Version Control v2: Which “results” are you looking at? You’ve got all those runs going in your pipeline, and how do you go about keeping track of the combination of code / data / run each result maps to?

/via https://xkcd.com/1597/

Yes, yes, I know, every single one of the above is solvable. None of it is exactly rocket-science, “just making sure that you are using the right tools”, “follow good programming practices”, etc.
Except.
Except, each of the above is something that even you, a highly trained, extraordinarily competent developer, nay, engineer, messes up every now and then. (And takes shortcuts around on occasion. It’s ok, you can admit it ). And when you put them all together, and add in the unfortunate reality that a lot (well, most) of the people doing this are nothardcore engineers who have been living/breathing version control and CI/CD/CT forever, well, there is so much that can go wrong I can’t even…

So yeah, Pete Warden has the right of it — there is a reproducibility crisis in Machine Learning. It’s just that I’m a wee bit more pessimistic than him about it, I don’t think it’s going to be easily resolved barring some kind of significant change.
What kind of change?
The mind boggles, but, at the very least, something like a Deep Learning version of JetBrains, or “ml-mode” in emacs, or some such. And, even then, we still have the whole Size issue in -1- above (“massive amounts of data to be managed”) to deal with…

(•) Mind you, there is a different set of issues involving randomness (SGD) and precision reduction that also comes into play, but this post is more about the software-development side of things…

Search This Blog

And you are here why?

Reproducibility and Machine Learning

Comments

Popular posts from this blog

Sysadmin Day - July 27th

Cannonball Tree!

Erlang, Binaries, and Garbage Collection (Sigh)