Reproducibility and Data

TL;DR → Version control for your data doesn’t mean s**t if you’re not versioning right up to the point you use it /via https://www.digital-science.com/blog/guest/digital-science-doodles-lab-life-and-experimental-reproducibility/ I’ve written about the Reproducibility Crisis in Machine Learning before. Without going into gory details (just go read the post instead), the field is a perfect storm that comes from combining huge — nay massive — data sets, jobs that take forever to run, pipelined workflows within these jobs, workflows running in parallel, and poor version control. “Poor Version Control → This one is the killer here. While you might (might! Admit it, you don’t!) have version control for your individual models and datasets, what you inevitably end up not doing is versioning all your tweaks in flight. It is, after all, easier to just tweak some of the data on the servers rather than copying all your assets up again...