Posts

Showing posts with the label Reproducibility

Reproducibility and Data

Image
TL;DR →   Version control for your data doesn’t mean s**t if you’re not versioning right up to the point you use it /via https://www.digital-science.com/blog/guest/digital-science-doodles-lab-life-and-experimental-reproducibility/ I’ve written about the  Reproducibility Crisis in Machine Learning  before. Without going into gory details (just go read  the post  instead), the field is a perfect storm that comes from combining huge — nay massive — data sets, jobs that take forever to run, pipelined workflows within these jobs, workflows running in parallel, and poor version control. “Poor Version Control → This one is the killer here. While you might (might! Admit it, you don’t!) have version control for your individual models and datasets, what you inevitably end up  not  doing is versioning  all  your tweaks in flight. It is, after all, easier to just tweak some of the data on the servers rather than copying all your assets up again...