reproducibility
view markdownSome notes on different programming techniques / frameworks for reproducibility
containerization
- e.g. docker
data version control
-
dvc - data version control
- .dvc folder keeps track of some internal stuff (like .git)
-
metafiles ending with .dvc are stored in git, tracking big things like data and models
-
also simple support for keeping track of metrics, displaying pipeline, making plots
-
keep track of old things using git checkout and dvc checkout
-
dagshub - built on dvc, like github (gives ~10GB free storage per project)
- “Our recommendation is to separate distinct experiments (for example, different types of models) into separate branches, while smaller changes between runs (for example, changing model parameters) are consecutive commits on the same branch.”
- not open source :frowning_face:
- replicate.ai - version control for ml
- lightweight, focuses on tracking model weights / sharing + dependencies
- less about hyperparams
-
mlflow (open-source) from databricks
- API and UI for logging parameters, code versions, metrics and output files
- gigantum - like a closed-source dagshub
- codalab - good framework for reproducibility
- paid / closed-source
- weights and biases (free for academics, paid otherwise)
- neptune.ai
- h20 ai (source here)
hyperparameter tuning
- weaker versions
- tensorboard (mainly for deep learning)
- pytorch-lightning + hydra
- ray
weights and biases
wandb.login()
- login to W&B at the start of your sessionwandb.init()
- initialise a new W&B, returns a “run” objectwandb.log()
- logs whatever you’d like to log
workflow management
- prefect
- tasks are basically functions
- flows are used to describe the dependencies between tasks, such as their order or how they pass data around