Machine Learning projects deals with both the data and code. DVC is a tool which provides data versioning. For more information on DVC visit here.
The main purpose of this tutorial is to get a brief overview of the DVC and how it is solving modern machine learning project issues.
To start a project that both uses GIT and DVC follow these steps.
git clone https://github.com/ryokugyu/dvc_tutorial.git
After cloning the repository, change directory to dvc_tutorial.
dvc init
After initializing the DVC repository. Let’s pull the data into our machine locally:
dvc push
Now we have both the data and code present locally in our machine. First, split the dataset into 70-30% ratio.
dvc run -d data/mnist_train.csv -d code/split_test_train.py -d code/conf.py -o data/X_train.npy -o data/Y_train.npy -o data/X_val.npy -o data/Y_val.npy python code/split_test_train.py 0.33 2
model.
mkdir model
dvc run -v -d data/X_train.npy -d data/Y_train.npy -d data/X_val.npy -d data/Y_val.npy -d code/conf.py -d code/model_train.py -o model/model.json -o model/model.h5 python code/model_train.py 1 256
dvc run -d data/mnist_train.csv -d code/conf.py -d code/model_test.py -M data/eval.txt -f Dvcfile python code/model_test.py
DVC metric feature
dvc metrics show
Moreover you can link DVC to your Google Drive File Stream remote space, especially when working in large companies that are using google services, this feature may turn out to be extremely useful.