Versioning Data Files For Machine Learning Projects
layout: Versioning data files for machine learning projects with DVC categories: tech date: 2020-10-09 published: true —
DVC Tutorial
Machine Learning projects deals with both the data and code. DVC is a tool which provides data versioning. For more information on DVC visit here.
The main purpose of this tutorial is to get a brief overview of the DVC and how it is solving modern machine learning project issues.
To start a project that both uses GIT and DVC follow these steps.
Clone this repository
git clone https://github.com/ryokugyu/dvc_tutorial.git
After cloning the repository, change directory to dvc_tutorial
.
Initialize DVC repository
dvc init
After initializing the DVC repository. Let’s pull the data into our machine locally:
dvc push
Now we have both the data and code present locally in our machine. First, split the dataset into 70-30% ratio.
Splitting the dataset:
dvc run -d data/mnist_train.csv -d code/split_test_train.py -d code/conf.py -o data/X_train.npy -o data/Y_train.npy -o data/X_val.npy -o data/Y_val.npy python code/split_test_train.py 0.33 2
- Now, lets create a folder named
model
.
mkdir model
compiling and training the model. also validating the performance.
storing the model matrix also
dvc run -v -d data/X_train.npy -d data/Y_train.npy -d data/X_val.npy -d data/Y_val.npy -d code/conf.py -d code/model_train.py -o model/model.json -o model/model.h5 python code/model_train.py 1 256
loading the model and testing the model performance with exxternal testing dataset
dvc run -d data/mnist_train.csv -d code/conf.py -d code/model_test.py -M data/eval.txt -f Dvcfile python code/model_test.py
DVC metric feature
dvc metrics show
Moreover you can link DVC to your Google Drive File Stream remote space, especially when working in large companies that are using google services, this feature may turn out to be extremely useful.