Understanding Tracking in MLOps: Code, Data, and Models
Written on
Tracking is a familiar concept for many, whether you are a researcher or an engineer involved in fields like machine learning, data science, or software development. Regardless of your role, the act of tracking is crucial and unavoidable. In the realm of MLOps, we monitor various elements: code, data, and machine learning models. This article elucidates the necessity of tracking through a practical illustration, where we implement testing at different stages of a machine learning workflow. The complete codebase related to this discussion can be found in the accompanying repository.
Table of contents:
- Introduction
- Project Setup
- Code Tracking
- Data Tracking
- ML Model Tracking
- Conclusion
1. Introduction
Tracking is fundamentally the act of documenting and monitoring the modifications and status of various system components to enhance productivity and collaboration, as well as to facilitate maintenance. Within MLOps, tracking stands as one of the core principles, encapsulating the historical tracking of different phases in the machine learning workflow, which includes data, machine learning models, and code. In a prior tutorial on MLOps principles, I linked tracking with monitoring after a model has been deployed. Although these concepts are related, they are not identical; monitoring emphasizes real-time observation of system performance post-deployment, while tracking encompasses the entire project lifecycle.
Why is tracking important? Tracking code, data, and models enhances reproducibility by recording inputs, outputs, code executions, workflows, and models. Furthermore, it augments testing by identifying anomalies in system behavior and performance. The iterative nature of machine learning development further necessitates effective tracking.
When should tracking occur? As mentioned, tracking should be applied throughout the entire project lifecycle, simultaneously for code, data, and models due to their interconnected nature. The processing of data and the development of ML models rely on code, thus necessitating tracking across all these components.
What are the use cases for tracking? Consider a scenario within a handwritten digit classification project. Suppose the deployed machine learning model, which was trained on a public dataset and achieved a specific accuracy during development, begins to underperform once in production. Tracking the model's behavior allows for early detection of this decline. Additionally, tracking the individual MLOps components (code, data, and ML model) can help identify the reasons for this decline:
- By tracking the code, any bugs can be identified quickly, and the associated commit can be pinpointed. This enables a temporary rollback, correction of the issue, and reintegration into production.
- Monitoring changes in the distribution and characteristics of the incoming data can reveal issues such as data drift over time.
- Tracking the ML model not only aids in detecting performance degradation but also allows for rollbacks and updates without disruption.
Although this article focuses on the concept of tracking, it is also part of my MLOps series. By engaging with my previous and forthcoming tutorials, you’ll be equipped to create your own end-to-end MLOps project, from workflow development to model deployment and tracking.
For those interested in MLOps, I encourage you to explore my articles:
- Tutorial 1: A Key Start to MLOps: Exploring Its Essential Components
- Tutorial 2: A Beginner-Friendly Introduction to MLOps Workflow
- Tutorial 3: Introduction to MLOps Principles
- Tutorial 4: Structuring Your Machine Learning Project with MLOps in Mind
- Tutorial 5: Version Controlling in Practice: Data, ML Model, and Code
- Tutorial 6: Testing in Practice: Code, Data, and ML Model
2. Project Setup
In this article, we will explore the handwritten digit classification project utilizing a Convolutional Neural Network (CNN). The model’s task is to identify a digit, ranging from 0 to 9, from an input image and output the corresponding label. The AI canvas is structured as follows:
This project serves as a step-by-step tutorial in my MLOps series. Its structure adheres to a template suitable for MLOps, available as a cookiecutter project or GitHub template. More details on this project structure can be found in my previous article. We will utilize Git for code version control and DVC for data version control. The complete codebase for this project is accessible in the repository.
The remainder of this article will enhance this project by incorporating tracking for code, data, and the ML model, demonstrating how this can be accomplished.
3. Code Tracking
Code tracking is vital for maintaining machine learning projects. It involves documenting code versions, changes in dependencies, and all updates related to the code. To effectively track our code, we need to adhere to several best practices:
- Utilize a version control system like Git, leveraging its features such as tags, detailed commit messages, and other functionalities to maintain history and switch between different commits. For further insights into version control, refer to my article: Version Controlling in Practice: Data, ML Model, and Code.
- Implement a Git workflow tailored to the project’s needs, which aids in tracking code changes and feature development, ensuring changes are isolated before merging into the main branch for easier tracking. For more information on Git workflows, check out my articles: Mastering Git: The 3 Essential Workflows for Efficient Version Controlling or Git Workflow for Machine Learning Projects: the Git Workflow I use in my Projects.
- Manage dependencies and their versions using tools like pip for Python. I typically create a requirements.txt file prior to sharing or publishing the project and include it in the version control system to monitor dependencies.
- Integrate the repository with an MLOps platform that facilitates end-to-end machine learning lifecycle orchestration for improved tracking.
- Additional practices will be discussed in future tutorials, including continuous deployment (CD) and continuous integration (CI).
Now, let's detail some Git commands commonly used in code tracking:
To verify the repository’s status, the command git status displays the current branch status, lists file changes, and highlights untracked files.
$ git status
On branch feature/grad_cam
Your branch is up to date with 'origin/feature/grad_cam'.
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git restore <file>..." to discard changes in working directory)
modified: src/models/cnn/train.py
modified: tests/grad_cam.py
Untracked files:
(use "git add <file>..." to include in what will be committed)
mlartifacts/
mlruns/
tests/mlruns/
no changes added to commit (use "git add" and/or "git commit -a")
To list branches, you can use one of the following commands:
$ git branch # to list all local branches
feature/data-test
feature/grad_cam
feature/integration_test
feature/model-test
feature/preprocessing_test
master
$ git branch -r # to list all remote branches
origin/HEAD -> origin/master
origin/feature/data
origin/feature/data-dvc
origin/feature/data-test
origin/feature/grad_cam
origin/feature/integration_test
# ...
$ git branch -a # to list all branches (remote and local branches)
feature/data-test
feature/grad_cam
feature/integration_test
feature/model-test
feature/preprocessing_test
master
remotes/origin/HEAD -> origin/master
remotes/origin/feature/data
remotes/origin/feature/data-dvc
remotes/origin/feature/data-test
# ...
$ git branch -vv # to list all branches with detailed information
feature/data-test a976d83 [origin/feature/data-test] test: add features domain validation.
feature/grad_cam f959be7 [origin/feature/grad_cam] Merge branch 'feature/integration_test'
# ...
To view the commit history, the command is:
$ git log # to display the detailed commit history
# ...
# Alternatively, use:
$ git log --pretty=format:"%h %s" # to show only the commit ID and message
f959be7 Merge branch 'feature/integration_test'
eca40ba fix: predict using the latest run.
aa53e29 feat: system integration testing.
Additional commands for simplicity and readability include:
$ git diff # to view changes between the working directory and the staging area.
$ git diff --staged # to see changes between the staging area and the last commit.
$ git reset <file> # to unstage changes
$ git checkout -- <file_name> # to discard local changes
$ git revert <commit_hash> # to undo a commit
$ git reset --soft <commit_hash> # to move HEAD to a specific commit while keeping changes in staging area
4. Data Tracking
Data tracking is another crucial practice for managing machine learning projects. This involves documenting the data version in various forms, metadata, applied processes, and quality over time. To efficiently track our data, a series of actions must be taken:
- Implement data versioning to ensure changes can be monitored and reproduced.
- Establish data lineage to track the origin and transformations of data throughout the processing and ML pipeline.
- Log metadata, including data sources, to ensure preprocessing steps and transformations are recorded.
In a prior tutorial, we utilized DVC for data tracking. Here are some commonly used commands for data tracking:
To check the status of data files, whether they are current or need synchronization, use:
$ dvc status
To verify the correct data file versions based on the current Git commit, use:
$ dvc checkout
When data is stored in remote storage, you can use:
$ dvc pull # to retrieve the latest data files to the local workspace
$ dvc push # to upload the latest data files from the local workspace to remote storage
$ dvc fetch # to fetch data files from remote storage without checking them out to the workspace.
5. ML Model Tracking
If I were to prioritize, I would rank ML model tracking as the most critical aspect. Monitoring a model's performance allows for the early identification of any system issues and aids in decision-making and prompt resolution.
Tracking an ML model encompasses recording its name, architecture, parameters, weights, and experiments, as well as the code and data versions used during training. I understand that it can seem overwhelming! Before delving into MLOps, I struggled to save all my experiments effectively. I relied on basic file-based storage methods (like pickle and CSV files) which lacked scalability and required manual management, limiting reproducibility and collaboration. Thankfully, this led me to explore advanced approaches and learn about new technologies in MLOps. Numerous tools and platforms exist today to cater to the varying needs across different MLOps stages, but that is beyond the scope of this article.
For this discussion, we will utilize MLFlow, which was introduced in a previous tutorial (Version Controlling in Practice: Data, ML Model, and Code) and employed for version control of the ML model.
First, we initiate a local MLflow Tracking Server:
mlflow server --host 127.0.0.1 --port 8080
Next, we set an MLflow experiment to manage and organize our training runs:
# Set tracking server URI for logging
mlflow.set_tracking_uri(tracking_uri)
# Create an MLflow Experiment
mlflow.set_experiment(experiment_name)
MLflow offers an impressive feature called mlflow.autolog(), which automatically logs metrics and parameters. This command should be called before the training code:
with mlflow.start_run():
mlflow.autolog()
# Train:
model.compile(loss=loss, optimizer='adam', metrics=[metric])
history = model.fit(x_train, y_train, epochs=epochs, batch_size=batch_size, verbose=1, validation_data=(x_val, y_val))
The MLflow auto-log captures approximately 29 parameters, including batch size, number of epochs, and optimizer name, along with training metrics like loss values. Another significant advantage of MLflow is its graphical interface, which allows you to view logs and display graphs.
Additionally, we often need to log other metrics and parameters, which can be accomplished using mlflow.log_metrics() and mlflow.log_params(). Here’s how to log the loss function name, precision, and F1 score:
with mlflow.start_run():
mlflow.autolog()
# Train:
model.compile(loss=loss, optimizer='adam', metrics=[metric])
history = model.fit(x_train, y_train, epochs=epochs, batch_size=batch_size, verbose=1, validation_data=(x_val, y_val))
# Log additional parameters:
mlflow.log_params({
'loss': loss,
'metric': metric,
})
# Log additional metrics
mlflow.log_metrics({
'acc': acc,
'precision': precision,
'recall': recall,
'f1': f1,
'training_loss': history.history['loss'][-1],
'training_acc': history.history['accuracy'][-1],
'val_loss': history.history['val_loss'][-1],
'val_acc': history.history['val_accuracy'][-1],
'test_loss': test_loss,
'test_metric': test_metric
})
As demonstrated, the additional metrics and parameters are accurately logged, and comparisons between different runs can be made to identify the best model.
Moreover, MLflow automatically stores essential metadata, including the Git commit, user, training file source, model summary, and requirements file.
- Additionally, we can effortlessly register and version our model using the MLFlow UI, preparing it for deployment.
I cannot conclude this section without reiterating the importance of using tools like MLflow. They are vital for managing the complexities of the ML lifecycle, enhancing the structure and efficiency of development, experimentation, and deployment. With MLflow, ML model tracking becomes considerably more effective and advantageous, and its UI offers a dynamic, visual method for managing, tracking, and comparing models.
6. Conclusion
This article has reached its conclusion! We have introduced one of the most critical principles in MLOps: tracking. Tracking guarantees the quality, reliability, and reproducibility of machine learning workflows. It also plays a significant role in model selection for deployment, which we will delve into further in upcoming articles.
Through my writings, I aim to provide my readers with clear, organized, and easy-to-follow tutorials, offering a solid introduction to various topics while promoting sound coding and reasoning skills. My journey of self-improvement is ongoing, and I share my discoveries through these articles. I often refer back to my own writings as valuable resources when needed.
Thank you for reading! You can find all examples from my tutorials in my GitHub profile. If you find my tutorials helpful, please consider following me and subscribing to my mailing list for notifications on new articles. Feel free to leave comments with any questions or suggestions.
Image Credits
All images and figures in this article that do not have a source mentioned in the captions are by the author.