Demystifying Machine Learning: Comprehensive Guide to Development & Deployment

Feb 03, 2024

Introduction

My journey in the MLOps/ML space has been enriching as I expand my experience. My initial exposure to Machine Learning occurred in my previous role while working on Recommendation systems when the team was exploring various feature stores. After grasping the basics, I recently seized the opportunity to standardize the ML development life cycle, involving the evaluation of different products for various stages in the development process.

While navigating through the vast amount of literature and resources available on MLOps, I dedicated significant time to understanding the fundamental ML model development life cycle. Recently, I was listening to "Outsmart Your Brain" by Daniel T. Willingham—an insightful book on studying efficiently, and effectively and enjoying the learning process. Willingham talks about SQRRR (Survey, Question, Read, Recite, and Review), a reading comprehension method emphasizing the benefits of retrieval in strengthening long-term memory.

This post serves as my recital of what I've learned about MLOps over the past year. I should warn you that this is a long post and not for the faint of heart. It unravels the secrets of the ML Development life cycle, covering fundamental components at each stage, best practices, and insights into third-party tools explored during this journey.

ML Pipeline: From Inception to Deployment

Developing, deploying, and operationalizing ML applications in production involves several stages, as shown in the diagram below. The initial phase encompasses data collection, preprocessing, and validation to address any potential data quality issues. Subsequent stages include feature engineering, model training, and hyperparameter tuning. It is crucial to orchestrate and monitor each step to identify data drift and facilitate automatic model retraining.

Data Organization

Just as we humans learn from books, ML Models acquire knowledge from data. The richness and quality of the data shape the quality of the models, empowering them to make informed decisions and contribute meaningfully. Thus, meticulous attention is crucial to data collection and preprocessing, ensuring its quality and reliability. By effectively organizing our data, we proactively address potential problems, enabling a smooth transition through later stages like feature engineering, model training, and hyperparameter tuning. A well-structured data foundation not only streamlines our workflow but also directly influences the accuracy and efficiency of your entire ML development cycle.

For supervised machine learning, crafting a "gold dataset" takes center stage. This carefully chosen and meticulously validated dataset becomes the benchmark for model training and evaluation. By establishing a gold set, we can ensure our ML models learn from high-quality, representative data. This ultimately strengthens the entire ML pipeline, boosting its robustness and trustworthiness.

Development Environments

In the context of AI/ML development environments, the choice of the optimal platform is crucial as it can significantly impact team productivity, cost-effectiveness, and collaborative efficiency. Balancing flexibility, scalability, and GPU utilization is key.

Here's a summary of the options:

Google Colab Pro - Cloud-based, cost-effective, but with limited GPU availability and no direct Git integration.

Self-hosting JupyterHub - Collaborative and scalable, but requires managing infrastructure and might be cost-prohibitive based on the team size.

Cloud Instance with GPU - Powerful for large-scale training, but requires careful cost management.

The best option depends on your specific needs and costs.

Feature Stores

Feature stores centralize curated and reusable features for machine learning pipelines. They offer an organized environment to store, manage, and share features across models, promoting reproducibility and faster development. By ensuring consistent and up-to-date feature data, they improve monitoring and observability. Ultimately, feature stores contribute to efficient, scalable, and reliable ML systems by promoting good feature management practices.

We self-hosted Feast in one of my prior teams but this site provides a list of all feature stores that are available: https://www.featurestore.org/

Model Training

Following initial model development, the pivotal next step is model training, encompassing data preparation, feature engineering, ML model code writing, code execution on prepared data, and result evaluation. Containerizing training steps are recommended for enhanced flexibility across different platforms. One can opt for a cloud provider or establish a self-managed training pipeline. Despite initial challenges, Google Vertex AI proved effective for setting up a training pipeline. Additionally, Dagster, known for modeling data pipelines, was successfully utilized to construct a training pipeline, showcasing its suitability for both data and training pipelines.

Next, the crucial stage of continuous learning is important for keeping models relevant and adaptive. Similar to software engineers, models need to evolve to maintain their effectiveness. This iterative process involves real-time monitoring for automatic retraining, scheduled updates based on new data, and incorporating human feedback through labeled data to address specific weaknesses. Continuous learning transforms our models into adaptable, robust entities, always prepared for the next challenge.

Model Management

ML Model Management involves systematically organizing, deploying, monitoring, and maintaining machine learning models throughout their life cycle. This process ensures reproducibility, allowing data scientists to recreate model results in different environments and facilitating collaboration between model builders and ML Engineers. Model management is crucial for monitoring performance, making timely updates, and addressing issues such as concept drift.

MLflow, an open-source platform, addresses these challenges by providing a flexible and reproducible ML life cycle management solution. Its components include Tracking for centralized metadata, Projects for reproducible code packaging, and Models for versatile deployment in various production environments. The platform's flexibility allows users to start with tracking/experiments and explore other features as needed. For detailed information, refer to https://mlflow.org/

Model Deployment/Inference

Model inference, the final stage of the machine learning life cycle, involves using a trained model on new data to make predictions or generate outputs. Several model inference hosting products, including Google Vertex Predictions, BentoML, AWS Sagemaker, and Nvidia Triton, are available. The key considerations for selection are performance, cloud agnosticism, and cost-effectiveness. I wrote about my experience trying out Triton in an earlier post and recommend reading that if interested.

Observability

Monitoring tools such as MLFlow or Weights & Biases provide insights into "what" is occurring with our model. In contrast, ML observability extends beyond answering "why?" and "how?" questions, delving into the model's internal operations and decision-making process to provide a more profound understanding of its behavior. There are a few commercially available products from Arize and Fiddler AI and an open-source solution, Evidently.ai.

Conclusion

In summary, the post provides a comprehensive overview of the ML Model Development Life Cycle, covering each stage with a brief exploration of available options. The visual representation below illustrates the complete ML Pipeline, from its inception to deployment.

ML/AI platforms such as AWS SageMaker and Google Vertex AI offer comprehensive support across all stages of ML development and deployment. While these platforms provide convenience, it's crucial to be mindful of associated costs. As discussed earlier, opting for these platforms for specific ML lifecycle stages can be wise, but considering cost-effective self-hosting with open-source solutions for other aspects may also be a prudent strategy. After all, saving money on cloud bills is the real unsung hero of any successful ML project – your VP of Finance will thank you!

Bits and Bytes

Discussion about this post