ML Cycle Architecture With Human in the Loop

Self Introduction

Hi everyone. I am Rohith Kumar Chandragiri, a Junior Machine Learning Engineer at Money Forward. This article is about ML Post Deployment Updates. To make it readable for software engineers, I have written it in a story format. Hope you guys like it.


This is a story of an ML serving method that evolved from a basic architecture with every release into the state of the art ML Architecture.

Story Part I: The First ML Model in Production 

We spent a lot of efforts in training multiple models and evaluating them. And eventually, we decided to select a model that was best suited for our users based on factors such as latency, infrastructure, and many more. To make the model available to the users, we created an API as a wrapper to serve the model.

After a month, we received some queries from the users that the predictions were relatively weak. So the engineers gathered the latest data, evaluated them using the existing model, and observed that the model predictions were not good. We made sure that the APIs were highly available and reliable. 

Why was the model performing poorly after just one month post-release?

  1. Data drift
    1. The model was trained on data that was 6 months old. 
  2. Training – Production Discrepancies
    1. The pre-processing code used in the training and production environments might have been different.
    2. ML research engineers do a lot of patchwork, which may result in a lot of bugs.

What do you think we missed in the above deployment?

We made sure that the API was highly scalable and reliable. However, we failed to maintain the model’s reliability. The reliability of the deployed model is highly correlated with the accuracy. So we have to continuously measure the model accuracy.

To measure the model accuracy, we need the ground truth for the predictions, which can be either obtained from the users, or we might need to hire manual annotators. With the ground truth we can visualize the performance of our production model with a live dashboard.

In the end, we set up a dashboard to monitor model accuracy, with a annotators who could correct our predictions

Story Part II: The First Update of the ML Model in Production

After gathering the latest data, the engineers trained a new ML model and released the latest version of the model into production. After the release of the model, we noticed that there was a drop in model accuracy, but what could be the reason? The first thought that came to mind was that our model could be overfitting to the latest production data.

Releasing a worse-performing model into production is not only bad for our users, but also bad for us as ML engineers. As time passes, users should be expecting a better performing model, so if they all of a sudden see worse predictions it would erode their trust in our services. Therefore, it’s a better idea to benchmark our model in the background before using its predictions in the live product. One such release engineering technique is called “shadow deployment.”

Shadow deployment is a deployment technique where all incoming calls from users are sent to the latest model but the predictions are not returned in the product. However, the model predictions are still stored to track the model performance. After confirming that the newer model is consistently better, we can release the model to production.  

[Figure 1: Diagram explaining the deployment technology suitable for the respective ML pipeline. Source:]

Story Part III: The First Automatic Training Trigger

The new machine learning engineer who joined the team noticed that the previous engineers were just running the same training code again and again with different datasets. He realized if there was a proper pipeline set for versioning the data, we could automate the training and deployment processes.

The new engineer started the dataset versioning based on the Moon phases, the Waxing phase (the moon grows from zero to almost full moon), and the Waning phase (the full moon is decreased to almost no moon). After every month, the Waxing phase dataset is added to the training dataset and the Waning phase dataset is kept aside for evaluation. Once the whole dataset is evaluated, the training pipeline is triggered by the annotators and the model is automatically deployed in the shadow mode.

The shadow deployed model is temporarily monitored for some days, and once the results are out for 5 days, it will be automatically deployed in the production environment.

Conclusion: Set up versioning of the data followed by continuous training, continuous evaluation, and continuous deployment.

With these three stories, the first version of the SOTA ML cycle is built. So, for every iteration, there is an annotator teaching the right answers to the ML model, concluding the name ML Cycle with Human in the Loop.

This architecture is not concept drift proof. So, if there is a concept drift, the machine learning engineer needs to identify it and find the new ML model architecture for the problem. Once the new architecture is identified, the training, evaluation, and deployment pipelines will be updated to the new architecture.

Post Credits

We at Money Forward are trying to create SOTA ML architectures to automate training and evaluation pipelines. We have a lot of required resources used for annotation, training, and inference hosted on kubernetes cluster. I am thankful to have many experienced members on the team. 

Happy Modeling!!!