In recent years, the term MLOps has become a buzzword in the world of AI, often discussed in the context of tools and technology. However, while much attention is given to the technical aspects of MLOps, what's often overlooked is the importance of the operations. There is often a lack of discussion around the operations needed for machine learning (ML) in production, and monitoring specifically. Things like accountability for AI performance, timely alerts for relevant stakeholders, the establishment of necessary processes to resolve issues, are often disregarded for discussions about specific tools and tech stacks.
ML teams have traditionally been research-oriented, focusing heavily on training models to achieve high testing scores. However, once the model is ready to be deployed in real business processes and applications, the culture around establishing production-oriented operations is lacking. As a consequence, there is a lack of clarity regarding who is responsible for the models’ outcomes and performance. Without the right operations in place, even the most advanced tools and technology won't be enough to ensure healthy governance for your AI-driven processes. In this blog post, we'll explore key tips to help you set up a robust monitoring operation that proactively addresses issues before they negatively impact your business KPIs.
As previously stated, data science and ML teams have traditionally been research-oriented, and were measured on model evaluation scores and not on real world, business-related outcomes. In such an environment, there is no way monitoring will be done correctly, because, frankly - no one cares sufficiently. To fix this situation, the team responsible for building AI models must take ownership and feel accountable for the models’ success or failure in serving the business function it was designed for.
The best way to achieve this is by measuring the individual's and the team’s performance based on production-oriented KPIs and creating an environment that fosters a sense of ownership over the model’s overall performance, rather than just in controlled testing environments.
While some team members may remain focused on research, it’s important to recognize that achieving good test scores in experiments is not sufficient to ensure the model’s success in production. The ultimate success of the model lies in its effectiveness in real-world business processes and applications.
To ensure the ongoing success of an AI-driven application, planning how it is going to be monitored is a critical factor that should not be overlooked.
In healthy engineering organizations, whenever a new component is released, there is always a release checklist that entails setting up a monitoring plan. AI teams should follow that pattern. The person or team responsible for building a model must have a clear understanding of how it fits into the overall system and should be able to predict potential issues that could arise, as well as identify who needs to be alerted and what actions should be taken in the event of an issue.
While some potential issues may be more research-oriented, such as data or concept drift, there are many other factors to consider such as a broken feature pipeline or a third-party data provider changing input formats. It is important to anticipate as many of these issues as possible and set up a plan to effectively deal with them should they arise.
Although it’s very likely that there are potential issues that will remain unforeseen, it’s still better to do something rather than nothing, and typically, the first 80% of issues can be anticipated with 20% of the work.
Sharing the responsibility among team members may be necessary or helpful, depending on the size of your team and the number of models or systems under your control. By setting up an “on-call” rotation, everyone can have peace of mind knowing that there is at least one knowledgeable person available to handle any issues the moment they arise.
It’s important to note that taking care of an issue doesn’t necessarily mean solving the problem immediately. Sometimes, it might mean triaging and deferring it to a later time or waking up the person who is best equipped to solve the problem. Sharing an on-call rotation with pre-existing engineering teams can also be an option in some instances. However, this is use-case dependent and may not be possible for every team.
Regardless of the approach, it is imperative to establish a shared knowledge base that the person on-call can utilize so that your team can be well-prepared to take care of emerging issues.
To maintain healthy monitoring operations, it is essential to have accessible resources that detail how your system works and its main components. This is where wikis and playbooks come in. Wikis can provide a central location for documentation on your system, including its architecture, data sources, and model dependencies. Playbooks can be used to document specific procedures for handling common issues or incidents that may arise.
Having these resources in place can help facilitate knowledge sharing and ensure that everyone on the team is equipped to troubleshoot and resolve issues quickly. It also allows for smoother onboarding of new team members who can quickly get up to speed on the system. In addition, having well-documented procedures and protocols can help reduce downtime and improve response times when issues transpire.
Monitoring is an iterative process. It is impossible to predict everything that might go wrong in advance. But when an issue does occur and goes undetected or unresolved for too long, it is important to conduct a thorough analysis of the issue and identify the root cause. Once a root cause is understood, the built monitoring plan can be amended and improved accordingly.
Post mortems also help in building a culture of accountability, which as discussed earlier, is the key factor in having successful monitoring operations.
Once you have established the need of maintaining healthy monitoring operations and addressed any cultural considerations, the next critical step is to equip your team members with the appropriate tools to empower them to be accountable for the models’ performance in the business function it serves.
This means implementing tools that enable timely alerts for issues (which is difficult due to issues typically starting small and hidden), along with capabilities for root cause analysis and troubleshooting. Integrations with your existing tools, such as ticketing systems, as well as issue tracking and management capabilities are also essential for seamless coordination and collaboration among team members. Investing in the right tools will empower your team to take full ownership and accountability, ultimately leading to better outcomes for the business.
By following these guidelines, you can be sure that your AI team will be set up for successful production-oriented operations. Monitoring is a crucial aspect of MLOps, involving accountability, timely alerts, troubleshooting and much more. Taking the time to set up healthy monitoring practices leads to continuous improvements. At Mona, we understand the importance of empowering AI teams with advanced monitoring capabilities to take end-to-end accountability for outcomes and performance. Request a demo today to see how we can help you set up effective monitoring operations!