Building a Monitoring System That Actually Works

When building and managing products, it’s crucial to ensure they’re performing as expected and that everything is running smoothly. We typically rely on metrics to gauge the health of our products. And many factors can influence our KPIs, from internal changes such as UI updates, pricing adjustments, or incidents to external factors like competitor actions or seasonal trends. That’s why it’s important to continuously monitor your KPIs so you can respond quickly when something goes off track. Otherwise, it might take several weeks to realise that your product was completely broken for 5% of customers or that conversion dropped by 10 percentage points after the last release.

To gain this visibility, we create dashboards with key metrics. But let’s be honest, dashboards that no one actively monitors offer little value. We either need people constantly watching dozens or even hundreds of metrics, or we need an automated alerting and monitoring system. And I strongly prefer the latter. So, in this article, I’ll walk you through a practical approach to building an effective monitoring system for your KPIs. You’ll learn about different monitoring approaches, how to build your first statistical monitoring system, and what challenges you’ll likely encounter when deploying it in production.

Setting up monitoring

Let’s start with the big picture of how to architect your monitoring system, then we’ll dive into the technical details. There are a few key decisions you need to make when setting up monitoring:

Sensitivity. You need to find the right balance between missing important anomalies (false negatives) and getting bombarded with false alerts 100 times a day (false positives). We’ll talk about what levers you have to adjust this later on.
Dimensions. The segments you choose to monitor also affect your sensitivity. If there’s a problem in a small segment (like a specific browser or country), your system is much more likely to catch it if you’re monitoring that segment’s metrics directly. But here’s the catch: the more segments you monitor, the more false positives you’ll deal with, so you need to find the sweet spot.
Time granularity. If you have plenty of data and can’t afford delays, it might be worth looking at minute-by-minute data. If you don’t have enough data, you can aggregate it into 5–15 minute buckets and monitor those instead. Either way, it’s always a good idea to have higher-level daily, weekly, or monthly monitoring alongside your real-time monitoring to keep an eye on longer-term trends.

However, monitoring isn’t just about the technical solution. It’s also about the processes you have in place:

You need someone who’s responsible for monitoring and responding to alerts. We used to handle this with an on-call rotation in my team, where each week, one person would be in charge of reviewing all the alerts.
Beyond automated monitoring, it’s worth doing some manual checks too. You can set up TV displays in the office, or at the very least, have a process where someone (like an on-call person) reviews the metrics once a day or week.
You need to establish feedback loops. When you’re reviewing alerts and looking back at incidents you might have missed, take the time to fine-tune your monitoring system’s settings.
The value of a change log (a record of all changes affecting your KPIs) can’t be overstated. It helps you and your team always have context about what happened to your KPIs and when. Plus, it gives you a valuable dataset for evaluating the real impact on your monitoring system when you make changes (like figuring out what percentage of past anomalies your new setup would actually catch).

Now that we’ve covered the high-level picture, let’s move on and dig into the technical details of how to actually detect anomalies in time series data.

Frameworks for monitoring

There are many out-of-the-box frameworks you can use for monitoring. I’d break them down into two main groups.

The first group involves creating a forecast with confidence intervals. Here are some options:

You can use statsmodels and the classical implementation of ARIMA-like models for time series forecasting.
Another option that typically works pretty well out of the box is Prophet by Meta. It’s a simple additive model that returns uncertainty intervals.
There’s also GluonTS, a deep learning-based forecasting framework from AWS.

The second group focuses on anomaly detection, and here are some popular libraries:

PyOD: The most popular Python outlier/anomaly detection toolbox, with 50+ algorithms (including time series and deep learning methods).
ADTK (Anomaly Detection Toolkit): Built for unsupervised/rule-based time series anomaly detection with easy integration into pandas dataframes.
Merlion: Combines forecasting and anomaly detection for time series using both classical and ML approaches.

I’ve only mentioned a few examples here; there are way more libraries out there. You can absolutely try them out with your data and see how they perform. However, I want to share a much simpler approach to monitoring that I usually start with. Even though it’s so simple that you can implement it with a single SQL query, it works surprisingly well in many cases. Another significant advantage of this simplicity is that you can implement it in pretty much any tool, whereas deploying more complex ML approaches can be tricky in some systems.

Statistical approach to monitoring

The core idea behind monitoring is straightforward: use historical data to build a confidence interval (CI) and detect when current metrics fall outside of expected behaviour. We estimate this confidence interval using the mean and standard deviation of past data. It’s just basic statistics.

\[
\textbf{Confidence Interval} = (\textbf{mean} – \textsf{coef}_1 \times \textbf{std},\; \textbf{mean} + \textsf{coef}_2 \times \textbf{std})
\]

However, the effectiveness of this approach depends on several key parameters, and the choices you make here will significantly impact the accuracy of your alerts.

The first decision is how to define the data sample used to calculate your statistics. Typically, we compare the current metric to the same time period on previous days. This involves two main components:

Time window: I usually take a window of ±10–30 minutes around the current timestamp to account for short-term fluctuations.
Historical days: I prefer using the same weekday over the past 3–5 weeks. This method accounts for weekly seasonality, which is usually present in business data. However, depending on your seasonality patterns, you might choose different approaches (for example, splitting days into two groups: weekdays and weekends).

Another important parameter is the choice of coefficient used to set the width of the confidence interval. I usually use three standard deviations since it covers 99.7% of observations for distributions close to normal.

As you can see, there are several decisions to make, and there’s no one-size-fits-all answer. The most reliable way to determine optimal settings is to experiment with different configurations using your own data and choose the one that delivers the best performance for your use case. So this is an ideal moment to put the approach into action and see how it performs on real data.

Example: monitoring the number of taxi rides

To test this out, we’ll use the popular NYC Taxi Data dataset (Open Data). I loaded data from May to July 2025 and focused on rides related to high-volume for-hire vehicles. Since we have hundreds of trips every minute, we can use minute-by-minute data for monitoring.

Building the first version

So, let’s try our approach and build confidence intervals based on real data. I started with a default set of key parameters:

A time window of ±15 minutes around the current timestamp,
Data from the current day plus the same weekday from the previous three weeks,
A confidence band defined as ±3 standard deviations.

Now, let’s create a couple of functions with the business logic to calculate the confidence interval and check whether our value falls outside of it.

# returns the dataset of historic data
def get_distribution_for_ci(param, ts, n_weeks=3, n_mins=15): 
  tmp_df = df[['pickup_datetime', param]].rename(columns={param: 'value', 'pickup_datetime': 'dt'})
  
  tmp = [] 
  for n in range(n_weeks + 1):
    lower_bound = (pd.to_datetime(ts) - pd.Timedelta(weeks=n, minutes=n_mins)).strftime('%Y-%m-%d %H:%M:%S')
    upper_bound = (pd.to_datetime(ts) - pd.Timedelta(weeks=n, minutes=-n_mins)).strftime('%Y-%m-%d %H:%M:%S')
    tmp.append(tmp_df[(tmp_df.dt >= lower_bound) & (tmp_df.dt <= upper_bound)])

  base_df = pd.concat(tmp)
  base_df = base_df[base_df.dt < ts]
  return base_df

# calculates mean and std needed to calculate confidence intervals
def get_ci_statistics(param, ts, n_weeks=3, n_mins=15):
  base_df = get_distribution_for_ci(param, ts, n_weeks, n_mins)
  std = base_df.value.std()
  mean = base_df.value.mean()
  return mean, std

# iterating through all the timestamps in historic data
ci_tmp = []
for ts in tqdm.tqdm(df.pickup_datetime):
  ci = get_ci_statistics('values', ts, n_weeks=3, n_mins=15)
  ci_tmp.append(
    {
        'pickup_datetime': ts,
        'mean': ci[0],
        'std': ci[1],
    }
  )

ci_df = df[['pickup_datetime', 'values']].copy()
ci_df = ci_df.merge(pd.DataFrame(ci_tmp), how='left', on='pickup_datetime')

# defining CI
ci_df['ci_lower'] = ci_df['mean'] - 3 * ci_df['std']
ci_df['ci_upper'] = ci_df['mean'] + 3 * ci_df['std']

# defining whether value is outside of CI
ci_df['outside_of_ci'] = (ci_df['values'] < ci_df['ci_lower']) | (ci_df['values'] > ci_df['ci_upper'])

Analysing results

Let’s look at the results. First, we’re seeing quite a few false positive triggers (one-off points outside the CI that seem to be due to normal variability).

There are two ways we can adjust our algorithm to account for this:

The CI doesn’t need to be symmetric. We might be less concerned about increases in the number of trips, so we could use a higher coefficient for the upper bound (for example, use 5 instead of 3).
The data is quite volatile, so there will be occasional anomalies where a single point falls outside the confidence interval. To reduce such false positive alerts, we can use more robust logic and only trigger an alert when multiple points are outside the CI (for example, at least 4 out of the last 5 points, or 8 out of 10).

However, there’s another potential problem with our current CIs. As you can see, there are quite a few cases where the CI is excessively wide. This looks off and could reduce the sensitivity of our monitoring.

Let’s look at one example to understand why this happens. The distribution we’re using to estimate the CI at this point is bimodal, which leads to a higher standard deviation and a wider CI. That’s because the number of trips on the evening of July 14th is significantly higher than in other weeks.

So we’ve encountered an anomaly in the past that’s affecting our confidence intervals. There are two ways to address this issue:

If we’re doing constant monitoring, we know there was anomalously high demand on July 14th, and we can exclude these periods when constructing our CIs. This approach requires some discipline to track these anomalies, but it pays off with more accurate results.
However, there’s always a quick-and-dirty approach too: we can simply drop or cap outliers when constructing the CI.

Improving the accuracy

So after the first iteration, we identified several potential improvements for our monitoring approach:

Use a higher coefficient for the upper bound since we care less about increases. I used 6 standard deviations instead of 3.
Deal with outliers to filter out past anomalies. I experimented with removing or capping the top 10–20% of outliers and found that capping at 20% alongside increasing the period to 5 weeks worked best in practice.
Raise an alert only when 4 out of the last 5 points are outside the CI to reduce the number of false positive alerts caused by normal volatility.

Let’s see how this looks in code. We’ve updated the logic in get_ci_statistics to account for different strategies for handling outliers.

def get_ci_statistics(param, ts, n_weeks=3, n_mins=15, show_vis = False, filter_outliers_strategy = 'none', 
                   filter_outliers_perc = None):
  assert filter_outliers_strategy in ['none', 'clip', 'remove'], "filter_outliers_strategy must be one of 'none', 'clip', 'remove'"
  base_df = get_distribution_for_ci(param, ts, n_weeks, n_mins, show_vis)
  if filter_outliers_strategy != 'none': 
    p_upper = base_df.value.quantile(1 - filter_outliers_perc)
    p_lower = base_df.value.quantile(filter_outliers_perc)
    if filter_outliers_strategy == 'clip':
      base_df['value'] = base_df['value'].clip(lower=p_lower, upper=p_upper)
    if filter_outliers_strategy == 'remove':
      base_df = base_df[(base_df.value >= p_lower) & (base_df.value <= p_upper)]
  std = base_df.value.std()
  mean = base_df.value.mean()
  return mean, std

We also need to update the way we define the outside_of_ci parameter.

for ts in tqdm.tqdm(ci_df.pickup_datetime):
  tmp_df = ci_df[(ci_df.pickup_datetime <= ts)].tail(5).copy()
  tmp_df = tmp_df[~tmp_df.ci_lower.isna() & ~tmp_df.ci_upper.isna()]
  if tmp_df.shape[0] < 5: 
    continue
  tmp_df['outside_of_ci'] = (tmp_df['values'] < tmp_df['ci_lower']) | (tmp_df['values'] > tmp_df['ci_upper'])
  if tmp_df.outside_of_ci.map(int).sum() >= 4:
    anomalies.append(ts) 

ci_df['outside_of_ci'] = ci_df.pickup_datetime.isin(anomalies)

We can see that the CI is now significantly narrower (no more anomalously wide CIs), and we’re also getting far fewer alerts since we increased the upper bound coefficient.

Let’s investigate the two alerts we found. These two alerts from the last 2 weeks look plausible when we compare the traffic to previous weeks.

Practical tip: This chart also reminds us that ideally we should account for public holidays and either exclude them or treat them as weekends when calculating the CI.

So our new monitoring approach makes total sense. However, there’s a drawback: by only looking for cases where 4 out of 5 minutes fall outside the CI, we’re delaying alerts in situations where everything is completely broken. To address this problem, you can actually use two CIs:

Doomsday CI: A broad confidence interval where even a single point falling outside means it’s time to panic.
Incident CI: The one we built earlier, where we might wait 5–10 minutes before triggering an alert, since the drop in the metric isn’t as critical.

Let’s define 2 CIs for our case.

It’s a balanced approach that gives us the best of both worlds: we can react quickly when something is completely broken while still keeping false positives under control. With that, we’ve achieved a good result and we’re ready to move on.

Testing our monitoring on anomalies

We’ve confirmed that our approach works well for business-as-usual cases. However, it’s also worth doing some stress testing by simulating anomalies we want to catch and checking how the monitoring performs. In practice, it’s worth testing against previously known anomalies to see how it would handle real-world examples.

In our case, we don’t have a change log of previous anomalies, so I simulated a 20% drop in the number of trips, and our approach caught it immediately.

These kinds of step changes can be tricky in real life. Imagine we lost one of our partners, and that lower level becomes the new normal for the metric. In that case, it’s worth adjusting our monitoring as well. If it’s possible to recalculate the historical metric based on the current state (for example, by filtering out the lost partner), that would be ideal since it would bring the monitoring back to normal. If that’s not feasible, we can either adjust the historical data (say, subtract 20% of traffic as our estimate of the change) or drop all data from before the change and use only the new data to construct the CI.

Let’s look at another tricky real-world example: gradual decay. If your metric is slowly dropping day after day, it likely won’t be caught by our real-time monitoring since the CI will be shifting along with it. To catch situations like this, it’s worth having less granular monitoring (like daily, weekly, or even monthly).

You can find the full code on GitHub.

Operational challenges

We’ve discussed the math behind alerting and monitoring systems. However, there are several other nuances you’ll likely encounter once you start deploying your system in production. So I’d like to cover these before wrapping up.

Lagging data. We don’t face this problem in our example since we’re working with historical data, but in real life, you need to deal with data lags. It usually takes some time for data to reach your data warehouse. So you need to learn how to distinguish between cases where data hasn’t arrived yet versus actual incidents affecting the customer experience. The most straightforward approach is to look at historical data, identify the typical lag, and filter out the last 5–10 data points.

Different sensitivity for different segments. You’ll likely want to monitor not just the main KPI (the number of trips), but also break it down by multiple segments (like partners, areas, etc.). Adding more segments is always beneficial since it helps you spot smaller changes in specific segments (for instance, that there’s a problem in Manhattan). However, as I mentioned above, there’s a downside: more segments mean more false positive alerts that you need to deal with. To keep this under control, you can use different sensitivity levels for different segments (say, 3 standard deviations for the main KPI and 5 for segments).

Smarter alerting system. Also, when you’re monitoring many segments, it’s worth making your alerting a bit smarter. Say you have monitoring for the main KPI and 99 segments. Now, imagine we have a global outage and the number of trips drops everywhere. Within the next 5 minutes, you’ll (hopefully) get 100 notifications that something is broken. That’s not an ideal experience. To avoid this situation, I’d build logic to filter out redundant notifications. For example:

If we received the same notification within the last 3 hours, don’t fire another alert.
If there’s a notification about a drop in the main KPI plus more than 3 segments, only alert about the main KPI change.

Overall, alert fatigue is real, so it’s worth minimising the noise.

And that’s it! We’ve covered the entire alerting and monitoring topic, and hopefully, you’re now fully equipped to set up your own system.

Summary

We’ve covered a lot of ground on alerting and monitoring. Let me wrap it up with a step-by-step guide on how to start monitoring your KPIs.

The first step is to gather a change log of past anomalies. You can use this both as a set of test cases for your system and to filter out anomalous periods when calculating CIs.
Next, build a prototype and run it on historical data. I’d start with the highest-level KPI, try out several possible configurations, and see how well it catches previous anomalies and whether it generates a lot of false alerts. At this point, you should have a viable solution.
Then try it out in production, since this is where you’ll have to deal with data lags and see how the monitoring actually performs in practice. Run it for 2–4 weeks and tweak the parameters to make sure it’s working as expected.
After that, share the monitoring with your colleagues and start expanding the scope to include other segments. Don’t forget to keep adding all anomalies to the change log and establish feedback loops to improve your system continuously.

And that’s it! Now you can rest easy knowing that automation is keeping an eye on your KPIs (but still check in on them from time to time, just in case).

Thank you for reading. I hope this article was insightful. Remember Einstein’s advice: “The important thing is not to stop questioning. Curiosity has its own reason for existing.” May your curiosity lead you to your next great insight.

Building a Monitoring System That Actually Works

Setting up monitoring

Frameworks for monitoring

Statistical approach to monitoring

Example: monitoring the number of taxi rides

Building the first version

Analysing results

Improving the accuracy

Testing our monitoring on anomalies

Operational challenges

Summary

Related Articles

Implementing Convolutional Neural Networks in TensorFlow

Hands-on Time Series Anomaly Detection using Autoencoders, with Python

Solving a Constrained Project Scheduling Problem with Quantum Annealing

Back To Basics, Part Uno: Linear Regression and Cost Function

Must-Know in Statistics: The Bivariate Normal Projection Explained

Our Columns

Optimizing Marketing Campaigns with Budgeted Multi-Armed Bandits