Learning to Communicate with Lookout for Metrics

Intro

Amazon Lookout for Metrics uses machine learning (ML) to automatically detect and diagnose anomalies (i.e. outliers from the norm) in business and operational data, such as a sudden dip in sales revenue or customer acquisition rates. In a couple of clicks, you can connect Amazon Lookout for Metrics to popular data stores like Amazon S3, Amazon Redshift, and Amazon Relational Database Service (RDS), as well as third-party SaaS applications, such as Salesforce, Servicenow, Zendesk, and Marketo, and start monitoring metrics that are important to your business.

That said, it is still a computer system and at the end of the day it will respond accordingly to the specific inputs it is given. This is designed to be a short guide to help you think through the data you provide Lookout for Metrics and to better understand what the results from Lookout for Metrics mean. This will be done using a few real world-esque scenarios.

What Exactly Is A Metric?

If you are going to be looking out for them, it would help to know what they are at first. Within Lookout for Metrics there are 3 components to your datasets, together they help shape your Metrics.

They are:

Timestamp - This is required, all entries in this service are required to start with a timestamp of when the remaining columns were relevant or occurred.
Dimensions - These are categorical columns, you can have up to 5 of them, keep in mind they are combined to refer to a specific entity. For example, if your domains are location and repair_type your data could look like this:

timestamp	location	repair_type
01/10/2022 10:00:00	123 Interesting Ave	oil_change
01/10/2022 10:00:00	123 Interesting Ave	tire_rotation
01/10/2022 10:00:00	745 Interesting Ave	oil_change
01/10/2022 10:00:00	745 Interesting Ave	tire_rotation

From this dataset we have 2 dimensions(location and emergency type) and when we start to think about the total number of possible metrics(full calculation to come) we can see there are 2 distinct locations and 2 distinct repair types.

Measures - These are the numerical columns where real observable numbers are placed. These numbers are bound to a specific unique set of domains. You can also have up to 5 of these columns. Now expanding our earlier dataset with 2 additional numerical columns of total and fixed.

timestamp	location	repair_type	total	fixed
01/10/2022 10:00:00	123 Interesting Ave	oil_change	10	8
01/10/2022 10:00:00	123 Interesting Ave	tire_rotation	10	10
01/10/2022 10:00:00	745 Interesting Ave	oil_change	10	10
01/10/2022 10:00:00	745 Interesting Ave	tire_rotation	10	7

Lookout for Metrics with this dataset has 8 metrics. How did we get this number?

A Metric is a unique combination of categorical entries and 1 numerical value.

How to Calculate the Total Number of Metrics

The formula to calculate the total number of metrics is: Unique(domain1) * Unique(domain2) * Number of measures. So in this case that would be:

2 * 2 * 2 or 8.

At present Lookout for Metrics can support a maximum of 50,000 metrics per Detector, which is the trained model assigned to a particular set of data. So if you wanted to track more of them than 50,000, you would simply segment your data into multiple Detectors.

What Can Lookout for Metrics Tell Me From My Data

Like all great things powered by Machine Learning: It depends!

How Does Structure Impact Anomalies

Specifically here it depends on how your data is structured and how your data is aggregated. Structure applies to the way we shape a dataset in determining the number of domains and the number of measures provided. For example the same dataset as earlier:

timestamp	location	repair_type	total	fixed
01/10/2022 10:00:00	123 Interesting Ave	oil_change	10	8
01/10/2022 10:00:00	123 Interesting Ave	tire_rotation	10	10
01/10/2022 10:00:00	745 Interesting Ave	oil_change	10	10
01/10/2022 10:00:00	745 Interesting Ave	tire_rotation	10	7

This would identify the following types of issues:

If the total number of oil_change type events at 123 Interesting Ave is anomalous
If the total number of tire_rotation type events at 123 Interesting Ave is anomalous
If the fixed number of oil_change type events at 123 Interesting Ave is anomalous
If the fixed number of tire_rotation type events at 123 Interesting Ave is anomalous
If the total number of oil_change type events at 745 Interesting Ave is anomalous
If the total number of tire_rotation type events at 745 Interesting Ave is anomalous
If the fixed number of oil_change type events at 745 Interesting Ave is anomalous
If the fixed number of tire_rotation type events at 745 Interesting Ave is anomalous

That's 8 different things, mapping to the 8 metrics identified earlier. Additionally, with the causation features, IF there was a relationship defined by reliable patterns in the data for the values between the total and fixed columns of a particular location and repair_type then we could report how one anomaly may impact the other. Also, if there were a reliable pattern between locations, those anomalies could be linked in a cause and effect relationship as well. The last thing that could happen here is that anomalies that look similar in the same time period could be groups together and shown as on the page at the same time.

Can Structure Go Bad?

YES!

With the simplified dataset earlier:

timestamp	location	repair_type
01/10/2022 10:00:00	123 Interesting Ave	oil_change
01/10/2022 10:00:00	123 Interesting Ave	tire_rotation
01/10/2022 10:00:00	745 Interesting Ave	oil_change
01/10/2022 10:00:00	745 Interesting Ave	tire_rotation

What if we changed that to:

timestamp	location	repair_type
01/10/2022 10:00:00	123 Interesting Ave	oil_change
01/10/2022 10:00:00	123 Interesting Ave	excess_mileage
01/10/2022 10:00:00	745 Interesting Ave	oil_change
01/10/2022 10:00:00	745 Interesting Ave	excess_mileage

So far that seems fine... but what happens when we fill in data:

timestamp	location	repair_type	total	fixed
01/10/2022 10:00:00	123 Interesting Ave	oil_change	10	8
01/10/2022 10:00:00	123 Interesting Ave	excess_mileage	10	10
01/10/2022 10:00:00	745 Interesting Ave	oil_change	10	10
01/10/2022 10:00:00	745 Interesting Ave	excess_mileage	10	7

Now we see the problem arises when we add in the measures. Specifically, what exactly could we mean by the total of excess_mileage. This could potentially be the number of vehicles seen with higher miles than their last service. That could potentially be OK if we just took it as a regular reading, but in the context of the measure fixed, what could go there? In this case it does not make sense. Are we stating that we did some service to alleviate it? That might just show up as relevant other service types. Lookout for Metrics requires all columns be filled or the datapoint will be ignored, here we can see that the structure defined may restrict choices about what kind of data we can insert. As an exercise in cleaning it up, this might work:

timestamp	location	repair_type	total	fixed	out_of_interval
01/10/2022 10:00:00	123 Interesting Ave	oil_change	10	8	6
01/10/2022 10:00:00	123 Interesting Ave	tire_rotation	10	10	5
01/10/2022 10:00:00	745 Interesting Ave	oil_change	10	10	2
01/10/2022 10:00:00	745 Interesting Ave	tire_rotation	10	7	7

Here we are now seeing that there's an out_of_interval count for the total number of vehicles that are out of their service interval. We might expect to see patterns where the higher the value for it, indicates a lower volume of fixed items due to the complexity of any additional maintenance. It could also have downstream impacts if it creates congestion inside the garage.

How Does Aggregation Impact Anomalies

Lookout for Metrics is detecting anomalies only on structured, time series data. Anomalies are also detected against a regular interval(5min, 10min, 1hr, 1day), if your data only contains a single entry for a given metric in your interval, then aggregation has no impact whatsoever.

To use the same dataset again:

timestamp	location	repair_type	total	fixed
01/10/2022 10:00:00	123 Interesting Ave	oil_change	10	8
01/10/2022 10:00:00	123 Interesting Ave	tire_rotation	10	10
01/10/2022 10:00:00	745 Interesting Ave	oil_change	10	10
01/10/2022 10:00:00	745 Interesting Ave	tire_rotation	10	7

Here we have exactly 1 entry for each metric for a given hour, if the detector is hourly then no aggregation is needed.

Lookout for Metrics supports 2 aggregation functions:

Sum - Add all the values together within the interval.
Average - The average of the values within the interval.

To expand now on the existing dataset:

timestamp	location	repair_type	total	fixed
01/10/2022 10:00:00	123 Interesting Ave	oil_change	10	8
01/10/2022 10:05:00	123 Interesting Ave	oil_change	10	8
01/10/2022 10:00:00	123 Interesting Ave	tire_rotation	10	10
01/10/2022 10:05:00	123 Interesting Ave	tire_rotation	10	10
01/10/2022 10:00:00	745 Interesting Ave	oil_change	10	10
01/10/2022 10:05:00	745 Interesting Ave	oil_change	10	10
01/10/2022 10:00:00	745 Interesting Ave	tire_rotation	10	7
01/10/2022 10:05:00	745 Interesting Ave	tire_rotation	10	7

Sum

Here we can see that the number of metrics has not changed, however the number of entries in our dataset has doubled, there's an entry at the start of the hour and at minute number 5. If we have selected SUM for our dataset that would yield this:

timestamp	location	repair_type	total	fixed
01/10/2022 10:00:00	123 Interesting Ave	oil_change	20	16
01/10/2022 10:00:00	123 Interesting Ave	tire_rotation	20	20
01/10/2022 10:00:00	745 Interesting Ave	oil_change	20	20
01/10/2022 10:00:00	745 Interesting Ave	tire_rotation	20	14

Each numerical value has doubled(because the entries were the exact same AND there are 2 of them).

Sum is useful when you want to pay attention to the specific total value of a metric!

Average

For Average the dataset would look like this:

timestamp	location	repair_type	total	fixed
01/10/2022 10:00:00	123 Interesting Ave	oil_change	10	8
01/10/2022 10:00:00	123 Interesting Ave	tire_rotation	10	10
01/10/2022 10:00:00	745 Interesting Ave	oil_change	10	10
01/10/2022 10:00:00	745 Interesting Ave	tire_rotation	10	7

This is EXACTLY the same as the first dataset, because our secondary records were the EXACT same as well, and there were only 2 data points per aggregation window. If however we had 3 data points per interval:

timestamp	location	repair_type	total	fixed
01/10/2022 10:00:00	123 Interesting Ave	oil_change	10	8
01/10/2022 10:05:00	123 Interesting Ave	oil_change	10	8
01/10/2022 10:30:00	123 Interesting Ave	oil_change	10	8
01/10/2022 10:00:00	123 Interesting Ave	tire_rotation	10	10
01/10/2022 10:05:00	123 Interesting Ave	tire_rotation	10	10
01/10/2022 10:30:00	123 Interesting Ave	tire_rotation	10	10
01/10/2022 10:00:00	745 Interesting Ave	oil_change	10	10
01/10/2022 10:05:00	745 Interesting Ave	oil_change	10	10
01/10/2022 10:30:00	745 Interesting Ave	oil_change	10	10
01/10/2022 10:00:00	745 Interesting Ave	tire_rotation	10	7
01/10/2022 10:05:00	745 Interesting Ave	tire_rotation	10	7
01/10/2022 10:30:00	745 Interesting Ave	tire_rotation	10	7

Interestingly the Average value looks like this:

timestamp	location	repair_type	total	fixed
01/10/2022 10:00:00	123 Interesting Ave	oil_change	10	8
01/10/2022 10:00:00	123 Interesting Ave	tire_rotation	10	10
01/10/2022 10:00:00	745 Interesting Ave	oil_change	10	10
01/10/2022 10:00:00	745 Interesting Ave	tire_rotation	10	7

But the SUM table would look like this:

timestamp	location	repair_type	total	fixed
01/10/2022 10:00:00	123 Interesting Ave	oil_change	30	24
01/10/2022 10:00:00	123 Interesting Ave	tire_rotation	30	30
01/10/2022 10:00:00	745 Interesting Ave	oil_change	30	30
01/10/2022 10:00:00	745 Interesting Ave	tire_rotation	30	21

What if the values were not so uniform?

timestamp	location	repair_type	total	fixed
01/10/2022 10:00:00	123 Interesting Ave	oil_change	5	4
01/10/2022 10:05:00	123 Interesting Ave	oil_change	17	10
01/10/2022 10:30:00	123 Interesting Ave	oil_change	2	1
01/10/2022 10:00:00	123 Interesting Ave	tire_rotation	7	7
01/10/2022 10:05:00	123 Interesting Ave	tire_rotation	7	5
01/10/2022 10:30:00	123 Interesting Ave	tire_rotation	10	8
01/10/2022 10:00:00	745 Interesting Ave	oil_change	2	2
01/10/2022 10:05:00	745 Interesting Ave	oil_change	3	2
01/10/2022 10:30:00	745 Interesting Ave	oil_change	10	9
01/10/2022 10:00:00	745 Interesting Ave	tire_rotation	4	4
01/10/2022 10:05:00	745 Interesting Ave	tire_rotation	10	7
01/10/2022 10:30:00	745 Interesting Ave	tire_rotation	8	5

Then the Average table looks like:

timestamp	location	repair_type	total	fixed
01/10/2022 10:00:00	123 Interesting Ave	oil_change	8	5
01/10/2022 10:00:00	123 Interesting Ave	tire_rotation	8	6.66
01/10/2022 10:00:00	745 Interesting Ave	oil_change	5	4.33
01/10/2022 10:00:00	745 Interesting Ave	tire_rotation	7.33	5.33

With the SUM table:

timestamp	location	repair_type	total	fixed
01/10/2022 10:00:00	123 Interesting Ave	oil_change	24	15
01/10/2022 10:00:00	123 Interesting Ave	tire_rotation	24	18
01/10/2022 10:00:00	745 Interesting Ave	oil_change	15	13
01/10/2022 10:00:00	745 Interesting Ave	tire_rotation	22	16

Depending on your data:

Average is problematic if there are spikes within the interval that you want to be aware of!

Average is great if spikes within the interval is normal, and you want to iron them out!

chrisking/lookout.md