As data scientists find innovative ways to apply cognitive computing to asset maintenance, certain technical challenges are common. This blog article is based on the collective experience of the Presenso data science team. It was generated using the real-world experiences with our customers that are applicable to the wider Machine Learning community.
#1 Use a wide range of calculations for distributions
Summary metrics such as means, median, standard deviation, etc., are the basic distribution calculations. We recommend using a much richer representation of the distribution. Our preferred list includes histograms, CDFs and Q-Q plots, etc. These calculations identify interesting features of the data such as multi-modal behavior and are used to summarize a significant class of outliers.
#2 Consider the data outliers
Data outliers can be “canaries in the coal mine” that flag fundamental problems with your analysis. It is acceptable to exclude outliers from the data or to group them together into an “Unusual” category. However, it is important to understand the reason that data is in this category.
Let’s use an example from our experience with sensor data from Machine Learning for asset maintenance. Analyzing the signals/tags segments with the lowest values may reveal degradation indicators that one is failing to count. Analyzing signals/ tags segments with the highest values may reveal non-physical issues (such as communication problems) that one should not be counting.
Invariably, there are some outliers that one will never be able to explain. Our advice: be careful about spending too much time devoted to outliers.
#3 Report noise/confidence
One needs to recognize that randomness exists and mistakes will happen. There is a danger of finding patterns in the noise.
Every estimator that is produced needs a notion of confidence in this estimate attached to it. Sometimes this will be more formal and precise (through techniques such as confidence intervals or credible intervals for estimators, and p-values or Bayes factors for conclusions).
For example, if a colleague asks how many contextual anomalous correlations you get on weekends, it is acceptable to conduct a quick analysis using data from a couple of weekends and report “typically between 10 and 12”.
#4 Look at examples
Anytime one produces new analysis code, look at examples of the underlying data and how the code is interpreting those examples. It is almost impossible to produce working analysis code of any complexity without this step.
Your analysis is removing lots of features from the underlying data to produce useful summaries. By looking at the full complexity of individual examples, you can gain confidence that your summarization is reasonable.
Use stratified sampling to look at a good sample across the distribution of values so you are not too focused on the most common cases.
For example, if you are calculating a prediction for machine failure, make sure you look at examples throughout your distribution, especially the extremes. If you do not have the right tools/visualization to look at your data, you need to work on those first.
#5 Slice your data
Slicing means to separate your data into subgroups and look at the values of your metrics in those subgroups separately. In our analysis of turbine data, we commonly slice along dimensions such as wind versus geothermal, electric, hydro, etc.
If the underlying phenomenon is likely to work differently across subgroups, you must slice the data to verify this. Even if you do not expect a slice to be consequential, looking at a few slices for internal consistency gives you greater confidence that you are measuring the right thing. In some cases, a particular slice may have bad data, a broken experience, or in some way be fundamentally different.
Anytime you are slicing your data to compare two groups (such as experiment/control, but even time A vs. time B comparisons), you need to be aware of mix shifts. A mix shift is when the amount of data in a slice is different across the groups you are comparing. Generally, if the relative amount of data in a slice is the same across your two groups, you can safely make a comparison.
#6 Consider practical significance
With a large volume of data, it is often tempting to focus solely on the statistical significance or to hone in on the details of every piece of data. Ask yourself: “even if it is true that value X is 0.1% more than value Y, does it matter?”
This can be especially important if you are unable to understand/categorize part of your data. If you are unable to make sense of some user agents’ strings in your logs, whether it is 0.1% of 10% makes a big difference in how much you should investigate those cases.
Alternatively, you sometimes have a small volume of data. Many changes will not look statistically significant but that is different than claiming it is “neutral”. You must ask yourself: Are the results statistically significant”?
#7 Check for consistency over time
We recommend that the one slicing you should almost always employ is to slice by units of time. We often use days, but other units may also be useful. This is because many disturbances to underlying data happen as our systems evolve over time. Typically, the initial version of a feature or the initial data collection will be checked carefully, but it is not uncommon for something to break along the way.
Just because a particular day or set of days is an outlier does not mean you should discard it. Use the data as a hook to find a causal reason for that day being different before you discard it.
The other benefit of looking at day-over-day data is it gives you a sense of the variation in the data that would eventually lead to confidence intervals or claims of statistical significance. This should not generally replace rigorous confidence interval calculation, but often with large changes, you can see they will be statistically significant just from the day-over-day graphs.
If you have any questions about the 7 best practices we outlined in this article, please leave a comment below and we will respond.