[ Table of Contents | NEXT ARTICLE ]

INTERPRETING DATA MINING RESULTS: THE INFLUENCE OF HEURISTICS
by Ed Colet


Data mining's promise is the ability to discover interesting patterns that are hidden in large amounts of data. Depending on the particulars of the underlying data mining algorithm, the basis of "interestingness" is formally based on some aspects of probability theory. An interesting pattern is a pattern that occurs despite a statistically low probability. The data mining system will output such patterns. The next step is the interpretation of these patterns by a user. A user's judgment of how interesting a pattern really is, itself involves an implicit judgment of probability. Thus probability is assessed twice during the data mining process - first by the software in a formal explicit manner, and secondly by the user in a more subtle and implicit way. This column addresses the latter aspect - what affects the user's judgment of probability and therefore the interestingness of a result?

Reasoning and decision making is an important topic in cognitive science. A classic paper in the field is a 1974 paper by Amos Tversky and Daniel Kahnemann entitled "Judgments under uncertainty: Heuristics and biases" published in Science, vol. 185, pages 1124-1131. They showed that people rely on certain heuristics when estimating probabilities. This is because in many circumstances, it's simply not possible or practical to perform formal probability calculations. As a result, heuristics are a useful strategy since they can provide a "shortcut" to making a decision. But the use of heuristics can result in systematic errors of judgment. Two heuristics are discussed here.

Two heuristics: Availability and Similarity:

"Availability" is one of the heuristics illustrated in the referenced paper. Availability refers to the notion that what is easier to remember is thought to be more probable - even though it may not be. Tversky and Kahnemann asked people to judge which is more likely - the number of words beginning with the letter 'k' or the number of words in which the letter 'k' occurs in the third position? People thought that there are more words that begin with 'k' than words that have 'k' in the third position. But in fact, there are three times as many words have 'k' in the third position than words that begin with 'k'. Because words that begin with the letter 'k' are easier to recall (more available in memory) they are erroneously judged to be more probable.

In our development of Advanced Scout (data mining software for the NBA), this shows up in the judgments about player shooting percentages. For example, who has a higher field goal percentage - the undisputed best player in the game, Michael Jordan, or an unheralded NBA player called Charles Outlaw? The answer is Charles Outlaw. In fact, of the top 50 players ranked by field goal percentage through the end of the 1998 season, Michael Jordan is a "lowly ranked" 49th (shooting 46.5%), while Charles Outlaw is a surprisingly highly ranked number 2 (shooting 55.4%). (Shaquille O'Neal leads all players at 58.4%.) The availability heuristic makes it easier to think of Michael Jordan as a better shooter than it is to think of Charles Outlaw.

"Similarity" is the other heuristic that people rely upon in their judgments. The notion is that given a set of equally likely patterns, the pattern that is more similar to an expectation is thought to be more probable - even though it may not be. This is illustrated by the perception that a sequence of all heads is judged to be less probable than a sequence of coin flips in which there is some alternating pattern of heads and tails. Yet because each coin flip in the sequence is independent, both sequences are in fact equally likely. The sequence of a mixture of heads and tails appears more "similar" to an expectation of variation and is therefore erroneously judged to be more probable.

In our experience with Advanced Scout this has manifested itself as the question of a shooter having a "hot hand". A player that makes 5 baskets in a row, and then misses the next 5, is thought to have displayed an interesting streak. Yet a player that makes some shots and misses others but overall makes 5/10 isn't viewed as interesting. Yet both outcomes of five out of ten shots are equally likely (if we assume that probabilities of making a basket are close to 50%, and each shot is independent).

Beyond the academic examples, and our experiences with the NBA, these heuristics can manifest themselves in other more typical data mining domains in which patterns have been discovered. For example, the availability heuristic leads people to associate IBM with hardware and Microsoft with software. Thus the following finding becomes quite interesting - IBM gains more revenue from software sales than any other company, and Microsoft has a large revenue component from hardware sales (via mice, keyboards, etc). The similarity heuristic comes into play in the analysis of financial data and stock market predictions. For example, continued and rapid escalations in the stock prices of companies (some of which have yet to turn a profit) is surprising because this pattern of continued escalations is very dissimilar to the more minor upward and downward fluctuations that we've come to expect. (This assumes that the stock market really isn't different from a random process akin to flipping a coin - stock prices can go up or down on a given day. Therefore this pattern isn't really any different than a pattern represented as a sequence of coin flips.)

Patterns that data mining software and analytical systems have reported as interesting can become more (or less) interesting depending on the interpretations of the user. And as research has shown, the user's interpretations are affected by a subtle reliance upon heuristics.

---

For more information, see http://www.virtualgold.com.


[ Table of Contents | NEXT ARTICLE ]