Stats
The pros and cons of control charts versus data mining (November 17, 2007)
In a talk I gave in December 2006, I highlighted how in the analysis of
adverse event data, control charts can augment more complex statistical tools
like data mining. Here's a summary of the pros and cons of using control
charts.
Advantages of control charts. Control charts were originally
proposed by Walter Shewhart in the 1920's. There is a lot of history behind
the control charts, allowing for lots of experience to prove their usefulness
and adaptability in a wide range of applications.
The long history of the control chart also makes it a tool that is familiar
and comfortable to a lot of people. While most of the applications are in
industrial areas, a book published a decade ago,
- Measuring Quality Improvement in Healthcare: A Guide to Statistical
Process Control Applications. Carey RG, Lloyd RC (1995) New York:
Quality Resources.
highlights numerous applications of control charts in health care.
Finally, the control chart is easy to use. Even with some of the recent
enhancements and extensions, control charts remain a relatively simple and
accessible tool. You don't need a lot of state-of-the-art statistical tools
like you do for a data mining project.
This means that you don't need a lot of statistical and computational
expertise to use control charts. There are only a small number of people who
have the qualifications and the expertise to do a good job with a data mining
model. By placing control charts in the hands of a larger number of people,
you increase the number of eyes that look at a problem and (in theory)
increase the chances that safety problems are found early.
Disadvantages of control charts. The control chart is an exploratory
tool. If the control chart shows a point out of control, the chart won't
explain to you WHY it is out of control.
The control chart won't help to identify a subgroup at greater risk if you
did not have the foresight to monitor that group. It also won't identify an
adverse event that was unexpected. With a control chart, you have to know
what you're looking for.
While there are some adaptations of control charts for multivariate data,
seasonal data, and other complexities, the control chart is not easily
adapted to these types of complexities.
Advantages of data mining models
Data mining models excel in situations where the data streams are large and
complex. Some of the data mining methods are adept at handling ambiguous data
and missing data. They can also detect subtle non-linearities and
interactions that most other statistical methods might miss.
While the data mining methods are not as easy to use as their proponents
claim (the old saw "easy to use is easy to say" certainly applies here), the
researchers in this field go to great lengths to automate key components of
the data mining process. Many methods will incorporate methods like cross
validation that allow you to instantly hone in on a model that is neither too
complex nor overly simple.
There is a wealth of data mining tools, each with its own particular
strengths, so a sophisticated modeler can apply a variety of data mining
methods to rapidly triangulate on an accurate solution.
Finally, data mining models are just a lot of fun. Or am I the only
one who thinks this sort of thing is cool?
Disadvantages of data mining models. While some of the disadvantages
of data mining models are highlighted above (the need for highly trained
personnel and specialized software), perhaps two additional disadvantages can
be summarized by a couple of personal anecdotes that I originally discussed
in a January 6, 2005 weblog entry.
The first story was told to me by a doctor here at Children's Mercy, Jay
Portnoy. He was describing a data mining model that was fed images of both
cars and trucks (a training set, in the parlance of data mining) to see if it
could develop a rule for identifying whether a future image was either a
car or a truck based just on mathematical properties of that image. It
did a pretty good job of finding factors in the training set that
distinguished between cars and trucks. But it failed miserably on the
first new image it was trying to classify. It was an image of a car on a
snow covered highway. The data mining algorithm said that this was almost
certainly a truck. What the researchers then realized is that in the training
set, anytime there was snow in the background, it was a truck that was being
shown and never a car. I suppose it is the tendency of marketing to always
show trucks in rugged, primitive, and/or dangerous driving conditions. So the
data mining model seized on a key relationship (color of the background)
that existed only accidentally in the training set, rather than focusing
on those aspects, such as the shape and size of the vehicle, that most of us
would use to distinguish cars from trucks.
Moral from anecdote #1. Even the most sophisticated data mining
models cannot overcome deficiencies in your data.
The second story was one I heard in a training class by Richard DeVeaux on
data mining models that dealt with the question "so what?". He mentioned one
of the earliest findings in a data mining model world (though he is uncertain
if this is a true story or an urban legend) was that there was an unusual
association seen in sales patterns at convenience stores. It seemed that
people who came in to buy beer almost always ended up buying diapers at the
same visit. This is the classic sort of thing that data mining models are
supposed to find: unusual and unexpected associations in a very large data
set. So he posed this question to a group of managers: what would you do
with this information? A common response was: stock the shelves so
that the beer and the diapers are close together to make the trip for the
customer faster and more convenient. Another common response was: put the
beer and the diapers at opposite ends of the store so that customers
would have to spend more time in the store, increasing the chances for
impulse purchases. Another common response was a shrug of the shoulders.
In fact, we often don't know what to make of the associations found by data
mining models.
Moral from anecdote #2. Significant findings from a data mining
model are not guaranteed to provide appropriate clinical guidance.
The bottom line. No one statistical tool or method is going to
provide you with everything you need. The broader range of methods that you
bring to bear on a problem, the better your chances of success.
Category: Adverse events in clinical
trials