I attended an SPSS web seminar about their Clementine program, which performs data mining.
The talk was oriented to business applications, but still had some interesting general
insights. The speaker started with the claim that projects that incorporated data mining
technologies had a much greater return on investment than other projects. --
www.spss.com/dk/IDC%20Predictive%20Analytics%20and%20ROI%20Report.pdf
Data mining is not really new. Anytime people use information about the world to draw
conclusions, you can argue that they are data mining. Typically, though, data mining is
reserved for situations where the number of data observations are large.
Data mining is used to
-
predict category membership or a numeric value,
-
group or cluster things together than have similar characteristics,
-
associate events that occur together or in a sequence,
-
find outliers that don't fit ordinary patterns or expected behavior.
The first two bullets represent
supervised learning and
unsupervised learning, and when I have time I want to document some of the approaches
used for supervised and
unsupervised learning. But for now, these
web pages are painfully incomplete.
Finding outliers is an interesting approach that I had not devoted much thought to.
Perhaps the outliers are observations that merit additional scrutiny. For example, in some
applications, outliers may be potentially fraudulent cases.
CRISP-DM is the model that SPSS uses to model data mining (www.crisp-dm.org).
The steps in CRISP-DM include
-
Business understanding
-
Data understanding
-
Data preparation
-
Modeling
-
Evaluation
-
Deployment
There are several loop backs in this model. For example, data understanding is involved in
a feedback loop with business understanding. Modeling is involved with a feedback loop with
data preparation. Evaluation, of course feeds back to business understanding.
Children's Memorial Hospital used Clementine and SPSS recognized by Computerworld Honors
Foundation for research for treatments of pediatric brain tumors (www.spss.com/press/template_view.cfm?PR_ID=636).
A nice feature of Clementine is that models generated by the software can be exported as C
code or XML which allows you to automate the delivery of data mining solutions to other
computer platforms or on the web. It includes a module for mining data from text fields
and a module for extracting events from web logs.
The speaker mentioned a new model (CARMA -- Continuous Association Rule
Mining Algorithm) which allows interactive pruning of rules in a decision tree (http://control.cs.berkeley.edu/carma.html).
In a one hour presentation, you can't get a good feel for how the software works. It looks
like a good comprehensive package that is easy to use. There are a lot of competing products
out there, of course. One of the more intriguing competitors is Weka, an open source system
for data mining. The main site for Weka is
at the University of Waikato in Hamilton, New
Zealand.
Since Weka is open source, it is popular with data mining classes at universities where
you can't ask the students to go out and buy a thousand dollar software program (the price of
college textbooks is already bad enough).
A good book about Weka is Data Mining: Practical Machine Learning Tools and Techniques
with Java Implementations, by Ian H. Witten, Eibe Frank (ISBN: 1558605525)
[BookFinder4U link]
I have not had a chance to work with either Clementine or Weka.