Research (Overview)
My main research focus from 1995 until 2009 involved how real world issues
such as rarity impact the data mining process (see below). However, since 2009
my main focus has shifted and I have been working on WISDM (WIreless Sensor Data
Mining). I feel that this is an incredibly important emerging area that offers
tremendous opportunities for data mining. Today's smart phones (e.g., iPhones,
Droid phones, etc.) can provide us with accelerometer, GPS, light sensor,
audio, image, and user proximity data, all of which is potentially available
for data mining-- from millions of users. There are many interesting things we
can do with such data and my WISDM team has already been able to reliably
identify a user's activity (walking, jogging, etc.) and his/her identify
just from their accelerometer data. We are now working on the infrastructure for
collecting such sensor data from Android-based cellphones. For more information on the WISDM project, see my
WISDM project page.
Until recently however, most of my research has focused mainly on
foundational issues in data mining that occur
when addressing complex real world problems and has most specifically focused
on handling problems involving "rarity", in any of its many forms. This work
began with research into
small disjuncts,
which correspond to the
classification "rules" that cover very few training examples. This work
analyzed how rare and exceptional cases impact classification algorithms and
classification performance. Since small disjuncts often belong to the rare
(i.e., minority) class, this work naturally led me to focus on the relationship
between class distribution
and classification performance, and how one could find the optimal class
distribution for learning if the amount of training data has to be
limited. The rationale for limiting the amount of training data was that
there may be costs associated with data acquisition and computation
(i.e., model induction) and my research focus evolved as I began to think
more generally about how such costs, or utility values, impact the data
mining process. My research then began to focus squarely on these costs and
the main focus of my next research was on how to maximize the effectiveness
of the data mining process when there are data acquisition and modelling costs.
More recently I have returned to the issue of class distribution to handle
the case where the class distribution changes over time but only
unlabelled data is available from the new distribution. The resulting work
demonstrated that semi-supervised learning and quantification based methods
can improve learning performance in this situation. I am interested in
pursuing other such cases where the data distribution changes over time.
In between my work on small disjuncts and class distribution I focused
on another issue related to rarity, motivated by my need at AT&T to predict
telecommunication switch failures from logs of alarm messages. I studied this
rare event prediction
problem, which involved predicting rare events from a time-series history of
events. This led to my genetic-algorithm based learning system, called
Timeweaver, which is notable in that it can operate directly on
time-series data.
In addition to the original research on rarity and related areas just
described, I also have actively helped promote and shape work in the area. My
focus on rarity convinced me that the research literature lacked a
comprehensive paper on the area and this led me to publish an article that
described the problems that arise due to rarity and potential solutions to
each of these problems. Also, as my work shifted from a focus on rarity
to a related but more general focus on the costs and benefits that impact the
data mining process, it occurred to me and two of my colleagues that the
field could benefit from a focus on these issues. We thus introduced the
term Utility-Based Data Mining (UBDM) to cover all work that considers costs and
benefits (i.e., utilities) that impact the data mining process. Our efforts
in promoting work in this area included our organizing two highly successful
KDD workshops and editing a special issue on this area for the Data Mining
and Knowledge Discovery journal.
I have also conducted research in other aspects of machine learning and
data mining, as well as in other areas of Computer Science. I have industry
experience working with object technology and expert systems and these
interests intersected when I was one of the first users of R++, a rule-based
extension to C++ that provides object-oriented rules. I used this language
to implement the ANSWER expert system for diagnosing telecommunication network
errors and this work received a AAAI innovative application of
artificial intelligence award. I also showed how object-oriented rules could
be used to implement design patterns. More recent work has focused on
feature construction and information fusion, semi-supervised learning, and
link mining. I have also published general works on data mining including a
brief survey of the field and real-world experiences and recommendations based
on my work as a data mining practitioner. In addition, I have edited the
proceedings from two data mining conferences. I have also been active in the
area of data mining in the telecommunications industry and have published four
book chapters on this subject.
|