Research (Overview)

My main research focus from 1995 until 2009 involved how real world issues such as rarity impact the data mining process (see below). However, since 2009 my main focus has shifted and I have been working on WISDM (WIreless Sensor Data Mining). I feel that this is an incredibly important emerging area that offers tremendous opportunities for data mining. Today's smart phones (e.g., iPhones, Droid phones, etc.) can provide us with accelerometer, GPS, light sensor, audio, image, and user proximity data, all of which is potentially available for data mining-- from millions of users. There are many interesting things we can do with such data and my WISDM team has already been able to reliably identify a user's activity (walking, jogging, etc.) and his/her identify just from their accelerometer data. We are now working on the infrastructure for collecting such sensor data from Android-based cellphones. For more information on the WISDM project, see my WISDM project page.

Until recently however, most of my research has focused mainly on foundational issues in data mining that occur when addressing complex real world problems and has most specifically focused on handling problems involving "rarity", in any of its many forms. This work began with research into small disjuncts, which correspond to the classification "rules" that cover very few training examples. This work analyzed how rare and exceptional cases impact classification algorithms and classification performance. Since small disjuncts often belong to the rare (i.e., minority) class, this work naturally led me to focus on the relationship between class distribution and classification performance, and how one could find the optimal class distribution for learning if the amount of training data has to be limited. The rationale for limiting the amount of training data was that there may be costs associated with data acquisition and computation (i.e., model induction) and my research focus evolved as I began to think more generally about how such costs, or utility values, impact the data mining process. My research then began to focus squarely on these costs and the main focus of my next research was on how to maximize the effectiveness of the data mining process when there are data acquisition and modelling costs. More recently I have returned to the issue of class distribution to handle the case where the class distribution changes over time but only unlabelled data is available from the new distribution. The resulting work demonstrated that semi-supervised learning and quantification based methods can improve learning performance in this situation. I am interested in pursuing other such cases where the data distribution changes over time.

In between my work on small disjuncts and class distribution I focused on another issue related to rarity, motivated by my need at AT&T to predict telecommunication switch failures from logs of alarm messages. I studied this rare event prediction problem, which involved predicting rare events from a time-series history of events. This led to my genetic-algorithm based learning system, called Timeweaver, which is notable in that it can operate directly on time-series data.

In addition to the original research on rarity and related areas just described, I also have actively helped promote and shape work in the area. My focus on rarity convinced me that the research literature lacked a comprehensive paper on the area and this led me to publish an article that described the problems that arise due to rarity and potential solutions to each of these problems. Also, as my work shifted from a focus on rarity to a related but more general focus on the costs and benefits that impact the data mining process, it occurred to me and two of my colleagues that the field could benefit from a focus on these issues. We thus introduced the term Utility-Based Data Mining (UBDM) to cover all work that considers costs and benefits (i.e., utilities) that impact the data mining process. Our efforts in promoting work in this area included our organizing two highly successful KDD workshops and editing a special issue on this area for the Data Mining and Knowledge Discovery journal.

I have also conducted research in other aspects of machine learning and data mining, as well as in other areas of Computer Science. I have industry experience working with object technology and expert systems and these interests intersected when I was one of the first users of R++, a rule-based extension to C++ that provides object-oriented rules. I used this language to implement the ANSWER expert system for diagnosing telecommunication network errors and this work received a AAAI innovative application of artificial intelligence award. I also showed how object-oriented rules could be used to implement design patterns. More recent work has focused on feature construction and information fusion, semi-supervised learning, and link mining. I have also published general works on data mining including a brief survey of the field and real-world experiences and recommendations based on my work as a data mining practitioner. In addition, I have edited the proceedings from two data mining conferences. I have also been active in the area of data mining in the telecommunications industry and have published four book chapters on this subject.