Chapter I: Introduction to Data Mining
We are in an age often referred to as the information age. In this information age, because we believe that
information leads to power and success, and thanks to sophisticated technologies such as computers,
satellites, etc., we have been collecting tremendous amounts of information. Initially, with the advent of
computers and means for mass digital storage, we started collecting and storing all sorts of data, counting
on the power of computers to help sort through this amalgam of information. Unfortunately, these massive
collections of data stored on disparate structures very rapidly became overwhelming. This initial chaos has
led to the creation of structured databases and database management systems (DBMS). The efficient
database management systems have been very important assets for management of a large corpus of data
and especially for effective and efficient retrieval of particular information from a large collection
whenever needed. The proliferation of database management systems has also contributed to recent
massive gathering of all sorts of information. Today, we have far more information than we can handle:
from business transactions and scientific data, to satellite pictures, text reports and military intelligence.
Information retrieval is simply not enough anymore for decision-making. Confronted with huge collections
of data, we have now created new needs to help us make better managerial choices. These needs are
automatic summarization of data, extraction of the "essence" of information stored, and the discovery of
patterns in raw data.
Data mining is a powerful new technology with great potential to help companies focus on the most
important information in their data warehouses. It has been defined as:
The automated analysis of large or complex data sets in order to discover significant patterns or trends that
would otherwise go unrecognised.
The key elements that make data mining tools a distinct form of software are:
Data mining automates the process of sifting through historical data in order to discover new
information. This is one of the main differences between data mining and statistics, where a model is
usually devised by a statistician to deal with a specific analysis problem. It also distinguishes data
mining from expert systems, where the model is built by a knowledge engineer from rules extracted
from the experience of an expert.
The emphasis on automated discovery also separates data mining from OLAP and simpler query and
reporting tools, which are used to verify hypotheses formulated by the user. Data mining does not rely
on a user to define a specific query, merely to formulate a goal - such as the identification of fraudulent