Unit 1 Data Mining Concepts Simply stated, data mining refers to extracting or “mining” knowledge from large amounts of data stored in databases, data warehouses, or other information repositories. Many people treat data mining as a synonym for another popularly used term, Knowledge Discovery from Data, or KDD. Alternatively, others view data mining as simply an essential step in the process of knowledge discovery. Knowledge discovery consists of an iterative sequence of the following steps: Data cleaning - It removes noise and inconsistent data Data integration - This combines data from multiple data sources Data selection - Data relevant to the analysis task are retrieved from the database Data transformation - Data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations. Data mining - an essential process where intelligent methods are applied in order to extract data patterns Pattern evaluation - Identifies the truly interesting patterns representing knowledge based on some interestingness measures. Knowledge presentation - Knowledge representation techniques are used to present the mined knowledge to the user. Figure: Knowledge Discovery Process (Stages of KDD)
According to this view, data mining is only one step in the knowledge discovery process. However, in industry, in media, and in the database research milieu, the term data mining is becoming more popular than the longer term of knowledge discovery from data. Therefore, in this book, we choose to use the term data mining. Based on this view, the architecture of a typical data mining system may have the following major components. Database, Data Warehouse, World Wide Web, or Other Information Repository: This is one or a set of databases, data warehouses, spreadsheets, or other kinds of information repositories. Data cleaning and data integration techniques may be performed on the data. Database or Data Warehouse Server: The database or data warehouse server is responsible for fetching the relevant data, based on the user’s data mining request. Knowledge Base: This is the domain knowledge that is used to guide the search or evaluate the interestingness of resulting patterns. It is simply stored in the form of set of rules. Such knowledge can include concept hierarchies, used to organize attributes or attribute values into different levels of abstraction. Data Mining Engine: This is essential to the data mining system and ideally consists of a set of functional modules for tasks such as characterization, association and correlation analysis, classification, prediction, cluster analysis, outlier analysis, and evolution analysis. Pattern Evaluation Module: This component typically employs interestingness measures and interacts with the data mining modules so as to focus the search toward interesting patterns. It may use interestingness thresholds to filter out discovered patterns. User interface: This module communicates between users and the data mining system, allowing the user to interact with the system by specifying a data mining query or task. In addition, this component allows the user to browse database and data warehouse schemas or data structures, evaluate mined patterns, and visualize the patterns in different forms.
Figure: Architecture of Data Mining System Data Warehouse Concepts A data warehouse is a repository of information collected from multiple sources, stored under a unified schema, and that usually resides at a single site. Data warehouses are constructed via a process of data cleaning, data integration, data transformation, data loading, and periodic data refreshing. To facilitate decision making, the data in a data warehouse are organized around major subjects, such as customer, item, supplier, and activity. A data warehouse is usually modeled by a multidimensional database structure, where each dimension corresponds to an attribute or a set of attributes in the schema, and each cell stores the value of some aggregate measure, such as count or sales amount. Multidimensional structure is defined as a variation of the relational model that uses multidimensional structures to organize data and express the relationships between data. The structure is broken into cubes and the cubes are able to store and access data within the confines of each cube. "Each cell within a multidimensional structure contains aggregated data related to elements along each of its dimensions.
Figure: Multidimensional Database By providing multidimensional data views and the pre-computation of summarized data, data warehouse systems are well suited for on-line analytical processing, or OLAP. Examples of OLAP operations include drill-down and roll-up, which allow the user to view the data at differing degrees of summarization. We can drill down on sales data summarized by quarter to see the data summarized by month. Similarly, we can roll up on sales data summarized by city to view the data summarized by country.