UNIT I 1.1 What motivated data mining? Why is it important? The major reason that data mining has attracted a great deal of attention in information industry in recent years is due to the wide availability of huge amounts of data and the imminent need for turning such data into useful information and knowledge. The information and knowledge gained can be used for applications ranging from business management, production control, and market analysis, to engineering design and science exploration. The evolution of database technology Data collection and Database Creation (1960s and earlier) Primitive file processing Database Management Systems (1970s-early 1980s) 1) Hierarchical and network database system 2) Relational database system 3) Data modeling tools: entity-relational models, etc 4) Indexing and accessing methods: B-trees, hashing etc. 5) Query languages: SQL, etc. User Interfaces, forms and reports 6) Query Processing and Query Optimization 7) Transactions, concurrency control and recovery 8) Online transaction Processing (OLTP) Advanced Database Systems (mid 1980s-present) 1) Advanced Data models: Extended relational, objectrelational ,etc. 2) Advanced applications; Spatial, temporal, multimedia, active stream and sensor, knowledge based Advanced Data Analysis: Data warehousing and Data mining (late 1980s-present) 1)Data warehouse and OLAP 2)Data mining and knowledge discovery:generalization,classification,associ ation,clustering,frequent pattern, outlier analysis, etc 3)Advanced data mining applications: Stream data mining,bio-data mining, text mining, web mining etc Web based databases (1990s-present) 1) XML- based database systems 2)Integration with information retrieval 3)Data and information Integration New Generation of Integrated Data and Information Systems(present future)
1.2 What is data mining? Data mining refers to extracting or mining" knowledge from large amounts of data. There are many other terms related to data mining, such as knowledge mining, knowledge extraction, data/pattern analysis, data archaeology, and data dredging. Many people treat data mining as a synonym for another popularly used term, Knowledge Discovery in Databases", or KDD Essential step in the process of knowledge discovery in databases Knowledge discovery as a process is depicted in following figure and consists of an iterative sequence of the following steps: data cleaning: to remove noise or irrelevant data data integration: where multiple data sources may be combined data selection: where data relevant to the analysis task are retrieved from the database data transformation: where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations data mining :an essential process where intelligent methods are applied in order to extract data patterns pattern evaluation to identify the truly interesting patterns representing knowledge based on some interestingness measures knowledge presentation: where visualization and knowledge representation techniques are used to present the mined knowledge to the user. Architecture of a typical data mining system/Major Components Data mining is the process of discovering interesting knowledge from large amounts of data stored either in databases, data warehouses, or other information repositories. Based on this view, the architecture of a typical data mining system may have the following major components: 1. A database, data warehouse, or other information repository, which consists of the set of databases, data warehouses, spreadsheets, or other kinds of information repositories containing the student and course information. 2. A database or data warehouse server which fetches the relevant data based on users’ data mining requests. 3. A knowledge base that contains the domain knowledge used to guide the search or to evaluate the interestingness of resulting patterns. For example, the knowledge base may contain metadata which describes data from multiple heterogeneous sources. 4. A data mining engine, which consists of a set of functional modules for tasks such as classification, association, classification, cluster analysis, and evolution and deviation analysis. 5. A pattern evaluation module that works in tandem with the data mining modules by employing interestingness measures to help focus the search towards interestingness patterns.
6. A graphical user interface that allows the user an interactive approach to the data mining system. How is a data warehouse different from a database? How are they similar? • Differences between a data warehouse and a database: A data warehouse is a repository of information collected from multiple sources, over a history of time, stored under a unified schema, and used for data analysis and decision support; whereas a database, is a collection of interrelated data that represents the current status of the stored data. There could be multiple heterogeneous databases where the schema of one database may not agree with the schema of another. A database system supports ad-hoc query and on-line transaction processing. For more details, please refer to the section “Differences between operational database systems and data warehouses.” • Similarities between a data warehouse and a database: Both are repositories of information, storing huge amounts of persistent data. 1.3 Data mining: on what kind of data? / Describe the following advanced database systems and applications: object-relational databases, spatial databases, text databases, multimedia databases, the World Wide Web. In principle, data mining should be applicable to any kind of information repository. This includes relational databases, data warehouses, transactional databases, advanced database systems, flat files, and the World-Wide Web. Advanced database systems include object-oriented and object-relational databases, and special c application-oriented databases, such as