Every problem might not have a solution right now, but don’t forget that but every solution was once a problem.
--Your friends at LectureNotes

Note for Data Mining And Data Warehousing - DMDW By Abhishek Kumar

  • Data Mining And Data Warehousing - DMDW
  • Note
  • Biju Patnaik University of Technology BPUT - BPUT
  • 7 Topics
  • 1 Offline Downloads
  • Uploaded 2 months ago
Abhishek Kumar
Abhishek Kumar
0 User(s)
Download PDFOrder Printed Copy

Share it with your friends

Leave your Comments

Text from page-1

www.Vidyarthiplus.com UNIT I: DATA WAREHOUSING Introduction to Data warehouse A data warehouse is a collection of data marts representing historical data from different operations in the company. This data is stored in a structure optimized for querying and data analysis as a data warehouse. Table design, dimensions and organization should be consistent throughout a data warehouse so that reports or queries across the data warehouse are consistent. A data warehouse can also be viewed as a database for historical data from different functions within a company. The term Data Warehouse was coined by Bill Inmon in 1990, which he defined in the following way: "A warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process". He defined the terms in the sentence as follows: Subject Oriented: Data that gives information about a particular subject instead of about a company's ongoing operations. Integrated: Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Time-variant: All data in the data warehouse is identified with a particular time period. Non-volatile: Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. It is a single, complete and consistent store of data obtained from a variety of different sources made available to end users in what they can understand and use in a business context. It can be Used for decision Support Used to manage and control business Used by managers and end-users to understand the business and make judgments Data Warehousing is an architectural construct of information systems that provides users with current and historical decision support information that is hard to access or present in traditional operational data stores Other important terminology Enterprise Data warehouse: It collects all information about subjects (customers, products, sales, assets, personnel) that span the entire organization Data Mart: Departmental subsets that focus on selected subjects. A data mart is a segment of a data warehouse that can provide data for reporting and analysis on a section, unit, department or operation in the company, e.g. sales, payroll, production. Data marts are sometimes complete individual data warehouses which are usually smaller than the corporate data warehouse. Decision Support System (DSS): Information technology to help the knowledge worker (executive, manager, and analyst) makes faster & better decisions Drill-down: Traversing the summarization levels from highly summarized data to the underlying current or old detail Metadata: Data about data. Containing location and description of warehouse system components: names, definition, structure… Benefits of data warehousing Data warehouses are designed to perform well with aggregate queries running on large amounts of data. The structure of data warehouses is easier for end users to navigate, understand and query against unlike the relational databases primarily designed to handle lots of transactions. Data warehouses enable queries that cut across different segments of a company's operation. E.g. production data could be compared against inventory data even if they were originally stored in different databases with different structures. www.Vidyarthiplus.com

Text from page-2

www.Vidyarthiplus.com Queries that would be complex in very normalized databases could be easier to build and maintain in data warehouses, decreasing the workload on transaction systems. Data warehousing is an efficient way to manage and report on data that is from a variety of sources, non uniform and scattered throughout a company. Data warehousing is an efficient way to manage demand for lots of information from lots of users. Data warehousing provides the capability to analyze large amounts of historical data for nuggets of wisdom that can provide an organization with competitive advantage. Operational and informational Data • Operational Data: Focusing on transactional function such as bank card withdrawals and deposits Detailed Updateable Reflects current data • Informational Data: Focusing on providing answers to problems posed by decision makers Summarized Non updateable DATA WAREHOUSE ARCHITECTURE AND ITS SEVEN COMPONENTS 1. Data sourcing, cleanup, transformation, and migration tools 2. Metadata repository 3. Warehouse/database technology 4. Data marts 5. Data query, reporting, analysis, and mining tools 6. Data warehouse administration and management 7. Information delivery system 1 Data warehouse database This is the central part of the data warehousing environment. This is the item number 2 in the above arch. diagram. This is implemented based on RDBMS technology. 2 Sourcing, Acquisition, Clean up, and Transformation Tools This is item number 1 in the above arch diagram. They perform conversions, summarization, key changes, structural changes and condensation. The data transformation is required so that the information can by used by decision support tools. The transformation produces programs, control statements, JCL code, COBOL code, UNIX scripts, and SQL DDL code etc., to move the data into data warehouse from multiple operational systems. The functionalities of these tools are listed below: To remove unwanted data from operational db Converting to common data names and attributes Calculating summaries and derived data Establishing defaults for missing data Accommodating source data definition changes 3 Meta data It is data about data. It is used for maintaining, managing and using the data warehouse. It is classified into two: Technical Meta data: It contains information about data warehouse data used by warehouse designer, administrator to carry out development and management tasks. It includes, www.Vidyarthiplus.com

Text from page-3

www.Vidyarthiplus.com Info about data stores Transformation descriptions. That is mapping methods from operational db to warehouse db Warehouse Object and data structure definitions for target data The rules used to perform clean up, and data enhancement Data mapping operations Access authorization, backup history, archive history, info delivery history, data acquisition history, data access etc., Business Meta data: It contains info that gives info stored in data warehouse to users. It includes, Subject areas, and info object type including queries, reports, images, video, audio clips etc. Internet home pages Info related to info delivery system Data warehouse operational info such as ownerships, audit trails etc., 4 Access tools Its purpose is to provide info to business users for decision making. There are five categories: Data query and reporting tools Application development tools Executive info system tools (EIS) OLAP tools Data mining tools 5 Data marts Departmental subsets that focus on selected subjects. They are independent used by dedicated user group. They are used for rapid delivery of enhanced decision support functionality to end users. Data mart is used in the following situation: Extremely urgent user requirement The absence of a budget for a full scale data warehouse strategy The decentralization of business needs The attraction of easy to use tools and mind sized project Data mart presents two problems: 1. Scalability: A small data mart can grow quickly in multi dimensions. So that while designing it, the organization has to pay more attention on system scalability, consistency and manageability issues 2. Data integration 6 Data warehouse admin and management The management of data warehouse includes, Security and priority management Monitoring updates from multiple sources Data quality checks Managing and updating meta data Auditing and reporting data warehouse usage and status Purging data Replicating, sub setting and distributing data Backup and recovery Data warehouse storage management which includes capacity planning, hierarchical storage management and purging of aged data etc., www.Vidyarthiplus.com

Text from page-4

www.Vidyarthiplus.com 7 Information delivery system • It is used to enable the process of subscribing for data warehouse info. • Delivery to one or more destinations according to specified scheduling algorithm Building a Data warehouse: There are two reasons why organizations consider data warehousing a critical need. In other words, there are two factors that drive you to build and use data warehouse. They are: Business factors: Business users want to make decision quickly and correctly using all available data. Technological factors: To address the incompatibility of operational data stores IT infrastructure is changing rapidly. Its capacity is increasing and cost is decreasing so that building a data warehouse is easy There are several things to be considered while building a successful data warehouse Business considerations: Organizations interested in development of a data warehouse can choose one of the following two approaches: Top - Down Approach (Suggested by Bill Inmon) Bottom - Up Approach (Suggested by Ralph Kimball) Top - Down Approach In the top down approach suggested by Bill Inmon, we build a centralized repository to house corporate wide business data. This repository is called Enterprise Data Warehouse (EDW). The data in the EDW is stored in a normalized form in order to avoid redundancy. The central repository for corporate wide data helps us maintain one version of truth of the data. The data in the EDW is stored at the most detail level. The reason to build the EDW on the most detail level is to leverage 1. Flexibility to be used by multiple departments. 2. Flexibility to cater for future requirements. The disadvantages of storing data at the detail level are 1. The complexity of design increases with increasing level of detail. 2. It takes large amount of space to store data at detail level, hence increased cost. Once the EDW is implemented we start building subject area specific data marts which contain data in a de normalized form also called star schema. The data in the marts are usually summarized based on the end users analytical requirements. The reason to de normalize the data in the mart is to provide faster access to the data for the end users analytics. If we were to have queried a normalized schema for the same analytics, we would end up in a complex multiple level joins that would be much slower as compared to the one on the de normalized schema. We should implement the top-down approach when 1. The business has complete clarity on all or multiple subject areas data warehosue requirements. 2. The business is ready to invest considerable time and money. The advantage of using the Top Down approach is that we build a centralized repository to cater for one version of truth for business data. This is very important for the data to be reliable, consistent across subject areas and for reconciliation in case of data related contention between subject areas. www.Vidyarthiplus.com

Lecture Notes