--Your friends at LectureNotes

Data Mining And Data Warehousing

by Vivek Chandrakar
Type: NoteInstitute: Chhattisgarh swami Vivekanand Technical University Offline Downloads: 106Views: 2296Uploaded: 9 months agoAdd to Favourite

Touch here to read

Data Mining And Data Warehousing by Vivek Chandrakar

Vivek Chandrakar
Vivek Chandrakar

/ 100

Share it with your friends

Suggested Materials

Leave your Comments


Vivek Chandrakar
Vivek Chandrakar
Introduction to Data Warehousing What is a Data Warehouse? Data Warehouse is a storage place for data. It is used to store current and historical information. According to Ralph Kimball, “Data warehouse is the conglomerate of all data marts within the enterprise. Information is always stored in the dimensional model” According to BillInmon, “A warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision making process” However, the definition by Bill Inmon is extensively accepted which is explained in detail below: Subject-Oriented: Data Warehouse concentrates on a particular subject area. For example, a data warehouse consisting the details of the sales as illustrated below Integrated: Data warehouse consists of data that has been combined from numerous sources. The data can be obtained from mainframes, flat files, SQL Server database and various other heterogeneous sources as illustrated. Time-variant: Data warehouse stores current and historical data. The data is used for decision-making and forecasting the business trends. Non-volatile: The data stored in a data warehouse is retained and never deleted. A different version of data is stored in the data warehouse indicating if any inserts, updates or deletes have occurred. Hence, the original data is never altered. Why do we need a data warehouse? Data warehouses are used extensively in the largest and most complex businesses around the world. In demanding situations, good decision making becomes critical. Significant and relevant data is required to make decisions. This is possible only with the help of a well-designed data warehouse.
Enhancing the turnaround time for analysis and reporting: Data warehouse allows business users to access critical data from a single source enabling them to take quick decisions. They need not waste time retrieving data from multiple sources. The business executives can query the data themselves with minimal or no support from IT which in turn saves money and time. Improved Business Intelligence: Data warehouse helps in achieving the vision for the managers and business executives. Outcomes that affect the strategy and procedures of an organization will be based on reliable facts and supported with evidence and organizational data. Benefit of historical data: Transactional data stores data on a day to day basis or for a very short period of duration without the inclusion of historical data. In comparison, a data warehouse stores large amounts of historical data which enables the business to include time-period analysis, trend analysis, and trend forecasts. Standardization of data: The data from heterogeneous sources are available in a single format in a data warehouse. This simplifies the readability and accessibility of data. For example, gender is denoted as Male/ Female in Source 1 and m/f in Source 2 but in a data warehouse the gender is stored in a format which is common across all the businesses i.e. M/F. Immense ROI (Return On Investment): Return On Investment refers to the additional revenues or reduces expenses a business will be able to realize from any project. According to a 2002 International Data Corporation (IDC) study “The Financial Impact of Business Analytics”, analytics projects have been achieving a substantial impact on a business’ financial status. What are the components of a Data warehouse? The components of a data warehouse are depicted in the figure below Data Sources  A flat file database stores data in a normal text format. Contrary to a relational database where the data is stored in the form of tables, in a flat file database the data stored does not have a folders or paths related to them. No manipulations are performed on the data. Delimiters are used in flat files to separate the data columns.  Excel spreadsheets are regularly used in data warehousing operations. They are impressive, lowpriced, and flexible tolls that many decision-makers find convenient to use. Excel also provides graphing features that allow the end-user to present the required data in chart and graph formats. These formats can be easily integrated into MS Word and Power Point presentations.  Operational systems of a business contain the day to day transactions of the data at a low-level. For example, the sales data, HR data, marketing data are used as input sources for a data warehouse.  Legacy systems are the applications of the yesteryear. They mirror the requirements of a business that might be twenty to twenty five year old. They are use till date since over years these systems have captured the business knowledge and rules that are exceptionally difficult to translate to a new platform/application. Staging Area  The first part of the staging area is the most challenging process of extraction. Depending on how accurately the data is extracted the subsequent operations succeed or fail. The source systems might be
complicated or poorly documented due to which the process becomes all the more difficult. The data may be extracted not only once but also periodically when changes occur at the source side.  The second stage is the transformation where the data is converted from one format to another. Since data often exists in different locations and formats across the enterprises, data conversion is mandatory to ensure that data from one application is comprehensible to other applications and databases.  The third stage is the loading where the extracted and transformed data is loaded into a data mart or a data warehouse depending on the business. The populated data is used for presentation applications by the end users. Data Repository The data is loaded into a data warehouse in the form of facts and dimensions Users The loaded data is accessed for reporting, analysis, and mining. The reporting tools like Business Objects and Cognos are used by users to generate reports. The data is also used for predicting trends What is the difference between OLTP and a Data Warehouse? OLTP  Online Transactional Processing constitutes of a large number of brief on-line transactions. It focuses on quicker query processing  The database queries are generally uncomplicated, need sub-second responses and return comparatively fewer records For example, when we withdraw cash from an ATM we get a mini statement issuing the withdrawal amount, balance amount, data of withdrawal and transaction number. This transaction falls under an OLTP. The mini statement consist of fewer records and takes sub-seconds to process. On the other hand, when we need a bank statement for a year we request the bank for an account statement. This statement falls under a data warehouse operation. We can view transactions occurred over a year with the dates and the updated amounts over each transaction and the closing balance. OLTP OLAP Origin of data Transactional data Data combined from several sources Objective Used for basis business tasks Used for predicting future trends and analysis Inserts and Update Operations Data is inserted and updated regularly by the end users Data is refreshed by longrunning batch jobs Nature of queries Simple queries with lesser number of records Complex queries required Processing speed Very fast Takes a longer time Evolution of Data Warehousing Data warehouse architecture is being influenced by business practices and goals that continue to evolve, notes Russom. The reason: a well-aligned data warehouse reflects the business it serves. Here are some of the main business drivers of today’s evolving data warehouse architectures, according to Russom:  Reporting is increasingly important to business operations: Reports are the primary way businesses distribute and consume information as well as monitor operations every day. That means businesses must protect their traditional data warehouse architectures, which are mainly aimed at providing data for reports as well as management dashboards, performance management, and online analytic processing.  Discovery analytics is to new business development what reporting is to established operations: Discovery-oriented analytics help companies discover new facts, trends, patterns, outliers, etc. about their business, customers, partners, and the competitive landscape. That is why businesses are
extending their average report-oriented data warehouse environments by adding new standalone data platforms that better help with discovery analytics—such as columnar databases, data appliances, NoSQL databases, and Hadoop.  Big Data isn’t about the ―bigness,‖ but rather about business analytics: The ideal way to get business value out of Big Data is through analytics. Therefore, satisfying data requirements of business analytics (either with Big Data or traditional enterprise data) is the ―leading driver for change in data warehouse architectures today.‖  Because each department has different requirements, they usually build their own ―shadow programs‖ for BI and analytics: To prevent the systems in each department from becoming data silos, data warehouse architectures are becoming more federated. Several databases appear to function as a single entity and all the data from multiple sources is presented as if it were stored in one place. This enables the architectural plan to extend across different systems in different departments.  Increasingly, businesses need access to real-time data: The leading edge now is event processing. Instead of storing data to find out what happened or what could have been, businesses need to act on events as they occur. Event processing allows businesses to proact instead of react to risk as well as create opportunities, not chase them. Although traditional data warehouse architectures are designed for ―data-at-rest,‖ real-time capabilities for ―data-in-motion‖ can retrofit into the architecture. Data Warehousing concepts What is Data Warehousing? Data warehousing is the process of constructing and using a data warehouse. A data warehouse is constructed by integrating data from multiple heterogeneous sources that support analytical reporting, structured and/or ad hoc queries, and decision making. Data warehousing involves data cleaning, data integration, and data consolidations. Using Data Warehouse Information There are decision support technologies that help utilize the data available in a data warehouse. These technologies help executives to use the warehouse quickly and effectively. They can gather data, analyze it, and take decisions based on the information present in the warehouse. The information gathered in a warehouse can be used in any of the following domains:  Tuning Production Strategies - The product strategies can be well tuned by repositioning the products and managing the product portfolios by comparing the sales quarterly or yearly.  Customer Analysis - Customer analysis is done by analyzing the customer's buying preferences, buying time, budget cycles, etc.  Operations Analysis - Data warehousing also helps in customer relationship management, and making environmental corrections. The information also allows us to analyze business operations. Integrating Heterogeneous Databases To integrate heterogeneous databases, we have two approaches:  Query-driven Approach  Update-driven Approach Query-Driven Approach This is the traditional approach to integrate heterogeneous databases. This approach was used to build wrappers and integrators on top of multiple heterogeneous databases. These integrators are also known as mediators. Process of Query-Driven Approach  When a query is issued to a client side, a metadata dictionary translates the query into an appropriate form for individual heterogeneous sites involved.  Now these queries are mapped and sent to the local query processor.  The results from heterogeneous sites are integrated into a global answer set. Disadvantages  Query-driven approach needs complex integration and filtering processes.  This approach is very inefficient.  It is very expensive for frequent queries.  This approach is also very expensive for queries that require aggregations. Update-Driven Approach

Lecture Notes