×
Push yourself because, no one else is going to do it for you.
--Your friends at LectureNotes
Close

Data Mining And Data Warehousing

by Rishabh Pathak
Type: NoteOffline Downloads: 20Views: 452Uploaded: 5 months agoAdd to Favourite

Touch here to read
Page-1

Data Mining And Data Warehousing by Rishabh Pathak

Topic:
Rishabh Pathak
Rishabh Pathak

/ 183

Share it with your friends

Suggested Materials

Leave your Comments

Contributors

Rishabh Pathak
Rishabh Pathak
DATA WAREHOUSING UNIT 1,2,3,4,5,6 TYBSC(IT) SEM 6 COMPILED BY : SZ,JP,IN 302 PARANJPE UDYOG BHAVAN, NEAR KHANDELWAL SWEETS, NEAR THANE STATION , THANE (WEST) PHONE NO: 8097071144 / 8097071155 / 8655081002
TYBSC-IT (SEM 6) DATA WAREHOUSING Prof: siddhesh zele’s We-IT Tutorials UNIT 1 1.1 Introduction to Data Warehouse What Is A Data Warehouse? A data warehouse is a powerful database model that significantly enhances the user‟s ability to quickly analyze large, multidimensional data sets. It cleanses and organizes data to allow users to make business decisions based on facts. Hence, the data in the data warehouse must have strong analytical characteristics. Creating data to be analytical requires that it be subject-oriented, integrated, timereferenced, and non-volatile. Subject-Oriented Data In a data warehouse environment, information used for analysis is organized around subjects: employees, accounts, sales, products, and so on. This subject specific design helps in reducing the query response time by searching through very few records to get an answer to the user‟s question. Integrated Data Integrated data refers to de-duplicating information and merging it from many sources into one consistent location. When short listing your top 20 customers, you must know that “HAL” and “Hindustan Aeronautics Limited” are one and the same. Much of the transformation and loading work that goes into the data warehouse is centered on integrating data and standardizing it. Time-Referenced Data Time-referenced data essentially refers to its time-valued characteristic. For example, the user may ask “What were the total sales of product „A‟ for the past three years on New Year‟s Day across region „Y ‟?” Time-referenced data when analyzed can also help in spotting the hidden trends between different associative data elements, which may not be obvious to the naked eye. This exploration activity is termed “data mining”. Non-Volatile Data The non-volatility of data, characteristic of data warehouse, enables users to dig deep into history and arrive at specific business decisions based on facts. Why A Data Warehouse? The Data Access Crisis If there is a single key to survival in the 1990s and beyond, it is being able to analyze, plan, and react to changing business conditions in a much more rapid fashion. In order to do this, top 1 ADDRESS:302 PARANJPE UDYOG BHAVAN,OPP SHIVSAGAR RESTAURANT,THANE [W].PH 8097071144/55
TYBSC-IT (SEM 6) DATA WAREHOUSING Prof: siddhesh zele’s We-IT Tutorials managers, analysts, and knowledge workers in our enterprises, need more and better information. Every day, organizations large and small, create billions of bytes of data about all aspects of their business; millions of individual facts about their customers, products, operations and people. But for the most part, this is locked up in a maze of computer systems and is exceedingly difficult to get at. This phenomenon has been described as “data in jail”. Data Warehousing Data warehousing is a field that has grown from the integration of a number of different technologies and experiences over the past two decades. These experiences have allowed the IT industry to identify the key problems that need to be solved. Operational vs. Informational Systems Operational systems, as their name implies, are the systems that help the every day operation of the enterprise. These are the backbone systems of any enterprise, and include order entry, inventory, manufacturing, payroll and accounting. Due to their importance to the organization, operational systems were almost always the first parts of the enterprise to be computerized. Informational systems deal with analyzing data and making decisions, often major, about how the enterprise will operate now, and in the future. Not only do informational systems have a different focus from operational ones, they often have a different scope. Where operational data needs are normally focused upon a single area, informational data needs often span a number of different areas and need large amounts of related operational data. Framework Of The Data Warehouse One of the reasons that data warehousing has taken such a long time to develop is that it is actually a very comprehensive technology. In fact, it can be best represented as an enterprise-wide framework for managing informational data within the organization. In order to understand how all the components involved in a data warehousing strategy are related, it is essential to have a Data Warehouse Architecture. Data Warehouse Architecture A Data Warehouse Architecture (DWA) is a way of representing the overall structure of data, communication, processing and presentation that exists for end-user computing within the enterprise. The architecture is made up of a number of interconnected parts: Source system Source data transport layer Data quality control and data profiling layer Metadata management layer Data integration layer Data processing layer End user reporting layer Source System 2 ADDRESS:302 PARANJPE UDYOG BHAVAN,OPP SHIVSAGAR RESTAURANT,THANE [W].PH 8097071144/55
TYBSC-IT (SEM 6) DATA WAREHOUSING Prof: siddhesh zele’s We-IT Tutorials Operational systems process data to support critical operational needs. In order to do this, operational databases have been historically created to provide an efficient processing structure for a relatively small number of well-defined business transactions. Clearly, the goal of data warehousing is to free the information locked up in the operational systems and to combine it with information from other, often external, sources of data. Increasingly, large organizations are acquiring additional data from outside databases. This information includes demographic, econometric, competitive and purchasing trends. The so-called information superhighway is providing access to more data resources every day. Clearly, the goal of data warehousing is to free the information locked up in the operational systems and to combine it with information from other, often external, sources of data. Increasingly, large organizations are acquiring additional data from outside databases. This information includes demographic, econometric, competitive and purchasing trends. The so-called information superhighway is providing access to more data resources every day. Source Data Transport Layer The data transport layer of the DWA, largely constitutes data trafficking. It particularly represents the tools and processes involved in transporting data from the source systems to the enterprise warehouse system. Since the data volume is huge, the interfaces with the source system have to be robust and scalable enough to manage secured data transmission. Data Quality Control and Data Profiling Layer data quality causes the most concern in any data warehousing solution. Incomplete and inaccurate data will jeopardize the success of The data warehouse. Data warehouses do not generate their own data; rather they rely on the input data from the various source systems. It is very essential to measure the quality of the source data and take corrective action even before the information is processed and loaded into the target warehouse. Metadata Management Layer Metadata is the information about data within the enterprise. Record descriptions in a COBOL program are metadata. So are DIMENSION statements in a FORTRAN program, or SQL Create statements. The information in an ERA diagram is also metadata. In order to have a fully functional warehouse, it is necessary to have a variety of metadata available as also facts about the end-user views of data and information about the operational databases. Ideally, endusers should be able to access data from the warehouse (or from the operational databases) without having to know where it resides or the form in which it is stored. Data Integration Layer The data integration layer is involved in scheduling the various tasks that must be accomplished to integrate data acquired from various source systems. A lot of formatting and cleansing activities happen in this layer so that the data is consistent across the enterprise. This layer is heavily driven by off-the-shelf tools and consists of high-level job control for the many processes (procedures) that must occur to keep the data warehouse Up-to-date. Data Processing Layer The warehouse (core) is where the dimensionally modeled data resides. In some cases, one can think of the warehouse simply as a transformed view of the operational data; but modeled for analytical purposes. This layer consists of data staging and enterprise warehouse. Data staging often involves complex programming, but increasingly warehousing tools are being created that help in this process. Staging 3 ADDRESS:302 PARANJPE UDYOG BHAVAN,OPP SHIVSAGAR RESTAURANT,THANE [W].PH 8097071144/55

Lecture Notes