UNIT-1 INTRODUCTION TO BIG DATA Big Data: Big data is an evolving term that describes any voluminous amount of structured, semi structured and unstructured data that has the potential to be mined for information. Big data is a term used to refer to the study and applications of data sets that are too complex for traditional data-processing application software to adequately deal with. Big data challenges include capturing and analysis and storage, search, sharing, transfer, visualization, querying, updating, information privacy and data source. Characteristics of Big Data: Big data is often characterized by 3Vs, those are Volume, Velocity and Variety. 1)Volume: Volume is the amount of data generated that must be understood to make data based decisions. The quantity of generated and stored data. The size of the data determines the value and potential insight, and whether it can be considered big data or not. A text file is a few kilobytes, a sound file is a few megabytes while a full-length movie is a few gigabytes. Example: Amazon handles 15 million customer click stream user data per day to recommend products. Extremely large volume of data is major characteristic of big data 2) Velocity: Velocity measures how fast data is produced and modified and the speed with which it needs to be processed. An increased number of data sources both machine and human generated drive velocity. In this context, the speed at which the data is generated and processed to meet the demands and challenges that lie in the path of growth and development. Big data is often available in real-time. Example: 72 hours of video are uploaded to YouTube every minute this is the velocity. Extremely high velocity of data is another major characteristic of big data. 3) Variety: Variety defines data coming from new sources—both inside and outside of an enterprise It can be structured, semi-structured or unstructured. The type and nature of the data. This helps people who analyze it to effectively use the resulting insight. Big data draws from text, images, audio, video; plus it completes missing pieces through data fusion. Structured data is typically found in tables with columns and rows of data. The intersection of the row and the column in a cell has a value and is given a ―key,‖ which it can be referred to in queries. Because there is a direct relationship between the column and the row, these databases are commonly referred to as relational databases. A retail outlet that stores their sales data (name of person, product sold, amount) in an Excel spreadsheet or CSV file is an example of structured data.
Example: A Product table in a database is an example of Structured Data Product_id Product_name Product_price 1 Pen $5.95 2 Paper $8.95 Semi-structured data also has an organization, but the table structure is removed so the data can be more easily read and manipulated. XML files or an RSS feed for a webpage are examples of semi-structured data. Example: XML file <product> <name>Pen </name> <price>$7.95</price> </product> <product> <name>Paper </name> <price>$8.95</price> </product> Unstructured data: Unstructured data generally has no organizing structure, and Big Data technologies use different ways to add structure to this data. Typical example of unstructured data is, a heterogeneous data source containing a combination of simple text files, images, videos etc Example: Output returned by ‗Google Search‘ Importance of Big Data: The importance of big data is how you utilize the data which you own. Data can be fetched from any source and analyze it to solve that enable us in terms of 1) Cost reductions 2) Time reductions, 3) New product development and optimized offerings, and 4) Smart decision making. 5) Businesses can utilize outside intelligence while taking decisions: Access to social data from search engines and sites like face book, twitter are enabling organizations to fine tune their business strategies.
6) Improved customer service: Traditional customer feedback systems are getting replaced by new systems designed with 'Big Data' technologies. In these new systems, Big Data and natural language processing technologies are being used to read and evaluate consumer responses. 7) Early identification of risk to the product/services, if any 8) Better operational efficiency: 'Big Data' technologies can be used for creating staging area or landing zone for new data before identifying what data should be moved to the data warehouse. In addition, such integration of 'Big Data' technologies and data warehouse helps organization to offload infrequently accessed data. Combination of big data with high-powered analytics, you can have great impact on your business strategy such as: Finding the root cause of failures, issues and defects in real time operations. Generating coupons at the point of sale seeing the customer‘s habit of buying goods. Recalculating entire risk portfolios in just minutes. Detecting fraudulent behavior before it affects and risks your organization. Patterns for Big Data Development: The simplest way to describe a pattern is that it provides a proven solution to a common problem individually documented in a consistent format and usually as part of a larger collection. The notion of a pattern is already a fundamental part of everyday life. Without acknowledging it each time, we naturally use proven solutions to solve common problems each day. Patterns in the IT world that revolve around the design of automated systems are referred to as design patterns. Each of the design patterns covered in this catalog is documented in a pattern profile comprised of the following parts: Requirement – A requirement is a concise, single-sentence statement that presents the fundamental requirement addressed by the pattern in the form of a question. Every pattern description begins with this statement. Problem – The issue causing a problem and the effects of the problem are described in this section, which may be accompanied by a figure that further illustrates the "problem state." It is this problem for which the pattern is expected to provide a solution. Part of the problem description includes common circumstances that can lead to the problem (also known as "forces"). Solution – This represents the design solution proposed by the pattern to solve the problem and fulfill the requirement. Often the solution is a short statement that may be further followed by a diagram that concisely communicates the final solution state. "How-to" details are not provided in this section but are instead located in the Application section. Application – This part is dedicated to describing how the pattern can be applied. In can include guidelines, implementation details, and sometimes even a suggested process. Mechanisms – This section lists common mechanisms that can be implemented to apply the pattern. Usually, some of the mechanisms will have already been referenced in the Application section. The application of the pattern is not limited to the use of these mechanisms. Data in the Warehouse and Data in Hadoop: 1. Processing Structured Data: It is something that your traditional database is already very good at. After all, structured data, by definition is easy to enter, store, query and analyze. It conforms nicely to a fixed schema model of neat columns and rows that can be manipulated by Structured Query Language (SQL) to establish relationships. As such, using Hadoop to process structured data would be comparable to running simple errands with a Formula One racecar. However, with the rise of big data, many of those simple errands have become quite complex, calling for a more powerful and streamlined solution than the data warehouse can offer.
2. Storing, managing and analyzing massive volumes of semi-structured and unstructured data: It is what Hadoop was purpose-built to do. Unlike structured data, found within the tidy confines of records, spreadsheets and files, semi-structured and unstructured data is raw, complex, and pours in from multiple sources such as emails, text documents, videos, photos, social media posts, Twitter feeds, sensors and click streams. 3. Hadoop as a Service: Hadoop as a Service provides a scalable solution to meet everincreasing data storage and processing demands that the data warehouse can no longer handle. With its unlimited scale and on-demand access to compute and storage capacity, Hadoop as a Service is the perfect match for big data processing. Using tools found within the Hadoop ecosystem, such as Pig, Spark, Presto and others, Hadoop as a Service will help you to obtain the deeper insights often hidden in unstructured data that can propel your business forward. 4. Running constant and predictable workloads: This is what your existing data warehouse has been all about. And as a solution for meeting the demands of structured data—data that can be entered, stored, queried, and analyzed in a simple and straightforward manner—the data warehouse will continue to be a viable solution. But when it comes to handling massive volumes of unstructured data, that‘s where the warehouse falls short. 5. Running fluctuating workloads: These to meet growing big data demands requires a scalable infrastructure that allows servers to be provisioned as needed. That‘s where Qubole‘s cloud-based Hadoop service comes in handy. With the ability to spin virtual servers up or down on demand within minutes, Hadoop in the cloud provides the flexible scalability you‘ll need to handle fluctuating workloads. 6. Keeping costs down: it is a concern for every business in today‘s ultra-competitive arena. And traditional relational databases are certainly cost effective. If you are considering adding Hadoop to your data warehouse, it‘s important to make sure that your company‘s big data demands are genuine and that the potential benefits to be realized from implementing Hadoop will outweigh the costs. While on-premise Hadoop implementations save money by combining open source software with commodity servers, a cloud-based Hadoop platform will save you even more by eliminating the expense of physical servers and warehouse space entirely. Hybrid systems, which integrate Qubole‘s cloud-based Hadoop with traditional relational databases, are fast gaining popularity as cost-effective ways for companies to leverage the benefits of both platforms. 7. Running large distributed workloads: that address every file in the database is something that Hadoop handles very well, but not very fast. And so the tradeoff with this type of processing is slower time-to-insight. 8. Shorter time-to-insight necessitates interactive querying via the analysis of smaller data sets in near or real-time. And that‘s a task that the data warehouse has been well equipped to handle. However, thanks to a powerful Hadoop processing engine called Spark, Hadoop—and in particular Qubole‘s Hadoop as a Service—can handle both batch and streaming workloads at lightening fast speeds. Spark is designed for advanced, real-time analytics and has the framework and tools to deliver when shorter time-to-insight is critical.