Subject: Data and Web Mining Semester: 8th Subject Code: PECS5409 Applicable Branch (es):CSE & IT Question Code:F380 Year of Examination:2014 Exam Type: (Special/Regular):Regular Solution Prepared by Sl.No. Faculty Name Email ID 1 Rashmita Jena firstname.lastname@example.org 2 3 4 NATIONAL INSTITUTE OF SCIENCE &TECHNOLOGY PALUR HILLS, BERHAMPUR, ORISSA – 761008, INDIA
1. a) Spatial Data carries topological and /or distance information and it is often organized by spatial indexing structures and accesses by spatial access methods. These distinct features of a spatial database bring opportunities for mining information from spatial data. Spatial data mining is the process of discovering interesting, useful, non-trivial patterns from large spatial datasets or geographic data. Spatial data mining has wide application in Geographic Information Systems, remote sensing, medical imaging, robot navigation etc. b) Some examples of practical applications of text mining techniques include: i. ii. iii. iv. v. vi. vii. Spam filtering Creating suggestion and recommendations (like amazon) Monitoring public opinions (for example in blogs or review sites) Customer service, email support Automatic labeling of documents in business libraries Measuring customer preferences by analyzing qualitative interviews Fraud detection by investigating notification of claims Typical Applications for Text Mining is the automatic classification of mails. For example, it is possible to "filter" out automatically most undesirable "junk email" based on certain terms or words that are not likely to appear in legitimate messages, but instead identify undesirable electronic mail. In this manner, such messages can automatically be discarded. Such automatic systems for classifying electronic messages can also be useful in applications where messages need to be routed (automatically) to the most appropriate department or agency; e.g., email messages with complaints or petitions to a municipal authority are automatically routed to the appropriate departments; at the same time, the emails are screened for inappropriate or obscene messages, which are automatically returned to the sender with a request to remove the offending words or content. c. Web content mining is the process of extracting useful information from the contents of web documents. Content data is the collection of facts a web page is designed to contain. It may consist of text, images, audio, video, or structured records such as lists and tables. d. A rule is defined as an implication of the form where and . The sets of items (for short itemsets) and are called antecedent (left-hand-side or LHS) and consequent (right-hand-side or RHS) of the rule respectively. To select interesting rules from the set of all possible rules, constraints on various measures of significance and interest can be used. The best-known constraints are minimum thresholds on support and confidence.The support of an itemset is defined as the proportion of transactions in the data set which contain the itemset. In the example database, the itemset has a support of since it occurs in 20% of all transactions (1 out of 5 transactions).The confidence of a rule is defined . For example, the rule has a confidence of in the database, which means that for 100% of the transactions containing butter and bread the rule is correct (100% of the times a customer buys butter and bread, milk is bought as well). Be careful when reading the expression: here supp(X∪Y) means "support for occurrences of transactions where X and Y both appear", not "support for occurrences of transactions where either X or Y appears", the latter interpretation arising because set union is equivalent to logical disjunction. The argument of is a set of preconditions, and thus becomes more restrictive as it grows. e. A Bayesian classifier is based on the idea that the role of a (natural) class is to predict the values of features for members of that class. Examples are grouped in classes because they have common values for the features. Such classes are often called natural kinds. In this section, the target feature corresponds to a discrete class, which is not
f. necessarily binary.The idea behind a Bayesian classifier is that, if an agent knows the class, it can predict the values of the other features. If it does not know the class, Bayes' rule can be used to predict the class given (some of) the feature values. In a Bayesian classifier, the learning agent builds a probabilistic model of the features and uses that model to predict the classification of a new example. Unlike conventional clustering, which identiﬁes groups of objects, model-based clustering methods ﬁnd characteristic descriptions for each group, where each group represents a concept or class. The most frequently used induction methods are decision trees and neural networks. Sample observations arise from a distribution that is a mixture of two or more components. Each component is described by a density function and has an associated probability or “weight” in the mixture. In principle, we can adopt any probability model for the components, but typically we will assume that components are p-variate normal distributions. (This does not necessarily mean things are easy: inference in tractable, however.) Thus, the probability model for clustering will often be a mixture of multivariate normal distributions.Each component in the mixture is what we call a cluster. g. Possible number of association rules from a dataset with d items h. Web usage mining is the process of extracting useful information from server logs e.g. use Web usage mining is the process of finding out what users are looking for on theInternet. Web usage mining itself can be classified further depending on the kind of usage data considered.Web Server Data: The user logs are collected by the Web server. Typical data includes IP address, page reference and access time.Application Server Data: Commercial application servers have significant features to enable e-commerce applications to be built on top of them with little effort. A key feature is the ability to track various kinds of business events and log them in application server logs.Application Level Data: New kinds of events can be defined in an application, and logging can be turned on for them thus generating histories of these specially defined events. Web content mining is the mining, extraction and integration of useful data, information and knowledge from Web page content. The heterogeneity and the lack of structure that permits much of the ever-expanding information sources on the World Wide Web, such as hypertext documents, prompted researchers to develop more intelligent tools for information retrieval, such as intelligent web agents, as well as to extend database and data mining techniques to provide a higher level of organization for semi-structured data available on the web. The agent-based approach to web mining involves the development of sophisticated AI systems that can act autonomously or semi-autonomously on behalf of a particular user, to discover and organize web-based information. i. Data mining is the process of non-trivial discovery from implied, previously unknown, and potentially useful information from data in large databases. Hence it is a core element in knowledge discovery, often used synonymously. The data is integrated and cleaned so that the relevant data is taken. Data mining presents discovered data that is not just clear to data mining analysts but also for domain experts who may use it to derive actionable recommendations. Web mining describes the application of traditional data mining techniques onto the web resources and has facilitated the further development of these techniques to consider the specific structures of web data. The analyzed web resources contain (1) the actual web site (2) the hyperlinks connecting these sites and (3) the path that online users take on the web to reach a particular site. Web usage mining then refers to the derivation of useful knowledge from these data inputs. The content of the raw data for web usage mining on the one hand, and the expected knowledge to be derived from it on the other, pose a special challenge. While the input data are mostly web server logs and other primarily technically oriented data, the desired output is an understanding of user behavior in the domain of online information search, online shopping, online learning etc. j. Predictive modeling involves the development of a model based on existing data. The model is then used as a basis for the prediction of another variable that is relevant to the data reviewed. The term "predictive" indicates that this data mining tool can enable the user to predict some value based on what is known in the dataset. Predictive analysis may be used by marketers to determine what products customers are seeking. Based on current purchasing trends, marketers may be able to make predictions about which new products may be popular in the future. Descriptive modeling is a data mining analysis tool used to collectively describe all of the data in a given dataset. Specifically, this approach synthesizes all of the data to provide information regarding trends, segments and clusters that are present in the information searched. Descriptive data mining analysis is commonly used in advertising. One
example of this is market segmentation in which marketers take larger customer groups and segment them by homogeneous characteristics. 2. B. Following the original definition by Agrawal et al. the problem of association rule mining is defined as: Let be a set of binary attributes called items. Let be a set of transactions called the database. Each transaction in has a unique transaction ID and contains a subset of the items in . A rule is defined as an implication of the form where and . The sets of items (for short itemsets) and are called antecedent (left-hand-side or LHS) and consequent (right-hand-side or RHS) of the rule respectively. To illustrate the concepts, we use a small example from the supermarket domain. The set of items is and a small database containing the items (1 codes presence and 0 absence of an item in a transaction) is shown in the table. An example rule for the supermarket could be meaning that if butter and bread are bought, customers also buy milk. To select interesting rules from the set of all possible rules, constraints on various measures of significance and interest can be used. The best-known constraints are minimum thresholds on support and confidence. The support of an itemset is defined as the proportion of transactions in the data set which contain the itemset. In the example database, the itemset has a support of since it occurs in 20% of all transactions (1 out of 5 transactions). The confidence of a rule is defined . For example, the rule has a confidence of in the database, which means that for 100% of the transactions containing butter and bread the rule is correct (100% of the times a customer buys butter and bread, milk is bought as well). Be careful when reading the expression: here supp(X∪Y) means "support for occurrences of transactions where X and Y both appear", not "support for occurrences of transactions where either X or Y appears", the latter interpretation arising because set union is equivalent to logical disjunction. The argument of is a set of preconditions, and thus becomes more restrictive as it grows (instead of more inclusive). 3. a) Decision tree induction is the learning of decision trees fromclass-labeled training tuples. Adecision tree is a flowchart-like tree structure,where each internal node (nonleaf node) denotes a test on an attribute, each branch represents an outcomeof the test, and each leaf node (or terminal node) holds a class label. The topmost node in a tree is the root node.