It does not matter how slowly you go as long as you do not stop.
--Your friends at LectureNotes

Note for Information Retrieval - IR By JNTU Heroes

  • Information Retrieval System - IR
  • Note
  • Jawaharlal Nehru Technological University Anantapur (JNTU) College of Engineering (CEP), Pulivendula, Pulivendula, Andhra Pradesh, India - JNTUACEP
  • 7 Topics
  • 221 Offline Downloads
  • Uploaded 1 year ago
0 User(s)
Download PDFOrder Printed Copy

Share it with your friends

Leave your Comments

Text from page-1


Text from page-2

INFORMATION RETRIEVAL SYSTEM UNIT -1 Retrieval Strategies INTRODUCTION:  Information Retrival System is a system it is a capable of stroring, maintaining from a system. and retrieving of information. This information May Any of the form that is audio,vedio,text.  Information Retrival System is mainly focus electronic searching and retrieving of documents.  Information Retrival is a activity of obtaining relevant documents based on user needs from collection of retrieved documents. Fig shows basic information retrieval system  A static, or relatively static, document collection is indexed prior to any user query.  A query is issued and a set of documents that are deemed relevant to the query are ranked based on their computed similarity to the query and presented to the user query.  Information Retrieval (IR) is devoted to finding relevant documents, not finding simple matches to patterns.  A related problem is that of document routing or filtering. Here, the queries are static and the document collection constantly changes. An environment where corporate e-mail is routed based on predefined queries to different parts of the organization (i.e., e-mail about sales is routed to the sales department,marketing e-mail goes to marketing, etc.) is an example of an application of document routing. Figure illustrates document routing 2

Text from page-3

 Fig: Document routing algorithms PRECISION AND RECALL: In Figure we illustrate the critical document categories that correspond to any issued query. Namely, in the collection there are documents which are retrieved, and there are those documents that are relevant. In a perfect system, these two sets would be equivalent; we would only retrieve relevant documents. In reality, systems retrieve many non-relevant documents. To measure effectiveness, two ratios are used: precision and recall. Precision is the ratio of the number of relevant documents retrieved to the total number retrieved. Precision provides an indication of the quality of the answer set. However, this does not consider the total number of relevant documents. A system might have good precision by retrieving ten documents and finding that nine are relevant(a 0.9 precision), but the total number of relevant documents also matters. If there were only nine relevant documents, the system would be a huge success.however if millions of documents were relevant and desired, this would not be a good result set. Recall considers the total number of relevant documents; it is the ratio of the number of relevant documents retrieved to the total number of documents in the collection that are believed to be relevant. Computing the total number of relevant documents is non-trivial. 3

Text from page-4

Fig: PRECISION AND RECALL 1. RETRIEVAL STRATEGIES: Retrieval strategies assign a measure of similarity between a query and a document. These strategies are based on the common notion that the more often terms are found in both the document and the query, the more "relevant" the document is deemed to be to the query. Some of these strategies employ counter measures to alleviate problems that occur due to the ambiguities inherent in language-the reality that the same concept can often be described withmany different terms. A retrieval strategy is an algorithm that takes a query Q and a set of documents D1 , D2 , ... , Dn and identifies the Similarity Coefficient SC(Q,Di) for each of the documents 1 :s: i :s: n The retrieval strategies identified are: 1.1 Vector Space Model Both the query and each document are represented as vectors in the term space. A measure of the similarity between the two vectors is computed. The vector space model computes a measure of similarity by defining a vector that represents each document, and a vector that represents the query The model is based on the idea that, in some rough sense, the meaning of a document is conveyed by the words used. If one can represent the words in the document by a vector, it is possible to compare documents with queries to determine how similar their content is. If a query is considered to be like a document, a similarity coefficient (SC) that measures the similarity between a document and a query can be computed. Documents whose content, as measured by the terms in the document, correspond most closely to the content of the query are judged to be the most relevant. Figure illustrates the basic notion of the vector space model in which vectors that represent a query and three documents are illustrated. 4

Lecture Notes