×
The best revenge is massive SUCCESS
--Your friends at LectureNotes
Close

Note for Theme Detection - TD By Placement Factory

  • Theme Detection - TD
  • Note
  • Logical Reasoning
  • 448 Views
  • 12 Offline Downloads
  • Uploaded 6 months ago
0 User(s)
Download PDFOrder Printed Copy

Share it with your friends

Leave your Comments

Text from page-2

direct subjective frames and expressive subjective element. To distinguish opinion-oriented materials from other factual materials, objective speech event frames have been defined. Direct Subjective Frame: A private state containing direct subjective element is called a direct subjective frame. For example, in the sentence “The U.S. fears a spill-over,” said Xirao-Nima.” The word ‘fears’ represents a private state and is annotated as ‘Direct Subjective Frame’. Expressive subjective elements: A private state containing no direct opinion but only subjective references to opinion is called an expressive subjective element frame. For example, in the sentence “The report is full of absurdities,” Xirao-Nima said. The phrase “full of absurdities” represents a private state and is annotated as ‘Expressive Subjective Element’. Objective Speech Event: This is purely the factual part of any event. For example, the sentence does not carry any opinionated information but a description of a fact or event. “O’Leary said “the incident took place at 2:00pm.” During the subjective tagging of MPQA sentences, the hypothesis is that if a sentence has any Direct or Expressive Subjective phrases then the sentence would be subjective itself. Hence, the sentences containing either of the two private states (i.e. Direct Subjective and Expressive Subjective) are extracted as subjective sentences and the sentences containing only objective speech event or no annotated private states are discarded. There are features of an attitude frame like intensity in the MPQA annotation scheme. But for the present task other attributes of an attitude frame are not considered. 2.2. Subjectivity-Annotation in Bengali From the collected document set (Letters to the Editor Section), some documents have been chosen for the annotation task. Documents that appeared within an interval of four months are chosen on the hypothesis that these letters to the editors will be on related events. A simple annotation tool has been designed for annotating the subjective sentences. Three annotators participated in the present task. The documents with such annotated sentences are saved in XML format. The tool also highlights the sentiment words (SentiWordNet, see section 4.2.5 for details) by four different colors within a document according to their POS categories (Noun, Adjective, Adverb and Verb). This technique helps to increase the speed of annotation process. Finally 100 annotated documents have been produced. No inter-annotator agreement has been calculated for the present task. Some statistics about the Bengali news corpus is represented in the Table 1. Total number of documents in the corpus Total number of sentences in the corpus 100 2234 Avgerage number of sentences in a document 22 Total number of wordforms in the corpus 28807 Avgerage number of wordforms in a 288 document Total number of distinct wordforms in the 17176 corpus Table 1. Bengali News Corpus Statistics 3. Theme Detection Theme Detection is a rule based algorithm to identify subjective sentences in text documents. The algorithm takes the subjectivity decision on the basis of several features of the sentences in the text. These features are obtained using various machine learning algorithms. Various linguistic resources that are used to derive several binary features have been developed manually. Theme detection process works in two stages. Theme detection technique first captures discourse level opinion theme in terms of thematic expressions which best describes the opinionated theme of a document. In the next level the algorithm examines the presence of thematic expression as an opinion constituent (SubjectAspect-Evaluation) in any sentence. Subjectivity detection by lexicons like SentiWordNet or Subjectivity word list has been explored by researchers. Theme detection technique works on discourse level and takes care of syntactic structure of sentences. . The challenge is to identify the most concise feature set and construct the rules effectively for the two stage identification problem. Experiments are carried out with an initial list of features and finally some of the features are discarded as they are found to have no contribution towards increasing system performance. The Theme detection technique has been applied on both English and Bengali language texts. Motivation for the technique has been presented in Section 4.1. The various subjectivity clues or features and how these can be obtained have been discussed in Section 4.2. The evaluation results presented in the Section 6 show the effectiveness of the algorithm. 3.1. Motivation Many supervised and unsupervised techniques have been explored for subjectivity annotation task by many researchers over a long period of time. Several linguistic resources and tools like dependency parsing, Named Entity Recognition, Morphological Analyzer, Stemmer, SentiWordNet, WordNet etc have been used several times in the subjectivity detection task. But in the case of morphologically rich Indian languages like Bengali, such resources and tools are not readily available. Highly inspired by Janyce Wiebe et.al, 2005 [28] the present work is initiated to develop a subjectivity classifier that will work on un-annotated text. Our aim is to design an automatic process that learns linguistically rich extraction patterns for subjective (opinionated) expressions and produces a rich ontological language-

Text from page-3

specific (rather than domain dependent) knowledge. Subjective remarks come in a variety of forms, including opinions, rants, allegations, accusations, suspicions, humor and speculations. 3.2. Learning Subjective Clues Existing methods for opinion extraction tend to rely on relatively simple proximity-based or pattern-based techniques. However, these pattern-based techniques are not enough to extract opinions because these patterns can apply to the cases where all constituents of opinion appear in a sentence. It has been observed that most of the opinion constituents do not have a direct syntactic dependency relation within a sentence, mostly due to elliptical arguments. Based on a corpus study, it is proposed to define an opinion unit as a quadruple, i.e., the opinion holder, the subject being evaluated (Subject), the part or the attribute in which it is evaluated (Aspect), and the evaluation that expresses a positive or negative assessment (Evaluation). The present subjectivity detection algorithm has been applied to News corpus (both for English and Bengali) where name of the author of any article is rarely mentioned. Hence, the Opinion Holder information is not taken into consideration and only the SubjectAspect-Evaluation constituents are identified for analysis. These constituents can be further defined as: Subject: A named entity (Person or Location etc.) of a given particular class of interest (e.g. a leader name or a location name where an incident occurred). Aspect: An attribute of the subject with respect to which evaluation is made (size, color, date etc.). The aspect can define a characteristic of the subject or an integral part of the subject. Evaluation: An evaluative or subjective phrase used to express an evaluation or the opinion holder's mental/emotional attitude (good, poor, powerful, stylish etc.). Initially, a detailed analysis of the English MPQA and Bengali newspaper corpus has been done to understand the most concise and effective features for opinion (i.e., opinion constituents) identification and their characteristics in the corpus. In order to identify features we started with Part Of Speech (POS) categories and continued the exploration with the other features like chunk, functional word, ontology list, SentiWordNet, stemming cluster, frequency, positional aspect (e.g. title, first Paragraph, last two sentences, critical Issues) and average distribution. Each of the features and the methods for their identification are now being discussed. 3.2.1 Part Of Speech (POS) Hatzivassiloglou et. al., 2000 [29], Chesley et. al., 2006 [30] etc. have proved that those words carrying opinion in sentences are mainly adjective, adverb, noun and verbs. Many opinion mining task, for example the one presented in [31], are mostly based on adjective words. This means that this part-of-speech (POS) tag is more important for Subjectivity Detection than others. The identified POS categories that carry opinion information are Adjective, Adverb, Verb and Noun while the opinion information for words of other POS categories is difficult to generalize. Thus, we concentrated on identifying the POS categories of the words, especially to see whether these words are adjectives, adverbs, noun and verbs. Stanford Parser 1 has been used for English text to get the word level POS category. The overall accuracy of this tool as reported in Klein et.al. 2003 [32] is 86.7%. The POS Tagging Engine for Bengali text has been developed using the statistical Conditional Random Fields (CRF) 2 [33]. The system makes use of the different contextual information of the words along with the variety of features that are helpful in predicting the various part of speech (POS) classes. The training set consists of 200K words and has been manually annotated with a POS tag-set 3 of twenty one tags developed by International Institute of Information Technology Hyderabad (IIIT-H). The system produces output in Shakti-Standard-Format 4 (SSF), also developed by IIIT-H for Indian languages. Experimental test results show the effectiveness of the CRF based POS tagging system with an overall average 87.23% accuracy. Feature selection plays a crucial role in CRF framework. Experiments were carried out to find out the most suitable features for POS in Bengali. The Experimental results are shown below in Table 2. Training-Set Test-Set Accuracy 16397 4587 87.23% Table 2. Experimental Result of POS Tagging. 3.2.2 Chunk Identification of subjective feature depends on opinion constituents. A detailed empirical study reveals that Subject, Aspect or Evaluation expressions may be defined in terms of chunk tags.  Subject phrases are generally noun phrases with noun-noun, adjective-noun or noun-other combinations. The nouns in the subject phrases are generally named entities or low frequency noun words or out of vocabulary words. In case of multiple noun phrases in a sentence the head noun phrase is treated as a Subject phrase. Some empirical rules have been defined to identify the head noun phrase in a sentence: the distance of each NP (Noun phrase) from VP (Verb Phrase) is calculated in terms of characters. The NP which is situated at the farthest distance from VP is selected as the head NP. In case of multiple VPs, only the phrases tagged as Verb Finite Phrases are considered. 1 http://nlp.stanford.edu/software/lex-parser.shtml#Download http://crfpp.sourceforge.net 3 http://shiva.iiit.ac.in/SPSAL2007/iiit_tagset_guidelines.pdf 4 http://ltrc.iiit.net/MachineTrans/research/tb/shakti-analy-ssf.pdf 2

Text from page-4

 Aspect phrases are an attribute of the Subject. But in News domain aspect is not only sub part of the Subject but also a conceptual sub-event or an actor’s (i.e. Subject) activity on a particular event. More specifically the aspects are factual information which should be discarded by subjectivity detector. As an example: 1) Saddam hanged up yesterday night. 2) USA president says “He deserved it”. 3) USA people’s general opinion this is a form of real cruelty. In the sentence 2, the aspect is Saddam Hussain’s death is missing, which is an event reported in the sentence 1. In the sentence 3 subject is peoples of USA but the aspect is also an event that is president’s statement which is a subsequent event of the main event i.e., Saddam Hussain’s death is missing. Aspects are either missing or present as elliptical arguments in News corpus. In other cases when aspect present in a sentence, it has been observed that it is generally a noun phrase with some sentiment words. A small set of rules are defined for aspect identification in a sentence.  Evaluation phrases are used to express an evaluation or the opinion holder's mental/emotional attitude. Detailed analysis of both the English and Bengali corpus has shown that Evaluation phrases are generally verb phrases with some evaluative sentiment phrases (adjective or adverbial). In case of missing finite verb phrases (Sentence 3 in the preceding example) in any sentence the evaluative phrases are generally adjective or adverbial phrases or noun phrases along with evaluative sentiment words (obtained from SentiWordNet). Rules have been defined accordingly to identify those phrases. The common feature among the three constituents of opinion is that they could be noun phrases. Positional information and language features have been used to disambiguate among the three constituents. The Stanford Parser was used for English chunking task. The parser produced a parse tree as the output that is converted into SSF format for further processing. The chunker for Bengali texts is trained on the feature templates for predicting the chunk boundary tags using CRF. The accuracy of the chunker is shown in Table 3. Training-Set Test-Set Accuracy 16397 4587 79.51% Table 3. Experimental Result of Chunker. 3.2.3 Functional Word Function words in a language are high frequency words and these words generally do not contribute to identify subjectivity; hence these words are dropped by system in the first stage. But function words help many times to understand syntactic pattern of an opinionated sentence, hence some rules are constructed based on functional words at the POS and Chunk level instead of the word itself. An example may illustrate the situation. 4) President and Army chief both congratulated NASA on the success of their expedition. In the sentence 4 the subject is “President and Army chief” and the clue is the function word “and”. Rules are defined to find out more than one consecutive NPs (Noun phrase) connected with any conjunct. A list of 253 entries is collected from the Bengali corpus. First a unique high frequency word list is generated where the assumed threshold frequency is considered as 20. The list is manually corrected keeping in mind that a word should not carry any opinionated or sentiment feature. For English the functional word list is collected from Website 5. In the English functional word list there are 300 entries. 3.2.4 Ontology List Four different ontology lists corresponding to the four POS categories (Noun, Verb, Adjective and Adverb) of words have been prepared by selecting the top 100 words in each category from POS tagged English and Bengali texts. During the second stage of subjectivity detection, i.e., Theme Sentence Identification, words with Noun POS categories were separated into Named Entities and Common nouns. 3.2.5 SentiWordNet in Bengali Words that are present in the SentiWordNet carry opinion information. SentiWordNet [34] is an automatically constructed lexical resource for English which assigns a positivity score, a negativity score, and a neutrality score to each WordNet synset. Release 1.1 of SentiWordNet for English was obtained from the authors. SentiWordNet Release 1.1 consists of 115,341 words marked with positive and negative orientation scores ranging from 0 to 1. A subset of 8,427 opinionated words was extracted from SentiWordNet, by selecting those whose orientation strength is above a threshold of 0.4. As there was no such SentiWordNet for Bengali, a task has been initiated to develop a similar resource for Bengali. For the task, Samsad 6, a widely used EnglishBengali dictionary available both in offline and online version, is selected. The Samsad English–Bengali dictionary has approximately 102119 entries. A word to word simple lexical-transfer technique is applied to each entry of SentiWordNet. Each dictionary search produces a set of Bengali words for a particular English word. Instead of making them into one entry we separate them into multiple one word entries for making the subsequent search process faster. The positive and negative opinion scores for the Bengali words are copied from their English equivalents. This process has resulted in 20,789 Bengali entries which is a useful resource in the Bengali Opinion Mining task. The words in the POS tagged English and Bengali texts with the following POS tags, namely Noun, Verb Adjective and Adverbs, were checked into the 5 http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_w ords 6 http://dsal.uchicago.edu/dictionaries/biswas_bengali/

Text from page-5

SentiWordNet for the respective languages. Words for which the POS category obtained from the POS tagger does not match in the SentiWordNet are discarded. Other words are considered important for subjectivity detection. 3.2.6 Stemming Cluster Several words in a sentence that carry opinion information may be present in a sentence in their inflected forms. Stemming is necessary for such inflected words before they can be searched in appropriate lists. Due to non availability of good stemmers in Indian languages especially in Bengali, a stemmer based on stemming cluster technique has been developed. This feature analyzes prefixes and suffixes of all the word forms present in a particular document. Words that are identified to have same root form are grouped in a finite number of clusters with the identified root word as cluster center. The term prefix/suffix is a sequence of first/last few characters of a word, which may not be linguistically meaningful. The use of prefix/suffix information works well for highly inflected languages like the Indian languages. Experiments are carried out with two types of algorithms: simple suffix stripping algorithm and score base stemming cluster identification algorithm. A small list of 205 suffixes for Bengali has been manually generated. The Suffix stripping algorithm simply checks if any word has any suffixes (one or more than one suffixes) from the list and then the word is assigned to the appropriate cluster where cluster center is the assumed root word, i.e., the form obtained after deleting the suffix from the surface form. Suffix stripping algorithm works well for Noun, Adjective, Adverb categories. In case of Verbs in Bengali, root form of the word changes when suffixes are added. Hence for the Bengali Verb words simple suffix stripping does not work well. The score based stemming technique has been designed to resolve the stem for inflected Verb words. The technique uses Minimum Edit Distance method [35], well known for spelling error detection, to measure the cost of classifying every word being in a particular class. Score based technique considers two standard operations of Minimum Edit Distance, i.e., insertion and deletion. The consideration range of insertion and deletion for the present task is maximum three characters. The idea is that the present word matches an existing cluster centre after insertion and/or deletion of maximum three characters. The present word will be assigned to the cluster that can be reached with minimum number of insertion and/or deletion. This is an iterative clustering mechanism for assigning each word into a cluster. The system iterates 6 times i.e. it starts from -3 (deletion of three characters) and ended with +3 (insertion of three characters) value and finally generate a finite number of stemming clusters. A separate list of verb inflections (only 50 entries) has been maintained to validate the result of the score based technique. The standard K- means Clustering technique has been used here. Each cluster center is treated as a root stem. For English, standard Porter Stemmer 7 algorithm has been used. 3.2.7 Frequency Frequency plays a crucial role in identifying the importance of a word in the document. After function word removal and POS annotation, system generates four separate high frequent word lists for four POS cate Root Surface Form Suffixes ভা রত ভা রতে , ভা রতে র ে ,ে র Adje ctive অমা নব, দুর্ভাগ্য ি ক বশত Adve rb ভা রী , দূর, দূর অমা নবি ক, দুর্ভাগ্যবশত ভা রি ক্কি, দূরী ভূত, দূরী কৃত Type Noun Verb খা ি ক ্ কি, ী ভূত, ী কৃত খা চ্ছেন, খে য়ে ছি লে ন চ্ ছেন, য়ে ছি লে ন Table 4. POS Category wise Variations. -gories: Adjective, Adverb, Verb and Noun. The Theme Expression identification module then starts to recognize most important Theme Expressions from the four categorical lists. These lists contain single word and multiword entities (identified through chunk level information) simultaneously. The system makes several iterations to calculate the presence of each Theme Expressions in the opinion constituents (SubjectAspect-Evaluation) in a sentence and update the associated score of the theme sentence. The system then proceeds to examine the valence of every sentence based on the presence of Thematic Expressions along with many other features and rule set. 3.2.8 Positional Aspect Depending upon the position of subjectivity clue, every document is divided into a number of zones. The dependency factors of this feature are Title of the document, first paragraph and last two sentences. A detailed study was done on the MPQA and Bengali corpus to identify the roles of the positional aspect in the detection of subjectivity of a sentence and these results are shown in Tables 5. 3.2.8.1 Title of the document It has been observed that Title of a document always carries some meaningful subjective information. Thus a Thematic expression containing title words (words that are present in the title of the document) always get higher score. The sentences that contain these thematic expressions also get higher scores. 3.2.8.2 First Paragraph It has been observed that people generally give a brief idea of their beliefs and speculations in the first paragraph of the document and subsequently elaborate 7 http://tartarus.org/~martin/PorterStemmer/

Lecture Notes