Lexical Approach for Sentiment Analysis in Hindi

Only available on StudyMode
  • Topic: Adjective, Adverb, Linguistics
  • Pages : 6 (1427 words )
  • Download(s) : 584
  • Published : February 18, 2013
Open Document
Text Preview
Lexical Approach for Sentiment Analysis in Hindi
Santosh K
IIITH Hyderabad, India

Rahul Sharma
IIITH Hyderabad, India

Chiranjeev Sharma
IIITH Hyderabad, India

ABSTRACT
This paper presents a study on sentiment analysis and opinion mining in Hindi on product reviews. We experimented with several methods, mainly focusing on lexical based approaches. Different lexicons were used on same data set to analyse the significance of lexical based approaches.

2.1 Lexicon
Two different lexicons were used in order to test the efficiency of the lexical based approach for sentiment analysis. Each lexicon contains Adjectives and Adverbs and their corresponding positive and negative scores. HSL lexicon has positive, negative and objective score, where as HSWN lexicon has only positive and negative scores. The scores are the probability values of a word being used in a positive, negative or objective (neutral) sense. For any given word in the lexicon, the sum of all the scores is 1. The total score of a word w is given by, total score(w) = P (p) + P (n) + P (o) (1)

General Terms
Languages, Unsupervised

Keywords
Opinion Mining, Sentiment Analysis

1. INTRODUCTION
In view of the growing content on web in various Indian languages, there is a need for an analysis of the data from various sources like blogs, product reviews and other social networking websites. This classification can be useful in product analysis, marketing strategies, advertisements and other user specific recommendation systems. Sentiment analysis has been done in English and other languages. But it is fairly new in Hindi and other Indian languages. In this paper we propose a method to classify the reviews in to either positive or negative using a lexicon. Two different lexicons, HSL (Hindi Subjective Lexicon)1 [1] and HSWN (Hindi Sentence WordNet)2 were used and each lexicon contains Adjectives, Adverbs and their corresponding scores.

where, P(p), P(n) and P(o) is the probability of word w being used in a positive, negative and objective (neutral) sense. The size of the lexicons is given in the below table. Lexicon HSL HSWN Adjectives 8108 4861 Adverbs 889 294

Table 1: Size of Lexicons

3. LEXICAL BASED APPROACH
A lexical based approach is followed, in which the data set is tested against two different lexicons[2]. Each review in the data set is classified based on the calculated score for adjective and adverb presence. Two types of approaches were followed using the Lexicon. Both the approaches are tested on two lexicons. • Using Hindi Parts-of-speech (PoS) tagger 3 , where only words that are tagged as JJ or RB are scored based on the lexcicon. • Without PoS tagger, where every word in the review is searched against the adjectives and adverbs in the lexicon and score in computed. There is a chance that the scores for the adjectives and adverbs are biased or domain dependent, so the reviews are ranked on based on the presence (occurrence) of them. For each of the above two approaches, the following four methods are followed. 3 http://ltrc.iiit.ac.in/showfile.php?filename= downloads/shallow_parser.php

2. DATA SET
The data set is product reviews in English, translated to Hindi and is validated manually. The data set contains 700 product reviews, out of which 350 are classified as positive and 350 as negative. The length of each review varies from 2 to 30 words. 1 2

HSL (Developed at IIIT, Hyderabad) HSWN (Developed at IIT, Bombay)

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00.

• Adjective presence in the lexicon. • Adjective and...
tracking img