Application of Porter Stremmer Algorithm

Only available on StudyMode
  • Topic: Stemming, Consonant, Affix
  • Pages : 7 (1309 words )
  • Download(s) : 119
  • Published : December 10, 2012
Open Document
Text Preview
Using of Porter Stremmer Algorithm

Overview
The Porter Stemmer is a conflation Stemmer developed by Martin Porter at the University of Cambridge in 1980. The stemmer is a context sensitive suffix removal algorithm. It is the most widely used of all the stemmers and implementations in many languages are available. This native functor creates a module that exports a function which performs stemming by means of the Porter stemming algorithm. Quoting Martin Porter himself: The Porter stemming algorithm (or 'Porter stemmer') is a process for removing the commoner morphological and inflexional endings from words in English. Its main use is as part of a term normalisation process that is usually done when setting up Information Retrieval systems.

Algorithm
Porter's Algorithm works based on number of vowel characters, which are followed be a consonant character in the stem (Measure), must be greater than one for the rule to be applied. In details we can say that, every word (except noun) is a combination of consonant and vowel. A consonant is a letter other than A, E, I, O, U and Y preceded by a consonant. For example the in the word boy the consonants are B and Y, but in try they are T and R. A vowel is any letter that is not a consonant. A list of consonants greater than or equal to length one will be denoted by a C and a similar list of vowels by a V.Y preceded by a consonant here. A consonant will be denoted by c, a vowel by v.

ccc… is a list of consonant which will denoted by C, means sequence of one or more consonants. vvv… is a list of vowel which will denoted by V, means sequence of one or more vowel. A word may be in different length and therefore have four forms- CVCV ... C

CVCV ... V
VCVC ... C
VCVC ... V

These may all be represented by the single form
[C]VCVC ... [V]
These can be represented as [C](VC)m[V].

The superscript m in the equation, which is the measure, indicates the number of VC sequences. Square brackets indicate an optional occurrence. Some examples of measures for terms follows-

Measure Words

m=0 TR, EE, TREE, Y, BY
m=1 TROUBLE, OATS, TREES, IVY
m=2 TROUBLES, PRIVATE, OATEN

Depends on the value of m for a word ,the suffix of that word should be removed. All such rules are of the form; (condition) S1 -> S2 Which means that the suffix S1 is replaced by S2 if the remaining letters of S1 will satisfy the condition. This means that if a word ends with the suffix S1, and the stem before S1 satisfies the given condition, S1 is replaced by S2. The condition is usually given in terms of m, e.g.

(m > 1) EMENT ->

Here S1 is `EMENT' and S2 is null. This would map REPLACEMENT to REPLAC, since REPLAC is a word part for which m = 2.
Where the value of m can be calculated like the following way-

• tree C(VC)0V (tr)()(ee)
• troubles C(VC)2(tr)(ou)(bl)(e)(s)

The `condition' part may also contain the following:
* < X >--the stem ends with a given letter X
*v*--the stem contains a vowel
*d--the stem ends in a double consonant
*o--the stem ends with a consonant-vowel-consonant, sequence, where the final consonant is not w, x, or y. Suffix conditions take the form: (current_suffix == pattern). Rule conditions take the form: (rule was used). Actions are rewrite rules of the form:

old_suffix -> new_suffix
The rules are divided into steps. The rules in a step are examined in sequence, and only one rule from a step can apply. The longest possible suffix is always removed because of the ordering of the rules within a step. The algorithm is as follows.

Step 1: Gets rid of plurals and -ed or -ing suffixes
Step 2: Turns terminal y to i when there is another
vowel in the stem
Step 3: Maps double suffixes to single ones:
-ization, -ational, etc.
Step 4: Deals with suffixes( index final letter of strem if matches then remove ending), - full,...
tracking img