Preview

Part of Speech Recognizer

Good Essays
Open Document
Open Document
3200 Words
Grammar
Grammar
Plagiarism
Plagiarism
Writing
Writing
Score
Score
Part of Speech Recognizer
Improving Identifier Informativeness using Part of Speech Information
Dave Binkley Matthew Hearn Dawn Lawrie Loyola University Maryland Baltimore MD 21210-2699, USA {binkley, lawrie}@cs.loyola.edu, mthearn@loyola.edu
Keywords: source code analysis tools, natural language processing, program comprehension, identifier analysis

Abstract
Recent software development tools have exploited the mining of natural language information found within software and its supporting documentation. To make the most of this information, researchers have drawn upon the work of the natural language processing community for tools and techniques. One such tool provides part-of-speech information, which finds application in improving the searching of software repositories and extracting domain information found in identifiers. Unfortunately, the natural language found is software differs from that found in standard prose. This difference potentially limits the effectiveness of off-the-shelf tools. The presented empirical investigation finds that this limitation can be partially overcome, resulting in a tagger that is up to 88% accurate when applied to source code identifiers. The investigation then uses the improved part-of-speech information to tag a large corpus of over 145,000 field names. From patterns in the tags several rules emerge that seek to improve structure-field naming.

Source Part of Extract Split Apply Source ⇒ Code ⇒ Field ⇒ Field ⇒ ⇒ Speech Template Code Mark-up Tagging Names Names

Figure 1. Process for POS tagging of field names. The text available in source-code artifacts, in particular a program’s identifiers, has a very different structure. For example the words of an identifier rarely form a grammatically correct sentence. This raises an interesting question: can an existing POS tagger be made to work well on the natural language found in source code? Better POS information would aid existing techniques that have used limited POS information to successfully improve retrieval



References: [1] S. L. Abebe and P. Tonella. Natural language parsing of program element names for concept extraction. In 18th IEEE International Conference on Program Comprehension. IEEE, 2010. [2] K. Atkinson. Spell checking oriented word lists (scowl). [3] E. Boschee, R. Weischedel, and A. Zamanian. Automatic information extraction. In Proceedings of the International Conference on Intelligence Analysis, 2005. [4] B. Caprile and P. Tonella. Restructuring program identifier names. In ICSM, 2000. [5] ML Collard, HH Kagdi, and JI Maletic. An XML-based lightweight C++ fact extractor. Program Comprehension, 2003. 11th IEEE International Workshop on, pages 134–143, 2003. [6] E. Høst and B. Østvold. The programmer’s lexicon, volume i: The verbs. In International Working Conference on Source Code Analysis and Manipulation, Beijing, China, September 2008. [7] E. W. Høst and B. M. Østvold. Debugging method names. In ECOOP 09. Springer Berlin / Heidelberg, 2009. [8] J. Jiang and C. Zhai. Instance weighting for domain adaptation in nlp. In ACL 2007, 2007. [9] D. Lawrie, D. Binkley, and C. Morrell. Normalizing source code vocabulary. In Proceedings of the 17th Working Conference on Reverse Engineering, 2010. [10] L. Shen, G. Satta, and A. K. Joshi. Guided learning for bidirectional sequence classification. In ACL 07. ACL, June 2007. [11] D. Shepherd, Z. P. Fry, E. Hill, L. Pollock, and K. Vijay-Shanker. Using natural language program analysis to locate and understand action-oriented conerns. In AOSD 07. ACM, March 2007. [12] K. Toutanova, D. Klein, C. Manning, and Y. Singer. Feature-rich part-of-speech tagging with a cyclic dependency network. In HLTNAACL 2003, 2003. 4 Related Work This section briefly reviews three projects that use POS information. Each uses an off-the-shelf POS tagger or lookup table. First, Host et al. study naming of Java methods using a lookup table to assign POS tags [7]. Their aim is to find what they call “naming bugs” by checking to see if the method’s implementation is properly indicated with the name of the method. Second, Abebe and Tonella study class, method, and attribute names using a POS tagger based on a modification of minipar to formulate domain concepts [1]. Nouns in the identifiers are examined to form ontological relations between concepts. Based on a case study, their approach improved concept searching. Finally, Shepherd et al. considered finding concepts in code using natural language information [11]. The resulting Find-Concept tool locates action-oriented concerns more effectively than the other tools and with less user effort. This is made possible by POS information applied to source code. 4

You May Also Find These Documents Helpful

  • Better Essays

    Service Request Sr Rm 004

    • 1582 Words
    • 7 Pages

    Fry, Z. P., Shepherd, D. D., Hill, E. E., Pollock, L. L., & Vijay-Shanker, K. K. (2008). Analysing source code: looking for useful verb–direct object pairs in all the right places. IET Software, 2(1), 27-36. doi:10.1049/iet-sen:20070112…

    • 1582 Words
    • 7 Pages
    Better Essays
  • Good Essays

    In this phase a token is generated against all the lexemes in the source code. These lexemes and tokens are stored in the Symbol Table. Tokens against the lexemes are generated based on some patterns or rules.…

    • 703 Words
    • 3 Pages
    Good Essays
  • Powerful Essays

    Nt1330 Unit 1 Assignment

    • 883 Words
    • 4 Pages

    Name Entity (NE) is an expression that refers to proper names such as persons, locations, and organizations. For example: Arafat Awajan is a full professor at Princess Sumaya University for Technology in Jordan, then Arafat Awajan, Princess Sumaya University for Technology, and Jordan would be identified as reference to person, an organization, and location, respectively. The task that attempts to locate, extract, and automatically classify named entities into predefined classes or types in open-domain and unstructured texts, such as newspaper articles, was called Name Entity Recognition (NER)[Shaalan 2014].…

    • 883 Words
    • 4 Pages
    Powerful Essays
  • Good Essays

    This paper will describe the variable naming rules of three different programming languages. These three languages are Visual Basic, Python, and Java. Each of these languages has different rules that apply to them and some similarities. I will describe the rules and the similarities and differences in the next few paragraphs.…

    • 878 Words
    • 4 Pages
    Good Essays
  • Satisfactory Essays

    Fast bug and security fixes: Open source software usually has many people combing its source code, who rapidly fix problems as they are discovered.(3)…

    • 443 Words
    • 2 Pages
    Satisfactory Essays
  • Powerful Essays

    |1.2 Demonstrate an understanding of the |Code listings show a variety of objects | |…

    • 2281 Words
    • 10 Pages
    Powerful Essays
  • Good Essays

    Gary Paulsen

    • 508 Words
    • 3 Pages

    Ramsey, B. H. (2001 , December 12th ). Gary Paulsen . Retrieved April 18, 2008, from Internet School Library Media Center : http://falcon.jmu.edu/~ramseyil/paulsen.htm…

    • 508 Words
    • 3 Pages
    Good Essays
  • Powerful Essays

    Automatic Sentence Generator

    • 3412 Words
    • 14 Pages

    1.- Introduction. The growing, unstoppable development of very high speed information processing computers with tremendous main memory capacity which we see today leads us to think that it will be possible to design and construct automatic speech recognition systems which can detect and code all the grammatical components of a training corpus. As part of our effort to make a contribution to the fascinating world of Automatic Speech Recognition, we have developed a system composed of a set of computer programs. We have observed that on the basis of a model of a small corpus made up of sentences in a particular context, we can automatically generate a great quantity of grammatically correct sentences with this context. Also, our system can effect a linguistic discrimination to the point of rejecting, as…

    • 3412 Words
    • 14 Pages
    Powerful Essays
  • Satisfactory Essays

    CAPSTONE PROJECT

    • 349 Words
    • 3 Pages

    This software development project entitled ONLINE RESERVATION SYSTEM USING BARCODE TECHNOLOGY VERIFICATION IN NOAH'S PARK RESORT in partial flfillment…

    • 349 Words
    • 3 Pages
    Satisfactory Essays
  • Good Essays

    Skinner & Piaget

    • 808 Words
    • 4 Pages

    develop and assign a meaning or definition to a word or object which is stored in…

    • 808 Words
    • 4 Pages
    Good Essays
  • Powerful Essays

    The compiler is a special type of computer program that translates a human readable text file into a form that the computer can more easily understand.…

    • 16769 Words
    • 68 Pages
    Powerful Essays
  • Powerful Essays

    Science

    • 7351 Words
    • 42 Pages

    while retaining a programming model based on generic programming principles. The template features generality and…

    • 7351 Words
    • 42 Pages
    Powerful Essays
  • Satisfactory Essays

    Java Editor

    • 6650 Words
    • 27 Pages

    The programmer doesn’t need to remember all the keywords provided by Java language. One just has to provide the initial character of the keyword and he could get what he wants.…

    • 6650 Words
    • 27 Pages
    Satisfactory Essays
  • Best Essays

    9. The study of code-mixing, 2005. [Online] Available at: http://assets.cambridge.org/97805217/71689/excerpt/9780521771689_excerpt.pdf (Accessed at 3 Nov 2010)…

    • 1473 Words
    • 6 Pages
    Best Essays
  • Better Essays

    Google Books

    • 1988 Words
    • 8 Pages

    Nunberg, Geoff. "Language Log." » Google Books: A Metadata Train Wreck. N.p., 29 Aug. 2009. Web. 11 Nov. 2012. http://languagelog.ldc.upenn.edu/nll/?p=1701…

    • 1988 Words
    • 8 Pages
    Better Essays