Educational Testing Service Rosedale Road, 18E Princeton, NJ 08541 firstname.lastname@example.org
Department of Psychology Hunter College 695 Park Avenue New York, NY 10021 email@example.com
Educational Testing Service Rosedale Road, 18E Princeton, NJ 08541 firstname.lastname@example.org
This paper describes a deployed educational technology application: the CriterionSM Online Essay Evaluation Service, a web-based system that provides automated scoring and evaluation of student essays. Criterion has two complementary applications: E-rater®, an automated essay scoring system and Critique Writing Analysis Tools, a suite of programs that detect errors in grammar, usage, and mechanics, that identify discourse elements in the essay, and that recognize elements of undesirable style. These evaluation capabilities provide students with feedback that is specific to their writing in order to help them improve their writing skills. Both applications employ natural language processing and machine learning techniques. All of these capabilities outperform baseline algorithms, and some of the tools agree with human judges as often as two judges agree with each other.
2. Application Description
Criterion contains two complementary applications that are based on natural language processing (NLP) methods. The scoring application, e-rater®, extracts linguisticallybased features from an essay and uses a statistical model of how these features are related to overall writing quality to assign a holistic score to the essay. The second application, Critique, is comprised of a suite of programs that evaluate and provide feedback for errors in grammar, usage, and mechanics, identify the essay’s discourse structure, and recognize undesirable stylistic features. See Appendices for sample evaluations and feedback.
2.1. The E-rater scoring engine
The e-rater scoring engine is designed to identify features in student essay writing that reflect characteristics that are specified in reader scoring guides. Human readers are told to read quickly for a total impression and to take into account syntactic variety, use of grammar, mechanics, and style, organization and development, and vocabulary usage. For example, the free-response section of the writing component of the Test of English as a Foreign Language (TOEFL) is scored on a 6-point scale where scores of 5 and 6 are given to essays that are “well organized,” “use clearly appropriate details to support a thesis,” “demo nstrate syntactic variety,” and show “a range of vocabulary.” By contrast, 1’s and 2’s show “serious disorganization or underdevelopment” and may show “ serious and frequent errors in sentence structure or usage.” (See www.toefl.org/educator/edtwegui.html for the complete list of scoring guide criteria.) E-rater uses four modules for identifying features relevant to the scoring guide criteria – syntax, discourse, topical content, and lexical complexity. 2.1.1. E-rater features. In order to evaluate syntactic v ariety, a parser identifies syntactic structures, such as subjunctive auxiliary verbs and a variety of clausal structures, such as complement, infinitive, and subordinate clauses. E-rater’s discourse analysis module contains a lexicon based on the conceptual framework of conjunctive relations in Quirk et al. (1985) in which cue terms, such as in summary, are classified. These classifiers indicate whether or not the term is a discourse development term (for exam-
The best way to improve one’s writing skills is to write, receive feedback from an instructor, revise based on the feedback, and then repeat the whole process as often as possible. Unfortunately, this puts an enormous load on the classroom teacher who is faced with reading and providing feedback for perhaps 30 essays or more every time a topic is...