Using Artificial Neural Networks to Identify Image Spam

Only available on StudyMode
  • Download(s) : 79
  • Published : January 30, 2013
Open Document
Text Preview
Spam Image Identification Using an Artificial Neural Network Jason R. Bowling, Priscilla Hope, Kathy J. Liszka
The University of Akron
Akron, Ohio 44325-4003
{bowling, ph11, liszka}@uakron.edu
Abstract
We propose a method for identifying image spam by
training an artificial neural network. A detailed process
for preprocessing spam image files is given, followed by a
description on how to train an artificial neural network to
distinguish between ham and spam. Finally, we exercise
the trained network by testing it against unknown images.

1. Introduction
Select – delete – repeat. It’s what we spend the first ten minutes of every day doing -- purging spam from our
inboxes. In the first month after the National Do Not Call
Registry went into affect, we noticed about a 30%
increase in spam. No, that’s not backed by scientific
process, just personal observation. And then, it got worse.
Clearly spam isn’t going away, at least not in the
foreseeable future. People still respond to it, buy products from it, and are scammed by it.
Filters are available to combat these unsolicited
nuisances. But spammers continually develop new
techniques to avoid detection by filters. See [1] for a
current and comprehensive list of spam techniques. This
paper focuses on one specific category of unsolicited bulk
email – image spam. This is a fairly recent phenomenon
that has appeared in the past few years. In 2005, it
comprised roughly 1% of all emails, then grew to an
estimated 21% by mid 2006 [2]. They come as image
attachments that contain text with what looks like a
legitimate subject and from address. They are successfully
getting by traditional spam filters and optical character
recognition (OCR) systems. As a result, they are often
referred to as OCR-evading spam images. A common
example is shown in Figure 1. These come in many forms
by way of file type, multipart images where the image is
split into multiple images, and even angled, or twisted.
A number of image spam identification and
classification techniques have been proposed [3, 4, 5]
including image processing and computer vision
techniques [6, 7]. We are studying the use of an artificial
neural network (ANN) to identify the difference between a
spam image from a “ham” (i.e., non-spam) image. ANNs
have been used in [8] to identify spam by looking at the

Figure 1. Simple Image Spam

text-based header portion of spam email. In our research,
we are interested in spam that uses jpg, bmp, gif and png
images.
An artificial neural network (ANN) is a computational
model based on biological neural networks. Given proper
inputs, they are supposed to be adaptive, learning by
example. An ANN is defined by a set of input and output
variables and then it is given a set of training examples.
The Fast Artificial Neural Network (FANN) [9] is an open
source library that implements an ANN. An excellent
primer on artificial neural nets and, specifically, the
FANN libraries can be found in [10].
The process is accomplished in three steps. First is
image preparation. We create a file compatible with the
inputs of the ANN we are testing. Next we train the
artificial neural network with our training data. Finally, we test the network with “unknown” images to see how well
it has learned to identify spam versus ham. Sections 2
through 4 provide a detailed description of how we set up
our experiments. Code is included in the appendices for
the interested reader. We’ve provided everything one
needs to create and test an ANN except the spam. You’ll
need to raid your own inbox for that.

2. Image Preparation
The first step in the overall process is to prepare the
images in a standard format. We have a small C program
called image2fann (see Appendix A) that takes images in
most any common format using a utility called
ImageMagick [11]. This is an open source utility that
converts and formats images from virtually any format to
another. In...
tracking img