Access Restriction

Author Srikanth, K. ♦ Ramakrishna, S.
Source CiteSeerX
Content type Text
File Format PDF
Subject Domain (in DDC) Computer science, information & general works ♦ Data processing & computer science
Subject Keyword Email Spam ♦ Statistical Consideration ♦ Email Management ♦ Spam Ham Group ♦ Chat Room ♦ Unsolicited Mail ♦ Popular Communication Medium ♦ Virus-infected Computer ♦ Content-based Classifier ♦ Internet User ♦ Bayesian Approach ♦ Conditional Probability ♦ Recent Ten Year ♦ Statistical Property ♦ Botnets Network ♦ Nature Spam Mail Email ♦ Unsolicited Email ♦ Training Data ♦ Email Address ♦ Several Statistical Method ♦ Ham Basing ♦ Large Corpus ♦ Legitimate Mail ♦ Different User ♦ Rebecca Lieb ♦ Word Tokenization ♦ Non Spam Mail ♦ Data Set ♦ Significant Amount ♦ Unwanted Email
Abstract Abstract––While email is one of the fastest form of communication, the user is frequently faced with receiving unsolicited emails called spams. Non- spam mails are known as hams which are legitimate mails. It is practically very difficult to perfectly classify a mail into spam or ham basing on the content or subject of the mail. Several statistical methods are available which classify mails with some chance of misclassification. The most popular is the Bayesian approach which use the conditional probability of occurrence of given words in the spam/ham groups of the training data. Most of the content-based classifiers are based on word tokenization leading to large corpus of words along with their probabilities of occurrence. In this paper, we discuss some statistical properties of data sets used as corpora for training classifiers. I. THE NATURE OF SPAM MAILS Email has been an efficient and popular communication medium as the number of internet users increase during the recent ten years. Therefore email management is an important and growing problem for individuals and organizations because it is prone to misuse. Email spam is an unsolicited, unwanted email that is sent indiscriminately, directly or indirectly by a sender having no relationship with recipient. Email spam has steadily grown since 1990‟s. According to Rebecca Lieb (2002), Botnets networks of Virus-infected computers used to send about 80 % of spam. A significant amount of time and resources is wasted by examining the spams and deleting them and the cost is borne by the recipient. Spammers are the people who send unsolicited mails to different users. Spammers collect email address from chat rooms, websites and
Educational Role Student ♦ Teacher
Age Range above 22 year
Educational Use Research
Education Level UG and PG ♦ Career/Technical Study