Authorship Attribution using Content based Features and N-gram features
Raju Dara1, T. Raghunadha Reddy2

1Dr. Raju Dara, Computer Science and Engineering, Vignana Bharathi Institute of Technology, Hyderabad, (Telangana), India.
2Dr. T. Raghunadha Reddy*, Information Technology, Vardhaman College of Engineering, Hyderabad, (Telangana), India.
Manuscript received on September 15, 2019. | Revised Manuscript received on October 15, 2019. | Manuscript published on October 30, 2019. | PP: 1152-1156 | Volume-9 Issue-1, October 2019 | Retrieval Number: A9507109119/2019©BEIESP | DOI: 10.35940/ijeat.A9507.109119
Open Access | Ethics and Policies | Cite | Mendeley
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)

Abstract: The internet is increasing exponentially with textual content primarily through social websites. The problems were also increasing with anonymous textual data in the internet. The researchers are searching for alternative techniques to know the author of an unknown document. Authorship Attribution is one such technique to predict the details of an unknown document. The researchers extracted various classes of stylistic features like character, lexical, syntactic, structural, content and semantic features to distinguish the authors writing style. In this work, the experiment performed with most frequent content specific features, n-grams of character, word and POS tags. A standard dataset is used for experimentation and identified that the combination of content based and n-gram features achieved best accuracy for prediction of author. Two standard classification algorithms were used for author prediction. The Random forest classifier attained best accuracy for prediction of author when compared with Naïve Bayes Multinomial classifier. The achieved results were good compared to many existing solutions to the Authorship Attribution.
Keywords: Authorship Attribution, Accuracy, N-grams, Author Prediction, Content based features.