A huge number of informal messages are posted every day in social network sites, blogs and discussion forums. Emotions seem to be frequently important in these texts for expressing friendship, showing social support or as part of online arguments. Algorithms to identify sentiment and sentiment strength are needed to help understand the role of emotion in this informal communication and also to identify inappropriate or anomalous affective utterances, potentially associated with threatening behaviour to the self or others. Nevertheless, existing sentiment detection algorithms tend to be commercially-oriented, designed to identify opinions about products rather than user behaviours. This article partly fills this gap with a new algorithm, SentiStrength, to extract sentiment strength from informal English text, using new methods to exploit the de-facto grammars and spelling styles of cyberspace. Applied to MySpace comments and with a lookup table of term sentiment strengths optimised by machine learning, SentiStrength is able to predict positive emotion with 60.6% accuracy and negative emotion with 72.8% accuracy, both based upon strength scales of 1-5. The former, but not the latter, is better than baseline and a wide range of general machine learning approaches.
每天在社交网站、博客和论坛上都会发布大量非正式消息。 在这些文本中,情感似乎对于表达友谊、显示社会支持或作为在线争论的一部分很重要。 需要识别情绪和情绪强度的算法来帮助理解情绪在这种非正式交流中的作用,并识别不恰当或异常的情感表达,这些表达可能与对自己或他人的威胁行为有关。 尽管如此,现有的情绪检测算法往往是面向商业的,旨在识别对产品的看法而不是用户行为。 本文使用新算法 SentiStrength 部分填补了这一空白,该算法使用新方法从非正式英语文本中提取情感强度,利用网络空间的实际语法和拼写风格。 应用于 MySpace 评论和通过机器学习优化的术语情绪强度查找表,SentiStrength 能够以 60.6% 的准确度预测积极情绪,以 72.8% 的准确度预测消极情绪,两者均基于 1-5 的强度等级。 前者,而不是后者,优于基线和广泛的通用机器学习方法。
Introduction
Most opinion mining algorithms attempt to identify the polarity of sentiment in text: positive, negative or neutral. Whilst for many applications this is sufficient, texts often contain a mix of positive and negative sentiment and for some applications it is necessary to detect both simultaneously and also to detect the strength of sentiment expressed. For instance, programs to monitor sentiment in online communication, perhaps designed to identify and intervene when inappropriate emotions are used or to identify at-risk users (e.g., Huang, Goh, & Liew, 2007), would need to be sensitive to the strength of sentiment expressed and whether participants were appropriately balancing positive and negative sentiment. In addition, basic research to understand the role of emotion in online communication (e.g., Derks, Fischer, & Bos, 2008; e.g., Hancock, Gee, Ciaccio, & Lin, 2008; Nardi, 2005) would also benefit from fine-grained sentiment detection, as would the growing body of psychology and other social science research into the role of sentiment in various types of discussion or general discourse (Balahur, Kozareva, & Montoyo, 2009; Pennebaker, Mehl, & Niederhoffer, 2003; Short & Palmer, 2008).
大多数意见挖掘算法都试图识别文本中情绪的极性:正面、负面或中性。 虽然对于许多应用程序来说这已经足够了,但文本通常包含正面和负面情绪的混合,对于某些应用程序,有必要同时检测两者并检测表达的情绪强度。 例如,监控在线交流情绪的程序可能旨在识别和干预何时使用不当情绪或识别处于风险中的用户(例如,Huang、Goh 和 Liew,2007 年),需要对强度敏感 表达的情绪以及参与者是否适当地平衡了正面和负面情绪。 此外,了解情绪在在线交流中的作用的基础研究(例如,Derks、Fischer 和 Bos,2008 年;例如,Hancock、Gee、Ciaccio 和 Lin,2008 年;Nardi,2005 年)也将受益于细粒度的 情绪检测,以及越来越多的心理学和其他社会科学研究,研究情绪在各种类型的讨论或一般话语中的作用(Balahur、Kozareva 和 Montoyo,2009 年;Pennebaker、Mehl 和 Niederhoffer,2003 年;Short 和 Palmer , 2008).