A huge number of informal messages are posted every day in social
network sites, blogs and discussion forums. Emotions seem to be
frequently important in these texts for expressing friendship, showing
social support or as part of online arguments. Algorithms to identify
sentiment and sentiment strength are needed to help understand the role
of emotion in this informal communication and also to identify
inappropriate or anomalous affective utterances, potentially associated
with threatening behaviour to the self or others. Nevertheless, existing
sentiment detection algorithms tend to be commercially-oriented,
designed to identify opinions about products rather than user
behaviours. This article partly fills this gap with a new algorithm,
SentiStrength, to extract sentiment strength from informal English text,
using new methods to exploit the de-facto grammars and spelling styles
of cyberspace. Applied to MySpace comments and with a lookup table of
term sentiment strengths optimised by machine learning, SentiStrength is
able to predict positive emotion with 60.6% accuracy and negative
emotion with 72.8% accuracy, both based upon strength scales of 1-5. The
former, but not the latter, is better than baseline and a wide range of
general machine learning approaches.
每天在社交网站、博客和论坛上都会发布大量非正式消息。
在这些文本中,情感似乎对于表达友谊、显示社会支持或作为在线争论的一部分很重要。
需要识别情绪和情绪强度的算法来帮助理解情绪在这种非正式交流中的作用,并识别不恰当或异常的情感表达,这些表达可能与对自己或他人的威胁行为有关。
尽管如此,现有的情绪检测算法往往是面向商业的,旨在识别对产品的看法而不是用户行为。
本文使用新算法 SentiStrength
部分填补了这一空白,该算法使用新方法从非正式英语文本中提取情感强度,利用网络空间的实际语法和拼写风格。
应用于 MySpace 评论和通过机器学习优化的术语情绪强度查找表,SentiStrength
能够以 60.6% 的准确度预测积极情绪,以 72.8%
的准确度预测消极情绪,两者均基于 1-5 的强度等级。
前者,而不是后者,优于基线和广泛的通用机器学习方法。
Introduction
Most opinion mining algorithms attempt to identify the polarity of
sentiment in text: positive, negative or neutral. Whilst for many
applications this is sufficient, texts often contain a mix of positive
and negative sentiment and for some applications it is necessary to
detect both simultaneously and also to detect the strength of sentiment
expressed. For instance, programs to monitor sentiment in online
communication, perhaps designed to identify and intervene when
inappropriate emotions are used or to identify at-risk users (e.g.,
Huang, Goh, & Liew, 2007), would need to be sensitive to the
strength of sentiment expressed and whether participants were
appropriately balancing positive and negative sentiment. In addition,
basic research to understand the role of emotion in online communication
(e.g., Derks, Fischer, & Bos, 2008; e.g., Hancock, Gee, Ciaccio,
& Lin, 2008; Nardi, 2005) would also benefit from fine-grained
sentiment detection, as would the growing body of psychology and other
social science research into the role of sentiment in various types of
discussion or general discourse (Balahur, Kozareva, & Montoyo, 2009;
Pennebaker, Mehl, & Niederhoffer, 2003; Short & Palmer,
2008).
大多数意见挖掘算法都试图识别文本中情绪的极性:正面、负面或中性。
虽然对于许多应用程序来说这已经足够了,但文本通常包含正面和负面情绪的混合,对于某些应用程序,有必要同时检测两者并检测表达的情绪强度。
例如,监控在线交流情绪的程序可能旨在识别和干预何时使用不当情绪或识别处于风险中的用户(例如,Huang、Goh
和 Liew,2007 年),需要对强度敏感
表达的情绪以及参与者是否适当地平衡了正面和负面情绪。
此外,了解情绪在在线交流中的作用的基础研究(例如,Derks、Fischer 和
Bos,2008 年;例如,Hancock、Gee、Ciaccio 和 Lin,2008 年;Nardi,2005
年)也将受益于细粒度的
情绪检测,以及越来越多的心理学和其他社会科学研究,研究情绪在各种类型的讨论或一般话语中的作用(Balahur、Kozareva
和 Montoyo,2009 年;Pennebaker、Mehl 和 Niederhoffer,2003 年;Short 和
Palmer , 2008).