SentiStrength is a sentiment analysis tool developed by Mike Thelwall etc [^1] in 2010. In this manual, we will first briefly introduce this tool, describe its functions, and show how to use this tool. More details can be found in its paper and SentiStrength’s official website [^2].
SentiStrength 是 Mike Thelwall 等 [^1] 于 2010 年开发的情感分析工具。在本手册中,我们将首先简要介绍该工具,描述其功能,并展示如何使用该工具。更多详细信息可以在其论文和 SentiStrength 的官方网站 [^2] 中找到。
Introduce
SentiStrength is a tool developed from comments on social networking sites (MySpace [^3]). Its core function is to use dictionary-based algorithms to analyse sentiment for text. Specifically, it first assigns priori sentimental scores to words according to the sentiment dictionary, and then adjusts the assignation result with several heuristic rules. It can give a sentimental score pair $(\rho, \eta)$ for each input text, where $\rho$ represents the positive score of the text, and $\eta$ represents the negative score. The scale and meaning for $\rho$ and $\eta$ are as below:
SentiStrength 是一种根据社交网站(MySpace [^3])上的评论开发的工具。它的核心功能是使用基于字典的算法来分析文本的情感。具体来说,它首先根据情感词典为单词分配先验情感分数,然后使用多种启发式规则调整分配结果。它可以为每个输入文本给出一个情感分数对 $(\rho, \eta)$,其中 $\rho$ 表示文本的正面分数,$\eta$ 表示负面分数。$\rho$ 和 $\eta$ 的尺度和含义如下:
$$
[\text{no positive sentiment}] +1, +2, +3, +4, +5 [\text{very strong positive sentiment}]
\newline
[\text{no negative sentiment}] -1, -2, -3, -4, -5 [\text{very strong negative sentiment}]
$$
Core Function
SentiStrength’s core function is to use a set of dictionaries and several heuristic rules to conduct sentimental analysis on the text. The key elements of SentiStrength are listed below.
SentiStrength 的核心功能是使用一组词典和若干启发式规则对文本进行情感分析。下面列出了 SentiStrength 的关键要素。
UC-1 Assigning Sentiment Scores for Words
The core of the algorithm is the sentiment word strength list (EmotionLookupTable
in SentiStrength). EmotionLookupTable
contains 2546 words or wildcards (hereinafter collectively referred to as items). Each item was preseted a sentimental score, which is an integer ranging from -5 to 5. SentiStrength will assign sentimental scores to each word in the sentence based on the EmotionLookupTable
. If the word does not exist in the EmotionLookupTable
, it defaults to neutral. It is noteworthy that the word “miss” was allocated a positive and negative strength of 2. This was the only word classed as both positive and negative. It was typically used in the phrase “I miss you”, suggesting both sadness and love.
该算法的核心是情感词强度表(SentiStrength 中的
EmotionLookupTable
)。EmotionLookupTable
包含 2546 个词或通配符(以下统称为项)。每个项目都预设了一个情感分数,它是一个介于 -5 到 5 之间的整数。SentiStrength 会根据EmotionLookupTable
为句子中的每个单词分配情感分数。如果EmotionLookupTable
中不存在该词,则默认为中性。值得注意的是,“miss”这个词被分配了 2 的正面和负面强度。这是唯一一个同时被归类为正面和负面的词。它通常用于短语“我想你”中,暗示着悲伤和爱。
UC-2 Assigning Sentiment Scores for Phrases
EmotionLookupTable
is used to assign sentimental scores for uni-grams, while IdiomLookupTable
is used to assign scores for the phrases which often contain multiple words. When an idiom is recognized, the sentimental score of the idiom will cover the sentimental score of the single word that constitutes the idiom. For example, In the text “It’s a killer feature.”, “killer feature” is a phrase in the dictionary with positive score 02. Although the word ‘kill’ carries negative sentiment, its effect is overridden by the sentimental score of the enclosing phrase. So the text is analyzed as positive finally.
EmotionLookupTable
用于为 uni-gram 分配情感分数,而IdiomLookupTable
用于为通常包含多个单词的短语分配分数。当一个成语被识别时,该成语的情感分数将覆盖构成该成语的单个词的情感分数。例如,在文本“It’s a killer feature.”中,“killer feature”在字典中是一个积极得分为 2 的短语。虽然“kill”这个词带有负面情绪,但它的效果会被短语所包含的情绪得分所覆盖。所以文本最终被分析为正面的。
UC-3 Spelling Correction
A algorithm identifies the standard spellings of words that have been miss-spelled by the inclusion of repeated letters. For example, hellllloooo would be identified as “hello” by this algorithm. The algorithm (a) automatically deletes repeated letters above twice (e.g., helllo $\to$ hello); (b) deletes repeated letters occurring twice for letters rarely occurring twice in English (e.g., niice $\to$ nice) and © deletes letters occurring twice if not a standard word but would form a standard word if deleted (e.g., nnice $\to$ nice but not hoop $\to$ hop nor baaz $\to$ baz). EnglishWordList
will be used to check whether the spelling of the word is correct.
一种算法可以识别因包含重复字母而拼写错误的单词的标准拼写。例如,helllllooooo 将被该算法识别为“hello”。算法(a)自动删除上面两次重复的字母(例如,hello $\to$ hello);(b)删除在英语中很少出现两次的字母重复出现两次的字母(例如,niice $\to$ nice)和(c)删除出现两次的字母如果不是标准词但如果删除将形成标准词(例如 nnice $\to$ nice) 不错,但不是 hoop $\to$ hop 或 baaz $\to$ baz)。
EnglishWordList
将用于检查单词的拼写是否正确。
UC-4 Booster Word Rule
A booster word list (BoosterWordList
) contains words that boost or reduce the emotion of subsequent words, whether positive or negative. Each word increases emotion strength by 1 or 2 (e.g., very, extremely) or decreases it by 1 (e.g., some).
增强词列表(
BoosterWordList
)包含可以增强或降低后续词的情绪的词,无论是正面的还是负面的。每个词都会将情绪强度增加 1 或 2(例如,very 和 extremely)或减少 1(例如,some)。
UC-5 Negating Word Rule
A negating word list (NegatingWordList
) contains words that invert subsequent emotion words (including any preceding booster words). For example, if “very happy” had positive strength 4 then “not very happy” would have negative strength 4. The possibility that some negating terms do not negate was not incorporated as this did not seem to occur often in the pilot data set.
否定词列表(
NegatingWordList
)包含反转后续情感词(包括任何前面的助推词)的词。例如,如果“very happy”的正强度为 4,那么“not very happy”的负强度为 4。一些否定词不否定的可能性没有被纳入,因为这在试验数据集中似乎并不经常发生。
UC-6 Repeated Letter Rule
Repeated letters above those needed for correct spelling are used to give a strength boost of 1 to sentimental words, as long as there are at least two additional letters. The use of repeated letters is a common device for expressing emotion or energy in MySpace comments, but one repeated letter often appeared to be a typing error.
只要有至少两个额外的字母,重复的字母高于正确拼写所需的字母,就会使情感词的强度提高 1。使用重复的字母是 MySpace 评论中表达情感或能量的常用手段,但重复的字母往往看起来像是打字错误。
UC-7 Emoji Rule
An emoticon list (EmotionLookupTable
) with associated strengths (positive or negative 2) supplements the sentiment word strength list (and punctuation included in emoticons is not processed further for the purposes below).
具有相关强度(正面或负面 2)的表情符号列表(
EmotionLookupTable
)补充了情感词强度列表(并且表情符号中包含的标点符号不会出于以下目的进一步处理)。
UC-8 Exclamation Mark Rule
Any sentence with an exclamation mark was allocated a minimum positive strength of 2.
任何带有感叹号的句子都被分配了 2 的最小正强度。
UC-9 Repeated Punctuation Rule
Repeated punctuation including at least one exclamation mark gives a strength boost of 1 to the immediately preceding emotion word (or sentence).
包括至少一个感叹号在内的重复标点符号会使紧接在前的情感词(或句子)的强度提高 1。
UC-10 Negative Sentiment Ignored in Questions
For example, the question “are you angry?” would be classified as not containing sentiment, despite the presence of the word “angry”. This was not applied to positive sentiment because many question sentences appeared to contain mild positive sentiment. In particular, sentences like “whats up?” were typically classified as containing mild positive sentiment (strength 2). QuestionWord is used to identify question words.
The above factors were applied separately to each sentence, with the sentence being assigned with both the most positive and most negative sentiments identified in it. Each overall text was assigned with the most positive of its sentence sentiments and the most negative of its sentence sentiments. Sentences were split either by line breaks in comments or after punctuation other than emoticons. The example in Table 1 shows how SentiStrength analyses.
例如,“are you angry?”这个问题。将被归类为不包含情绪,尽管存在“愤怒”一词。这不适用于积极情绪,因为许多问题句子似乎包含温和的积极情绪。特别是像“whats up?”这样的句子。通常被归类为包含温和的积极情绪(强度 2)。QuestionWord 用于识别疑问词。
上述因素分别应用于每个句子,该句子被分配了其中确定的最积极和最消极的情绪。每个整体文本都被分配了最积极的句子情绪和最消极的句子情绪。句子被评论中的换行符或除表情符号以外的标点符号分隔。表 1 中的示例显示了 SentiStrength 如何分析。
Sample | $\rho$ | $\eta$ | Dictionary/Rule Used | Explanation |
---|---|---|---|---|
It’s a good feature. | 2 | -1 | EmotionLookupTable |
The sentimental score of the word ‘good’ is pre-assigned to 02; so the sentence is assigned positive score 02. |
It’s a very good feature. | 3 | -1 | EmotionLookupTable BoosterWordList |
As booster word ‘very’ is used before the sentimental word, the sentence is assigned a positive score 03. |
It’s not good feature. | 1 | -2 | EmotionLookupTable NegatingWordList |
Sentimental polarity of the sentimental word is inverted in here due to the use of the negation word ‘not’ before sentimental word. |
It’s a good feature! | 3 | -1 | EmotionLookupTable "!" Rule |
“!” will strengthen the sentimental strength |
It’s a gooood feature. | 3 | -1 | Repeated Letter Rule |
Repeated letters above those needed for correct spelling are used to give a strength boost of 1 to sentimental words. |
Other Functions
SentiStrength was initially released in the form of a jar package. In this section, we will introduce some non-core functions and explain how to set options to use these functions when analysing. For a more comprehensive description of options, please refer to the manual on the official website.
SentiStrength 最初以 jar 包的形式发布。在本节中,我们将介绍一些非核心功能,并说明在分析时如何设置选项以使用这些功能。更全面的选项说明,请参考官网说明书。
Complete Different Classification Tasks (6)
SentiStrength can classify individual texts or multiple texts and can be invoked in many different ways. This section covers these methods although most users only need one of them.
SentiStrength 可以对单个文本或多个文本进行分类,并且可以通过许多不同的方式调用。本节涵盖这些方法,尽管大多数用户只需要其中一种方法。
UC-11 Classify a single text
1 |
|
The submitted text will be classified and the result returned in the form +ve –space- -ve. If the classification method is trinary, binary or scale then the result will have the form +ve –space- -ve –space- overall. E.g.,
提交的文本将被分类,结果以+ve –space- -ve 的形式返回。如果分类方法是三元法、二元法或尺度法,则结果的形式为 +ve –space- -ve –space- overall。例如,
1 |
|
The result will be: 3 -1
UC-12 Classify all lines of text in a file for sentiment [includes accuracy evaluations]
1 |
|
Each line of [filename] will be classified for sentiment. Here is an example.
1 |
|
A new file will be created with the sentiment classifications added to the end of each line.
If the task is to test the accuracy of SentiStrength, then the file may have +ve codes in the 1st column, then negative codes in the 2nd column and text in the last column. If using binary/trinary/scale classification then the first column can contain the human coded values. Columns must be tab-separated. If human coded sentiment scores are included in the file then the accuracy of SentiStrength will be compared against them.
将创建一个新文件,并将情感分类添加到每行的末尾。
如果任务是测试 SentiStrength 的准确性,那么文件可能在第一列中有 +ve 代码,然后在第二列中有负代码,在最后一列中有文本。如果使用二元/三元/尺度分类,那么第一列可以包含人类编码值。列必须以制表符分隔。如果文件中包含人工编码的情感分数,则将与它们比较 SentiStrength 的准确性。
UC-13 Classify texts in a column within a file or folder
For each line, the text in the specified column will be extracted and classified, with the result added to an extra column at the end of the file (all three parameters are compulsory).
对于每一行,将提取和分类指定列中的文本,并将结果添加到文件末尾的额外列中(所有三个参数都是必需的)。
1 |
|
If a folder is specified instead of a filename (i.e., an input parameter) then all files in the folder are processed as above. If a fileSubstring value is specified, then only files matching the substring will be classified. The parameter overwrite must be specified to explicitly allow the input files to be modified. This is a purely safety feature. E.g.,
如果指定文件夹而不是文件名(即输入参数),则文件夹中的所有文件都按上述方式处理。如果指定了 fileSubstring 值,则只会对匹配该子字符串的文件进行分类。必须指定参数 overwrite 以明确允许修改输入文件。这是一个纯粹的安全功能。例如,
1 |
|
UC-14 Listen at a port for texts to classify
1 |
|
This sets the program to listen at a port number for texts to classify, e.g., to listen at port 81 for texts for trinary classification:
这将程序设置为在端口号上侦听要分类的文本,例如,在端口 81 上侦听三元分类的文本:
1 |
|
The texts must be URLEncoded and submitted as part of the URL. E.g., if the listening was set up on port 81 then requesting the following URL would trigger classification of the text “love you”: http://127.0.0.1:81/love you
The result for this would be 3 -1 1. This is: (+ve classification) (-ve classification) (trinary classification)
文本必须经过 URLEncoded 并作为 URL 的一部分提交。例如,如果在端口 81 上设置了监听,那么请求以下 URL 将触发文本“love you”的分类:http://127.0.0.1:81/love you
结果为 3 -1 1。这是:(+ve 分类) (-ve 分类) (三元分类)
UC-15 Run interactively from the command line
cmd (can also set options and sentidata folder). E.g.,
1 |
|
This allows the program to classify texts from the command prompt. After running this every line you enter will be classified for sentiment. To finish enter @end
这允许程序对来自命令提示符的文本进行分类。运行此命令后,您输入的每一行都将根据情绪进行分类。完成输入@end
UC-16 Process stdin and send to stdout
1 |
|
SentiStrength will classify all texts sent to it from stdin and then will close. This probably the most efficient way of integrating SentiStrength efficiently with non-Java programs. The alternatives are the Listen at a port option or dumping the texts to be classified into a file and then running SentiStrength on the file.
The parameter textCol can be set [default 0 for the first column] if the data is sent in multiple tab-separated columns and one column contains the text to be classified. The results will be appended to the end of the input data and send to STD out. The Java loop code for this is essentially:
SentiStrength 将对从 stdin 发送给它的所有文本进行分类,然后关闭。这可能是将 SentiStrength 与非 Java 程序有效集成的最有效方式。备选方案是在端口选项上监听或将要分类的文本转储到文件中,然后在该文件上运行 SentiStrength。
如果数据以多个制表符分隔的列发送,并且一列包含要分类的文本,则可以设置参数 textCol [第一列默认为 0]。结果将附加到输入数据的末尾并发送到 STD 输出。Java 循环代码本质上是:
1 |
|
So for greatest efficiency, null should not be sent to stdin as this will close the program.
因此,为了获得最高效率,不应将 null 发送到标准输入,因为这将关闭程序。
Set Location of Data (4)
UC-17 Location of linguistic data folder
1 |
|
This option is used to set which folder the tool searches for the dictionary needed for analysis (such as EmotionLookupTable
, IdiomLookupTable
, etc).
此选项用于设置工具在哪个文件夹中搜索分析所需的词典(如
EmotionLookupTable
、IdiomLookupTable
等)。
UC-18 Location of sentiment term weights
1 |
|
This option is used to set which file will be set as the core sentiment word strength list for the tool. The default value is EmotionLookupTable
.txt or SentimentLookupTable.txt
. This file must be in the directory specified by sentidata.
此选项用于设置将哪个文件设置为该工具的核心情感词强度列表。默认值为
EmotionLookupTable.txt
或SentimentLookupTable.txt
。此文件必须位于 sentidata 指定的目录中。
UC-19 Location of output folder
1 |
|
This option is used to set the name of the folder to put the output.
此选项用于设置放置输出的文件夹的名称。
UC-20 File name extension for output
1 |
|
This option is used to set what identifier will be used to mark out the output file. Its default value is “_out.txt”. For example, if you set “input.txt” as the input file, the first output file will be “input0_out.txt” (input file name + index + result extension).
此选项用于设置将使用什么标识符来标记输出文件。它的默认值是
_out.txt
。例如,如果您将input.txt
设置为输入文件,则第一个输出文件将为input0_out.txt
(输入文件名+索引+结果扩展名)。
Set Different Type of Output (4)
UC-21 Classify positive (1 to 5) and negative (-1 to -5) sentiment strength separately
This is the default and is used unless binary, trinary or scale is selected. Note that 1 indicates no positive sentiment and -1 indicates no negative sentiment. There is no output of 0.
这是默认值,除非选择二元、三元或范围,否则将使用它。请注意,1 表示没有正面情绪,-1 表示没有负面情绪。没有 0 的输出。
UC-22 Use trinary classification (positive-negative-neutral)
1 |
|
The result for this would be like 3 -1 1. This is: (+ve classification) (-ve classification) (trinary classification)
其结果类似于 3 -1 1。这是:(+ve 分类)(-ve 分类)(三元分类)
UC-23 Use binary classification (positive-negative)
1 |
|
The result for this would be like 3 -1 1. This is: (+ve classification) (-ve classification) (binary classification)
其结果类似于 3 -1 1。这是:(+ve 分类)(-ve 分类)(二元分类)
UC-24 Use a single positive-negative scale classification
1 |
|
The result for this would be like 3 -4 -1. This is: (+ve classification) (-ve classification) (scale classification)
其结果类似于 3 -4 -1。这是:(+ve 分类)(-ve 分类)(尺度分类)
UC-25 Explain the classification
1 |
|
Adding this parameter to most of the options results in an approximate explanation being given for the classification. E.g.,
将此参数添加到大多数选项会导致对分类给出近似解释。例如,
1 |
|
UC-26 Set Classification Algorithm Parameters
Please note that most of these options can be mapped to the core function of SentiStrength. They can change how the sentiment analysis algorithm works.
- alwaysSplitWordsAtApostrophes (split words when an apostrophe is met – important for languages that merge words with ', like French (e.g., t’aime $\to$ t ’ aime with this option t’aime without))
- noBoosters (ignore sentiment booster words (e.g., very))
- noNegatingPositiveFlipsEmotion (don’t use negating words to flip +ve words)
- noNegatingNegativeNeutralisesEmotion (don’t use negating words to neuter -ve words)
- negatedWordStrengthMultiplier (strength multiplier when negated (default=0.5))
- maxWordsBeforeSentimentToNegate (max words between negator & sentiment word (default 0))
- noIdioms (ignore idiom list)
- questionsReduceNeg (-ve sentiment reduced in questions)
- noEmoticons (ignore emoticon list)
- exclamations2 (exclamation marks count them as +2 if not -ve sentence)
- mood [-1,0,1] (interpretation of neutral emphasis (e.g., miiike; hello!!). -1 means neutral emphasis interpreted as –ve; 1 means interpreted as +ve; 0 means emphasis ignored)
- noMultiplePosWords (don’t allow multiple +ve words to increase +ve sentiment)
- noMultipleNegWords (don’t allow multiple -ve words to increase -ve sentiment)
- noIgnoreBoosterWordsAfterNegatives (don’t ignore boosters after negating words)
- noDictionary (don’t try to correct spellings using the dictionary by deleting duplicate letters from unknown words to make known words)
- noDeleteExtraDuplicateLetters (don’t delete extra duplicate letters in words even when they are impossible, e.g., heyyyy) [this option does not check if the new word is legal, in contrast to the above option]
- illegalDoubleLettersInWordMiddle [letters never duplicate in word middles] this is a list of characters that never occur twice in succession. For English the following list is used (default): ahijkquvxyz Never include w in this list as it often occurs in www
- illegalDoubleLettersAtWordEnd [letters never duplicate at word ends] this is a list of characters that never occur twice in succession at the end of a word. For English the following list is used (default): achijkmnpqruvwxyz
- noMultipleLetters (don’t use the presence of additional letters in a word to boost sentiment)
请注意,这些选项中的大多数都可以映射到 SentiStrength 的核心功能。他们可以改变情绪分析算法的工作方式。
- alwaysSplitWordsAtApostrophes(遇到撇号时拆分单词 - 对于将单词与 ’ 合并的语言很重要,例如法语(例如,t’aime $\to$ t ’ aime 带有此选项 t’aime 没有))
- noBoosters(忽略情绪助推词(例如,非常))
- noNegatingPositiveFlipsEmotion(不要使用否定词来翻转 +ve 词)
- noNegatingNegativeNeutralisesEmotion(不要使用否定词来中和 -ve 词)
- negatedWordStrengthMultiplier(否定时的强度乘数(默认值=0.5))
- maxWordsBeforeSentimentToNegate(否定词和情感词之间的最大词数(默认为 0))
- noIdioms(忽略成语列表)
- questionsReduceNeg(-ve 情绪在问题中减少)
- noEmoticons(忽略表情符号列表)
- exclamations2(如果不是 -ve 句子,感叹号将它们计为 +2)
- mood [-1,0,1](中性强调的解释(例如,miiike; hello!!)。-1 表示中性强调解释为 –ve;1 表示解释为 +ve;0 表示忽略强调)
- noMultiplePosWords(不允许多个+ve 词来增加+ve 情绪)
- noMultipleNegWords(不允许多个 -ve 词增加 -ve 情绪)
- noIgnoreBoosterWordsAfterNegatives(否定词后不要忽略助推器)
- noDictionary(不要试图通过删除未知单词中的重复字母来制作已知单词来使用字典来纠正拼写)
- noDeleteExtraDuplicateLetters(不要删除单词中多余的重复字母,即使它们是不可能的,例如 heyyyy)[与上述选项相反,此选项不检查新单词是否合法]
- illegalDoubleLettersInWordMiddle [字母从不在单词中间重复] 这是一个永远不会连续出现两次的字符列表。对于英语,使用以下列表(默认):ahijkquvxyz 切勿在此列表中包含 w,因为它经常出现在 www 中
- illegalDoubleLettersAtWordEnd [字母从不在词尾重复] 这是一个从不在词尾连续出现两次的字符列表。对于英语,使用以下列表(默认):achijkmnpqruvwxyz
- noMultipleLetters(不要在单词中使用额外的字母来提升情绪)
Improving the accuracy of SentiStrength(2)
Basic manual improvements
If you see a systematic pattern in the results, such as the term “disgusting” typically having a stronger or weaker sentiment strength in your texts than given by SentiStrength then you can edit the text files with SentiStrength to change this. Please edit SentiStrength’s input files using a plain text editor because if it is edited with a word processor then SentiStrength may not be able to read the file afterwards.
如果您在结果中看到系统模式,例如术语“disgusting”通常在您的文本中具有比 SentiStrength 给出的更强或更弱的情绪强度,那么您可以使用 SentiStrength 编辑文本文件来更改它。请使用纯文本编辑器编辑 SentiStrength 的输入文件,因为如果使用文字处理器编辑,SentiStrength 之后可能无法读取该文件。
UC-27 Optimise sentiment strengths of existing sentiment terms
SentiStrength can suggest revised sentiment strengths for the EmotionLookupTable.txt
in order to give more accurate classifications for a given set of texts. This option needs a large (>500) set of texts in a plain text file with a human sentiment classification for each text. SentiStrength will then try to adjust the EmotionLookupTable.txt
term weights to be more accurate when classifying these texts. It should then also be more accurate when classifying similar texts.
SentiStrength 可以为
EmotionLookupTable.txt
建议修改后的情绪强度,以便为给定的文本集提供更准确的分类。此选项需要纯文本文件中的大量(>500)文本,每个文本都有人类情感分类。然后,SentiStrength 将尝试调整EmotionLookupTable.txt
术语权重,以便在对这些文本进行分类时更加准确。在对相似文本进行分类时,它也应该更加准确。
1 |
|
This creates a new emotion lookup table with improved sentiment weights based upon an input file with human coded sentiment values for the texts. This feature allows SentiStrength term weights to be customised for new domains. E.g.,
这将创建一个新的情感查找表,该表具有基于文本的人工编码情感值的输入文件改进的情感权重。此功能允许为新域自定义 SentiStrength 术语权重。例如,
1 |
|
This is very slow (hours or days) if the input file is large (hundreds of thousands or millions, respectively). The main optional parameter is minImprovement (default value 2). Set this to specify the minimum overall number of additional correct classifications to change the sentiment term weighting. For example, if increasing the sentiment strength of love from 3 to 4 improves the number of correctly classified texts from 500 to 502 then this change would be kept if minImprovement was 1 or 2 but rejected if minImprovement was >2. Set this higher to have more robust changes to the dictionary. Higher settings are possible with larger input files.
To check the performance on the new dictionary, the file could be reclassified using it instead of the original SentimentLookupTable.txt as follows:
如果输入文件很大(数十万或数百万),这将非常慢(分别为数小时或数天)。主要的可选参数是
minImprovement
(默认值 2)。设置此项以指定额外正确分类的最小总数,以更改情绪术语权重。例如,如果将 love 的情感强度从 3 增加到 4,将正确分类的文本数量从 500 增加到 502,那么如果minImprovement
为 1 或 2,则此更改将被保留,但如果minImprovement > 2
,则会被拒绝。将此值设置得更高,可以对字典进行更稳健的更改。较大的输入文件可以进行较高的设置。要检查新词典的性能,可以使用它而不是原始
SentimentLookupTable.txt
对文件进行重新分类,如下所示:
1 |
|
UC-28 Suggest new sentiment terms (from terms in misclassified texts)
SentiStrength can suggest a new set of terms to add to the EmotionLookupTable.txt
in order to give more accurate classifications for a given set of texts. This option needs a large (>500) set of texts in a plain text file with a human sentiment classification for each text. SentiStrength will then list words not found in the EmotionLookupTable.txt
that may indicate sentiment. Adding some of these terms should make SentiStrength more accurate when classifying similar texts.
SentiStrength 可以建议一组新的术语添加到
EmotionLookupTable.txt
,以便为给定的一组文本提供更准确的分类。此选项需要纯文本文件中的大量(>500)文本,每个文本都有人类情感分类。然后,SentiStrength 将列出在EmotionLookupTable.txt
中找不到的可能表示情绪的词。添加其中一些术语应该会使 SentiStrength 在对相似文本进行分类时更加准确。
1 |
|
This lists all terms in the data set and the proportion of times they are in incorrectly classified positive or negative texts. Load this into a spreadsheet and sort on the PosClassAvDiff and NegClassAvDiff to get an idea about terms that either should be added to the sentiment dictionary because one of these two values is high. This option also lists words that are already in the sentiment dictionary. Must be used with a text file containing correct classifications. E.g.,
这列出了数据集中的所有术语以及它们在错误分类的正面或负面文本中的次数比例。将其加载到电子表格中,然后对
PosClassAvDiff
和NegClassAvDiff
进行排序,以了解哪些术语应该添加到情绪词典中,因为这两个值之一很高。此选项还列出了情感词典中已有的单词。必须与包含正确分类的文本文件一起使用。例如,
1 |
|
This is very slow (hours or days) if the input file is large (tens of thousands or millions, respectively).
Interpretation: In the output file, the column PosClassAvDiff means the average difference between the predicted sentiment score and the human classified sentiment score for texts containing the word. For example, if the word “nasty” was in two texts and SentiStrength had classified them both as +1,-3 but the human classifiers had classified the texts as (+2,-3) and (+3,-5) then PosClassAvDiff would be the average of 2-1 (first text) and 3-1 (second text) which is 1.5. All the negative scores are ignored for PosClassAvDiff
NegClassAvDiff is the same as for PosClassAvDiff except for the negative scores.
如果输入文件很大(分别为数万或数百万),这会非常慢(数小时或数天)。
解释:在输出文件中,PosClassAvDiff 列表示包含该词的文本的预测情绪得分与人类分类情绪得分之间的平均差异。例如,如果单词“nasty”出现在两个文本中,SentiStrength 将它们都分类为 +1,-3,但人工分类器将文本分类为 (+2,-3) 和 (+3,-5),则 PosClassAvDiff 将是 2-1(第一个文本)和 3-1(第二个文本)的平均值,即 1.5。PosClassAvDiff 忽略所有负分数
除了负分数外,NegClassAvDiff 与 PosClassAvDiff 相同。
UC-29 Machine learning evaluations
These are machine learning options to evaluate SentiStrength for academic research. The basic command is train.
train
(evaluate SentiStrength by training term strengths on results in file). An input file of 500+ human classified texts is also needed - e.g.,
这些是评估学术研究的 SentiStrength 的机器学习选项。基本命令是 train。
train
(通过训练文件中结果的术语强度来评估 SentiStrength)。还需要包含 500 多个人类分类文本的输入文件 - 例如,
1 |
|
This attempts to optimise the sentiment dictionary using a machine learning approach and 10-fold cross validation. This is equivalent to using the command optimise on a random 90% of the data, then evaluating the results on the remaining 10% and repeating this 9 more times with the remaining 9 sections of 10% of the data. The accuracy results reported are the average of the 10 attempts. This estimates the improved accuracy gained from using the optimise command to improve the sentiment dictionary.
The output of this is two files. The file ending in _out.txt reports various accuracy statistics (e.g., number and proportion correct, number and proportion within 1 of the correct value; correlation between SentiStrength and human coded values. The file ending in _out_termStrVars.txt reports the changes to the sentiment dictionary in each of the folds. Both files also report the parameters used for the sentiment algorithm and machine learning. See the What do the results mean? section at the end for more information.
这尝试使用机器学习方法和 10 折交叉验证来优化情感词典。这相当于对随机 90% 的数据使用命令优化,然后评估剩余 10% 的结果,并用 10% 的数据的剩余 9 个部分再重复这 9 次。报告的准确性结果是 10 次尝试的平均值。这估计了使用优化命令改进情感词典所获得的提高的准确性。
其输出是两个文件。以
_out.txt
结尾的文件报告各种精度统计(例如,数字和比例正确,数字和比例在正确值的 1 以内;SentiStrength 与人类编码值之间的相关性。以_out_termStrVars.txt
结尾的文件报告更改到 每个折叠中的情感词典。这两个文件还报告了用于情感算法和机器学习的参数。有关更多信息,请参阅末尾的“What do the results mean?”部分。
Evaluation options
all
Test all option variations listed in Classification Algorithm Parameters above rather than use the default optionstot
Optimise by the number of correct classifications rather than the sum of the classification differencesiterations
[number of 10-fold iterations (default 1)] This sets the number of times that the training and evaluation is conducted. A value of 30 is recommended to help average out differences between runs.minImprovement
[min extra correct class. to change sentiment weights (default 2)] This sets the minimum number of extra correct classifications necessary to adjust a term weight during the training phase.multi
[# duplicate term strength optimisations to change sentiment weights (default 1)] This is a kind of super-optimisation. Instead of being optimised once, term weights are optimised multiple times from the starting values and then the average of these weights is taken and optimised and used as the final optimised term strengths. This should in theory give better values than optimisation once. e.g.,
all
测试上面分类算法参数中列出的所有选项变体,而不是使用默认选项tot
按正确分类的数量而不是分类差异的总和进行优化iterations
[10 次迭代次数(默认 1)] 这设置了进行训练和评估的次数。建议将值设为 30 以帮助平均计算运行之间的差异。minImprovement
[最小额外正确类。改变情绪权重(默认 2)] 这设置了在训练阶段调整术语权重所需的额外正确分类的最小数量。multi
[# duplicate term strength optimisations to change sentiment weights (default 1)] 这是一种超级优化。不是优化一次,而是从起始值开始多次优化术语权重,然后取这些权重的平均值并进行优化,并将其用作最终优化的术语强度。这在理论上应该比优化一次提供更好的价值。例如。,
1 |
|
Example: Using SentiStrength for 10-fold cross-validation
What is this? This estimates the accuracy of SentiStrength after it has optimised the term weights for the sentiment words (i.e., the values in the file EmotionLookupTable.txt
).
What do I need for this test? You need an input file that is a list of texts with human classified values for positive (1-5) and negative (1-5) sentiment. Each line of the file should be in the format:
Positive <tab>
Negative <tab>
text
How do I run the test? Type the following command, replacing the filename with your own file name.
示例:使用 SentiStrength 进行 10 折交叉验证
这是什么? 这估计了 SentiStrength 的准确性在它优化了情感词的术语权重(即文件
EmotionLookupTable.txt
中的值)之后。这个测试我需要什么? 您需要一个输入文件,它是一个文本列表,其中包含正面 (1-5) 和负面 (1-5) 情绪的人类分类值。文件的每一行应采用以下格式:
正
<tab>
负<tab>
文本我如何运行测试? 键入以下命令,将文件名替换为您自己的文件名。
1 |
|
This should take up to one hour – much longer for longer files. The output will be a list of accuracy statistics. Each 10-fold cross-validation
What does 10-fold cross-validation mean? See the k-fold section in http://en.wikipedia.org/wiki/Cross-validation_(statistics). Essentially, it means that the same data is used to identify the best sentiment strength values for the terms in EmotionLookupTable.txt
as is used to evaluate the accuracy of the revised (trained) algorithm – but this isn’t cheating when it is done this way.
The first line in the results file gives the accuracy of SentiStrength with the original term weights in EmotionLookupTable.txt
.
What do the results mean? The easiest way to read the results is to copy and paste them into a spreadsheet like Excel. The table created lists the options used to classify the texts and well as the results. Here is an extract from the first two rows of the key results. It gives the total number correct for positive sentiment (Pos Correct) and the proportion correct (Pos Correct/Total). It also reports the number of predictions that are correct or within 1 of being correct (Pos Within1). The same information is given for negative sentiment.
这最多需要一个小时——对于较长的文件,时间会更长。输出将是准确度统计列表。每 10 折交叉验证。
10 折交叉验证是什么意思? 请参阅 http://en.wikipedia.org/wiki/Cross-validation_(statistics) 中的 k 折部分。从本质上讲,这意味着使用相同的数据来确定
EmotionLookupTable.txt
中术语的最佳情绪强度值,就像用于评估修订(训练)算法的准确性一样——但这并不是作弊 方式。结果文件中的第一行给出了 SentiStrength 的准确性以及
EmotionLookupTable.txt
中的原始项权重。结果意味着什么? 阅读结果的最简单方法是将它们复制并粘贴到 Excel 等电子表格中。创建的表格列出了用于对文本和结果进行分类的选项。这是关键结果前两行的摘录。它给出了积极情绪的正确总数 (Pos Correct) 和正确的比例 (Pos Correct/Total)。它还报告正确或在 1 内的正确预测数 (Pos Within1)。负面情绪给出了相同的信息。
Pos Correct | Pos Correct/ Total | Neg Correct | Neg Correct/ Total | Pos Within1 | Pos Within1/ Total | Neg Within1 | Neg Within1/ Total |
---|---|---|---|---|---|---|---|
653 | 0.627281 | 754 | 0.724304 | 1008 | 0.9683 | 991 | 0.951969 |
Here is another extract of the first two rows of the key results. It gives the correlation between the positive sentiment predictions and the human coded values for positive sentiment (Pos Corr) and the Mean Percentage Error (PosMPEnoDiv). The same information is given for negative sentiment.
这是关键结果前两行的另一个摘录。它给出了积极情绪预测与积极情绪的人类编码值 (Pos Corr) 和平均百分比误差 (PosMPEnoDiv) 之间的相关性。负面情绪给出了相同的信息。
Pos Corr | NegCorr | PosMPE | NegMPE | PosMPEnoDiv | NegMPEnoDiv |
---|---|---|---|---|---|
0.638382 | 0.61354 | Ignore this | Ignore this | 0.405379 | 0.32853 |
If you specified 30 iterations then there will be 31 rows, one for the header and 1 for each iteration. Take the average of the rows as the value to use.
如果您指定了 30 次迭代,那么将有 31 行,一行用于标题,一行用于每次迭代。取行的平均值作为要使用的值。
[^1]: THELWALL M, BUCKLEY K, PALTOGLOU G, et al. Sentiment strength detection in short informal text[J]. Journal of the American society for information science and technology, 2010, 61(12) : 2544 – 2558.
[^2]: http://sentistrength.wlv.ac.uk
[^3]: https://myspace.com/