情感分析主要目的就是识别用户对事物或人的看法、态度(attitudes:enduring, affectively colored beliefs, dispositions towards objects or persons),参与主体主要包括:
Holder (source)of attitude:观点持有者
Target (aspect)of attitude:评价对象
Typeof attitude:评价观点
From a set of types:Like, love, hate, value, desire,etc.
Or (more commonly) weightedpolarity:positive, negative, neutral,together withstrength
Textcontaining the attitude:评价文本,一般是句子或整篇文档
更细更深入的还包括评价属性,情感词/极性词,评价搭配等、
通常,我们面临的情感分析任务包括如下几类:
Simplest task:Is the attitude of this text positive or negative?
More complex:Rank the attitude of this text from 1 to 5
Advanced:Detect the target, source, or complex attitude types
后续章节将以Simplest task为例进行介绍。
2)A Baseline Algorithm
本小节对影评进行情感分析为例,向大家展示一个简单、实用的情感分析系统。详细见论文: Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002.Thumbs up? Sentiment Classification using Machine Learning Techniques. EMNLP-2002, 79—86.
Bo Pang and Lillian Lee. 2004.A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts. ACL, 271-278
我们面临的任务是“Polarity detection: Is anIMDBmovie review positive or negative?”,数据集为“Polrity Data 2.0:”.作者将情感分析当作分类任务,拆分成如下子任务:
Tokenization:正文提取,过滤时间、电话号码等,保留大写字母开头的字符串,保留表情符号,切词;
Feature Extraction:直观上,我们会认为形容词直接决定文本的情感,而Pang和Lee的实验表明,采用所有词(unigram)作为特征,可以达到更好的情感分类效果。
其中,需要对否定句进行特别的处理,如句子”Ididn’tlike this movie”vs “I really like this movie”,unigram只差一个词,但是有着截然不同的含义。为了有效处理这种情况,Das and Chen (2001)提出了“Add NOT_ to every word between negation and following punctuation”,根据此规则可以将句子“didn’t like this movie , but I”转换为“didn’t NOT_like NOT_this NOT_movie, but I”。
本文来自电脑杂谈,转载请注明本文网址:
http://www.pc-fly.com/a/jisuanjixue/article-35803-2.html
有十万的人根本不在乎那100元