首页 > 操作系统 >

slope one_daocloud_slope one 百度学术(4)

电脑杂谈　发布时间：2017-01-30 21:02:01　来源：网络整理

注意上面使用了一个Map-Side Join的hint, 因为predict表非常小,只需要跑一个map only的job就可以完成join,无需shuffle数据给reduce. 这一步把用户自身的movie_id也参与计算,由于hive不支持in,所以结果有些偏差。可以用一道MapReduce作业来做预测这一步。

最后select .. order by一下就知道此用户喜欢哪些影片了。slope one

结论:

1. 使用mapreduce,将运算移至reduce端, 避免map端的merge可以有效地提高训练速度

2. Slope One是一种简单易实现的用户推荐算法,而且可以增量训练

3. 结合以上两点,加上BigTable, HyperTable, Voldermort, Cassendera这种分布式key-value存储库,完全可以做到实时用户推荐(HBase甭提了)。

-----------------------------------------------------------------------------------------------------

附: hive生成的mr job描述.

hive> explain > INSERT OVERWRITE TABLE freq_diff > SELECT > nf1.movie_id, nf2.movie_id, count(1), sum(nf1.rate - nf2.rate)/count(1) > FROM > netflix nf1 > JOIN > netflix nf2 ON nf1.user_id = nf2.user_id > WHERE nf1.movie_id > nf2.movie_id > GROUP BY nf1.movie_id, nf2.movie_id;OKABSTRACT SYNTAX TREE: (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF netflix nf1) (TOK_TABREF netflix nf2) (= (. (TOK_TABLE_OR_COL nf1) user_id) (. (TOK_TABLE_OR_COL nf2) user_id)))) (TOK_INSERT (TOK_DESTINATION (TOK_TAB freq_diff)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL nf1) movie_id)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL nf2) movie_id)) (TOK_SELEXPR (TOK_FUNCTION count 1)) (TOK_SELEXPR (/ (TOK_FUNCTION sum (- (. (TOK_TABLE_OR_COL nf1) rate) (. (TOK_TABLE_OR_COL nf2) rate))) (TOK_FUNCTION count 1)))) (TOK_WHERE (> (. (TOK_TABLE_OR_COL nf1) movie_id) (. (TOK_TABLE_OR_COL nf2) movie_id))) (TOK_GROUPBY (. (TOK_TABLE_OR_COL nf1) movie_id) (. (TOK_TABLE_OR_COL nf2) movie_id))))STAGE DEPENDENCIES: Stage-1 is a root stage Stage-2 depends on stages: Stage-1 Stage-0 depends on stages: Stage-2STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: nf2 TableScan alias: nf2 Reduce Output Operator key expressions: expr: user_id type: string sort order: + Map-reduce partition columns: expr: user_id type: string tag: 1 value expressions: expr: movie_id type: string expr: rate type: double nf1 TableScan alias: nf1 Reduce Output Operator key expressions: expr: user_id type: string sort order: + Map-reduce partition columns: expr: user_id type: string tag: 0 value expressions: expr: movie_id type: string expr: rate type: double Reduce Operator Tree: Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 {VALUE._col0} {VALUE._col2} 1 {VALUE._col0} {VALUE._col2} outputColumnNames: _col0, _col2, _col4, _col6 Filter Operator predicate: expr: (_col0 > _col4) type: boolean Select Operator expressions: expr: _col0 type: string expr: _col4 type: string expr: _col2 type: double expr: _col6 type: double outputColumnNames: _col0, _col4, _col2, _col6 Group By Operator aggregations: expr: count(1) expr: sum((_col2 - _col6)) keys: expr: _col0 type: string expr: _col4 type: string mode: hash outputColumnNames: _col0, _col1, _col2, _col3 File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat Stage: Stage-2 Map Reduce Alias -> Map Operator Tree: hdfs://xxx:9000/user/zhoumin/hive-tmp/22895032/10002 Reduce Output Operator key expressions: expr: _col0 type: string expr: _col1 type: string sort order: ++ Map-reduce partition columns: expr: _col0 type: string expr: _col1 type: string tag: -1 value expressions: expr: _col2 type: bigint expr: _col3 type: double Reduce Operator Tree: Group By Operator aggregations: expr: count(VALUE._col0) expr: sum(VALUE._col1) keys: expr: KEY._col0 type: string expr: KEY._col1 type: string mode: mergepartial outputColumnNames: _col0, _col1, _col2, _col3 Select Operator expressions: expr: _col0 type: string expr: _col1 type: string expr: _col2 type: bigint expr: (_col3 / _col2) type: double outputColumnNames: _col0, _col1, _col2, _col3 Select Operator expressions: expr: _col0 type: string expr: _col1 type: string expr: UDFToDouble(_col2) type: double expr: _col3 type: double outputColumnNames: _col0, _col1, _col2, _col3 File Output Operator compressed: true GlobalTableId: 1 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe name: freq_diff Stage: Stage-0 Move Operator tables: replace: true table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe name: freq_diff

你好，我想请问你使用hive方法，每个节点的配置是怎么样的？我这边１９个节点，８核，１６Ｇ，１６０Ｇ硬盘，mapred.tasktracker.map.tasks.maximum = ７mapred.tasktracker.reduce.tasks.maximum = ７mapred.reduce.tasks ＝１２６mapred.child.java.opts = -Xmx１０２４m训练ＳＱＬ到了７４%，然后往后倒退到６８%左右挂掉。。是否您还做了其他的优化？急~

本文来自电脑杂谈，转载请注明本文网址：
http://www.pc-fly.com/a/jisuanjixue/article-29635-4.html

相关阅读

发表评论　　请自觉遵守互联网相关的政策法规，严禁发布、暴力、反动的言论

侯永博

2026年06月03日回复顶转发
川中子雅人

收二手家电者的广告词

2026年06月03日回复顶转发
- 刘性初
  
  高速成长的时代已经永远结束
  
  2026年06月03日回复顶转发

每日福利

启动计算机失败的常见原因和解决方案！

天空彩票有最的服务团队给大家营造操作性高且安全快捷的娱乐姓名号码文档兔宝宝：关于召开2018年第一次临时股东大会的通知(20

姓名号码文档兔宝宝：关于召开2018年第一次临时股东大会的通知(20

itunes10.5?刺客信条本色安卓版?iphone4s怎么越狱教程?iTunes驱动完整版下载 (解决itunes驱

热点图片

热点排行