b2科目四模拟试题多少题驾考考爆了怎么补救
b2科目四模拟试题多少题 驾考考爆了怎么补救

slope one_daocloud_slope one 百度学术(4)

电脑杂谈  发布时间:2017-01-30 21:02:01  来源:网络整理

注意上面使用了一个Map-Side Join的hint, 因为predict表非常小,只需要跑一个map only的job就可以完成join,无需shuffle数据给reduce. 这一步把用户自身的movie_id也参与计算,由于hive不支持in,所以结果有些偏差。可以用一道MapReduce作业来做预测这一步。

最后select .. order by一下就知道此用户喜欢哪些影片了。slope one

结论:

1. 使用mapreduce,将运算移至reduce端, 避免map端的merge可以有效地提高训练速度

2. Slope One是一种简单易实现的用户推荐算法,而且可以增量训练

3. 结合以上两点,加上BigTable, HyperTable, Voldermort, Cassendera这种分布式key-value存储库,完全可以做到实时用户推荐(HBase甭提了)。

-----------------------------------------------------------------------------------------------------

附: hive生成的mr job描述.

hive> explain > INSERT OVERWRITE TABLE freq_diff > SELECT > nf1.movie_id, nf2.movie_id, count(1), sum(nf1.rate - nf2.rate)/count(1) > FROM > netflix nf1 > JOIN > netflix nf2 ON nf1.user_id = nf2.user_id > WHERE nf1.movie_id > nf2.movie_id > GROUP BY nf1.movie_id, nf2.movie_id;OKABSTRACT SYNTAX TREE: (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF netflix nf1) (TOK_TABREF netflix nf2) (= (. (TOK_TABLE_OR_COL nf1) user_id) (. (TOK_TABLE_OR_COL nf2) user_id)))) (TOK_INSERT (TOK_DESTINATION (TOK_TAB freq_diff)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL nf1) movie_id)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL nf2) movie_id)) (TOK_SELEXPR (TOK_FUNCTION count 1)) (TOK_SELEXPR (/ (TOK_FUNCTION sum (- (. (TOK_TABLE_OR_COL nf1) rate) (. (TOK_TABLE_OR_COL nf2) rate))) (TOK_FUNCTION count 1)))) (TOK_WHERE (> (. (TOK_TABLE_OR_COL nf1) movie_id) (. (TOK_TABLE_OR_COL nf2) movie_id))) (TOK_GROUPBY (. (TOK_TABLE_OR_COL nf1) movie_id) (. (TOK_TABLE_OR_COL nf2) movie_id))))STAGE DEPENDENCIES: Stage-1 is a root stage Stage-2 depends on stages: Stage-1 Stage-0 depends on stages: Stage-2STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: nf2 TableScan alias: nf2 Reduce Output Operator key expressions: expr: user_id type: string sort order: + Map-reduce partition columns: expr: user_id type: string tag: 1 value expressions: expr: movie_id type: string expr: rate type: double nf1 TableScan alias: nf1 Reduce Output Operator key expressions: expr: user_id type: string sort order: + Map-reduce partition columns: expr: user_id type: string tag: 0 value expressions: expr: movie_id type: string expr: rate type: double Reduce Operator Tree: Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 {VALUE._col0} {VALUE._col2} 1 {VALUE._col0} {VALUE._col2} outputColumnNames: _col0, _col2, _col4, _col6 Filter Operator predicate: expr: (_col0 > _col4) type: boolean Select Operator expressions: expr: _col0 type: string expr: _col4 type: string expr: _col2 type: double expr: _col6 type: double outputColumnNames: _col0, _col4, _col2, _col6 Group By Operator aggregations: expr: count(1) expr: sum((_col2 - _col6)) keys: expr: _col0 type: string expr: _col4 type: string mode: hash outputColumnNames: _col0, _col1, _col2, _col3 File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat Stage: Stage-2 Map Reduce Alias -> Map Operator Tree: hdfs://xxx:9000/user/zhoumin/hive-tmp/22895032/10002 Reduce Output Operator key expressions: expr: _col0 type: string expr: _col1 type: string sort order: ++ Map-reduce partition columns: expr: _col0 type: string expr: _col1 type: string tag: -1 value expressions: expr: _col2 type: bigint expr: _col3 type: double Reduce Operator Tree: Group By Operator aggregations: expr: count(VALUE._col0) expr: sum(VALUE._col1) keys: expr: KEY._col0 type: string expr: KEY._col1 type: string mode: mergepartial outputColumnNames: _col0, _col1, _col2, _col3 Select Operator expressions: expr: _col0 type: string expr: _col1 type: string expr: _col2 type: bigint expr: (_col3 / _col2) type: double outputColumnNames: _col0, _col1, _col2, _col3 Select Operator expressions: expr: _col0 type: string expr: _col1 type: string expr: UDFToDouble(_col2) type: double expr: _col3 type: double outputColumnNames: _col0, _col1, _col2, _col3 File Output Operator compressed: true GlobalTableId: 1 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe name: freq_diff Stage: Stage-0 Move Operator tables: replace: true table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe name: freq_diff

你好,我想请问你使用hive方法,每个节点的配置是怎么样的?我这边19个节点,8核,16G,160G硬盘,mapred.tasktracker.map.tasks.maximum = 7mapred.tasktracker.reduce.tasks.maximum = 7mapred.reduce.tasks =126mapred.child.java.opts = -Xmx1024m训练SQL到了74%,然后往后倒退到68%左右挂掉。。是否您还做了其他的优化?急~


本文来自电脑杂谈,转载请注明本文网址:
http://www.pc-fly.com/a/jisuanjixue/article-29635-4.html

相关阅读
    发表评论  请自觉遵守互联网相关的政策法规,严禁发布、暴力、反动的言论

    • 李欣宇
      李欣宇

      萨达姆是民选合法总统

    每日福利
    热点图片
    拼命载入中...