然后再建立两张表,netflix是处理后的netflix训练数据, freq_diff是训练后的模型矩阵
CREATE EXTERNAL TABLE netflix( movie_id STRING, user_id STRING, rate DOUBLE, rate_date STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY 'n' STORED AS TEXTFILE LOCATION '/user/zhoumin/netflix-hive'; CREATE TABLE freq_diff ( movie_id1 STRING, movie_id2 STRING, freq DOUBLE, diff DOUBLE );
okay,运行训练SQL
INSERT OVERWRITE TABLE freq_diff SELECT nf1.movie_id, nf2.movie_id, count(1), sum(nf1.rate - nf2.rate)/count(1) FROM netflix nf1 JOIN netflix nf2 ON nf1.user_id = nf2.user_id WHERE nf1.movie_id > nf2.movie_id GROUP BY nf1.movie_id, nf2.movie_id;
此SQL将会产生两道mapreduce job,使用 explain命令即可以看到, 第一道主要做join的工作,在reduce端会输
出所有的中间数据。slope oneHive自动会调整reducer的数量,但这儿的reducer为3, 跑得比较慢(超过9小时),可以将reducer显式地设大些,我这儿设为160,再跑上面的训练SQL.
set mapred.reduce.tasks=160;
两道job第一道花了33mins, 35sec,第二道花了1hrs, 29mins, 29sec,训练时间总共约2小时,可以接受。
训练完毕,就可以试一试预测功能了。假设某用户给影片1000评了2分,那么他将会对其它影片评多少分呢? 他将喜欢哪些影片呢?
okay,先做些准备工作
CREATE TABLE predict( movie_id STRING, rate FLOAT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY 'n' STORED AS TEXTFILE; echo"1000,2"> predict_data LOAD DATA LOCAL INPATH './predict_data' OVERWRITE INTO TABLE predict;
然后就可以进行预测了:
CREATE TABLE slopeone_result( movie_id STRING, freq DOUBLE, pref DOUBLE, rate DOUBLE ); INSERT OVERWRITE TABLE slopeone_result SELECT movie_id1 as movie_id, sum(freq) as freq, sum(freq*(diff + rate)) as pref, sum(freq*(diff + rate))/sum(freq) as rate FROM predict p JOIN freq_diff fd ON fd.movie_id2 = p.movie_id GROUP BY movie_id1
本文来自电脑杂谈,转载请注明本文网址:
http://www.pc-fly.com/a/jisuanjixue/article-29635-3.html
麻麻今天去看小王子