SparkML預(yù)測PV

背景

公司需要根據(jù)過去一段時間內(nèi)每天網(wǎng)站的流量數(shù)據(jù),預(yù)測未來一段時間每日流量,這樣,在流量高峰到來前,可以提前警示相關(guān)的運營、運維提前準(zhǔn)備。

這是個典型的“時序預(yù)測問題”,關(guān)于時序預(yù)測的方法有很多,有規(guī)則法、機器學(xué)習(xí)、傳統(tǒng)建模法等等。

本文主要講述機器學(xué)習(xí)的方式。

由于工作中主要用的是Spark技術(shù)棧處理數(shù)據(jù),所以這里也選用SparkML來解決。當(dāng)然,機器學(xué)習(xí)的包和庫又很多,完全可以用sklearn來做。實際上,數(shù)據(jù)分析階段我用的是pandas、numpy、sklearn,效率更高些。

數(shù)據(jù)分析

初始數(shù)據(jù)很簡單,只有兩列:PV、日期


部分?jǐn)?shù)據(jù)

畫個曲線圖,觀察一下:


曲線圖

從圖中看出,發(fā)現(xiàn)2019-07前后整體差異很大,這其實是由于業(yè)務(wù)調(diào)整導(dǎo)致的。由于需求是預(yù)測未來幾天的pv,那么一定是以現(xiàn)有的業(yè)務(wù)為基礎(chǔ),過早的數(shù)據(jù)反而是噪聲,直接拋棄。

選取近半年的數(shù)據(jù),再觀察一下:


近半年的數(shù)據(jù)

這個數(shù)據(jù)就相對比較穩(wěn)定了。

整體觀察,數(shù)據(jù)變化存在周期性,一個周期是一星期;工作日相對周末pv高些;

局部觀察,節(jié)假日為高峰(但并非所有節(jié)假日都是高峰,同樣這與具體業(yè)務(wù)相關(guān),所以需要按自己的業(yè)務(wù)整理出節(jié)假日表);

另外,非節(jié)假日也有高峰,可能的原因是有熱點事件(2020年2月,疫情期間熱點較多);對于熱點事件導(dǎo)致的流量高峰不可預(yù)測,所以我們盡量減小這類樣本的影響,因此后邊數(shù)據(jù)處理時會“去熱點”。

模型選取

這里選取線性回歸模型作為機器學(xué)習(xí)模型,并非是線性回歸是最優(yōu)的,而是趨勢預(yù)測很容易想到線性回歸模型,可以作為baseline,后續(xù)在此基礎(chǔ)上嘗試其他模型進行優(yōu)化。

特征提取

1. 時間特征

經(jīng)過上邊的數(shù)據(jù)分析,可以知道周末、工作日、節(jié)假日對pv影響較大,因此可以把這幾個值作為特征:

day_of_week // 星期幾,取值1~7
is_weekend // 是否是周末,取值0、1,星期六和星期日是周末
is_holiday // 是否是節(jié)假日,取值0、1,節(jié)假日庫根據(jù)實際業(yè)務(wù)維護

2. 均值特征

既然有周期性,那么
周一的pv與所有周一的平均值有一定關(guān)系
周二的pv與所有周二的平均值有一定關(guān)系
...
所以,每個day_of_week的平均值可以作為一個特征。

同樣,周末、節(jié)假日都有類似的均值特征。

day_of_week_avg // 按 day_of_week 分組,求平均值
is_weekend_avg // 按 is_weekend 分組,取平均值
is_holiday_avg // 按 is_holiday 分組,取平均值

3. 中位數(shù)特征

與均值特征類似,可以有中位數(shù)特征

day_of_week_med // 按 day_of_week 分組,取中位數(shù)
is_weekend_med // 按 is_weekend 分組,取中位數(shù)
is_holiday_med // 按 is_holiday 分組,取中位數(shù)

4. 平移特征

均值特征、中位數(shù)特征反應(yīng)的是整體的情況,實際上某日的pv很有可能取決于最近N天的pv。

具體N取幾?需要多試試了。這里N取1到14,得到一組特征:

lag_1 // 平移1天,即昨天的pv
lag_2 // 平移2天,即兩天前的pv
...
lag_7 // 平移7天,上周這天的pv
...
lag_14 // 平移14天,上上周這天的pv

平移后數(shù)據(jù)的樣子:

+--------+----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
|      pv|       day|   lag_1|   lag_2|   lag_3|   lag_4|   lag_5|   lag_6|   lag_7|   lag_8|   lag_9|  lag_10|  lag_11|  lag_12|  lag_13|  lag_14|
+--------+----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
|15156440|2019-11-01|    null|    null|    null|    null|    null|    null|    null|    null|    null|    null|    null|    null|    null|    null|
|12633297|2019-11-02|15156440|    null|    null|    null|    null|    null|    null|    null|    null|    null|    null|    null|    null|    null|
|11818845|2019-11-03|12633297|15156440|    null|    null|    null|    null|    null|    null|    null|    null|    null|    null|    null|    null|
|15130911|2019-11-04|11818845|12633297|15156440|    null|    null|    null|    null|    null|    null|    null|    null|    null|    null|    null|
|14332734|2019-11-05|15130911|11818845|12633297|15156440|    null|    null|    null|    null|    null|    null|    null|    null|    null|    null|
|15972959|2019-11-06|14332734|15130911|11818845|12633297|15156440|    null|    null|    null|    null|    null|    null|    null|    null|    null|
|16366371|2019-11-07|15972959|14332734|15130911|11818845|12633297|15156440|    null|    null|    null|    null|    null|    null|    null|    null|
|16969708|2019-11-08|16366371|15972959|14332734|15130911|11818845|12633297|15156440|    null|    null|    null|    null|    null|    null|    null|
|12983425|2019-11-09|16969708|16366371|15972959|14332734|15130911|11818845|12633297|15156440|    null|    null|    null|    null|    null|    null|
|11759009|2019-11-10|12983425|16969708|16366371|15972959|14332734|15130911|11818845|12633297|15156440|    null|    null|    null|    null|    null|
|13700888|2019-11-11|11759009|12983425|16969708|16366371|15972959|14332734|15130911|11818845|12633297|15156440|    null|    null|    null|    null|
|15490684|2019-11-12|13700888|11759009|12983425|16969708|16366371|15972959|14332734|15130911|11818845|12633297|15156440|    null|    null|    null|
|15275479|2019-11-13|15490684|13700888|11759009|12983425|16969708|16366371|15972959|14332734|15130911|11818845|12633297|15156440|    null|    null|
|14978239|2019-11-14|15275479|15490684|13700888|11759009|12983425|16969708|16366371|15972959|14332734|15130911|11818845|12633297|15156440|    null|
|16900067|2019-11-15|14978239|15275479|15490684|13700888|11759009|12983425|16969708|16366371|15972959|14332734|15130911|11818845|12633297|15156440|
|15668745|2019-11-16|16900067|14978239|15275479|15490684|13700888|11759009|12983425|16969708|16366371|15972959|14332734|15130911|11818845|12633297|
|15102373|2019-11-17|15668745|16900067|14978239|15275479|15490684|13700888|11759009|12983425|16969708|16366371|15972959|14332734|15130911|11818845|
|16475787|2019-11-18|15102373|15668745|16900067|14978239|15275479|15490684|13700888|11759009|12983425|16969708|16366371|15972959|14332734|15130911|
|16946753|2019-11-19|16475787|15102373|15668745|16900067|14978239|15275479|15490684|13700888|11759009|12983425|16969708|16366371|15972959|14332734|
|17422016|2019-11-20|16946753|16475787|15102373|15668745|16900067|14978239|15275479|15490684|13700888|11759009|12983425|16969708|16366371|15972959|

特征處理代碼:

        Dataset<Row> df = spark.read().schema(schema).option("header", "false").csv("file:///Users/sun/Downloads/pv_data.csv");
        df.createOrReplaceTempView("tmp");

        df = spark.sql("select * from tmp where day>='2019-11-01'");
        df.createOrReplaceTempView("tmp");

        // 補充待預(yù)測日期:
        int predDays = 7;
        String lastDay = spark.sql("select max(day) as day from tmp").first().getAs("day");
        Date lastDate = DateUtils.parseDate(lastDay, new String[]{"yyyy-MM-dd"});
        String sql = "select pv, day from tmp";
        for (int i=0; i<predDays; i++) {
            Date date = DateUtils.addDays(lastDate, (i + 1));
            String day = new SimpleDateFormat("yyyy-MM-dd").format(date);
            sql += " union (select 0, '" + day + "' from tmp limit 1)";
        }
        sql += " order by day asc";
        df = spark.sql(sql);
        df.createOrReplaceTempView("tmp");

        // 平移特征:
        int lagStart = 1;
        int lagEnd = 14;
        sql = "select *, ";
        for (int i=lagStart; i<=lagEnd; i++) {
            sql += "lag(pv, " + i + ") over (partition by null order by day) as lag_" + i;
            if (i <= lagEnd - 1)
                sql += ", ";
        }
        sql += " from tmp";
        df = spark.sql(sql);
        df.createOrReplaceTempView("tmp");

        // 時間特征:
        sql = "select *, " +
                "dayofweek(day) as day_of_week, " +
                "case when dayofweek(day)==1 or dayofweek(day)==7 then 1 else 0 end as is_weekend, " +
                "case when day in (" + Arrays.asList(holidays.split(",")).stream().map(s -> "'" + s + "'").collect(Collectors.joining(",")) + ") then 1 else 0 end as is_holiday " +
                "from tmp";
        df = spark.sql(sql);
        df.registerTempTable("tmp");

        // 均值特征:
        sql = "select tmp.*, t1.day_of_week_avg, t2.is_weekend_avg, t3.is_holiday_avg from tmp " +
                "left join (select day_of_week, avg(pv) as day_of_week_avg from tmp group by day_of_week) as t1 on tmp.day_of_week = t1.day_of_week " +
                "left join (select is_weekend, avg(pv) as is_weekend_avg from tmp group by is_weekend) as t2 on tmp.is_weekend = t2.is_weekend " +
                "left join (select is_holiday, avg(pv) as is_holiday_avg from tmp group by is_holiday) as t3 on tmp.is_holiday = t3.is_holiday ";
        df = spark.sql(sql);
        df.registerTempTable("tmp");

        // 中位數(shù)特征:
        sql = "select tmp.*, t1.day_of_week_med, t2.is_weekend_med, t3.is_holiday_med from tmp " +
                "left join (select day_of_week, percentile_approx(pv, 0.5) as day_of_week_med from tmp group by day_of_week) as t1 on tmp.day_of_week = t1.day_of_week " +
                "left join (select is_weekend, percentile_approx(pv, 0.5) as is_weekend_med from tmp group by is_weekend) as t2 on tmp.is_weekend = t2.is_weekend " +
                "left join (select is_holiday, percentile_approx(pv, 0.5) as is_holiday_med from tmp group by is_holiday) as t3 on tmp.is_holiday = t3.is_holiday ";
        df = spark.sql(sql);
        df.registerTempTable("tmp");

去熱點(異常值處理)

之前提到,有些樣本并非是節(jié)假日,但PV很高,可能是由于熱點事件導(dǎo)致。

大致有兩種情況:1. 運營搞了一些活動,刺激流量激增;2. 社會化熱點事件(參考微博熱搜)。

實際上,通過進一步的數(shù)據(jù)分析,可以知道主要原因是“疫情”間接帶來的PV波動。

熱點事件不像節(jié)假日一樣有跡可循,而有一定的隨機性、突發(fā)性。為了簡化,我們采取一定策略,對異常值進行處理。

這里,使用策略為:如果非節(jié)假日PV高于中位數(shù)的1.5倍,那么取中位數(shù)。代碼如下:

        // 異常值處理:
        // 非節(jié)假日,但流量超過中位數(shù)的1.5倍,認(rèn)為這樣的樣本是異常的(可能是熱點事件導(dǎo)致),處理為中位數(shù)
        df = spark.sql("select *, " +
                "if(is_holiday=0 and pv>day_of_week_med*1.5, day_of_week_med, pv) as y " +
                "from tmp order by day asc");
        df = df.na().drop();
        df.registerTempTable("tmp");

        // 平移特征0缺失值處理:處理為day_of_week_avg
        sql = "select *, ";
        for (int i=lagStart; i<=lagEnd; i++) {
            sql += "case when lag_" + i + ">0 then lag_"+i + " else day_of_week_avg end as lag_" + i + "_fix";
            if (i <= lagEnd - 1)
                sql += ", ";
        }
        sql += " from tmp";
        df = spark.sql(sql);
        df.registerTempTable("tmp");
        // 保存數(shù)據(jù):
        df.write().option("header", "true").csv("file:///Users/sun/Downloads/df");

得到數(shù)據(jù)示例:

+--------+----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+-----------+----------+----------+--------------------+--------------------+--------------------+---------------+--------------+--------------+--------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
|      pv|       day|   lag_1|   lag_2|   lag_3|   lag_4|   lag_5|   lag_6|   lag_7|   lag_8|   lag_9|  lag_10|  lag_11|  lag_12|  lag_13|  lag_14|day_of_week|is_weekend|is_holiday|     day_of_week_avg|      is_weekend_avg|      is_holiday_avg|day_of_week_med|is_weekend_med|is_holiday_med|       y|           lag_1_fix|           lag_2_fix|           lag_3_fix|           lag_4_fix|           lag_5_fix|           lag_6_fix|  lag_7_fix|  lag_8_fix|  lag_9_fix| lag_10_fix| lag_11_fix| lag_12_fix| lag_13_fix| lag_14_fix|
+--------+----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+-----------+----------+----------+--------------------+--------------------+--------------------+---------------+--------------+--------------+--------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
|16900067|2019-11-15|14978239|15275479|15490684|13700888|11759009|12983425|16969708|16366371|15972959|14332734|15130911|11818845|12633297|15156440|          6|         0|         0|2.2269334259259257E7|2.1140308681818184E7| 2.047994621387283E7|       18144580|      17914823|      17128256|16900067|         1.4978239E7|         1.5275479E7|         1.5490684E7|         1.3700888E7|         1.1759009E7|         1.2983425E7|1.6969708E7|1.6366371E7|1.5972959E7|1.4332734E7|1.5130911E7|1.1818845E7|1.2633297E7| 1.515644E7|
|15668745|2019-11-16|16900067|14978239|15275479|15490684|13700888|11759009|12983425|16969708|16366371|15972959|14332734|15130911|11818845|12633297|          7|         1|         0|2.2007671444444444E7|2.1197627611111112E7| 2.047994621387283E7|       15728601|      15623119|      17128256|15668745|         1.6900067E7|         1.4978239E7|         1.5275479E7|         1.5490684E7|         1.3700888E7|         1.1759009E7|1.2983425E7|1.6969708E7|1.6366371E7|1.5972959E7|1.4332734E7|1.5130911E7|1.1818845E7|1.2633297E7|
|15102373|2019-11-17|15668745|16900067|14978239|15275479|15490684|13700888|11759009|12983425|16969708|16366371|15972959|14332734|15130911|11818845|          1|         1|         0|2.0387583777777776E7|2.1197627611111112E7| 2.047994621387283E7|       15245430|      15623119|      17128256|15102373|         1.5668745E7|         1.6900067E7|         1.4978239E7|         1.5275479E7|         1.5490684E7|         1.3700888E7|1.1759009E7|1.2983425E7|1.6969708E7|1.6366371E7|1.5972959E7|1.4332734E7|1.5130911E7|1.1818845E7|
|16475787|2019-11-18|15102373|15668745|16900067|14978239|15275479|15490684|13700888|11759009|12983425|16969708|16366371|15972959|14332734|15130911|          2|         0|         0|1.9976350222222224E7|2.1140308681818184E7| 2.047994621387283E7|       16614896|      17914823|      17128256|16475787|         1.5102373E7|         1.5668745E7|         1.6900067E7|         1.4978239E7|         1.5275479E7|         1.5490684E7|1.3700888E7|1.1759009E7|1.2983425E7|1.6969708E7|1.6366371E7|1.5972959E7|1.4332734E7|1.5130911E7|
|16946753|2019-11-19|16475787|15102373|15668745|16900067|14978239|15275479|15490684|13700888|11759009|12983425|16969708|16366371|15972959|14332734|          3|         0|         0|2.0061554769230768E7|2.1140308681818184E7| 2.047994621387283E7|       17121601|      17914823|      17128256|16946753|         1.6475787E7|         1.5102373E7|         1.5668745E7|         1.6900067E7|         1.4978239E7|         1.5275479E7|1.5490684E7|1.3700888E7|1.1759009E7|1.2983425E7|1.6969708E7|1.6366371E7|1.5972959E7|1.4332734E7|

模型訓(xùn)練

標(biāo)準(zhǔn)化

lag_*、 *_avg、 *_med 這些特征是pv,量級為千萬級,對其進行標(biāo)準(zhǔn)化:

        // 標(biāo)準(zhǔn)化:lag_*、 *_avg、 *_med 特征進行標(biāo)準(zhǔn)化
        VectorAssembler vectorAssembler = new VectorAssembler()
                .setInputCols(new String[]{"lag_1", "lag_2", "lag_3", "lag_4", "lag_5", "lag_6", "lag_7", "lag_8", "lag_9", "lag_10", "lag_11", "lag_12", "lag_13", "lag_14", "day_of_week_avg", "is_weekend_avg", "is_holiday_avg", "day_of_week_med", "is_weekend_med", "is_holiday_med"})
                .setOutputCol("feature_vec");
        df = vectorAssembler.transform(df);
        MinMaxScaler scaler = new MinMaxScaler()
                .setInputCol("feature_vec")
                .setOutputCol("feature_out");
        df = scaler.fit(df).transform(df);

VectorAssembler 可以把 Dataset 的列轉(zhuǎn)為Vector類型(后邊算法API必須使用向量作為入?yún)ⅲ?br> MinMaxScaler 把特征縮放到[0,1]區(qū)間。

處理結(jié)果:

+--------+----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+-----------+----------+----------+--------------------+--------------------+-------------------+---------------+--------------+--------------+--------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+--------------------+--------------------+
|      pv|       day|   lag_1|   lag_2|   lag_3|   lag_4|   lag_5|   lag_6|   lag_7|   lag_8|   lag_9|  lag_10|  lag_11|  lag_12|  lag_13|  lag_14|day_of_week|is_weekend|is_holiday|     day_of_week_avg|      is_weekend_avg|     is_holiday_avg|day_of_week_med|is_weekend_med|is_holiday_med|       y|  lag_1_fix|  lag_2_fix|  lag_3_fix|  lag_4_fix|  lag_5_fix|  lag_6_fix|  lag_7_fix|  lag_8_fix|  lag_9_fix| lag_10_fix| lag_11_fix| lag_12_fix| lag_13_fix| lag_14_fix|         feature_vec|            features|
+--------+----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+-----------+----------+----------+--------------------+--------------------+-------------------+---------------+--------------+--------------+--------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+--------------------+--------------------+
|16900067|2019-11-15|14978239|15275479|15490684|13700888|11759009|12983425|16969708|16366371|15972959|14332734|15130911|11818845|12633297|15156440|          6|         0|         0|2.2269334259259257E7|2.1140308681818184E7|2.047994621387283E7|       18144580|      17914823|      17128256|16900067|1.4978239E7|1.5275479E7|1.5490684E7|1.3700888E7|1.1759009E7|1.2983425E7|1.6969708E7|1.6366371E7|1.5972959E7|1.4332734E7|1.5130911E7|1.1818845E7|1.2633297E7| 1.515644E7|[1.4978239E7,1.52...|[0.02878242285919...|
|15668745|2019-11-16|16900067|14978239|15275479|15490684|13700888|11759009|12983425|16969708|16366371|15972959|14332734|15130911|11818845|12633297|          7|         1|         0|2.2007671444444444E7|2.1197627611111112E7|2.047994621387283E7|       15728601|      15623119|      17128256|15668745|1.6900067E7|1.4978239E7|1.5275479E7|1.5490684E7|1.3700888E7|1.1759009E7|1.2983425E7|1.6969708E7|1.6366371E7|1.5972959E7|1.4332734E7|1.5130911E7|1.1818845E7|1.2633297E7|[1.6900067E7,1.49...|[0.05339190949495...|
|15102373|2019-11-17|15668745|16900067|14978239|15275479|15490684|13700888|11759009|12983425|16969708|16366371|15972959|14332734|15130911|11818845|          1|         1|         0|2.0387583777777776E7|2.1197627611111112E7|2.047994621387283E7|       15245430|      15623119|      17128256|15102373|1.5668745E7|1.6900067E7|1.4978239E7|1.5275479E7|1.5490684E7|1.3700888E7|1.1759009E7|1.2983425E7|1.6969708E7|1.6366371E7|1.5972959E7|1.4332734E7|1.5130911E7|1.1818845E7|[1.5668745E7,1.69...|[0.03762452432660...|
|16475787|2019-11-18|15102373|15668745|16900067|14978239|15275479|15490684|13700888|11759009|12983425|16969708|16366371|15972959|14332734|15130911|          2|         0|         0|1.9976350222222224E7|2.1140308681818184E7|2.047994621387283E7|       16614896|      17914823|      17128256|16475787|1.5102373E7|1.5668745E7|1.6900067E7|1.4978239E7|1.5275479E7|1.5490684E7|1.3700888E7|1.1759009E7|1.2983425E7|1.6969708E7|1.6366371E7|1.5972959E7|1.4332734E7|1.5130911E7|[1.5102373E7,1.56...|[0.03037198967476...|
|16946753|2019-11-19|16475787|15102373|15668745|16900067|14978239|15275479|15490684|13700888|11759009|12983425|16969708|16366371|15972959|14332734|          3|         0|         0|2.0061554769230768E7|2.1140308681818184E7|2.047994621387283E7|       17121601|      17914823|      17128256|16946753|1.6475787E7|1.5102373E7|1.5668745E7|1.6900067E7|1.4978239E7|1.5275479E7|1.5490684E7|1.3700888E7|1.1759009E7|1.2983425E7|1.6969708E7|1.6366371E7|1.5972959E7|1.4332734E7|[1.6475787E7,1.51...|[0.04795889832547...|
|17422016|2019-11-20|16946753|16475787|15102373|15668745|16900067|14978239|15275479|15490684|13700888|11759009|12983425|16969708|16366371|15972959|          4|         0|         0| 2.172384296153846E7|2.1140308681818184E7|2.047994621387283E7|       17928108|      17914823|      17128256|17422016|1.6946753E7|1.6475787E7|1.5102373E7|1.5668745E7|1.6900067E7|1.4978239E7|1.5275479E7|1.5490684E7|1.3700888E7|1.1759009E7|1.2983425E7|1.6969708E7|1.6366371E7|1.5972959E7|[1.6946753E7,1.64...|[0.05398973536338...|
|18010112|2019-11-21|17422016|16946753|16475787|15102373|15668745|16900067|14978239|15275479|15490684|13700888|11759009|12983425|16969708|16366371|          5|         0|         0|2.1671804769230768E7|2.1140308681818184E7|2.047994621387283E7|       17962984|      17914823|      17128256|18010112|1.7422016E7|1.6946753E7|1.6475787E7|1.5102373E7|1.5668745E7|1.6900067E7|1.4978239E7|1.5275479E7|1.5490684E7|1.3700888E7|1.1759009E7|1.2983425E7|1.6969708E7|1.6366371E7|[1.7422016E7,1.69...|[0.06007559655750...|
|17935725|2019-11-22|18010112|17422016|16946753|16475787|15102373|15668745|16900067|14978239|15275479|15490684|13700888|11759009|12983425|16969708|          6|         0|         0|2.2269334259259257E7|2.1140308681818184E7|2.047994621387283E7|       18144580|      17914823|      17128256|17935725|1.8010112E7|1.7422016E7|1.6946753E7|1.6475787E7|1.5102373E7|1.5668745E7|1.6900067E7|1.4978239E7|1.5275479E7|1.5490684E7|1.3700888E7|1.1759009E7|1.2983425E7|1.6969708E7|[1.8010112E7,1.74...|[0.06760631244495...|
|15623119|2019-11-23|17935725|18010112|17422016|16946753|16475787|15102373|15668745|16900067|14978239|15275479|15490684|13700888|11759009|12983425|          7|         1|         0|2.2007671444444444E7|2.1197627611111112E7|2.047994621387283E7|       15728601|      15623119|      17128256|15623119|1.7935725E7|1.8010112E7|1.7422016E7|1.6946753E7|1.6475787E7|1.5102373E7|1.5668745E7|1.6900067E7|1.4978239E7|1.5275479E7|1.5490684E7|1.3700888E7|1.1759009E7|1.2983425E7|[1.7935725E7,1.80...|[0.06665376836589...|
|14637174|2019-11-24|15623119|17935725|18010112|17422016|16946753|16475787|15102373|15668745|16900067|14978239|15275479|15490684|13700888|11759009|          1|         1|         0|2.0387583777777776E7|2.1197627611111112E7|2.047994621387283E7|       15245430|      15623119|      17128256|14637174|1.5623119E7|1.7935725E7|1.8010112E7|1.7422016E7|1.6946753E7|1.6475787E7|1.5102373E7|1.5668745E7|1.6900067E7|1.4978239E7|1.5275479E7|1.5490684E7|1.3700888E7|1.1759009E7|[1.5623119E7,1.79...|[0.03704027202242...|
+--------+----------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+-----------+----------+----------+--------------------+--------------------+-------------------+---------------+--------------+--------------+--------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+--------------------+--------------------+

features為處理后的特征列。

訓(xùn)練

        // 訓(xùn)練:使用 lastDay 之前的數(shù)據(jù)進行訓(xùn)練
        Dataset<Row> trainDataset = spark.sql("select day, features, pv, y from tmp where day<='" + lastDay + "' order by day asc");
        double maxR2 = 0.0D;
        double bestParam = 0.0D;
        LinearRegressionModel bestModel = null;
        // 搜索最優(yōu)參數(shù):
        for (int i=1; i<=10; i++) {
            LinearRegression lr = new LinearRegression()
                    .setLabelCol("y")
                    .setFeaturesCol("features")
                    .setMaxIter(10000)
                    .setRegParam(0.03) // 步長
                    .setElasticNetParam(0.1 * i);
            LinearRegressionModel model = lr.fit(trainDataset);
            LinearRegressionTrainingSummary trainingSummary = model.summary();
            System.out.println("RMSE: " + trainingSummary.rootMeanSquaredError());
            System.out.println("r2: " + trainingSummary.r2());
            if(trainingSummary.r2() > maxR2) {
                bestParam = 0.1 * i;
                maxR2 = trainingSummary.r2();
                bestModel = model;
            }
        }
        System.out.println("best param -> " + bestParam);
        System.out.println("best r2 -> " + maxR2);

這里使用LinearRegression ,主要調(diào)節(jié)setElasticNetParam參數(shù)值,詳細(xì)參數(shù)說明可以參考文檔。

從0.1~1.0,尋找一個最優(yōu)值,使得模型r2最高,此時的模型作為最優(yōu)模型。

最終,得到elasticnet為0.5時最優(yōu),r2為0.7008858790650143。

預(yù)測

對未來7天的數(shù)據(jù)進行預(yù)測

Dataset<Row> predDataset = spark.sql("select day, features, pv, y from tmp where day>'" + lastDay + "' order by day asc");
        bestModel.setPredictionCol("pv_pred");
        bestModel.transform(predDataset).show();

結(jié)果如下:

+----------+--------------------+---+---+--------------------+
|       day|            features| pv|  y|             pv_pred|
+----------+--------------------+---+---+--------------------+
|2020-04-28|[0.03159798985245...|  0|  0|1.7333708553490087E7|
|2020-04-29|[0.11516156320975...|  0|  0|1.7833363920196097E7|
|2020-04-30|[0.11449520118456...|  0|  0|1.7624262847742468E7|
|2020-05-01|[0.12214671526351...|  0|  0|3.6077728160918914E7|
|2020-05-02|[0.11879605768944...|  0|  0| 1.518647529881512E7|
|2020-05-03|[0.09805043124337...|  0|  0|1.5407320504048364E7|
|2020-05-04|[0.09278448304737...|  0|  0|  3.56043256732697E7|
+----------+--------------------+---+---+--------------------+

5月1日、5月4日是節(jié)假日,預(yù)計這兩天將出現(xiàn)流量高峰。

引用

時序預(yù)測方法匯總
SparkML文檔

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

友情鏈接更多精彩內(nèi)容