Clickhouse(流量分析(三).路徑分析案例)

1. 路徑分析定義

神策用戶分析模型——路徑分析的使用方法

漏斗分析是固化了具體的分析過程或者業(yè)務環(huán)節(jié),然后分析幾個大的業(yè)務環(huán)節(jié)的轉化;而路徑分析,是固化了用戶的路徑順序,在每個路徑次序中,都包含了各個主要業(yè)務環(huán)節(jié),因此在每一步中,出現(xiàn)的業(yè)務環(huán)節(jié)很有可能都是類似的。漏斗分析看重的是業(yè)務環(huán)節(jié)之間的留存關系,而路徑分析看重的是用戶在不同業(yè)務環(huán)節(jié)中的順序及流失關系。

路徑分析:如何將用戶的網(wǎng)站行為軌跡可視化呈現(xiàn)?
BI分析系統(tǒng)——路徑分析及產(chǎn)品化 )

2. 關鍵路徑分析

  1. sequenceMatch函數(shù)檢查是否有事件鏈滿足輸入的模式
  2. sequenceCount函數(shù)則統(tǒng)計滿足輸入模式的事件鏈的數(shù)量(如果兩個事件發(fā)生在同一秒時,是無法準確區(qū)分事件的發(fā)生先后關系的,所以會存在一定的誤差)
sequenceCount(pattern)(timestamp, cond1, cond2, ...)

pattern支持3中匹配模式:

(?N):表示時間序列中的第N個事件,從1開始,最長支持32個條件輸入;如,(?1)對應的是cond1

(?t op secs):插入兩個事件之間,表示它們發(fā)生時需要滿足的時間條件(單位為秒),支持 >=, >, <, <= 。例如上述SQL中,(?1)(?t<=15)(?2)即表示事件1和2發(fā)生的時間間隔在15秒以內,期間可能會發(fā)生若干次非指定事件。

.*:表示任意的非指定事件。

create table  nonodb.adm_path_tf_demo1
(
    eventTime DateTime,
    uid UInt32,
    event_type String,
    expflag int
)
ENGINE = MergeTree
ORDER BY (eventTime,uid)
SETTINGS index_granularity = 8192;

-- 在10分鐘內完成瀏覽收藏操作.之后下單的人
insert into nonodb.adm_path_tf_demo1 values ('2022-01-02 11:15:00',1,'瀏覽',1);
insert into nonodb.adm_path_tf_demo1 values ('2022-01-02 11:16:00',1,'瀏覽',1);
insert into nonodb.adm_path_tf_demo1 values ('2022-01-02 11:16:00',1,'瀏覽',1);
insert into nonodb.adm_path_tf_demo1 values ('2022-01-02 11:25:00',1,'收藏',1);
insert into nonodb.adm_path_tf_demo1 values ('2022-01-02 11:26:00',1,'收藏',1);
insert into nonodb.adm_path_tf_demo1 values ('2022-01-02 11:26:00',1,'收藏',1);
insert into nonodb.adm_path_tf_demo1 values ('2022-01-02 11:36:00',1,'點擊',1);
insert into nonodb.adm_path_tf_demo1 values ('2022-01-02 11:56:00',1,'下單',1);
insert into nonodb.adm_path_tf_demo1 values ('2022-01-02 11:57:00',1,'下單',1);
insert into nonodb.adm_path_tf_demo1 values ('2022-01-02 11:57:00',1,'瀏覽',1);
insert into nonodb.adm_path_tf_demo1 values ('2022-01-02 11:58:00',1,'下單',1);
insert into nonodb.adm_path_tf_demo1 values ('2022-01-02 11:15:00',2,'瀏覽',1);
insert into nonodb.adm_path_tf_demo1 values ('2022-01-02 11:25:00',2,'收藏',1);
insert into nonodb.adm_path_tf_demo1 values ('2022-01-02 11:56:00',2,'下單',1);
insert into nonodb.adm_path_tf_demo1 values ('2022-01-02 11:15:00',3,'瀏覽',1);
insert into nonodb.adm_path_tf_demo1 values ('2022-01-02 11:25:00',3,'收藏',1);

insert into nonodb.adm_path_tf_demo1 values ('2022-01-02 12:15:00',1,'瀏覽',1);
insert into nonodb.adm_path_tf_demo1 values ('2022-01-02 12:24:00',1,'收藏',1);
insert into nonodb.adm_path_tf_demo1 values ('2022-01-02 12:56:00',1,'下單',1);
insert into nonodb.adm_path_tf_demo1 values ('2022-01-02 12:56:00',2,'下單',1);
select uid,
      sequenceMatch('(?1)(?t<=600)(?2).*(?3)')(
          eventTime,
          event_type = '瀏覽',
          event_type = '收藏' and expflag =1 ,
          event_type = '下單'
          ) as is_match
       from nonodb.adm_path_tf_demo1
       where toDate(eventTime) >= '2022-01-02'  and toDate(eventTime) <= '2022-01-02'
       and uid in (1,2,3)
       group by uid;

??看結果鏈是不允許重疊的,確實要是允許重疊實現(xiàn)起來是真的非常麻煩啊,業(yè)務上有需求的話條件限制嘍

select uid,
      sequenceCount('(?1)(?t<=600)(?2).*(?3)')(
          eventTime,
          event_type = '瀏覽',
          event_type = '收藏' and expflag =1 ,
          event_type = '下單'
          ) as is_se
       from nonodb.adm_path_tf_demo1
       where toDate(eventTime) >= '2022-01-02'  and toDate(eventTime) <= '2022-01-02'
       and uid in (1,2,3)
       group by uid;

3. 智能路徑檢測

智能路徑分析模型比較復雜,但同時支持的分析需求也會更加復雜,如分析給定期望的路徑終點、途經(jīng)點和最大事件時間間隔,統(tǒng)計出每條路徑的用戶數(shù),并按照用戶數(shù)對路徑進行倒序排列

???網(wǎng)上給的例子怎么都是一樣的,算了吸收一下吧,雖說數(shù)組函數(shù)非常靈活,但是不怕查詢慢嗎??具體還得結合業(yè)務過濾大部分的數(shù)據(jù)

方案一

1

-- 用groupArray函數(shù)整理成<時間, <事件名, expflag>>的元組
select uid ,groupArray((toUInt32(eventTime),(event_type,expflag)))
from nonodb.adm_path_tf_demo1
where toDate(eventTime) >= '2022-01-02'  and toDate(eventTime) <= '2022-01-02' and uid in (1,2,3)
group by uid;

2

-- arrayFilter 只保留數(shù)組中滿足條件的數(shù)據(jù)
with  (select toUInt32(maxIf(eventTime, event_type = '下單')) AS end_event_max from nonodb.adm_path_tf_demo1) as end_event_max
select uid ,arrayFilter( x -> x.1 <= end_event_max,groupArray((toUInt32(eventTime),(event_type,expflag))))
from nonodb.adm_path_tf_demo1
where toDate(eventTime) >= '2022-01-02'  and toDate(eventTime) <= '2022-01-02' and uid in (1,2,3)
group by uid;

3

-- arraySort 對數(shù)組中的數(shù)據(jù)按照指定列進行升序排列
with  (select toUInt32(maxIf(eventTime, event_type = '下單')) AS end_event_max from nonodb.adm_path_tf_demo1) as end_event_max
select uid ,arraySort(x -> (x.1),arrayFilter( x -> x.1 <= end_event_max,groupArray((toUInt32(eventTime),(event_type,expflag)))))
from nonodb.adm_path_tf_demo1
where toDate(eventTime) >= '2022-01-02'  and toDate(eventTime) <= '2022-01-02' and uid in (1,2,3)
group by uid;

4

--  arrayCompact對事件按照時間維度排序后進行相鄰去重
with  (select toUInt32(maxIf(eventTime, event_type = '下單')) AS end_event_max from nonodb.adm_path_tf_demo1) as end_event_max
select uid ,arrayCompact(arraySort(x -> (x.1),arrayFilter( x -> x.1 <= end_event_max,groupArray((toUInt32(eventTime),(event_type,expflag)))))) AS sorted_events
from nonodb.adm_path_tf_demo1
where toDate(eventTime) >= '2022-01-02'  and toDate(eventTime) <= '2022-01-02' and uid in (1,2,3)
group by uid;

5

-- arrayEnumerate獲取下標數(shù)組
-- 切割點:過濾出原始行為鏈中的分界點下標。分界點的條件是路徑終點或者時間差大于最大間隔
-- arrayFilter如下返回第一個參數(shù)的值,滿足x,y,z的值

with  (select toUInt32(maxIf(eventTime, event_type = '下單')) AS end_event_max from nonodb.adm_path_tf_demo1) as end_event_max
select uid ,
       arrayCompact(arraySort(x -> (x.1),arrayFilter( x -> x.1 <= end_event_max,groupArray((toUInt32(eventTime),(event_type,expflag)))))) AS sorted_events,
       arrayEnumerate(sorted_events) AS event_idxs,
       arrayFilter( (x, y, z) -> z.1 <= end_event_max AND (z.2.1 = '下單' OR y > 600),event_idxs,arrayDifference(sorted_events.1),sorted_events ) AS gap_idxs
from nonodb.adm_path_tf_demo1
where toDate(eventTime) >= '2022-01-02'  and toDate(eventTime) <= '2022-01-02' and uid in (1,2,3)
group by uid;

6

-- 利用arrayMap和has函數(shù)獲取下標數(shù)組的掩碼(由0和1組成的序列),用于最終切分,1表示分界點
with  (select toUInt32(maxIf(eventTime, event_type = '下單')) AS end_event_max from nonodb.adm_path_tf_demo1) as end_event_max
select uid ,
       arrayCompact(arraySort(x -> (x.1),arrayFilter( x -> x.1 <= end_event_max,groupArray((toUInt32(eventTime),(event_type,expflag)))))) AS sorted_events,
       arrayEnumerate(sorted_events) AS event_idxs,
       arrayFilter( (x, y, z) -> z.1 <= end_event_max AND (z.2.1 = '下單' OR y > 600),event_idxs,arrayDifference(sorted_events.1),sorted_events ) AS gap_idxs,
       arrayMap(x -> x + 1, gap_idxs) AS gap_idxs_,  --如果不加1的話上一個事件鏈的結尾事件會成為下個事件鏈的開始事件
       arrayMap(x -> if(has(gap_idxs_, x), 1, 0), event_idxs) AS gap_masks  --標記切割點
from nonodb.adm_path_tf_demo1
where toDate(eventTime) >= '2022-01-02'  and toDate(eventTime) <= '2022-01-02' and uid in (1,2,3)
group by uid;

7

-- 調用arraySplit函數(shù)將原始行為鏈按分界點切分成單次訪問的行為鏈。注意該函數(shù)會將分界點作為新鏈的起始點,所以前面要將分界點的下標加1
with  (select toUInt32(maxIf(eventTime, event_type = '下單')) AS end_event_max from nonodb.adm_path_tf_demo1) as end_event_max
select uid ,
       arrayCompact(arraySort(x -> (x.1),arrayFilter( x -> x.1 <= end_event_max,groupArray((toUInt32(eventTime),(event_type,expflag)))))) AS sorted_events,
       arrayEnumerate(sorted_events) AS event_idxs,
       arrayFilter( (x, y, z) -> z.1 <= end_event_max AND (z.2.1 = '下單' OR y > 600),event_idxs,arrayDifference(sorted_events.1),sorted_events ) AS gap_idxs,
       arrayMap(x -> x + 1, gap_idxs) AS gap_idxs_,
       arrayMap(x -> if(has(gap_idxs_, x), 1, 0), event_idxs) AS gap_masks,
       arraySplit((x, y) -> y, sorted_events, gap_masks) AS split_events  --把用戶的訪問數(shù)據(jù)切割成多個事件鏈
from nonodb.adm_path_tf_demo1
where toDate(eventTime) >= '2022-01-02'  and toDate(eventTime) <= '2022-01-02' and uid in (1,2,3)
group by uid;

8

-- 調用arrayJoin和arrayCompact函數(shù)將事件鏈的數(shù)組打平成多行單列,并去除相鄰重復項
select uid,
       arrayJoin(split_events) AS event_chain_,
       arrayCompact(event_chain_.2) AS event_chain  --相鄰去重
       from (
with  (select toUInt32(maxIf(eventTime, event_type = '下單')) AS end_event_max from nonodb.adm_path_tf_demo1) as end_event_max
select uid ,
       arrayCompact(arraySort(x -> (x.1),arrayFilter( x -> x.1 <= end_event_max,groupArray((toUInt32(eventTime),(event_type,expflag)))))) AS sorted_events,
       arrayEnumerate(sorted_events) AS event_idxs,
       arrayFilter( (x, y, z) -> z.1 <= end_event_max AND (z.2.1 = '下單' OR y > 600),event_idxs,arrayDifference(sorted_events.1),sorted_events ) AS gap_idxs,
       arrayMap(x -> x + 1, gap_idxs) AS gap_idxs_,
       arrayMap(x -> if(has(gap_idxs_, x), 1, 0), event_idxs) AS gap_masks,
       arraySplit((x, y) -> y, sorted_events, gap_masks) AS split_events  --把用戶的訪問數(shù)據(jù)切割成多個事件鏈
from nonodb.adm_path_tf_demo1
where toDate(eventTime) >= '2022-01-02'  and toDate(eventTime) <= '2022-01-02' and uid in (1,2,3)
group by uid ) WHERE event_chain[length(event_chain)].1 = '下單';  -- 事件鏈最后一個事件必須是目標事件;

9

-- 調用hasAll函數(shù)確定是否全部存在指定的途經(jīng)點。如果要求有任意一個途經(jīng)點存在即可,就換用hasAny函數(shù)。當然,也可以修改WHERE謂詞來排除指定的途經(jīng)點
-- select  hasAll([('瀏覽',1),('收藏',1),('下單',1)], [('下單',1),('收藏',1)]) ??這能判斷,為什么下面的不行???
select uid,
       arrayJoin(split_events) AS event_chain_,
       arrayCompact(event_chain_.2) AS event_chain,
       hasAll(event_chain, [('瀏覽',1)]) AS has_midway_hit -- 調用hasAll函數(shù)確定是否全部存在指定的途經(jīng)點
       from (
with  (select toUInt32(maxIf(eventTime, event_type = '下單')) AS end_event_max from nonodb.adm_path_tf_demo1) as end_event_max
select uid ,
       arrayCompact(arraySort(x -> (x.1),arrayFilter( x -> x.1 <= end_event_max,groupArray((toUInt32(eventTime),(event_type,expflag)))))) AS sorted_events,
       arrayEnumerate(sorted_events) AS event_idxs,
       arrayFilter( (x, y, z) -> z.1 <= end_event_max AND (z.2.1 = '下單' OR y > 600),event_idxs,arrayDifference(sorted_events.1),sorted_events ) AS gap_idxs,
       arrayMap(x -> x + 1, gap_idxs) AS gap_idxs_,
       arrayMap(x -> if(has(gap_idxs_, x), 1, 0), event_idxs) AS gap_masks,
       arraySplit((x, y) -> y, sorted_events, gap_masks) AS split_events
from nonodb.adm_path_tf_demo1
where toDate(eventTime) >= '2022-01-02'  and toDate(eventTime) <= '2022-01-02' and uid in (1,2,3)
group by uid ) WHERE event_chain[length(event_chain)].1 = '下單' AND has_midway_hit = 1;  -- 必須包含途經(jīng)點;

10

-- 將最終結果整理成可讀的字符串
 SELECT
  result_chain,
  uniqCombined(uid) AS user_count
FROM (
select uid,
       arrayJoin(split_events) AS event_chain_,
       arrayCompact(event_chain_.2) AS event_chain,
       hasAll(event_chain, [('瀏覽','1')]) AS has_midway_hit,
       arrayStringConcat(arrayMap( x -> concat(x.1, '#', toString(x.2)),event_chain ), ' -> ') AS result_chain
       from (
with  (select toUInt32(maxIf(eventTime, event_type = '下單')) AS end_event_max from nonodb.adm_path_tf_demo1) as end_event_max
select uid ,
       arrayCompact(arraySort(x -> (x.1),arrayFilter( x -> x.1 <= end_event_max,groupArray((toUInt32(eventTime),(event_type,toString(expflag))))))) AS sorted_events,
       arrayEnumerate(sorted_events) AS event_idxs,
       arrayFilter( (x, y, z) -> z.1 <= end_event_max AND (z.2.1 = '下單' OR y > 600),event_idxs,arrayDifference(sorted_events.1),sorted_events ) AS gap_idxs,
       arrayMap(x -> x + 1, gap_idxs) AS gap_idxs_,
       arrayMap(x -> if(has(gap_idxs_, x), 1, 0), event_idxs) AS gap_masks,
       arraySplit((x, y) -> y, sorted_events, gap_masks) AS split_events
from nonodb.adm_path_tf_demo1
where toDate(eventTime) >= '2022-01-02'  and toDate(eventTime) <= '2022-01-02' and uid in (1,2,3)
group by uid ) WHERE event_chain[length(event_chain)].1 = '下單'
                AND has_midway_hit = 1
     ) group by result_chain order by user_count desc limit  10 ;

bug的解決,hasAll這塊判斷

 select  hasAll([('瀏覽',1),('收藏',1),('下單',1)], [('下單',1),('收藏',1)]

雖然上面的是可以的,但是例子中SQL太長了,好像沒有轉換的問題,導致無法匹配上,最好都轉成string類型的....

方案二

不設置途經(jīng)點,且僅以用戶最后一次到達目標事件作為參考

-- 這個有個問題3號沒有下單的記錄...過濾一下
select uid,
       arrayMap((x, y) -> (x, y),groupArray(event_type),groupArray(eventTime)),
       arrayWithConstant(length(groupArray(eventTime)),maxIf(eventTime, event_type = '下單'))
from nonodb.adm_path_tf_demo1
where toDate(eventTime) >= '2022-01-02'  and toDate(eventTime) <= '2022-01-02' and uid in (1,2,3)
group by uid;
-- 找到目標節(jié)點前1小時內的所有事件
select uid,
       arrayFilter((x,y) -> y - x.2 > 3600,arrayMap((x, y) -> (x, y),groupArray(event_type),groupArray(eventTime)),
       arrayWithConstant(length(groupArray(eventTime)),maxIf(eventTime, event_type = '下單')))
from nonodb.adm_path_tf_demo1
where toDate(eventTime) >= '2022-01-02'  and toDate(eventTime) <= '2022-01-02' and uid in (1,2,3)
group by uid;
-- 進行排序和映射
select uid,
       arrayMap(b -> b.1,arraySort(y -> y.2,arrayFilter((x,y) -> y - x.2 > 3600,arrayMap((x, y) -> (x, y),groupArray(event_type),groupArray(eventTime)),
       arrayWithConstant(length(groupArray(eventTime)),maxIf(eventTime, event_type = '下單')))))
from nonodb.adm_path_tf_demo1
where toDate(eventTime) >= '2022-01-02'  and toDate(eventTime) <= '2022-01-02' and uid in (1,2,3)
group by uid;
-- 相鄰事件去重
select uid,
       arrayCompact(arrayMap(b -> b.1,arraySort(y -> y.2,arrayFilter((x,y) -> y - x.2 > 3600,arrayMap((x, y) -> (x, y),groupArray(event_type),groupArray(eventTime)),
       arrayWithConstant(length(groupArray(eventTime)),maxIf(eventTime, event_type = '下單'))))))
from nonodb.adm_path_tf_demo1
where toDate(eventTime) >= '2022-01-02'  and toDate(eventTime) <= '2022-01-02' and uid in (1,2,3)
group by uid;
-- 拼接路徑 arrayStringConcat
select result,uniqCombined(uid) AS user_count
       from (
select uid,
       arrayStringConcat(arrayCompact(arrayMap(b -> b.1,arraySort(y -> y.2,arrayFilter((x,y) -> y - x.2 > 3600,arrayMap((x, y) -> (x, y),groupArray(event_type),groupArray(eventTime)),
       arrayWithConstant(length(groupArray(eventTime)),maxIf(eventTime, event_type = '下單')))))),'->') result
from nonodb.adm_path_tf_demo1
where toDate(eventTime) >= '2022-01-02'  and toDate(eventTime) <= '2022-01-02' and uid in (1,2,3)
group by uid )where result <> ''  group by result;
上面用到的幾個高階函數(shù)

1. arrayCompact對數(shù)組中的數(shù)據(jù)進行相鄰去重,用戶重復操作的事件只記錄一次(頁面去重)

SELECT arrayCompact([1, 2, 3, 3, 1, 1, 1, 4]) AS data

[1,2,3,1,4]

2. arraySort 對數(shù)組中的數(shù)據(jù)按照指定列進行升序排列;降序排列參考arrayReverseSort

 SELECT arraySort(x -> (x.1), [(1, 'a'), (4, 'd'), (2, 'b'), (3, 'c')]) AS data

[(1,'a'),(2,'b'),(3,'c'),(4,'d')]

3. arrayEnumerate 取數(shù)組的下標掩碼序列

SELECT arrayEnumerate([1, 2, 3, 3, 1, 1, 4, 2]) AS data

[1,2,3,4,5,6,7,8]

4. arrayMap 對數(shù)組中的每一列進行處理,并返回長度相同的新數(shù)組

SELECT arrayMap(x -> concat(toString(x.1), ':', x.2), [(1, 'a'), (4, 'a'), (3, 'a'), (2, 'c')]) AS data

['1:a','4:a','3:a','2:c']

5. arrayStringConcat將數(shù)組元素按照給定分隔符進行拼接,返回拼接后的字符串

SELECT arrayStringConcat( ['a','b','c'] , '-');

a-b-c

6. arraySplit 按照規(guī)則對數(shù)組進行分割(遇到下標為1時進行分割,分割點為下一個 數(shù)組的起始點;注意,首項為1還是0不影響結果)

SELECT arraySplit((x, y) -> y, ['a', 'b', 'c', 'd', 'e'], [1, 1, 1, 0, 0]) AS data

[['a'],['b'],['c','d','e']]

7. arrayDifference參數(shù)必須是數(shù)值類型;計算數(shù)組中相鄰數(shù)字的差值,第一個值為0

SELECT arrayDifference([3, 1, 1, 4, 2]) AS data

[0,-2,0,3,-2]

8. arrayFilter 只保留數(shù)組中滿足條件的數(shù)據(jù)

SELECT arrayFilter((x,y) -> (x > 2 and y >3), [12, 3, 4, 1, 0],[12, 3, 4, 1, 0]) AS data

[12,4]

SELECT arrayFilter(x -> (x > 2), [12, 3, 4, 1, 0]) AS data

[12,3,4]

4. VIVO方案

VIVO路徑分析模型

通常用戶在需要進行路徑分析的場景時關注的主要問題:

  1. 按轉換率從高至低排列在APP內用戶的主要路徑是什么?
  2. 用戶在離開預想的路徑后,實際走向是什么?
  3. 不同特征的用戶行為路徑有什么差異?
1. 一些概念
  1. Session和Session Time
    不同于WEB應用中的Session,在數(shù)據(jù)分析中的Session會話,是指在指定的時間段內在網(wǎng)站上發(fā)生的一系列互動。本模型中的Session Time的含義是,當兩個行為間隔時間超過Session Time,我們便認為這兩個行為不屬于同一條路徑。
  2. ?;鶊D


    桑基圖
2. 技術關鍵點
  1. Session切分
    其實這里我們可以從前端那拿到會話id,這樣就不用切分了
  2. 相鄰頁面去重
  3. 滑動數(shù)組
  4. 轉化率計算
    頁面轉化率,路徑轉化率
  5. 鄰接表和剪枝

5. Clickhouse問題解決

1. dataGrip連接clickhouse時,時間字段顯示差八小時問題
改一下的
2. 遠程連接clickhouse

ClickHouse安裝和使用

注意關閉防火墻

systemctl status firewalld 

systemctl stop firewalld.service

systemctl disable firewalld.service
3. 遇到的問題

clickhouse單節(jié)點報錯 Code: 210. DB::NetException: Connection refused (localhost:9000)

<-- <listen_host>0.0.0.0</listen_host> -->

這個注釋千萬不能去掉....暫時不知道怎么處理..卸載重裝吧

clickhouse卸載重裝

最后編輯于
?著作權歸作者所有,轉載或內容合作請聯(lián)系作者
【社區(qū)內容提示】社區(qū)部分內容疑似由AI輔助生成,瀏覽時請結合常識與多方信息審慎甄別。
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發(fā)布,文章內容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務。

相關閱讀更多精彩內容

友情鏈接更多精彩內容