1. 路徑分析定義
漏斗分析是固化了具體的分析過程或者業(yè)務環(huán)節(jié),然后分析幾個大的業(yè)務環(huán)節(jié)的轉化;而路徑分析,是固化了用戶的路徑順序,在每個路徑次序中,都包含了各個主要業(yè)務環(huán)節(jié),因此在每一步中,出現(xiàn)的業(yè)務環(huán)節(jié)很有可能都是類似的。漏斗分析看重的是業(yè)務環(huán)節(jié)之間的留存關系,而路徑分析看重的是用戶在不同業(yè)務環(huán)節(jié)中的順序及流失關系。
路徑分析:如何將用戶的網(wǎng)站行為軌跡可視化呈現(xiàn)?
BI分析系統(tǒng)——路徑分析及產(chǎn)品化 )
2. 關鍵路徑分析
- sequenceMatch函數(shù)檢查是否有事件鏈滿足輸入的模式
- sequenceCount函數(shù)則統(tǒng)計滿足輸入模式的事件鏈的數(shù)量(如果兩個事件發(fā)生在同一秒時,是無法準確區(qū)分事件的發(fā)生先后關系的,所以會存在一定的誤差)
sequenceCount(pattern)(timestamp, cond1, cond2, ...)
pattern支持3中匹配模式:
(?N):表示時間序列中的第N個事件,從1開始,最長支持32個條件輸入;如,(?1)對應的是cond1
(?t op secs):插入兩個事件之間,表示它們發(fā)生時需要滿足的時間條件(單位為秒),支持 >=, >, <, <= 。例如上述SQL中,(?1)(?t<=15)(?2)即表示事件1和2發(fā)生的時間間隔在15秒以內,期間可能會發(fā)生若干次非指定事件。
.*:表示任意的非指定事件。
create table nonodb.adm_path_tf_demo1
(
eventTime DateTime,
uid UInt32,
event_type String,
expflag int
)
ENGINE = MergeTree
ORDER BY (eventTime,uid)
SETTINGS index_granularity = 8192;
-- 在10分鐘內完成瀏覽收藏操作.之后下單的人
insert into nonodb.adm_path_tf_demo1 values ('2022-01-02 11:15:00',1,'瀏覽',1);
insert into nonodb.adm_path_tf_demo1 values ('2022-01-02 11:16:00',1,'瀏覽',1);
insert into nonodb.adm_path_tf_demo1 values ('2022-01-02 11:16:00',1,'瀏覽',1);
insert into nonodb.adm_path_tf_demo1 values ('2022-01-02 11:25:00',1,'收藏',1);
insert into nonodb.adm_path_tf_demo1 values ('2022-01-02 11:26:00',1,'收藏',1);
insert into nonodb.adm_path_tf_demo1 values ('2022-01-02 11:26:00',1,'收藏',1);
insert into nonodb.adm_path_tf_demo1 values ('2022-01-02 11:36:00',1,'點擊',1);
insert into nonodb.adm_path_tf_demo1 values ('2022-01-02 11:56:00',1,'下單',1);
insert into nonodb.adm_path_tf_demo1 values ('2022-01-02 11:57:00',1,'下單',1);
insert into nonodb.adm_path_tf_demo1 values ('2022-01-02 11:57:00',1,'瀏覽',1);
insert into nonodb.adm_path_tf_demo1 values ('2022-01-02 11:58:00',1,'下單',1);
insert into nonodb.adm_path_tf_demo1 values ('2022-01-02 11:15:00',2,'瀏覽',1);
insert into nonodb.adm_path_tf_demo1 values ('2022-01-02 11:25:00',2,'收藏',1);
insert into nonodb.adm_path_tf_demo1 values ('2022-01-02 11:56:00',2,'下單',1);
insert into nonodb.adm_path_tf_demo1 values ('2022-01-02 11:15:00',3,'瀏覽',1);
insert into nonodb.adm_path_tf_demo1 values ('2022-01-02 11:25:00',3,'收藏',1);
insert into nonodb.adm_path_tf_demo1 values ('2022-01-02 12:15:00',1,'瀏覽',1);
insert into nonodb.adm_path_tf_demo1 values ('2022-01-02 12:24:00',1,'收藏',1);
insert into nonodb.adm_path_tf_demo1 values ('2022-01-02 12:56:00',1,'下單',1);
insert into nonodb.adm_path_tf_demo1 values ('2022-01-02 12:56:00',2,'下單',1);
select uid,
sequenceMatch('(?1)(?t<=600)(?2).*(?3)')(
eventTime,
event_type = '瀏覽',
event_type = '收藏' and expflag =1 ,
event_type = '下單'
) as is_match
from nonodb.adm_path_tf_demo1
where toDate(eventTime) >= '2022-01-02' and toDate(eventTime) <= '2022-01-02'
and uid in (1,2,3)
group by uid;
??看結果鏈是不允許重疊的,確實要是允許重疊實現(xiàn)起來是真的非常麻煩啊,業(yè)務上有需求的話條件限制嘍
select uid,
sequenceCount('(?1)(?t<=600)(?2).*(?3)')(
eventTime,
event_type = '瀏覽',
event_type = '收藏' and expflag =1 ,
event_type = '下單'
) as is_se
from nonodb.adm_path_tf_demo1
where toDate(eventTime) >= '2022-01-02' and toDate(eventTime) <= '2022-01-02'
and uid in (1,2,3)
group by uid;
3. 智能路徑檢測
智能路徑分析模型比較復雜,但同時支持的分析需求也會更加復雜,如分析給定期望的路徑終點、途經(jīng)點和最大事件時間間隔,統(tǒng)計出每條路徑的用戶數(shù),并按照用戶數(shù)對路徑進行倒序排列
???網(wǎng)上給的例子怎么都是一樣的,算了吸收一下吧,雖說數(shù)組函數(shù)非常靈活,但是不怕查詢慢嗎??具體還得結合業(yè)務過濾大部分的數(shù)據(jù)
方案一
1
-- 用groupArray函數(shù)整理成<時間, <事件名, expflag>>的元組
select uid ,groupArray((toUInt32(eventTime),(event_type,expflag)))
from nonodb.adm_path_tf_demo1
where toDate(eventTime) >= '2022-01-02' and toDate(eventTime) <= '2022-01-02' and uid in (1,2,3)
group by uid;
2
-- arrayFilter 只保留數(shù)組中滿足條件的數(shù)據(jù)
with (select toUInt32(maxIf(eventTime, event_type = '下單')) AS end_event_max from nonodb.adm_path_tf_demo1) as end_event_max
select uid ,arrayFilter( x -> x.1 <= end_event_max,groupArray((toUInt32(eventTime),(event_type,expflag))))
from nonodb.adm_path_tf_demo1
where toDate(eventTime) >= '2022-01-02' and toDate(eventTime) <= '2022-01-02' and uid in (1,2,3)
group by uid;
3
-- arraySort 對數(shù)組中的數(shù)據(jù)按照指定列進行升序排列
with (select toUInt32(maxIf(eventTime, event_type = '下單')) AS end_event_max from nonodb.adm_path_tf_demo1) as end_event_max
select uid ,arraySort(x -> (x.1),arrayFilter( x -> x.1 <= end_event_max,groupArray((toUInt32(eventTime),(event_type,expflag)))))
from nonodb.adm_path_tf_demo1
where toDate(eventTime) >= '2022-01-02' and toDate(eventTime) <= '2022-01-02' and uid in (1,2,3)
group by uid;
4
-- arrayCompact對事件按照時間維度排序后進行相鄰去重
with (select toUInt32(maxIf(eventTime, event_type = '下單')) AS end_event_max from nonodb.adm_path_tf_demo1) as end_event_max
select uid ,arrayCompact(arraySort(x -> (x.1),arrayFilter( x -> x.1 <= end_event_max,groupArray((toUInt32(eventTime),(event_type,expflag)))))) AS sorted_events
from nonodb.adm_path_tf_demo1
where toDate(eventTime) >= '2022-01-02' and toDate(eventTime) <= '2022-01-02' and uid in (1,2,3)
group by uid;
5
-- arrayEnumerate獲取下標數(shù)組
-- 切割點:過濾出原始行為鏈中的分界點下標。分界點的條件是路徑終點或者時間差大于最大間隔
-- arrayFilter如下返回第一個參數(shù)的值,滿足x,y,z的值
with (select toUInt32(maxIf(eventTime, event_type = '下單')) AS end_event_max from nonodb.adm_path_tf_demo1) as end_event_max
select uid ,
arrayCompact(arraySort(x -> (x.1),arrayFilter( x -> x.1 <= end_event_max,groupArray((toUInt32(eventTime),(event_type,expflag)))))) AS sorted_events,
arrayEnumerate(sorted_events) AS event_idxs,
arrayFilter( (x, y, z) -> z.1 <= end_event_max AND (z.2.1 = '下單' OR y > 600),event_idxs,arrayDifference(sorted_events.1),sorted_events ) AS gap_idxs
from nonodb.adm_path_tf_demo1
where toDate(eventTime) >= '2022-01-02' and toDate(eventTime) <= '2022-01-02' and uid in (1,2,3)
group by uid;
6
-- 利用arrayMap和has函數(shù)獲取下標數(shù)組的掩碼(由0和1組成的序列),用于最終切分,1表示分界點
with (select toUInt32(maxIf(eventTime, event_type = '下單')) AS end_event_max from nonodb.adm_path_tf_demo1) as end_event_max
select uid ,
arrayCompact(arraySort(x -> (x.1),arrayFilter( x -> x.1 <= end_event_max,groupArray((toUInt32(eventTime),(event_type,expflag)))))) AS sorted_events,
arrayEnumerate(sorted_events) AS event_idxs,
arrayFilter( (x, y, z) -> z.1 <= end_event_max AND (z.2.1 = '下單' OR y > 600),event_idxs,arrayDifference(sorted_events.1),sorted_events ) AS gap_idxs,
arrayMap(x -> x + 1, gap_idxs) AS gap_idxs_, --如果不加1的話上一個事件鏈的結尾事件會成為下個事件鏈的開始事件
arrayMap(x -> if(has(gap_idxs_, x), 1, 0), event_idxs) AS gap_masks --標記切割點
from nonodb.adm_path_tf_demo1
where toDate(eventTime) >= '2022-01-02' and toDate(eventTime) <= '2022-01-02' and uid in (1,2,3)
group by uid;
7
-- 調用arraySplit函數(shù)將原始行為鏈按分界點切分成單次訪問的行為鏈。注意該函數(shù)會將分界點作為新鏈的起始點,所以前面要將分界點的下標加1
with (select toUInt32(maxIf(eventTime, event_type = '下單')) AS end_event_max from nonodb.adm_path_tf_demo1) as end_event_max
select uid ,
arrayCompact(arraySort(x -> (x.1),arrayFilter( x -> x.1 <= end_event_max,groupArray((toUInt32(eventTime),(event_type,expflag)))))) AS sorted_events,
arrayEnumerate(sorted_events) AS event_idxs,
arrayFilter( (x, y, z) -> z.1 <= end_event_max AND (z.2.1 = '下單' OR y > 600),event_idxs,arrayDifference(sorted_events.1),sorted_events ) AS gap_idxs,
arrayMap(x -> x + 1, gap_idxs) AS gap_idxs_,
arrayMap(x -> if(has(gap_idxs_, x), 1, 0), event_idxs) AS gap_masks,
arraySplit((x, y) -> y, sorted_events, gap_masks) AS split_events --把用戶的訪問數(shù)據(jù)切割成多個事件鏈
from nonodb.adm_path_tf_demo1
where toDate(eventTime) >= '2022-01-02' and toDate(eventTime) <= '2022-01-02' and uid in (1,2,3)
group by uid;
8
-- 調用arrayJoin和arrayCompact函數(shù)將事件鏈的數(shù)組打平成多行單列,并去除相鄰重復項
select uid,
arrayJoin(split_events) AS event_chain_,
arrayCompact(event_chain_.2) AS event_chain --相鄰去重
from (
with (select toUInt32(maxIf(eventTime, event_type = '下單')) AS end_event_max from nonodb.adm_path_tf_demo1) as end_event_max
select uid ,
arrayCompact(arraySort(x -> (x.1),arrayFilter( x -> x.1 <= end_event_max,groupArray((toUInt32(eventTime),(event_type,expflag)))))) AS sorted_events,
arrayEnumerate(sorted_events) AS event_idxs,
arrayFilter( (x, y, z) -> z.1 <= end_event_max AND (z.2.1 = '下單' OR y > 600),event_idxs,arrayDifference(sorted_events.1),sorted_events ) AS gap_idxs,
arrayMap(x -> x + 1, gap_idxs) AS gap_idxs_,
arrayMap(x -> if(has(gap_idxs_, x), 1, 0), event_idxs) AS gap_masks,
arraySplit((x, y) -> y, sorted_events, gap_masks) AS split_events --把用戶的訪問數(shù)據(jù)切割成多個事件鏈
from nonodb.adm_path_tf_demo1
where toDate(eventTime) >= '2022-01-02' and toDate(eventTime) <= '2022-01-02' and uid in (1,2,3)
group by uid ) WHERE event_chain[length(event_chain)].1 = '下單'; -- 事件鏈最后一個事件必須是目標事件;
9
-- 調用hasAll函數(shù)確定是否全部存在指定的途經(jīng)點。如果要求有任意一個途經(jīng)點存在即可,就換用hasAny函數(shù)。當然,也可以修改WHERE謂詞來排除指定的途經(jīng)點
-- select hasAll([('瀏覽',1),('收藏',1),('下單',1)], [('下單',1),('收藏',1)]) ??這能判斷,為什么下面的不行???
select uid,
arrayJoin(split_events) AS event_chain_,
arrayCompact(event_chain_.2) AS event_chain,
hasAll(event_chain, [('瀏覽',1)]) AS has_midway_hit -- 調用hasAll函數(shù)確定是否全部存在指定的途經(jīng)點
from (
with (select toUInt32(maxIf(eventTime, event_type = '下單')) AS end_event_max from nonodb.adm_path_tf_demo1) as end_event_max
select uid ,
arrayCompact(arraySort(x -> (x.1),arrayFilter( x -> x.1 <= end_event_max,groupArray((toUInt32(eventTime),(event_type,expflag)))))) AS sorted_events,
arrayEnumerate(sorted_events) AS event_idxs,
arrayFilter( (x, y, z) -> z.1 <= end_event_max AND (z.2.1 = '下單' OR y > 600),event_idxs,arrayDifference(sorted_events.1),sorted_events ) AS gap_idxs,
arrayMap(x -> x + 1, gap_idxs) AS gap_idxs_,
arrayMap(x -> if(has(gap_idxs_, x), 1, 0), event_idxs) AS gap_masks,
arraySplit((x, y) -> y, sorted_events, gap_masks) AS split_events
from nonodb.adm_path_tf_demo1
where toDate(eventTime) >= '2022-01-02' and toDate(eventTime) <= '2022-01-02' and uid in (1,2,3)
group by uid ) WHERE event_chain[length(event_chain)].1 = '下單' AND has_midway_hit = 1; -- 必須包含途經(jīng)點;
10
-- 將最終結果整理成可讀的字符串
SELECT
result_chain,
uniqCombined(uid) AS user_count
FROM (
select uid,
arrayJoin(split_events) AS event_chain_,
arrayCompact(event_chain_.2) AS event_chain,
hasAll(event_chain, [('瀏覽','1')]) AS has_midway_hit,
arrayStringConcat(arrayMap( x -> concat(x.1, '#', toString(x.2)),event_chain ), ' -> ') AS result_chain
from (
with (select toUInt32(maxIf(eventTime, event_type = '下單')) AS end_event_max from nonodb.adm_path_tf_demo1) as end_event_max
select uid ,
arrayCompact(arraySort(x -> (x.1),arrayFilter( x -> x.1 <= end_event_max,groupArray((toUInt32(eventTime),(event_type,toString(expflag))))))) AS sorted_events,
arrayEnumerate(sorted_events) AS event_idxs,
arrayFilter( (x, y, z) -> z.1 <= end_event_max AND (z.2.1 = '下單' OR y > 600),event_idxs,arrayDifference(sorted_events.1),sorted_events ) AS gap_idxs,
arrayMap(x -> x + 1, gap_idxs) AS gap_idxs_,
arrayMap(x -> if(has(gap_idxs_, x), 1, 0), event_idxs) AS gap_masks,
arraySplit((x, y) -> y, sorted_events, gap_masks) AS split_events
from nonodb.adm_path_tf_demo1
where toDate(eventTime) >= '2022-01-02' and toDate(eventTime) <= '2022-01-02' and uid in (1,2,3)
group by uid ) WHERE event_chain[length(event_chain)].1 = '下單'
AND has_midway_hit = 1
) group by result_chain order by user_count desc limit 10 ;
bug的解決,hasAll這塊判斷
select hasAll([('瀏覽',1),('收藏',1),('下單',1)], [('下單',1),('收藏',1)]
雖然上面的是可以的,但是例子中SQL太長了,好像沒有轉換的問題,導致無法匹配上,最好都轉成string類型的....
方案二
不設置途經(jīng)點,且僅以用戶最后一次到達目標事件作為參考
-- 這個有個問題3號沒有下單的記錄...過濾一下
select uid,
arrayMap((x, y) -> (x, y),groupArray(event_type),groupArray(eventTime)),
arrayWithConstant(length(groupArray(eventTime)),maxIf(eventTime, event_type = '下單'))
from nonodb.adm_path_tf_demo1
where toDate(eventTime) >= '2022-01-02' and toDate(eventTime) <= '2022-01-02' and uid in (1,2,3)
group by uid;
-- 找到目標節(jié)點前1小時內的所有事件
select uid,
arrayFilter((x,y) -> y - x.2 > 3600,arrayMap((x, y) -> (x, y),groupArray(event_type),groupArray(eventTime)),
arrayWithConstant(length(groupArray(eventTime)),maxIf(eventTime, event_type = '下單')))
from nonodb.adm_path_tf_demo1
where toDate(eventTime) >= '2022-01-02' and toDate(eventTime) <= '2022-01-02' and uid in (1,2,3)
group by uid;
-- 進行排序和映射
select uid,
arrayMap(b -> b.1,arraySort(y -> y.2,arrayFilter((x,y) -> y - x.2 > 3600,arrayMap((x, y) -> (x, y),groupArray(event_type),groupArray(eventTime)),
arrayWithConstant(length(groupArray(eventTime)),maxIf(eventTime, event_type = '下單')))))
from nonodb.adm_path_tf_demo1
where toDate(eventTime) >= '2022-01-02' and toDate(eventTime) <= '2022-01-02' and uid in (1,2,3)
group by uid;
-- 相鄰事件去重
select uid,
arrayCompact(arrayMap(b -> b.1,arraySort(y -> y.2,arrayFilter((x,y) -> y - x.2 > 3600,arrayMap((x, y) -> (x, y),groupArray(event_type),groupArray(eventTime)),
arrayWithConstant(length(groupArray(eventTime)),maxIf(eventTime, event_type = '下單'))))))
from nonodb.adm_path_tf_demo1
where toDate(eventTime) >= '2022-01-02' and toDate(eventTime) <= '2022-01-02' and uid in (1,2,3)
group by uid;
-- 拼接路徑 arrayStringConcat
select result,uniqCombined(uid) AS user_count
from (
select uid,
arrayStringConcat(arrayCompact(arrayMap(b -> b.1,arraySort(y -> y.2,arrayFilter((x,y) -> y - x.2 > 3600,arrayMap((x, y) -> (x, y),groupArray(event_type),groupArray(eventTime)),
arrayWithConstant(length(groupArray(eventTime)),maxIf(eventTime, event_type = '下單')))))),'->') result
from nonodb.adm_path_tf_demo1
where toDate(eventTime) >= '2022-01-02' and toDate(eventTime) <= '2022-01-02' and uid in (1,2,3)
group by uid )where result <> '' group by result;
上面用到的幾個高階函數(shù)
1. arrayCompact對數(shù)組中的數(shù)據(jù)進行相鄰去重,用戶重復操作的事件只記錄一次(頁面去重)
SELECT arrayCompact([1, 2, 3, 3, 1, 1, 1, 4]) AS data
[1,2,3,1,4]
2. arraySort 對數(shù)組中的數(shù)據(jù)按照指定列進行升序排列;降序排列參考arrayReverseSort
SELECT arraySort(x -> (x.1), [(1, 'a'), (4, 'd'), (2, 'b'), (3, 'c')]) AS data
[(1,'a'),(2,'b'),(3,'c'),(4,'d')]
3. arrayEnumerate 取數(shù)組的下標掩碼序列
SELECT arrayEnumerate([1, 2, 3, 3, 1, 1, 4, 2]) AS data
[1,2,3,4,5,6,7,8]
4. arrayMap 對數(shù)組中的每一列進行處理,并返回長度相同的新數(shù)組
SELECT arrayMap(x -> concat(toString(x.1), ':', x.2), [(1, 'a'), (4, 'a'), (3, 'a'), (2, 'c')]) AS data
['1:a','4:a','3:a','2:c']
5. arrayStringConcat將數(shù)組元素按照給定分隔符進行拼接,返回拼接后的字符串
SELECT arrayStringConcat( ['a','b','c'] , '-');
a-b-c
6. arraySplit 按照規(guī)則對數(shù)組進行分割(遇到下標為1時進行分割,分割點為下一個 數(shù)組的起始點;注意,首項為1還是0不影響結果)
SELECT arraySplit((x, y) -> y, ['a', 'b', 'c', 'd', 'e'], [1, 1, 1, 0, 0]) AS data
[['a'],['b'],['c','d','e']]
7. arrayDifference參數(shù)必須是數(shù)值類型;計算數(shù)組中相鄰數(shù)字的差值,第一個值為0
SELECT arrayDifference([3, 1, 1, 4, 2]) AS data
[0,-2,0,3,-2]
8. arrayFilter 只保留數(shù)組中滿足條件的數(shù)據(jù)
SELECT arrayFilter((x,y) -> (x > 2 and y >3), [12, 3, 4, 1, 0],[12, 3, 4, 1, 0]) AS data
[12,4]
SELECT arrayFilter(x -> (x > 2), [12, 3, 4, 1, 0]) AS data
[12,3,4]
4. VIVO方案
通常用戶在需要進行路徑分析的場景時關注的主要問題:
- 按轉換率從高至低排列在APP內用戶的主要路徑是什么?
- 用戶在離開預想的路徑后,實際走向是什么?
- 不同特征的用戶行為路徑有什么差異?
1. 一些概念
- Session和Session Time
不同于WEB應用中的Session,在數(shù)據(jù)分析中的Session會話,是指在指定的時間段內在網(wǎng)站上發(fā)生的一系列互動。本模型中的Session Time的含義是,當兩個行為間隔時間超過Session Time,我們便認為這兩個行為不屬于同一條路徑。 -
?;鶊D
桑基圖
2. 技術關鍵點
- Session切分
其實這里我們可以從前端那拿到會話id,這樣就不用切分了 - 相鄰頁面去重
- 滑動數(shù)組
- 轉化率計算
頁面轉化率,路徑轉化率 - 鄰接表和剪枝
5. Clickhouse問題解決
1. dataGrip連接clickhouse時,時間字段顯示差八小時問題

2. 遠程連接clickhouse
注意關閉防火墻
systemctl status firewalld
systemctl stop firewalld.service
systemctl disable firewalld.service
3. 遇到的問題
clickhouse單節(jié)點報錯 Code: 210. DB::NetException: Connection refused (localhost:9000)
<-- <listen_host>0.0.0.0</listen_host> -->
這個注釋千萬不能去掉....暫時不知道怎么處理..卸載重裝吧
