翻譯:《Data Science for Business》

第二章:Business Problems and Data Science Solutions(業(yè)務(wù)問題以及數(shù)據(jù)科學(xué)方案)

P24-P27

Supervised Versus Unsupervised Methods

監(jiān)督學(xué)習(xí)和非監(jiān)督學(xué)習(xí)方法

Consider two similar questions we might ask about a customer population. The first is: “Do our customers naturally fall into different groups?” Here no specific purpose or target has been specified for the grouping. When there is no such target, the data mining problem is referred to as unsupervised. Contrast this with a slightly different question: “Can we find groups of customers who have particularly high likelihoods of canceling their service soon after their contracts expire?” Here there is a specific target defined: will a customer leave when her contract expires? In this case, segmentation is being done for a specific reason: to take action based on likelihood of churn. This is called a supervised data mining problem.

考慮下述兩種與客戶群體相關(guān)的數(shù)據(jù)問題。第一個(gè)問題是:“我們的客戶會(huì)自然分成不同的群體嗎?”“這里沒有為分組指定特定的目的或目標(biāo)。當(dāng)沒有這樣的目標(biāo)時(shí),數(shù)據(jù)挖掘問題被稱為無監(jiān)督分類問題。與此相反的是另一個(gè)稍微不同的問題:“我們可以找到那些在合同期滿后很快取消服務(wù)的客戶群?jiǎn)??“,這里有一個(gè)明確的分組目標(biāo):當(dāng)合同到期時(shí),客戶會(huì)取消合同嗎?在這種情況下,因特定的原因而進(jìn)行的細(xì)分:根據(jù)流失的可能性而進(jìn)行的分析。這稱為有監(jiān)督分類問題。
這些問題之間的區(qū)別很微妙,但很重要。如果可以提供一個(gè)特定的分類目標(biāo),這個(gè)問題可以表述為一個(gè)監(jiān)督問題。監(jiān)督任務(wù)需要不同于無監(jiān)督任務(wù)的技術(shù),但是結(jié)果通常更有用。監(jiān)督技術(shù)提供了特殊的分組目標(biāo)——預(yù)測(cè)目標(biāo)分組。聚類是一種無監(jiān)督的任務(wù),它基于相似性產(chǎn)生分組,但不能保證這些相似性是有意義的,或者對(duì)于任何特定用途都是有用的。

The difference between these questions is subtle but important. If a specific target can be provided, the problem can be phrased as a supervised one. Supervised tasks require different techniques than unsupervised tasks do, and the results often are much more useful. A supervised technique is given a specific purpose for the grouping—predicting the target. Clustering, an unsupervised task, produces groupings based on similarities, but there is no guarantee that these similarities are meaningful or will be useful for any particular purpose.

從技術(shù)上講,監(jiān)督數(shù)據(jù)挖掘必須滿足另一個(gè)條件:目標(biāo)上必須有明確的數(shù)據(jù)(也就是需要相應(yīng)的訓(xùn)練集)。目標(biāo)信息在原則上存在是不夠的,它也必須存在于數(shù)據(jù)中(訓(xùn)練集需要有明確的標(biāo)記)。例如,了解給定的客戶是否會(huì)停留至少六個(gè)月可能是有用的,但如果在歷史數(shù)據(jù)中,保存的歷史信息出現(xiàn)丟失或不完整(如果數(shù)據(jù)只保留兩個(gè)月),則無法提供目標(biāo)值(無法明確的知道哪些歷史客戶是停留過六個(gè)月)(訓(xùn)練集不完整)。對(duì)于數(shù)據(jù)科學(xué)研究來說獲取目標(biāo)數(shù)據(jù)通常是很重要的一個(gè)步驟。個(gè)體的目標(biāo)變量的值通常稱為個(gè)體的標(biāo)簽,而這些標(biāo)簽通常是需要在進(jìn)行數(shù)據(jù)分析前對(duì)數(shù)據(jù)進(jìn)行處理,標(biāo)記。

Classification, regression, and causal modeling generally are solved with supervised methods. Similarity matching, link prediction, and data reduction could be either. Clustering, co-occurrence grouping, and profiling generally are unsupervised. The fundamental principles of data mining that we will present underlie all these types of technique.

分類、回歸和因果建模一般用監(jiān)督方法來解決。相似性匹配、鏈接預(yù)測(cè)和數(shù)據(jù)約簡(jiǎn)的問題一般也是用監(jiān)督方法進(jìn)行處理。聚類、共生分組和概要分析通常是無監(jiān)督的。我們將提出的數(shù)據(jù)挖掘的基本原則是所有這些技術(shù)的基礎(chǔ)。

Two main subclasses of supervised data mining, classification and regression, are distinguished by the type of target. Regression involves a numeric target while classification involves a categorical (often binary) target. Consider these similar questions we might address with supervised data mining: “Will this customer purchase service S1 if given incentive I?” This is a classification problem because it has a binary target (the customer either purchases or does not). “Which service package (S1, S2, or none) will a customer likely purchase if given incentive I?” This is also a classification problem, with a three-valued target. “How much will this customer use the service?” This is a regression problem because it has a numeric target. The target variable is the amount of usage (actual or predicted) per customer.

有監(jiān)督數(shù)據(jù)挖掘的兩個(gè)主要子類,分類和回歸,其主要是由目標(biāo)類型區(qū)分的?;貧w涉及一個(gè)數(shù)字連續(xù)性目標(biāo),而分類涉及一個(gè)分類(通常是二進(jìn)制)目標(biāo)??紤]我們可能在監(jiān)督類數(shù)據(jù)挖掘中處理的問題:
“如果給予激勵(lì),這個(gè)客戶是否會(huì)購買服務(wù)S1?”
這是一個(gè)分類問題,因?yàn)樗卸M(jìn)制目標(biāo)(客戶購買或不購買)。
“如果給予激勵(lì),客戶可能會(huì)購買哪種服務(wù)包(S1,S2,否)?”
這也是一個(gè)三重目標(biāo)的分類問題。
“這個(gè)客戶將使用多少服務(wù)?”
這是一個(gè)回歸問題,因?yàn)樗幸粋€(gè)數(shù)字目標(biāo)。 目標(biāo)變量是每個(gè)客戶的使用量(實(shí)際或預(yù)測(cè))。

There are subtleties among these questions that should be brought out. For business applications we often want a numerical prediction over a categorical target. In the churn example, a basic yes/no prediction of whether a customer is likely to continue to subscribe to the service may not be sufficient; we want to model the probability that the customer will continue. This is still considered classification modeling rather than regression because the underlying target is categorical. Where necessary for clarity, this is called “class probability estimation.”

在這些問題中存在一些需要注意的細(xì)節(jié), 對(duì)于商業(yè)應(yīng)用,我們經(jīng)常希望對(duì)分類目標(biāo)進(jìn)行數(shù)值預(yù)測(cè)。 比如在客戶流失示例中,利用基本的 是/否 來預(yù)測(cè)客戶是否有可能繼續(xù)訂閱服務(wù)可能還不足夠; 我們想模擬客戶將繼續(xù)使用的概率。 這仍然被認(rèn)為是分類建模而不是回歸問題,因?yàn)榛A(chǔ)目標(biāo)是分類的。為了更加精準(zhǔn)的定義這個(gè)問題,通常這被稱為“類概率估計(jì)”問題。

A vital part in the early stages of the data mining process is (i) to decide whether the line of attack will be supervised or unsupervised, and (ii) if supervised, to produce a precise definition of a target variable. This variable must be a specific quantity that will be the focus of the data mining (and for which we can obtain values for some example data). We will return to this in Chapter 3.

在數(shù)據(jù)挖掘過程的早期階段,比較重要的部分是:
(i)明確被研究問題是監(jiān)督還是無監(jiān)督問題,
(ii)如果是監(jiān)督性問題,需要確定目標(biāo)變量的精確定義。 這個(gè)變量必須是一個(gè)特定的數(shù)量,這是監(jiān)督性數(shù)據(jù)分析的重點(diǎn)(我們可以為此獲取某些示例數(shù)據(jù)的值),之后我們將會(huì)第三章來詳細(xì)討論這方面的問題。

Data Mining and Its Results

數(shù)據(jù)挖掘及其結(jié)果

There is another important distinction pertaining to mining data: the difference between (1) mining the data to find patterns and build models, and (2) using the results of data mining. Students often confuse these two processes when studying data science, and managers sometimes confuse them when discussing business analytics. The use of data mining results should influence and inform the data mining process itself, but the two should be kept distinct

在數(shù)據(jù)挖掘中存在著另外一種比較重要的差別:以尋找相關(guān)數(shù)據(jù)模式和構(gòu)建數(shù)據(jù)模型而進(jìn)行的數(shù)據(jù)挖掘,和為了得到并使用數(shù)據(jù)挖掘結(jié)果而進(jìn)行的數(shù)據(jù)挖掘。學(xué)習(xí)數(shù)據(jù)處理相關(guān)的科學(xué)知識(shí)時(shí),學(xué)生經(jīng)常會(huì)混淆這兩個(gè)過程,而在討論業(yè)務(wù)分析時(shí),管理者有時(shí)會(huì)混淆這些過程。 數(shù)據(jù)挖掘結(jié)果的使用應(yīng)該影響和反作用于數(shù)據(jù)挖掘過程本身,但兩者應(yīng)該保持不同。

圖 2-1 數(shù)據(jù)挖掘與數(shù)據(jù)挖掘結(jié)果的使用。 該圖的上半部分說明了挖掘歷史數(shù)據(jù)以生成模型。 重要的是,歷史數(shù)據(jù)具有指定的目標(biāo)(“類”)值。 下半部分顯示了使用數(shù)據(jù)挖掘的結(jié)果進(jìn)行數(shù)據(jù)的預(yù)測(cè),其中模型應(yīng)用于我們不知道分類的新數(shù)據(jù)。該模型同時(shí)預(yù)測(cè)了相應(yīng)數(shù)據(jù)的分類以及產(chǎn)生該類值的概率。

In our churn example, consider the deployment scenario in which the results will be used. We want to use the model to predict which of our customers will leave. Specifically, assume that data mining has created a class probability estimation model M. Given each existing customer, described using a set of characteristics, M takes these characteristics as input and produces a score or probability estimate of attrition. This is the use of the results of data mining. The data mining produces the model M from some other, often historical, data.

考慮一下上述使用數(shù)據(jù)挖掘結(jié)果的方案并應(yīng)用在之前所提到的客戶流失的案例中。比如我們想使用該模型來預(yù)測(cè)我們的哪些客戶會(huì)流失。具體地說,假設(shè)數(shù)據(jù)挖掘已經(jīng)創(chuàng)建了一個(gè)類概率估計(jì)模型M。給定每個(gè)現(xiàn)有客戶,使用一組特征描述,M將這些特征作為輸入,并產(chǎn)生一個(gè)分?jǐn)?shù)或概率來預(yù)測(cè)客戶的流失。這就是使用數(shù)據(jù)挖掘結(jié)果的例子。其中模型M通過數(shù)據(jù)挖掘并使用相關(guān)的歷史數(shù)據(jù)而產(chǎn)生的。

Figure 2-1 illustrates these two phases. Data mining produces the probability estimation model, as shown in the top half of the figure. In the use phase (bottom half), the model is applied to a new, unseen case and it generates a probability estimate for it.

圖2-1說明了這兩個(gè)階段。 如圖的上半部分所示數(shù)據(jù)挖掘產(chǎn)生概率估計(jì)模型。 在使用階段(下半部分),該模型被應(yīng)用于一個(gè)新的,不可見的樣本,并且它為其生成概率估計(jì)。

The Data Mining Process

數(shù)據(jù)挖掘過程

Data mining is a craft. It involves the application of a substantial amount of science and technology, but the proper application still involves art as well. But as with many mature crafts, there is a well-understood process that places a structure on the problem, allowing reasonable consistency, repeatability, and objectiveness. A useful codification of the data mining process is given by the Cross Industry Standard Process for Data Mining (CRISP-DM; Shearer, 2000), illustrated in Figure 2-2

數(shù)據(jù)挖掘是一種工藝。 它涉及大量的科學(xué)和技術(shù)的應(yīng)用,但是如何合理的使用它仍然是一門藝術(shù)。 但是與許多成熟的工藝一樣,there is a well-understood process that places a structure on the problem, allowing reasonable consistency, repeatability, and objectiveness。 數(shù)據(jù)挖掘過程是由跨行業(yè)數(shù)據(jù)挖掘標(biāo)準(zhǔn)流程(CRISP-DM; Shearer,2000)給出,如下圖2-2所示:

圖2-2 CRISP 數(shù)據(jù)挖掘過程

This process diagram makes explicit the fact that iteration is the rule rather than the exception. Going through the process once without having solved the problem is, generally speaking, not a failure. Often the entire process is an exploration of the data, and after the first iteration the data science team knows much more. The next iteration can be much more well-informed. Let’s now discuss the steps in detail.

這個(gè)過程圖顯示了反復(fù)的進(jìn)行數(shù)據(jù)循環(huán)分析,這是數(shù)據(jù)分析很重要的一個(gè)過程,而并不是一種異常。如果一個(gè)問題沒有立即解決,一般來說,這并不是一種失敗,因?yàn)檎麄€(gè)過程通常是對(duì)數(shù)據(jù)的探索,在第一次迭代之后,數(shù)據(jù)科學(xué)團(tuán)隊(duì)能夠知道的更多,下一次迭代可以更加清楚。 現(xiàn)在來詳細(xì)討論這些步驟。

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

  • 2017年,對(duì)我來說算是一個(gè)新的開始,這一年,我正好踏上邁入社會(huì)這個(gè)大熔爐的腳步。像所有的畢業(yè)生一樣,我對(duì)未來、對(duì)...
    h不懂閱讀 225評(píng)論 0 1
  • 今天故事的主角是小愛。 她想感謝她的兩位朋友。一個(gè)在她生病時(shí)會(huì)從別的城市搭車趕來守在手術(shù)室外七小時(shí),一個(gè)怕她情況惡...
    二喜的深夜食堂閱讀 831評(píng)論 9 24
  • 那天我把我在簡(jiǎn)書里寫的文章發(fā)到各個(gè)群里。沒想到第一時(shí)間有人打賞,接著清馨給我鼓勵(lì);緊接著閱讀量到了40,當(dāng)我再次打...
    譚念愛閱讀 348評(píng)論 1 5
  • 直接插入排序 基本思想: 將一個(gè)記錄插入到已排序好的有序表中,從而得到一個(gè)新,記錄數(shù)增1的有序表。即:先將序列的第...
    Fern16閱讀 366評(píng)論 0 0

友情鏈接更多精彩內(nèi)容