single cell clustering

Key Point
- scRNA數(shù)據(jù)分析聚類策略選擇
- 聚類的Technical, Biological, Computational挑戰(zhàn)
- 聚類的生物學(xué)意義解釋
寫在前面的話



- 流式也是一種單細(xì)胞的技術(shù),不同的是流式通過細(xì)胞的表面蛋白對細(xì)胞類群進行鑒定,而scRNA-seq對單個細(xì)胞的表達(dá)譜進行定量,通過Top基因的表達(dá)對細(xì)胞類群進行鑒定。
- 為什么要聚類?基于表達(dá)譜的聚類是一種無監(jiān)督的數(shù)據(jù)驅(qū)動,無偏的方法;利用聚類可以對細(xì)胞類型進行劃分,對研究細(xì)胞異質(zhì)性,發(fā)育,進化相關(guān)有很大的幫助
- 很多聚類方法有潛在的假設(shè),即數(shù)據(jù)中存在離散的cluser;但是對一些細(xì)胞發(fā)育譜系來說,可能需要考慮進化軌跡的問題,cluster之間存在時間上的關(guān)系。
文獻(xiàn)正文
聚類策略
scRNA-seq 表達(dá)譜矩陣特點:
- 高維(上萬個基因表達(dá))
- 稀疏(基因的表達(dá)值為0或接近0)
聚類中距離的計算:
- 使用所有的feature,即基因,容易落入'curse of dimensionality',使得距離傾向于更小
- 特征選擇和降維,使用一些基因組成的特征空間,比如PCA降維
可以使用Euclidean distance, cosine similarity, Pearson's similarity, Pearson's correlation 和 Spearman's correlation。后三個計算方法考慮值之間的相對差異,使得它們對library or cell size差異更加魯棒。
常用的聚類的方法k-means,計算復(fù)雜度隨點的數(shù)目線性增加,然而①k-means通常是貪婪算法,容易陷入局部最優(yōu),需要重復(fù)多次不同初始參數(shù)條件或者像SC3上游處理,發(fā)現(xiàn)consensus;②bias towards identifying equal-sized clusters,導(dǎo)致忽略稀有細(xì)胞類型。
另外一個常用方法是層次聚類,自上而下或自下而上,但是其time and memory consuming,隨著數(shù)據(jù)點的增加而呈現(xiàn)二次方增長。
另外一個常用的聚類方法是community-detection-based 算法,或者說是圖算法。首先其建立一個k-nearest neighbours graph,其中K的選擇對最終cluster的大小和數(shù)目影響很大。大多數(shù)基于圖的聚類方法只返回一個最優(yōu)解,而且其不用指定cluster的數(shù)目。
| Name | Year | Method type | Strengths | Limitations |
|---|---|---|---|---|
| scanpy 4 | 2018 | PCA?+?graph-based | Very scalable | May not be accurate for small data sets |
| Seurat (latest)3 | 2016 | PCA?+?graph-based | Very scalable | May not be accurate for small data sets |
| PhenoGraph32 | 2015 | PCA?+?graph-based | Very scalable | May not be accurate for small data sets |
| SC3 22 | 2017 | PCA?+?k-means | High accuracy through consensus, provides estimation of k | High complexity, not scalable |
| SIMLR 24 | 2017 | Data-driven dimensionality reduction?+?k-means | Concurrent training of the distance metric improves sensitivity in noisy data sets | Adjusting the distance metric to make cells fit the clusters may artificially inflate quality measures |
| CIDR 25 | 2017 | PCA?+?hierarchical | Implicitly imputes dropouts when calculating distances | |
| GiniClust 75 | 2016 | DBSCAN | Sensitive to rare cell types | Not effective for the detection of large clusters |
| pcaReduce 27 | 2016 | PCA?+?k-means?+?hierarchical | Provides hierarchy of solutions | Very stochastic, does not provide a stable result |
| Tasic et al.28 | 2016 | PCA?+?hierarchical | Cross validation used to perform fuzzy clustering | High complexity, no software package available |
| TSCAN 41 | 2016 | PCA?+?Gaussian mixture model | Combines clustering and pseudotime analysis | Assumes clusters follow multivariate normal distribution |
| mpath 45 | 2016 | Hierarchical | Combines clustering and pseudotime analysis | Uses empirically defined thresholds and a priori knowledge |
| BackSPIN 26 | 2015 | Biclustering (hierarchical) | Multiple rounds of feature selection improve clustering resolution | Tends to over-partition the data |
| RaceID23, RaceID2115, RaceID3 | 2015 | k-Means | Detects rare cell types, provides estimation of k | Performs poorly when there are no rare cell types |
| SINCERA 5 | 2015 | Hierarchical | Method is intuitively easy to understand | Simple hierarchical clustering is used, may not be appropriate for very noisy data |
| SNN-Cliq 80 | 2015 | Graph-based | Provides estimation of k | High complexity, not scalable |
- DBSCAN, density-based spatial clustering of applications with noise; PCA, principal component analysis; scRNA-seq, single-cell RNA sequencing.
Discrete versus continuous cell grouping
大多數(shù)劃分聚類的算法會忽略是否存在生物學(xué)有意義的群,如果數(shù)據(jù)中沒有離散的群存在的話,這些方法可能就不是很適用。特別是細(xì)胞處于連續(xù)的狀態(tài),比如分化,這時常用one dimensional manifold('pseudotime') to order the cells.

Technical challenges
- more dropouts, 可能原因:沒有表達(dá);測序深度低;建庫時沒有捕獲到轉(zhuǎn)錄本
目前有一些統(tǒng)計方法to impute zeros。 - 估計technical noise,使用內(nèi)源性spike-in RNA,作為陽性對照
- batch effect, 批次效應(yīng),最好的避免方法是平衡實驗設(shè)計
還需要考慮在建庫時的RNA降解的問題
doublets (droplets containing two cells)
一些高表達(dá)的基因比如ribosomal genes也會對聚類有影響
Biological challenges
cell-cycle, scLVM和cyclone可以處理這些問題
rare cell type鑒定,分治的策略,但是大cluster要不要繼續(xù)分又是一個問題。
Computational challenges
高維
線性降維:PCA
非線性降維:tSNE和UMAP
參數(shù)的選擇,比如k-means中k的選擇以及基于圖的算法中k階近鄰中k的選擇
如何驗證方法的有效性,及golden standard dataset的建立
- tissues that are very well studied and understood 或者 considering cells taken from the earliest stages of embryonic development
- many of the suitable data sets are quite small, making it difficult to test methods at the kinds of scale that are relevant for current experiments
可以借助實驗的方法,spatial methods,比如FISH,RNAscope等作為驗證。
生物學(xué)解釋和注釋
如何對劃分的類打標(biāo)簽,這是個很難的問題。與流式基于細(xì)胞表面的蛋白類似,scRNA-seq將cluster中高表達(dá)的基因作為marker基因,通過查文獻(xiàn),數(shù)據(jù)庫等方式對cluster進行打標(biāo)簽。
或者借助GO富集分析,這里急需一個Cell Ontology的DataBase
新的scRNA-seq數(shù)據(jù)如何以往數(shù)據(jù)進行整合,這里需要考慮batch effect的問題。
整合的是可以①先對表達(dá)矩陣進行merge再進行聚類分析;②或者類似進行blast的功能,給一個cell的表達(dá)矩陣,找到它最近的鄰居。
其實除了RNA水平,還有其它水平的數(shù)據(jù),即多組學(xué)數(shù)據(jù),可以更好的幫助我們進行cell type identification。還有實驗水平的空間染色方法,可以幫助我們驗證分群的好壞。