10X單細胞(10X空間轉錄組)數據分析之約束非負矩陣分解(cNMF)

hello,大家好,之前在文章10X單細胞(10X空間轉錄組)數據分析之NMF(非負矩陣分解)中介紹了NMF在單細胞數據分析中的運用,簡單總結一下就是NMF算法的基本思想是將原始非負矩陣分解為兩個非負矩陣的乘積從而對原始高維矩陣進行降維表示。

我們在做WGCNA的時候,得到的模塊未必與細胞類型或者樣本關聯(lián)度很高,但是我們又想得到基因模塊與某一種細胞類型或者一種樣本類型高度關聯(lián)而與其他類型幾乎沒關系,怎么做呢? cNMF就是為了這個而來。

python代碼如下:

import numpy as np
import random

def nmf(X, r, k, e):
    '''
    X是原始矩陣
    r是分解的兩個非負矩陣的隱變量維度,要遠小于原始矩陣的維度
    k是迭代次數
    e是理想誤差

    input X
    output U,V
    '''
    d, n = X.shape
    #print(d,n)
    #U = np.mat(random.random((d, r)))
    U = np.mat(np.random.rand(d, r))
    #V = np.mat(random.random((n, r)))
    V = np.mat(np.random.rand(n, r))
    #print(U, V)

    x = 1
    for x in range(k):
        print('---------------------------------------------------')
        print('開始第', x, '輪迭代')
        #error
        X_pre = U * V.T
        E = X - X_pre
        #print E
        err= 0.0
        for i in range(d):
            for j in range(n):
                err += E[i,j] * E[i,j]
        print('誤差:', err)

        if err < e:
            break
        #update U
        a_u = U * (V.T) * V
        b_u = X * V
        for i_1 in range(d):
            for j_1 in range(r):
                if a_u[i_1, j_1] != 0:
                    U[i_1, j_1] = U[i_1, j_1] * b_u[i_1, j_1] / a_u[i_1, j_1]
        #print(U)
        
        #update V
        a_v = V * (U.T) * U
        b_v = X.T * U
        print(r,n)
        for i_2 in range(n):
            for j_2 in range(r):
                if a_v[i_2,j_2] != 0:
                    V[i_2,j_2] = V[i_2,j_2] * b_v[i_2,j_2] / a_v[i_2,j_2]
        #print(V)
        print('第', x, '輪迭代結束')

    return U, V

if __name__ == "__main__":
    X = [[5, 3, 2, 1, 2, 3],
         [4, 2, 2, 1, 1, 5],
         [1, 1, 2, 5, 2, 3],
         [1, 2, 2, 4, 3, 2],
         [2, 1, 5, 4, 1, 1],
         [1, 2, 2, 5, 3, 2],
         [2, 5, 3, 2, 2, 5],
         [2, 1, 2, 5, 1, 1], ]
    X = np.mat(X)
    #print(X)
    U, V = nmf(X, 2, 100, 0.001)
    print(U*V.T)

其實在單細胞數據分析的過程中,細胞基因的表達模式受多種因素干擾,基因通過協(xié)同作用維持細胞類型特有的生物學特征,而相互協(xié)同的基因作為基因表達模塊(GEP)一起誘導和相應內外部信號,執(zhí)行復雜的細胞功能。功能基因模塊可以出現(xiàn)在多個不同的細胞類型中,而細胞類型基因模塊代表一個單一的細胞類型,因此可利用這一事實來區(qū)分細胞類型基因模塊和功能基因模塊。通過cNMF分析,可同時推斷出細胞類型相關和功能相關的GEPs,從而改進marker基因的推斷,使得細胞類型鑒定更加準確。

什么是cNMF??

約束非負矩陣分解(CNMF)算法,該算法將標簽信息作為附加的硬約束,使得具有相同類標簽信息的數據在新的低維空間中仍然保持一致。
但是,CNMF算法對于無標簽數據樣本沒有任何約束,因此在很少的標簽信息時它的性能受限,并且對于類中只有一個樣本有標簽的情形,CNMF算法中構建的約束矩陣將退化為單位矩陣,失去其意義。

實現(xiàn)代碼

import numpy as np
import random

def cnmf(X, C, r, k, e):
    '''
    X是原始矩陣,維度為d*n
    C是有標簽樣本指示矩陣,維度為l*c(l——有標簽的樣本數量,c——類別數量)
    r是分解的兩個非負矩陣的隱變量維度,要遠小于原始矩陣的維度
    k是迭代次數
    e是理想誤差
  
    input X,C
    output U,V
    '''
    d, n = X.shape
    l, c = C.shape

    #計算A矩陣
    I = np.mat(np.identity(n-l))
    A = np.zeros((n, n + c - l))

    for i in range(l):
        for j in range(c):
            A[i,j] = C[i,j]

    for i2 in range(n-l):
        A[l+i2, c+i2] = I[i2, i2]
    A = np.mat(A)
    U = np.mat(np.random.rand(d, r))
    Z = np.mat(np.random.rand(n + c - l, r))

    #print(A)

    x = 1
    for x in range(k):
        print('---------------------------------------------------')
        print('開始第', x, '輪迭代')
        #error
        X_pre = U * (A*Z).T
        E = X - X_pre
        #print E
        err= 0.0
        for i in range(d):
            for j in range(n):
                err += E[i,j] * E[i,j]
        print('誤差:', err)

        if err < e:
            break
        #update U
        a_u = U * Z.T * A.T * A * Z
        b_u = X * A * Z
        for i_1 in range(d):
            for j_1 in range(r):
                if a_u[i_1, j_1] != 0:
                    U[i_1, j_1] = U[i_1, j_1] * b_u[i_1, j_1] / a_u[i_1, j_1]
        #print(U)

        #update Z
        #print(Z.shape,n,r)
        a_z = A.T * A * Z * U.T * U
        b_z = A.T* X.T * U
        for i_2 in range(n + c - l):
            for j_2 in range(r):
                #print(i_2, j_2, a_z[i_2,j_2])
                if a_z[i_2,j_2] != 0:
                    Z[i_2,j_2] = Z[i_2,j_2] * b_z[i_2,j_2] / a_z[i_2,j_2]
        #print(V)
        print('第', x, '輪迭代結束')

    V = (A*Z).T
    return U, V

if __name__ == "__main__":
    X = [[5, 3, 2, 1, 2, 3],
         [4, 2, 2, 1, 1, 5],
         [1, 1, 2, 5, 2, 3],
         [1, 2, 2, 4, 3, 2],
         [2, 1, 5, 4, 1, 1],
         [1, 2, 2, 5, 3, 2],
         [2, 5, 3, 2, 2, 5],
         [2, 1, 2, 5, 1, 1],]#8*6,6個樣本
    X = np.mat(X)
    C = [[0, 0, 1],
         [0, 1, 0],
         [0, 1, 0],
         [1, 0, 0],]#4*3,假設有4個樣本有標簽,總共有三類標簽
    #print(X)
    C = np.mat(C)
    U, V = cnmf(X, C, 2, 100, 0.01)
    print(U.shape, V.shape)
    print(U * V)

cNMF在單細胞數據分析中的運用(文獻用法)

cNMF的目的

To better characterize sample heterogeneity in whole-tumor samples
圖片.png
圖注: Heatmap showing the correlation between signatures obtained for all 13 whole-tumour samples.

用法

The cNMF algorithm was applied individually to each whole-tumor sample with some modifications。In brief, nonnegative matrix factorization was run 100 times for k from 2 to 15 signatures. For each k, the 100 repetitions are clustered in k groups. We expect a stable clustering solution would produce tight clusters with one signature per cluster for each of the 100 repetitions. The proportion of repetitions with one signature per cluster was called reproducibility. Clustering of the signatures was done by k means with a constraint of uniform cluster sizes, prioritizing higher correlations. The largest k with a reproducibility above 0.9 was chosen. For a chosen k, we confirmed the clustering solution was appropriate by running tSNE on the signatures it generated. The final signatures for a given sample was obtained by averaging the signature repetitions within a cluster, excluding repetitions with poor reproducibility (the ones which did not produce a signature per cluster). We obtained between five and nine final signatures per sample, 79 signatures in total. From the inter-signature Pearson correlations, we used hierarchical clustering to find trends of signatures (Fig. 1e, hierarchical tree). Six main groups emerged. In one of these groups, important variations in gene weights were observed: signatures characterized by OLIG2 and ASCL1, for example, had less DCX and STMN1, and vice versa. We reclustered this group in two, yielding the final seven groups (Fig. 1e and Supplementary Fig. 2f,就是上圖).
We identified the most characteristic genes of each signature group by ranking genes according to their relative signal to noise ratio (snr) and chose the top 40 for the heatmap.
圖片.png

關于信噪比(SNR),大家可以參考我的文章單細胞數據信噪比(Signal-to-noise ratio,SNR)。

We scored each signature according to the TCGA using the method described above (Classifying cells by TCGA subtype). A given signature was labeled with the subtype yielding the highest score.(下圖)
圖片.png

看看文獻的原始代碼(參照上述cNMF的實現(xiàn)方式)

% cNMF_seperate.m
% use NMR to find signatures in scRNA data sample by sample. Based on cNMF (Kotliar et al.)  ###看來是樣本作為主要標簽。
% Author: Charles Couturier
% Date: August 3, 2018

%% prepare data and initialize parameters
% genes: list of gene names and ensembl code; logm: log of raw counts; sample: vector 
%  identifying cells by sample; sample_id: a list of the samples
load('gbm.mat','genes','logm','sample','sample_id')
kvals = 2:15;
L = length(kvals);
nreps = 100;

allHs = cell(L,1);
allHc = cell(L,1);
allSil = cell(L,1);
allE = cell(L,1);
allR = cell(L,1);
allCL = cell(L,1);

for s = 1:length(sample_id)
fprintf('Sample %s in progress, %i of %i\n',sample_id{s},s,length(sample_id))
    
data = logm(sample==s,:);

% initialize
m = size(data,1);
n = size(data,2);

E = zeros(L,1);
R = zeros(L,1);
Sil = zeros(L,1);
consensus_H = cell(L,1);
allH = cell(L,1);
CL = cell(L,1);

%% NMF
opt = statset('MaxIter',1e6);
parfor kid = 1:L
    k = kvals(kid);
    hi = zeros(k*nreps,n);
    consensus_hi = zeros(k,n);
    D = zeros(nreps,1);
    
    for i = 1:nreps
        [~,hi(i*k-k+1:i*k,:),D(i)]=nnmf(data,k,'replicates',20,'options',opt,'algorithm','mult');
    end
    
    % k clusters
    cl = uniform_kmeans(hi,k,'Replicates',10); 
    % find replicates with 1 of each cluster
    goodrep = all(sort(reshape(cl,k,nreps),1) == [1:k]',1); 
    goodrep = repmat(goodrep,k,1);
    goodrep = goodrep(:);
    
    % consensus hi finds median of each component cluster, removing
    % bad replicates (see above)
    for i = 1:k
        consensus_hi(i,:) = median(hi(cl==i & goodrep,:),1); 
    end
    
    allH{kid} = hi;
    consensus_H{kid} = consensus_hi;
    Sil(kid) = mean(silhouette(hi,cl));
    E(kid) = mean(D);
    R(kid) = sum(goodrep)/(k*nreps);
    CL{kid} = cl;
    fprintf('Run for %i components completed\n',k)
end

allHs{s} = allH;
allHc{s} = consensus_H;
allSil{s} = Sil;
allE{s} = E;
allR{s} = R;
allCL{s} = CL;

end

save 'NMF_results_separate.mat' allHs allHc allSil allE allR allCL

%% find best ks

best_k = zeros(length(sample_id),1);
th = 0.9; % threshold in reproducibility to reach to select k

for s = 1:length(sample_id)
    kid = find(kvals==best_k(s));
    E = allE{s};
    R = allR{s};
    Sil = allSil{s};
    CL = allCL{s};
    allH = allHs{s};

    kid = max(find(R>=th));
    best_k(s) = kvals(kid);
    
    figure;
    subplot(1,2,1)
    %subplot(length(sample_id),2,2*s-1)
    yyaxis left
    plot(kvals,E)
    yyaxis right
    plot(kvals,R)
    hold on
    plot(kvals,Sil)
    %legend({'E','Reproducibility of distinct clusters','Silhouette score'})
    
    y = tsne(allH{kid});
    subplot(1,2,2)
    gscatter(y(:,1),y(:,2),CL{kid},lines)
    legend('off')
    title(sprintf('%s',sample_id{s}))
    
    print('-depsc2',sprintf('%s.eps',sample_id{s}))
end


%% get W from H

H = [];
sig_id = [];
for s = 1:length(sample_id)
    
    data = logm(sample==s,:);
    consensus_H = allHc{s};
    kid = find(kvals==best_k(s));
    H0 = consensus_H{kid};
    W0 = max(0,data/H0);

    opt = statset('Maxiter',1e6,'Display','final');
    [~,H1] = nnmf(data,best_k(s),'h0',H0,'w0',W0,...
                     'options',opt,'replicates',100,...
                     'algorithm','als');
    H = [H; H1];
    sig_id = [sig_id; ones(size(H1,1),1)*s];
             
end

save all_signatures.mat H sig_id

相對還是很簡單的,大家不妨試一試,找一找細胞類型相關協(xié)同性很高的gene set。

生活很好,有你更好

最后編輯于
?著作權歸作者所有,轉載或內容合作請聯(lián)系作者
【社區(qū)內容提示】社區(qū)部分內容疑似由AI輔助生成,瀏覽時請結合常識與多方信息審慎甄別。
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發(fā)布,文章內容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務。
禁止轉載,如需轉載請通過簡信或評論聯(lián)系作者。

相關閱讀更多精彩內容

友情鏈接更多精彩內容