scdiffeq | 建模多時(shí)間點(diǎn)單細(xì)胞數(shù)據(jù)的發(fā)育軌跡

單細(xì)胞測(cè)序技術(shù)通過(guò)捕獲單個(gè)細(xì)胞的瞬時(shí)分子圖譜,為重建動(dòng)態(tài)生物學(xué)過(guò)程提供了可能。在發(fā)育和疾病過(guò)程中,細(xì)胞命運(yùn)決定是由確定性調(diào)控事件和隨機(jī)性調(diào)控事件之間復(fù)雜的平衡所協(xié)調(diào)的。漂移-擴(kuò)散方程在從高維單細(xì)胞測(cè)量數(shù)據(jù)建模單細(xì)胞動(dòng)態(tài)方面具有顯著效果。雖然現(xiàn)有的解決方案能夠在細(xì)胞狀態(tài)水平上描述這些方程中漂移項(xiàng)所關(guān)聯(lián)的確定性動(dòng)態(tài),但擴(kuò)散項(xiàng)在所有細(xì)胞狀態(tài)中都被建模為常數(shù)。為了充分理解發(fā)育和疾病中的動(dòng)態(tài)調(diào)控邏輯,需要建立能夠明確關(guān)注確定性與隨機(jī)性生物學(xué)之間平衡的模型。

為解決這些局限性,作者開(kāi)發(fā)了scDiffEq,一個(gè)生成式框架,用于學(xué)習(xí)能夠近似生物學(xué)確定性和隨機(jī)性動(dòng)態(tài)的神經(jīng)隨機(jī)微分方程。通過(guò)對(duì)多能祖細(xì)胞進(jìn)行計(jì)算機(jī)模擬擾動(dòng),我們發(fā)現(xiàn)scDiffEq能夠準(zhǔn)確重現(xiàn)CRISP 擾動(dòng)造血的動(dòng)態(tài)過(guò)程。這一方法可以從單一時(shí)間點(diǎn)的單細(xì)胞數(shù)據(jù)的動(dòng)態(tài)變化進(jìn)行建模,推廣到譜系示蹤或多時(shí)間點(diǎn)數(shù)據(jù)集。利用scDiffEq,可以模擬高分辨率的發(fā)育細(xì)胞軌跡,并對(duì)其漂移和擴(kuò)散進(jìn)行建模,從而幫助研究這些軌跡的時(shí)間依賴性基因水平動(dòng)態(tài)。

import scanpy as sc
import scdiffeq as sdq
import pandas as pd
import umap

adata_ref = sc.read_h5ad('scdiffeq_data/larry/larry.h5ad')
adata_ref
AnnData object with n_obs × n_vars = 130887 × 2492
    obs: 'Library', 'Cell barcode', 'Time point', 'Starting population', 'Cell type annotation', 'Well', 'SPRING-x', 'SPRING-y', 'clone_idx', 'fate_observed', 't0_fated', 'train', 'ct_score', 'ct_pseudotime', 'ct_num_exp_genes'
    var: 'gene_ids', 'hv_gene', 'must_include', 'exclude', 'use_genes', 'ct_gene_corr', 'ct_correlates'
    uns: 'fate_counts', 'h5ad_path', 'time_occupance'
    obsm: 'X_clone', 'X_pca', 'X_scaled', 'X_umap', 'cell_fate_df'
    layers: 'X_scaled'

UMAP = umap.UMAP(n_components=2)
adata_ref.obsm["X_umap"] = UMAP.fit_transform(adata_ref.obsm["X_pca"])

nm_clones = adata_ref.uns["fate_counts"][["Monocyte", "Neutrophil"]].dropna().index
adata_ref.obs['nm_clones'] = adata_ref.obs["clone_idx"].isin(nm_clones)
MASK = adata_ref.obs["Cell type annotation"].isin(["Monocyte", "Neutrophil", "Undifferentiated"]) & adata_ref.obs['nm_clones']

adata = adata_ref[MASK].copy()
del adata.obsm['X_clone']
del adata.obsm["cell_fate_df"]
adata.obs.index = adata.obs.reset_index(drop=True).index.astype(str)

準(zhǔn)備好數(shù)據(jù),建模過(guò)程下面幾行代碼即可,雖然建??梢栽?code>cpu環(huán)境運(yùn)行,但速度太慢,還是用gpu來(lái)加速吧:

model = sdq.scDiffEq(adata)
model.fit(train_epochs=1500)
model.drift()
model.diffusion()

不過(guò),train_epochs參數(shù)會(huì)影響模型的擬合效果,可以用下面的方式來(lái)查看模型的損失情況來(lái)確定:

import matplotlib.pyplot as plt

train_loss = model.metrics[['epoch', 'epoch_train_loss']].dropna().reset_index(drop=True)
val_loss = model.metrics[['epoch', 'epoch_validation_loss']].dropna().reset_index(drop=True)

# Plot raw values as scatter points
plt.scatter(train_loss['epoch'], train_loss['epoch_train_loss'], color='blue', s=8, alpha=0.15)
plt.scatter(val_loss['epoch'], val_loss['epoch_validation_loss'], color='orange', s=8, alpha=0.15)

# Compute moving averages
train_loss_ma = train_loss['epoch_train_loss'].rolling(window=10, min_periods=1, center=True).mean()
val_loss_ma = val_loss['epoch_validation_loss'].rolling(window=10, min_periods=1, center=True).mean()

# Plot moving averages as lines
plt.plot(train_loss['epoch'], train_loss_ma, color='blue', linewidth=2, label='Train Loss')
plt.plot(val_loss['epoch'], val_loss_ma, color='orange', linewidth=2, label='Val Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Training and Validation Loss')
plt.legend()
plt.show()

建模完成,接下來(lái)看看細(xì)胞的擬時(shí)序:

sdq.tl.velocity_graph(model.adata)
cmap = {"Undifferentiated": "dimgrey", "Neutrophil": "#023047", "Monocyte": "#F08700"}
sdq.pl.velocity_stream(model.adata, c="diffusion", scatter_kwargs={"vmax": 3})

做細(xì)胞發(fā)育軌跡推斷需要先選擇起始細(xì)胞:

progenitor = (model.adata.obs.loc[model.adata.obs["Time point"] == model.adata.obs["Time point"].min()].loc[model.adata.obs["Cell type annotation"] == "Undifferentiated"].sample(3))
progenitor
       Library       Cell barcode  Time point Starting population Cell type annotation  Well  SPRING-x  SPRING-y  clone_idx  ...       ct_pseudotime  ct_num_exp_genes  nm_clones KEGG   test fit_train  fit_val      drift  diffusion
6135     LK_d2  AATACATC-ATCCGCTA         2.0       Lin-Kit+Sca1-     Undifferentiated     0   549.977   144.213     4897.0  ...  0.8786757914387843               352       True    1  False      True    False  43.883648   0.990604
4951  LSK_d2_3  GGGCATCA-TTTATCAC         2.0       Lin-Kit+Sca1+     Undifferentiated     0   298.903   336.970     4445.0  ...  0.9406133190993315               148       True    1  False      True    False  39.128658   1.169227
2407      d2_2  AGTCACAA-TAGAAATG         2.0       Lin-Kit+Sca1-     Undifferentiated     0   353.290   719.613     4061.0  ...  0.9542395878430592               273       True    1  False      True    False  49.687836   1.136787

[3 rows x 22 columns]

然后,利用模型模擬細(xì)胞的發(fā)育過(guò)程,其中N這個(gè)參數(shù)比較特別,決定了每個(gè)起始細(xì)胞在各個(gè)時(shí)間生成的細(xì)胞數(shù)量。

adata_sim = sdq.tl.simulate(adata, idx=progenitor.index, N=512, diffeq=model.DiffEq, time_key="Time point")
sdq.tl.annotate_cell_state(adata_sim, kNN=model.kNN, obs_key="Cell type annotation")
sdq.tl.annotate_cell_fate(adata_sim, state_key="Cell type annotation")
adata_sim.obs
         t z0_idx  sim_i      sim Cell type annotation              fate
0      2.0   6135      0    61350     Undifferentiated        Neutrophil
1      2.0   4951      0    49510     Undifferentiated          Monocyte
2      2.0   2407      0    24070     Undifferentiated  Undifferentiated
3      2.0   6135      1    61351     Undifferentiated        Neutrophil
4      2.0   4951      1    49511     Undifferentiated  Undifferentiated
...    ...    ...    ...      ...                  ...               ...
62971  6.0   4951    510  4951510     Undifferentiated  Undifferentiated
62972  6.0   2407    510  2407510     Undifferentiated  Undifferentiated
62973  6.0   6135    511  6135511     Undifferentiated  Undifferentiated
62974  6.0   4951    511  4951511     Undifferentiated  Undifferentiated
62975  6.0   2407    511  2407511           Neutrophil        Neutrophil

[62976 rows x 6 columns]

模擬完成后,就可以清晰地看到隨著時(shí)間推移過(guò)程中細(xì)胞的發(fā)育軌跡:

pd.crosstab(adata_sim.obs.t, adata_sim.obs['Cell type annotation'])
t    Monocyte  Neutrophil  Undifferentiated
2.0         0           0              1536
2.1         0           0              1536
2.2         0           0              1536
2.3         0           0              1536
2.4         0           0              1536
2.5         0           0              1536
2.6         0           0              1536
2.7         0           0              1536
2.8         0           0              1536
2.9         0           1              1535
3.0         0           1              1535
3.1         0           5              1531
3.2         0          14              1522
3.3         0          54              1482
3.4         0          80              1456
3.5         1         142              1393
3.6         1         170              1365
3.7         2         235              1299
3.8         3         278              1255
3.9         5         336              1195
4.0         4         369              1163
4.1         4         403              1129
4.2         4         430              1102
4.3         5         466              1065
4.4         6         506              1024
4.5         8         530               998
4.6        14         555               967
4.7        17         571               948
4.8        22         618               896
4.9        32         641               863
5.0        35         666               835
5.1        47         687               802
5.2        49         709               778
5.3        53         731               752
5.4        59         746               731
5.5        67         764               705
5.6        74         781               681
5.7        81         800               655
5.8        88         814               634
5.9       100         826               610
6.0       108         846               582

接著,可以研究這些模擬細(xì)胞的基因表達(dá)情況。由于模擬過(guò)程使用的是PCA數(shù)據(jù),所以需要將其逆轉(zhuǎn)為基因表達(dá)值,這個(gè)過(guò)程需要提供StandardScaler和PCA模型,可以用sklearn生成。

scaler_model = sdq.io.read_pickle("scdiffeq_data/larry/scaler.pkl")
PCA = sdq.io.read_pickle("scdiffeq_data/larry/pca.pkl")
sdq.tl.annotate_gene_features(adata_sim, adata, PCA=PCA, gene_id_key="gene_ids")
sdq.tl.invert_scaled_gex(adata_sim, scaler_model = scaler_model)
adata_sim.obsm["X_gene_inv"]
gene_ids  1110002J07Rik  1110032F04Rik  1500002F19Rik  1500026H17Rik  1600010M07Rik  1700001C19Rik  1700001O22Rik  1700011I03Rik  1700016F12Rik  ...    Zfp963     Zfpm2   Zkscan2  Zmiz1os1  Zmpste24   Zmynd15     Znfx1    Zscan2       Zyx
0              0.000935       0.000000       0.000922       0.001447       0.003434       0.000488       0.000000       0.004353       0.000000  ...  0.022585  0.001477  0.000240  0.000000  0.079937  0.003646  0.133242  0.001753  0.118178
1              0.000524       0.000925       0.000347       0.000000       0.008591       0.003485       0.001161       0.003041       0.000000  ...  0.012729  0.000346  0.000589  0.000000  0.236309  0.000000  0.033923  0.001078  0.095812
2              0.000000       0.000000       0.000982       0.004394       0.009979       0.005414       0.000640       0.000962       0.000000  ...  0.012827  0.000000  0.001514  0.000268  0.235095  0.002141  0.040394  0.002379  0.100079
3              0.000935       0.000000       0.000922       0.001447       0.003434       0.000488       0.000000       0.004353       0.000000  ...  0.022585  0.001477  0.000240  0.000000  0.079937  0.003646  0.133242  0.001753  0.118178
4              0.000524       0.000925       0.000347       0.000000       0.008591       0.003485       0.001161       0.003041       0.000000  ...  0.012729  0.000346  0.000589  0.000000  0.236309  0.000000  0.033923  0.001078  0.095812
...                 ...            ...            ...            ...            ...            ...            ...            ...            ...  ...       ...       ...       ...       ...       ...       ...       ...       ...       ...
62971          0.001237       0.000000       0.001550       0.001806       0.001409       0.000161       0.002135       0.000567       0.000000  ...  0.011184  0.000052  0.000041  0.000352  0.317569  0.000000  0.092694  0.001254  0.113291
62972          0.000011       0.000874       0.000582       0.000515       0.002702       0.001036       0.000380       0.001900       0.000000  ...  0.012742  0.000551  0.000453  0.001684  0.238807  0.002792  0.123798  0.000524  0.114049
62973          0.000006       0.004426       0.008713       0.047629       0.000000       0.000000       0.000000       0.002204       0.142455  ...  0.023694  0.000000  0.000000  0.004168  0.000000  0.000000  0.452125  0.014248  0.387427
62974          0.001503       0.004170       0.000572       0.000000       0.004613       0.000000       0.002392       0.000000       0.001085  ...  0.019704  0.000207  0.002183  0.000000  0.269295  0.000000  0.002888  0.000000  0.069778
62975          0.000383       0.000039       0.001099       0.003847       0.003830       0.000936       0.000366       0.001295       0.015110  ...  0.013344  0.000756  0.000180  0.000000  0.693576  0.010397  0.160073  0.001989  0.230127

[62976 rows x 2492 columns]

下面就可以看基因的表達(dá)情況了:

import cellplots

def mean_and_std_expr(df, adata_sim, gene):
    x = adata_sim[df.index].obsm["X_gene_inv"][gene]
    return pd.Series({'mean': x.mean(), 'std': x.std()})

genes = ["Gfi1", "Elane", "Mpo", "Gstm1", "Mmp8", "Gata2"]
means = []
stds = []
for gene in genes:
    res = adata_sim.obs.groupby(["t", "fate"]).apply(mean_and_std_expr, adata_sim=adata_sim, gene=gene)
    mean_df = res['mean'].unstack()
    std_df = res['std'].unstack()
    means.append(mean_df)
    stds.append(std_df)

fig, axes = cellplots.plot(6, 3, height=0.65, width=0.8, wspace=0.4, hspace=0.4, x_label=["t (d)"], y_label=["norm. expr"], title=genes, delete=[["top", "right"]] * 3)

for en, (mean_df, std_df) in enumerate(zip(means, stds)):
    for col in mean_df:
        if col != "Undifferentiated":
            color = cmap[col]
            # Plot mean with line
            axes[en].plot(mean_df.index, mean_df[col], label=col, c=color)
            lower = mean_df[col] - std_df[col]
            upper = mean_df[col] + std_df[col]
            axes[en].fill_between(mean_df.index, lower, upper, color=color, alpha=0.25)
    axes[en].legend(facecolor="None", edgecolor="None")

雖然用了軟件文檔里面的數(shù)據(jù),但結(jié)果還是差別有點(diǎn)大,可見(jiàn)有時(shí)候重現(xiàn)別人的結(jié)果也不是一件容易的事。

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書(shū)系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容