單細(xì)胞測(cè)序技術(shù)通過(guò)捕獲單個(gè)細(xì)胞的瞬時(shí)分子圖譜,為重建動(dòng)態(tài)生物學(xué)過(guò)程提供了可能。在發(fā)育和疾病過(guò)程中,細(xì)胞命運(yùn)決定是由確定性調(diào)控事件和隨機(jī)性調(diào)控事件之間復(fù)雜的平衡所協(xié)調(diào)的。漂移-擴(kuò)散方程在從高維單細(xì)胞測(cè)量數(shù)據(jù)建模單細(xì)胞動(dòng)態(tài)方面具有顯著效果。雖然現(xiàn)有的解決方案能夠在細(xì)胞狀態(tài)水平上描述這些方程中漂移項(xiàng)所關(guān)聯(lián)的確定性動(dòng)態(tài),但擴(kuò)散項(xiàng)在所有細(xì)胞狀態(tài)中都被建模為常數(shù)。為了充分理解發(fā)育和疾病中的動(dòng)態(tài)調(diào)控邏輯,需要建立能夠明確關(guān)注確定性與隨機(jī)性生物學(xué)之間平衡的模型。
為解決這些局限性,作者開(kāi)發(fā)了scDiffEq,一個(gè)生成式框架,用于學(xué)習(xí)能夠近似生物學(xué)確定性和隨機(jī)性動(dòng)態(tài)的神經(jīng)隨機(jī)微分方程。通過(guò)對(duì)多能祖細(xì)胞進(jìn)行計(jì)算機(jī)模擬擾動(dòng),我們發(fā)現(xiàn)scDiffEq能夠準(zhǔn)確重現(xiàn)CRISP 擾動(dòng)造血的動(dòng)態(tài)過(guò)程。這一方法可以從單一時(shí)間點(diǎn)的單細(xì)胞數(shù)據(jù)的動(dòng)態(tài)變化進(jìn)行建模,推廣到譜系示蹤或多時(shí)間點(diǎn)數(shù)據(jù)集。利用scDiffEq,可以模擬高分辨率的發(fā)育細(xì)胞軌跡,并對(duì)其漂移和擴(kuò)散進(jìn)行建模,從而幫助研究這些軌跡的時(shí)間依賴性基因水平動(dòng)態(tài)。
import scanpy as sc
import scdiffeq as sdq
import pandas as pd
import umap
adata_ref = sc.read_h5ad('scdiffeq_data/larry/larry.h5ad')
adata_ref
AnnData object with n_obs × n_vars = 130887 × 2492
obs: 'Library', 'Cell barcode', 'Time point', 'Starting population', 'Cell type annotation', 'Well', 'SPRING-x', 'SPRING-y', 'clone_idx', 'fate_observed', 't0_fated', 'train', 'ct_score', 'ct_pseudotime', 'ct_num_exp_genes'
var: 'gene_ids', 'hv_gene', 'must_include', 'exclude', 'use_genes', 'ct_gene_corr', 'ct_correlates'
uns: 'fate_counts', 'h5ad_path', 'time_occupance'
obsm: 'X_clone', 'X_pca', 'X_scaled', 'X_umap', 'cell_fate_df'
layers: 'X_scaled'
UMAP = umap.UMAP(n_components=2)
adata_ref.obsm["X_umap"] = UMAP.fit_transform(adata_ref.obsm["X_pca"])
nm_clones = adata_ref.uns["fate_counts"][["Monocyte", "Neutrophil"]].dropna().index
adata_ref.obs['nm_clones'] = adata_ref.obs["clone_idx"].isin(nm_clones)
MASK = adata_ref.obs["Cell type annotation"].isin(["Monocyte", "Neutrophil", "Undifferentiated"]) & adata_ref.obs['nm_clones']
adata = adata_ref[MASK].copy()
del adata.obsm['X_clone']
del adata.obsm["cell_fate_df"]
adata.obs.index = adata.obs.reset_index(drop=True).index.astype(str)
準(zhǔn)備好數(shù)據(jù),建模過(guò)程下面幾行代碼即可,雖然建??梢栽?code>cpu環(huán)境運(yùn)行,但速度太慢,還是用gpu來(lái)加速吧:
model = sdq.scDiffEq(adata)
model.fit(train_epochs=1500)
model.drift()
model.diffusion()
不過(guò),train_epochs參數(shù)會(huì)影響模型的擬合效果,可以用下面的方式來(lái)查看模型的損失情況來(lái)確定:
import matplotlib.pyplot as plt
train_loss = model.metrics[['epoch', 'epoch_train_loss']].dropna().reset_index(drop=True)
val_loss = model.metrics[['epoch', 'epoch_validation_loss']].dropna().reset_index(drop=True)
# Plot raw values as scatter points
plt.scatter(train_loss['epoch'], train_loss['epoch_train_loss'], color='blue', s=8, alpha=0.15)
plt.scatter(val_loss['epoch'], val_loss['epoch_validation_loss'], color='orange', s=8, alpha=0.15)
# Compute moving averages
train_loss_ma = train_loss['epoch_train_loss'].rolling(window=10, min_periods=1, center=True).mean()
val_loss_ma = val_loss['epoch_validation_loss'].rolling(window=10, min_periods=1, center=True).mean()
# Plot moving averages as lines
plt.plot(train_loss['epoch'], train_loss_ma, color='blue', linewidth=2, label='Train Loss')
plt.plot(val_loss['epoch'], val_loss_ma, color='orange', linewidth=2, label='Val Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Training and Validation Loss')
plt.legend()
plt.show()

建模完成,接下來(lái)看看細(xì)胞的擬時(shí)序:
sdq.tl.velocity_graph(model.adata)
cmap = {"Undifferentiated": "dimgrey", "Neutrophil": "#023047", "Monocyte": "#F08700"}
sdq.pl.velocity_stream(model.adata, c="diffusion", scatter_kwargs={"vmax": 3})

做細(xì)胞發(fā)育軌跡推斷需要先選擇起始細(xì)胞:
progenitor = (model.adata.obs.loc[model.adata.obs["Time point"] == model.adata.obs["Time point"].min()].loc[model.adata.obs["Cell type annotation"] == "Undifferentiated"].sample(3))
progenitor
Library Cell barcode Time point Starting population Cell type annotation Well SPRING-x SPRING-y clone_idx ... ct_pseudotime ct_num_exp_genes nm_clones KEGG test fit_train fit_val drift diffusion
6135 LK_d2 AATACATC-ATCCGCTA 2.0 Lin-Kit+Sca1- Undifferentiated 0 549.977 144.213 4897.0 ... 0.8786757914387843 352 True 1 False True False 43.883648 0.990604
4951 LSK_d2_3 GGGCATCA-TTTATCAC 2.0 Lin-Kit+Sca1+ Undifferentiated 0 298.903 336.970 4445.0 ... 0.9406133190993315 148 True 1 False True False 39.128658 1.169227
2407 d2_2 AGTCACAA-TAGAAATG 2.0 Lin-Kit+Sca1- Undifferentiated 0 353.290 719.613 4061.0 ... 0.9542395878430592 273 True 1 False True False 49.687836 1.136787
[3 rows x 22 columns]
然后,利用模型模擬細(xì)胞的發(fā)育過(guò)程,其中N這個(gè)參數(shù)比較特別,決定了每個(gè)起始細(xì)胞在各個(gè)時(shí)間生成的細(xì)胞數(shù)量。
adata_sim = sdq.tl.simulate(adata, idx=progenitor.index, N=512, diffeq=model.DiffEq, time_key="Time point")
sdq.tl.annotate_cell_state(adata_sim, kNN=model.kNN, obs_key="Cell type annotation")
sdq.tl.annotate_cell_fate(adata_sim, state_key="Cell type annotation")
adata_sim.obs
t z0_idx sim_i sim Cell type annotation fate
0 2.0 6135 0 61350 Undifferentiated Neutrophil
1 2.0 4951 0 49510 Undifferentiated Monocyte
2 2.0 2407 0 24070 Undifferentiated Undifferentiated
3 2.0 6135 1 61351 Undifferentiated Neutrophil
4 2.0 4951 1 49511 Undifferentiated Undifferentiated
... ... ... ... ... ... ...
62971 6.0 4951 510 4951510 Undifferentiated Undifferentiated
62972 6.0 2407 510 2407510 Undifferentiated Undifferentiated
62973 6.0 6135 511 6135511 Undifferentiated Undifferentiated
62974 6.0 4951 511 4951511 Undifferentiated Undifferentiated
62975 6.0 2407 511 2407511 Neutrophil Neutrophil
[62976 rows x 6 columns]
模擬完成后,就可以清晰地看到隨著時(shí)間推移過(guò)程中細(xì)胞的發(fā)育軌跡:
pd.crosstab(adata_sim.obs.t, adata_sim.obs['Cell type annotation'])
t Monocyte Neutrophil Undifferentiated
2.0 0 0 1536
2.1 0 0 1536
2.2 0 0 1536
2.3 0 0 1536
2.4 0 0 1536
2.5 0 0 1536
2.6 0 0 1536
2.7 0 0 1536
2.8 0 0 1536
2.9 0 1 1535
3.0 0 1 1535
3.1 0 5 1531
3.2 0 14 1522
3.3 0 54 1482
3.4 0 80 1456
3.5 1 142 1393
3.6 1 170 1365
3.7 2 235 1299
3.8 3 278 1255
3.9 5 336 1195
4.0 4 369 1163
4.1 4 403 1129
4.2 4 430 1102
4.3 5 466 1065
4.4 6 506 1024
4.5 8 530 998
4.6 14 555 967
4.7 17 571 948
4.8 22 618 896
4.9 32 641 863
5.0 35 666 835
5.1 47 687 802
5.2 49 709 778
5.3 53 731 752
5.4 59 746 731
5.5 67 764 705
5.6 74 781 681
5.7 81 800 655
5.8 88 814 634
5.9 100 826 610
6.0 108 846 582
接著,可以研究這些模擬細(xì)胞的基因表達(dá)情況。由于模擬過(guò)程使用的是PCA數(shù)據(jù),所以需要將其逆轉(zhuǎn)為基因表達(dá)值,這個(gè)過(guò)程需要提供StandardScaler和PCA模型,可以用sklearn生成。
scaler_model = sdq.io.read_pickle("scdiffeq_data/larry/scaler.pkl")
PCA = sdq.io.read_pickle("scdiffeq_data/larry/pca.pkl")
sdq.tl.annotate_gene_features(adata_sim, adata, PCA=PCA, gene_id_key="gene_ids")
sdq.tl.invert_scaled_gex(adata_sim, scaler_model = scaler_model)
adata_sim.obsm["X_gene_inv"]
gene_ids 1110002J07Rik 1110032F04Rik 1500002F19Rik 1500026H17Rik 1600010M07Rik 1700001C19Rik 1700001O22Rik 1700011I03Rik 1700016F12Rik ... Zfp963 Zfpm2 Zkscan2 Zmiz1os1 Zmpste24 Zmynd15 Znfx1 Zscan2 Zyx
0 0.000935 0.000000 0.000922 0.001447 0.003434 0.000488 0.000000 0.004353 0.000000 ... 0.022585 0.001477 0.000240 0.000000 0.079937 0.003646 0.133242 0.001753 0.118178
1 0.000524 0.000925 0.000347 0.000000 0.008591 0.003485 0.001161 0.003041 0.000000 ... 0.012729 0.000346 0.000589 0.000000 0.236309 0.000000 0.033923 0.001078 0.095812
2 0.000000 0.000000 0.000982 0.004394 0.009979 0.005414 0.000640 0.000962 0.000000 ... 0.012827 0.000000 0.001514 0.000268 0.235095 0.002141 0.040394 0.002379 0.100079
3 0.000935 0.000000 0.000922 0.001447 0.003434 0.000488 0.000000 0.004353 0.000000 ... 0.022585 0.001477 0.000240 0.000000 0.079937 0.003646 0.133242 0.001753 0.118178
4 0.000524 0.000925 0.000347 0.000000 0.008591 0.003485 0.001161 0.003041 0.000000 ... 0.012729 0.000346 0.000589 0.000000 0.236309 0.000000 0.033923 0.001078 0.095812
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
62971 0.001237 0.000000 0.001550 0.001806 0.001409 0.000161 0.002135 0.000567 0.000000 ... 0.011184 0.000052 0.000041 0.000352 0.317569 0.000000 0.092694 0.001254 0.113291
62972 0.000011 0.000874 0.000582 0.000515 0.002702 0.001036 0.000380 0.001900 0.000000 ... 0.012742 0.000551 0.000453 0.001684 0.238807 0.002792 0.123798 0.000524 0.114049
62973 0.000006 0.004426 0.008713 0.047629 0.000000 0.000000 0.000000 0.002204 0.142455 ... 0.023694 0.000000 0.000000 0.004168 0.000000 0.000000 0.452125 0.014248 0.387427
62974 0.001503 0.004170 0.000572 0.000000 0.004613 0.000000 0.002392 0.000000 0.001085 ... 0.019704 0.000207 0.002183 0.000000 0.269295 0.000000 0.002888 0.000000 0.069778
62975 0.000383 0.000039 0.001099 0.003847 0.003830 0.000936 0.000366 0.001295 0.015110 ... 0.013344 0.000756 0.000180 0.000000 0.693576 0.010397 0.160073 0.001989 0.230127
[62976 rows x 2492 columns]
下面就可以看基因的表達(dá)情況了:
import cellplots
def mean_and_std_expr(df, adata_sim, gene):
x = adata_sim[df.index].obsm["X_gene_inv"][gene]
return pd.Series({'mean': x.mean(), 'std': x.std()})
genes = ["Gfi1", "Elane", "Mpo", "Gstm1", "Mmp8", "Gata2"]
means = []
stds = []
for gene in genes:
res = adata_sim.obs.groupby(["t", "fate"]).apply(mean_and_std_expr, adata_sim=adata_sim, gene=gene)
mean_df = res['mean'].unstack()
std_df = res['std'].unstack()
means.append(mean_df)
stds.append(std_df)
fig, axes = cellplots.plot(6, 3, height=0.65, width=0.8, wspace=0.4, hspace=0.4, x_label=["t (d)"], y_label=["norm. expr"], title=genes, delete=[["top", "right"]] * 3)
for en, (mean_df, std_df) in enumerate(zip(means, stds)):
for col in mean_df:
if col != "Undifferentiated":
color = cmap[col]
# Plot mean with line
axes[en].plot(mean_df.index, mean_df[col], label=col, c=color)
lower = mean_df[col] - std_df[col]
upper = mean_df[col] + std_df[col]
axes[en].fill_between(mean_df.index, lower, upper, color=color, alpha=0.25)
axes[en].legend(facecolor="None", edgecolor="None")

雖然用了軟件文檔里面的數(shù)據(jù),但結(jié)果還是差別有點(diǎn)大,可見(jiàn)有時(shí)候重現(xiàn)別人的結(jié)果也不是一件容易的事。