pySCENIC is a lightning-fast python implementation of the SCENIC pipeline (Single-Cell rEgulatory Network Inference and Clustering) which enables biologists to infer transcription factors, gene regulatory networks and cell types from single-cell RNA-seq data
R環(huán)境中操作流程見(jiàn):http://m.itdecent.cn/p/b0fd795ad05c
分析步驟:
0. Preliminary work
1. Inference of co-expression modules
2. Prune modules for targets with cis regulatory footprints (aka RcisTarget)
3. Cellular regulon enrichment matrix (aka AUCell)
0. 前期參數(shù)準(zhǔn)備
import os
import glob
import pickle
import pandas as pd
import numpy as np
from dask.diagnostics import ProgressBar
from arboreto.utils import load_tf_names
from arboreto.algo import grnboost2
from pyscenic.rnkdb import FeatherRankingDatabase as RankingDatabase
from pyscenic.utils import modules_from_adjacencies, load_motifs
from pyscenic.prune import prune2df, df2regulons
from pyscenic.aucell import aucell
# 這個(gè)包我在導(dǎo)入的時(shí)候出現(xiàn)了如下報(bào)錯(cuò):(Segmentation fault)
# 因?yàn)樵摵瘮?shù)只涉及到出圖,因此沒(méi)有導(dǎo)入的必要(后期結(jié)果直接導(dǎo)入到R中進(jìn)行出圖)
# import seaborn as sns
DATA_FOLDER="~/tmp"
RESOURCES_FOLDER="~/resources"
DATABASE_FOLDER = "~/databases/"
SCHEDULER="123.122.8.24:8786"
DATABASES_GLOB = os.path.join(DATABASE_FOLDER, "mm9-*.feather")
MOTIF_ANNOTATIONS_FNAME = os.path.join(RESOURCES_FOLDER, "motifs-v9-nr.mgi-m0.001-o0.0.tbl")
MM_TFS_FNAME = os.path.join(RESOURCES_FOLDER, 'mm_tfs.txt')
SC_EXP_FNAME = os.path.join(RESOURCES_FOLDER, "GSE60361_C1-3005-Expression.txt")
REGULONS_FNAME = os.path.join(DATA_FOLDER, "regulons.p")
MOTIFS_FNAME = os.path.join(DATA_FOLDER, "motifs.csv")
# 讀入表達(dá)矩陣,表達(dá)矩陣的格式:橫坐標(biāo)是基因,縱坐標(biāo)是細(xì)胞
ex_matrix = pd.read_csv(SC_EXP_FNAME, sep='\t', header=0, index_col=0).T
ex_matrix.shape
# 導(dǎo)入轉(zhuǎn)錄因子
tf_names = load_tf_names(MM_TFS_FNAME)
# 導(dǎo)入數(shù)據(jù)庫(kù)
db_fnames = glob.glob(DATABASES_GLOB)
def name(fname):
return os.path.basename(fname).split(".")[0]
dbs = [RankingDatabase(fname=fname, name=name(fname)) for fname in db_fnames]
dbs
1. Inference of co-expression modules
# 兩條命令解決
adjacencies = grnboost2(ex_matrix, tf_names=tf_names, verbose=True) #耗時(shí)
modules = list(modules_from_adjacencies(adjacencies, ex_matrix))
2. Prune modules for targets with cis regulatory footprints (aka RcisTarget)
# Calculate a list of enriched motifs and the corresponding target genes for all modules.
with ProgressBar():
df = prune2df(dbs, modules, MOTIF_ANNOTATIONS_FNAME)
# Create regulons from this table of enriched motifs.
regulons = df2regulons(df)
# Save the enriched motifs and the discovered regulons to disk.
df.to_csv(MOTIFS_FNAME)
with open(REGULONS_FNAME, "wb") as f:
pickle.dump(regulons, f)
3. Cellular regulon enrichment matrix (aka AUCell)
auc_mtx = aucell(ex_matrix, regulons, num_workers=4)
# 這一步出圖
# sns.clustermap(auc_mtx, figsize=(8,8))
將結(jié)果導(dǎo)入到R
首先將pySCENIC中得到的結(jié)果保存成.loom格式
# 還是在剛才的環(huán)境中(Python)
from pyscenic.export import export2loom
export2loom(ex_mtx = ex_matrix, auc_mtx = auc_mtx, regulons = regulons, out_fname = "your path way/xxx.loom")
# 這里會(huì)有提示(見(jiàn)圖1),按照提示改一下
export2loom(ex_mtx = ex_matrix, auc_mtx = auc_mtx, regulons = [r.rename(r.name.replace('(+)',' ('+str(len(r))+'g)')) for r in regulons], out_fname = "your path way/xxx.loom")
# 這一句話運(yùn)行完畢后,會(huì)在指定目錄下生成xxx.loom文件,這就是導(dǎo)入R所需要的文件

然后進(jìn)入R環(huán)境
# 保存xxx.loom文件的路徑
pyScenicDir <- "pySCENIC_example/output"
library(SCENIC)
library(SCopeLoomR)
pyScenicLoomFile <- file.path(pyScenicDir, "SCENIC.loom")
loom <- open_loom(pyScenicLoomFile, mode="r")
# Read information from loom file:
regulons_incidMat <- get_regulons(loom)
regulons <- regulonsToGeneLists(regulons_incidMat)
regulonsAUC <- get_regulonsAuc(loom)
regulonsAucThresholds <- get_regulonThresholds(loom)
embeddings <- get_embeddings(loom)
exprMat <- get_dgem(loom)
cellInfo <- get_cellAnnotation(loom)
clusterings <- get_clusterings_withName(loom)
close_loom(loom)
以上過(guò)程完成之后,pySCENIC中的結(jié)果就成功導(dǎo)入到R環(huán)境中了,接下來(lái)就可以在R中出圖(可參考:SCENIC—Single Cell rEgulatory Network Inference and Clustering)
所用數(shù)據(jù)集
# Transcription factors:
wget https://raw.githubusercontent.com/aertslab/scenic-nf/master/example/allTFs_hg38.txt
# Motif to TF annotation database:
wget https://raw.githubusercontent.com/aertslab/scenic-nf/master/example/motifs.tbl
# Ranking databases:
wget https://raw.githubusercontent.com/aertslab/scenic-nf/master/example/genome-ranking.feather
# Finally, get a small sample expression matrix (loom format):
wget https://raw.githubusercontent.com/aertslab/scenic-nf/master/example/expr_mat.loom
# Alternatively, in tsv format:
wget https://raw.githubusercontent.com/aertslab/scenic-nf/master/example/expr_mat.tsv
參考鏈接:
Importing pySCENIC results
pyscenic
數(shù)據(jù)集
補(bǔ)充
當(dāng)我用真實(shí)數(shù)據(jù)跑這一套流程的時(shí)候,在“將pySCENIC中得到的結(jié)果保存成.loom格式”中出現(xiàn)了如下報(bào)錯(cuò):

網(wǎng)上搜到比較靠譜的解釋是HDF5文件類型大小有限制(HDF5 has a header limit of 64kb for all metadata of the columns. This include name, types, etc. When you go about roughly 2000 columns, you will run out of space to store all the metadata.)
因?yàn)椴皇呛芏趺唇鉀Q這個(gè)問(wèn)題,再加上我只需要regulonAUC這個(gè)對(duì)象(對(duì)應(yīng)著pyscenic當(dāng)中的auc_mtx),所以我后面就沒(méi)有用到上述提到的方法導(dǎo)入R
# 接上面的 Cellular regulon enrichment matrix (aka AUCell)
AUC_FNAME=os.path.join(DATA_FOLDER, "auc.tsv")
regulons = [r.rename(r.name.replace('(+)',' ('+str(len(r))+'g)')) for r in regulons]
auc_mtx = aucell(ex_matrix, regulons, num_workers=4)
auc_mtx.to_csv(AUC_FNAME)
# 進(jìn)入R環(huán)境
# cellInfo就是SeuratObject@metadata
library(SCENIC)
library(SCopeLoomR)
library(AUCell)
pyScenicDir=DATA_FOLDER
regulonAUC <- importAUCfromText(file.path(pyScenicDir, "auc..tsv"))
regulonAUC <- regulonAUC[onlyNonDuplicatedExtended(rownames(regulonAUC)),]
regulonActivity_byCellType <- sapply(split(rownames(cellInfo), cellInfo$CellType),
function(cells) rowMeans(getAUC(regulonAUC)[,cells]))
regulonActivity_byCellType_Scaled <- t(scale(t(regulonActivity_byCellType), center = T, scale=T))
pheatmap::pheatmap(regulonActivity_byCellType_Scaled, #fontsize_row=3,
color=colorRampPalette(c("blue","white","red"))(100), breaks=seq(-3, 3, length.out = 100),
treeheight_row=10, treeheight_col=10, border_color=NA)

還有一種方法,因?yàn)檫@個(gè)包里面限速的步驟是GENIE3(R里面),也可以單獨(dú)把這一步在Python中運(yùn)行(GRNBoost,其他步驟可在R中運(yùn)行),生成的結(jié)果保存到R中的/int目錄下。具體操作:
https://github.com/aertslab/SCENIC/issues/40
https://arboreto.readthedocs.io/en/latest/examples.html