文獻分享——AlphaGenome: advancing regulatory variant effect prediction with a unified DNA sequence model

AlphaGenome: advancing regulatory variant effect prediction with a unified DNA sequence model

AlphaGenome:利用統(tǒng)一的DNA序列模型進行調(diào)控變體效應預測

title.png

作者簡介: ?iga Avsec, Ph.D
他從物理學轉(zhuǎn)向計算基因組學,標志著人工智能與基因研究的融合邁出了重要的一步。他從斯洛文尼亞來到慕尼黑,在朱利安·加格尼爾 (Julien Gagneur) 的指導下探索 DNA 的奧秘,并為 Kipoi 和 BPNet 等工具做出了貢獻,這些工具增進了我們對基因組學的理解。
在 Google DeepMind,?iga 在 Enformer 和 AlphaMissense 上的工作正在為識別基因變異和推進我們對抗遺傳疾病的斗爭開辟新天地。通過他的故事,我們可以一窺醫(yī)療保健的未來:人工智能驅(qū)動的基因組學發(fā)現(xiàn)將徹底改變個性化醫(yī)療和疾病治療。
更多詳細的介紹可以訪問如下鏈接:https://blog.superbio.ai/superbio-scientist-spotlight-%C5%BEiga-avsec-ph-d-2225dacc2b9b

1.前情提要

隨著大語言模型的出現(xiàn),transformer的推出為我們破譯基因組密碼提供了更優(yōu)質(zhì)的工具,先前基于k-mer,短序列的方法逐漸被取代,長序列深度學習模型的出現(xiàn),可以實現(xiàn)從更長的DNA序列中學習到更多基因組信息——建模增強子跨越長距離與啟動子相互作用;判斷單個堿基突變是否會破壞關鍵調(diào)控位點;觀察一個變異對所有相關層級的影響,重建完整的致病因果鏈......

2.摘要

  • 目標
    開發(fā)深度學習模型,從 DNA 序列預測功能基因組學測量值(例如基因表達、染色質(zhì)可及性等),以解讀基因調(diào)控密碼
  • 現(xiàn)有問題
    當前模型面臨一個關鍵取舍——要么處理較長的輸入序列但預測分辨率低,要么預測分辨率高但只能處理很短的序列片段。這限制了它們能夠預測的功能模態(tài)(數(shù)據(jù)類型)數(shù)量和預測性能。
  • 提出的解決方案 —— AlphaGenome
    AlphaGenome 解決了上述“長度-分辨率”的取舍問題。能夠處理長達1 兆堿基對 (1 Mb)的 DNA 序列輸入。這相當于人類基因組的大約 1/3000,包含了更廣泛的調(diào)控上下文(如遠距離增強子、拓撲關聯(lián)域邊界等)。 在如此長的輸入序列基礎上,能夠以單堿基對分辨率預測數(shù)千種功能基因組學數(shù)據(jù)軌道。
  • 預測覆蓋極其多樣化的功能模態(tài),包括:
    1.基因表達水平 2.轉(zhuǎn)錄起始位點
    3.染色質(zhì)可及性 (如 ATAC-seq) 4.組蛋白修飾 (如 H3K27ac, H3K4me3)
    5.轉(zhuǎn)錄因子結(jié)合位點 6.染色質(zhì)空間構(gòu)象 (染色質(zhì)接觸圖譜,如 Hi-C)
    7.剪接位點使用情況 8.剪接連接點坐標及其連接強度
  • 模型訓練與性能評估:
    訓練數(shù)據(jù): 使用人類和小鼠的基因組數(shù)據(jù)進行訓練。
    評估指標: 主要評估模型在預測遺傳變異效應(如 SNP)方面的能力。這是驗證模型是否真正理解序列-功能關系的關鍵任務。
    結(jié)果: 在 26 項獨立的、與現(xiàn)有最強外部模型(如 Enformer, Basenji2)的對比評估中,AlphaGenome 在 24 項上匹配或超越了這些模型的性能。這證明了其強大的預測能力。
  • 關鍵應用與價值:
    多模態(tài)變異效應評分: AlphaGenome 的核心優(yōu)勢在于能同時預測一個變異(如致病 SNP)對所有上述數(shù)千種功能模態(tài)的影響。
    揭示致病機制: 以 TAL1 癌基因附近的臨床相關變異為例,AlphaGenome 能夠準確重現(xiàn)該變異影響多個功能層面(如破壞某個轉(zhuǎn)錄因子結(jié)合位點、改變?nèi)旧|(zhì)可及性、進而影響基因表達)的完整致病機制。這為理解復雜疾病的遺傳基礎提供了前所未有的整合視角。
  • 可用性:
    工具發(fā)布: 為了促進更廣泛的應用,研究者提供了工具,方便用戶利用 AlphaGenome 進行基因組軌道預測和變異效應評分。

3.模型構(gòu)建

3.1 數(shù)據(jù)準備

  • Gneome data

Input sequences were extracted from the hg38 (human) and mm10 (mouse) reference genomes. For sequence intervals that extended beyond chromosomal boundaries, padding with ‘N’ characters was used to ensure consistent input length.

  • Tracks details
Human Mouse
Tracks 5930 1128
Gene expression RNA-seq (ENCODE and GTEx)
CAGE (FANTOM5)
PRO-cap (ENCODE)
667
546
12
173
188
0
Detailed splicing patterns splice sites (ENCODE and GTEx realigned using STAR)
splice site usage (公式計算)
splice junctions (splicemap package)
4
734
734
4
180
180
Chromatin state DNase (ENCODE)
ATAC-seq (ENCODE)
histone modifications (ENCODE)
TF binding (ENCODE)
305
167
1116
1617
67
18
183
127
Chromatin contact maps Hi-C / micro-C (4D Nucleome) 28 8

3.2 模型構(gòu)建

model1.jpg

3.2.1 模型架構(gòu) (圖a)

核心設計:U-Net式分層處理

①. 輸入處理:
  • 序列輸入:1 Mb DNA序列(1,000,000 bp)
  • 物種標識:區(qū)分人類/小鼠基因組
  • 并行計算策略:將1 Mb序列分割為 131 kb的獨立片段,分布式處理于多個計算設備(GPU/TPU)
②. 三階段處理流程:
階段 功能 關鍵技術
Encoder 序列降維壓縮:提取局部特征(如轉(zhuǎn)錄因子結(jié)合位點) 卷積層(捕捉基序特征) + 池化(降維)
Transformer 建模長程依賴:解析增強子-啟動子遠程互作、染色質(zhì)域結(jié)構(gòu) 跨設備通信的注意力機制(覆蓋1 Mb全局上下文)
Decoder 序列升維還原:重建高分辨率輸出 轉(zhuǎn)置卷積(上采樣) + 跳躍連接(保留細節(jié))
③. 任務特定輸出頭:
  • 多任務適配:連接至解碼器末端,生成11類實驗數(shù)據(jù)類型的預測結(jié)果
  • 分辨率定制化:不同數(shù)據(jù)類型的輸出分辨率獨立設定(如單堿基/128bp bin)
  • 預測規(guī)模:同時輸出5,930條人類基因組軌道或1,128條小鼠軌道

技術意義:U-Net結(jié)構(gòu)解決了長序列與高分辨率的矛盾——編碼器提取抽象特征,Transformer建模全局交互,解碼器恢復空間細節(jié)。

3.2.2 訓練策略 (圖b-c)

階段①:教師模型訓練 (圖1b)
  • 數(shù)據(jù)準備:
    采樣區(qū)域:從人類/小鼠基因組的交叉驗證劃分區(qū)域選取1 Mb區(qū)間
    數(shù)據(jù)增強:隨機平移(模擬調(diào)控元件位置變化)反向互補(增強序列方向不變性)
  • 模型訓練目標:
    直接預測實驗測得的基因組功能信號(如ChIP-seq峰、RNA表達量)
    產(chǎn)出兩種教師模型:
    Fold-specific:單折數(shù)據(jù)訓練的專家模型
    All-folds:全數(shù)據(jù)訓練的通用模型
階段②:學生模型蒸餾 (圖1c)
  • 知識蒸餾流程:
    教師凍結(jié):固定All-folds教師模型的參數(shù)
    學生輸入:在原始序列基礎上引入突變擾動(模擬自然變異)
    學習目標:讓學生模型復現(xiàn)教師對擾動序列的預測結(jié)果
  • 關鍵優(yōu)勢:
    變異預測專精化:學生模型專注學習序列變異與功能變化的映射
    模型輕量化:產(chǎn)出單一高效推理模型(避免集成多教師模型的計算開銷)

生物學意義:教師-學生框架將"功能預測"能力蒸餾為"變異效應預測"能力,提升臨床應用的準確性。

3.2.3 性能評估 (圖d-e)

①. 基因組軌道預測性能 (圖1d)
  • 評估指標:
    相對性能提升%= \frac {AlphaGenome得分?最佳基線得分} {隨機分類器得分}(分類任務需標準化)

  • 關鍵結(jié)果:

模態(tài)類型 代表性任務 性能提升 技術意義
轉(zhuǎn)錄調(diào)控 RNA表達量預測 顯著提升 捕捉長程增強子交互
染色質(zhì)構(gòu)象 Hi-C接觸圖譜預測 最大提升 建模1 Mb尺度三維結(jié)構(gòu)
表觀遺傳 H3K27ac組蛋白修飾預測 中等提升 識別開放染色質(zhì)區(qū)域
RNA加工 多聚腺苷酸化位點(PA)識別 顯著提升 精確定位轉(zhuǎn)錄后調(diào)控位點

注:128bp分辨率任務提升幅度普遍低于單堿基任務,因基線模型在此分辨率已有較好表現(xiàn)。

② .變異效應預測性能 (圖1e)
  • 評估場景:
    功能變異:預測非編碼區(qū)SNP對分子表型的影響
    因果推斷:評估數(shù)量性狀位點(ds/caQTL)的因果方向
  • 核心突破:
    24/26任務超越基線:在涵蓋染色質(zhì)可及性(ATAC)、轉(zhuǎn)錄因子結(jié)合(ChIP)、基因表達(eQTL)等任務中全面領先
    因果方向識別:對"變異是否導致分子表型改變"的判斷準確率提升15-25%

案例佐證:TAL1癌基因附近的臨床變異機制解析(多模態(tài)協(xié)同預測揭示:SNP→破壞TF結(jié)合→降低染色質(zhì)開放性→抑制基因表達)

3.2.4 技術突破總結(jié)

維度 創(chuàng)新點 解決的核心問題
架構(gòu)設計 U-Net + 跨設備Transformer 1 Mb長序列與單堿基分辨率的兼容
訓練策略 兩階段教師-學生蒸餾 變異效應預測的專一性優(yōu)化
多模態(tài)輸出 11類數(shù)據(jù)類型/數(shù)千軌道并行預測 系統(tǒng)性解析變異致病機制
工程實現(xiàn) 131 kb分塊并行計算 突破GPU顯存限制實現(xiàn)兆堿基處理
評估驗證 26項嚴格測試(含臨床變異機制再現(xiàn)) 證明模型在基礎研究和臨床應用的普適性

3.3 AlphaGenome model architecture

model2.jpg

Extended Data Figure 1 | AlphaGenome model architecture. (a) Overview schematic illustrating the flow of activations through the model. The architecture follows a U-Net-like structure with an Encoder, a central Transformer Tower, and a Decoder processing a 1Mb DNA input sequence. The Encoder uses convolutional blocks and max pooling to progressively downsample the sequence resolution (from 1 bp to 128 bp) while increasing feature channels. The Transformer Tower operates at 128 bp resolution, iteratively refining sequence representations and generating pairwise (2D) representations. The Decoder uses convolutional blocks and upsampling, incorporating skip connections (dashed lines) from corresponding Encoder stages, to restore sequence resolution up to 1 bp. An Output Embedder performs final processing before feeding representations to task-specific output heads. (b) Internal structure of key component blocks used repeatedly within the architecture overview shown in (a). Diagrams detail the layers within the convolutional blocks (Conv block, Upres block), the Transformer blocks, and the blocks responsible for generating and updating pairwise representations (Pair update block, Sequence to pair block). Tensor shapes are shown excluding the batch dimension. Abbreviations: r = log-resolution, c = channels.

4.結(jié)果展示

這里詳細介紹我感興趣的兩部分Result

4.1 AlphaGenome enables state-of-the-art enhancer-gene linking

AlphaGenome無需針對PE linking任務專門訓練(即“零樣本”)。其Transformer模塊通過自注意力機制
自動識別序列中遠距離的調(diào)控依賴關系。例如:

  • 增強子特有的轉(zhuǎn)錄因子結(jié)合基序(如MYB、CTCF)被局部卷積層捕獲;
  • Transformer將這些局部信號與遠端啟動子關聯(lián),形成功能連接假設
    零樣本表現(xiàn)媲美監(jiān)督模型
  • 在增強子距離TSS >10 kb時,AlphaGenome顯著優(yōu)于Borzoi(相對auPRC提升17–25%);
  • 與專門訓練E-P鏈接的ENCODE-rE2G-extended模型相比,性能差距<1% auPRC


    restlt1.jpg

Figure 4 | AlphaGenome predicts the effect of variants on gene expression. (j) Enhancer-gene linking performance (ENCODE-rE2G CRISPRi dataset17). Zero-shot evaluation: Performance (auPRC) comparison stratified by enhancer-TSS distance for AlphaGenome (distilled) vs Borzoi vs TSS distance baseline. Supervised evaluation: AlphaGenome input gradient score integrated into ENCODE-rE2G-extended vs ENCODE-rE2G models.
Extended Data Figure 7 | AlphaGenome improves enhancer-gene linking using input gradients and shows enhanced sensitivity to distal enhancers. (b) Impact of incorporating AlphaGenome’s input gradient score as a feature in the ENCODE-rE2G extended logistic regression model, evaluated on the ENCODE-rE2G benchmark. ENCODE-rE2G is a logistic regression model trained to predict enhancer-gene interactions from features2. Precision-recall curves are shown, colored by the feature sets used for training the regression model (auPRC values indicated in the legend). Feature sets are:
? rE2G extended with AlphaGenome features: All ENCODE-rE2G extended model features plus a single AlphaGenome’s input x gradient score.
? AlphaGenome features only : The AlphaGenome input x gradient score alone.
? TSS distance with AlphaGenome features: AlphaGenome input x gradient score plus the distance to TSS feature. ? rE2G extended: All features from the ENCODE-rE2G extended model2. ? TSS distance: Distance to TSS feature from2.
? ABC features only : Subset of ’rE2g extended’, with only features related to the Activity-By-Contact (ABC) model2.(c) Precision-recall curves for the ENCODE-rE2G benchmark, similar to panel (b), evaluating the ENCODE-rE2G extended regression model with different feature sets. Area under the precision-recall curve (auPRC) values for the different feature sets are indicated in the legend. In this configuration, ‘AlphaGenome features’ consist of a more comprehensive set of K562 cell line-specific variant effect scores. These include Allele-Specific Activity Scores (AAS) and variant effect scores calculated as the difference between alternate (ALT) and reference (REF) allele predictions (ALT-REF Diff scores). These scores were derived from AlphaGenome for the following genomic assays:
? RNA-seq of the target gene
? ChIP-TF EP300
? ChIP-Histone H3K27ac
? CAGE
? PRO-cap
? H1-ESC contact maps

4.2 AlphaGenome improves on predicting variant effects on chromatin accessibility and transcription factor binding

解決兩大關鍵問題:

  • QTL效應預測:
    判斷非編碼變異(如SNP)是否影響染色質(zhì)可及性(caQTL)、DNase敏感性(dsQTL)或轉(zhuǎn)錄因子結(jié)合(bQTL)
    量化變異對上述分子表型的效應強度
  • MPRA活性預測:
    預測短DNA序列的調(diào)控活性(報告基因表達水平)
    解析局部序列變異如何通過染色質(zhì)狀態(tài)調(diào)控基因表達
result2.png

Figure 5 | AlphaGenome accurately predicts variant effects on chromatin accessibility and SPI1 transcription factor binding. (a) Schematic of the center-mask variant scoring strategy. This approach, detailed in Methods, is used for accessibility (DNase-seq, ATAC-seq) and ChIP-seq predictions. (b) Performance comparison on QTL causality prediction. Average Precision (AP) for AlphaGenome, Borzoi, and ChromBPNet across QTL types (caQTL, dsQTL, bQTL) and ancestries. (c) Performance comparison on QTL effect size prediction. Pearson r is shown for AlphaGenome, Borzoi, and ChromBPNet across QTL types (caQTL, dsQTL, bQTL) and ancestries. (d) AlphaGenome’s predicted versus observed effect sizes for causal caQTLs (African ancestry). Scatterplot displays predictions using the DNase track for the GM12878 cell line. Signed Pearson r = 0.74; unsigned Pearson r = 0.45. Signed Pearson r correlation uses raw values; unsigned Pearson r uses absolute values. Red and blue circles highlight variants detailed in (e, f). (e) Example AlphaGenome predictions for selected caQTLs. Shown are ALT-REF differences in predicted DNase track (GM12878) around the variants highlighted in (d). (f) ISM-derived sequence logos for REF and ALT alleles of example caQTLs from (e). The examples suggest variant disruption or modulation of TF binding motifs. Putative binding factors and JASPAR39 matrix IDs (MA0105.1, MA0105.3) are indicated on the right. (g) AlphaGenome’s predicted versus observed effect sizes for causal SPI1 bQTLs. Scatterplot displays predictions using the SPI1 ChIP-seq track for the GM12878 cell line. Signed Pearson r = 0.55; unsigned Pearson r = 0.12. Red and blue circles highlight variants detailed in (h, i). (h) Example AlphaGenome predictions for selected SPI1 bQTLs. Shown are ALT-REF differences in predicted SPI1 ChIP-TF track (GM12878) around the variants highlighted in (g). (i) ISM-derived sequence logos for REF and ALT alleles of example SPI1 bQTLs from (h). Examples indicate potential motif impacts such as creation or disruption of SPI1 or related motifs. Putative binding factors and JASPAR matrix IDs (MA0081.2, MA0080.5) are indicated on the right. (j) CAGI5 MPRA challenge performance (average across loci). Top: Average zero-shot Pearson r performance, using cell type-matched raw DNase model outputs. Middle: Average Pearson r from LASSO regression using cell type-matched or cell type-agnostic DNase outputs. Bottom: LASSO regression Pearson r performance using features from multiple modalities and the full set of cell types (DNase + RNA + ChIP-Histone output types for AlphaGenome and Borzoi; DNase + CAGE output types for Enformer).

result2supp.png

Supplementary Figure 9 | Additional accessibility variant analysis. Extended evaluation of variant effect prediction on chromatin accessibility across diverse contexts. AP = average precision (auPRC). Signed Pearson R correlation uses raw values; unsigned Pearson R uses absolute values first. (a) Precision-Recall curves comparing AlphaGenome, Borzoi, and ChromBPNet performance on caQTL causality prediction in European ancestry. (b) Scatterplot comparing AlphaGenome’s predicted versus observed effect sizes (Coefficient) for causal caQTL variants in European ancestry. (c) Precision-Recall curves comparing AlphaGenome, Borzoi, and ChromBPNet performance on dsQTL causality prediction in Yoruba ancestry. (d) Scatterplot comparing AlphaGenome’s predicted versus observed effect sizes (Coefficient) for causal dsQTL variants in Yoruba ancestry. (e) Precision-Recall curves comparing model performance for caQTL causality prediction (African ancestry). (f) Effect size prediction for microglia causal caQTL variants. Scatterplot compares observed effects versus AlphaGenome’s predicted DNase effects in a closely-related available cell type (suppressor macrophage). (g) Effect size prediction for cardiac smooth muscle cell (SMC) causal caQTL variants. Scatterplot compares observed effects versus AlphaGenome’s predicted ATAC effects in a closely-related available cell type (left cardiac atrium ATAC). (h) Precision-Recall curves comparing model performance for SPI1 bQTL causality prediction.

訪問Google DeepMind可以獲得關于AlphaGenome更多詳細信息:網(wǎng)址如下https://deepmind.google/discover/blog/alphagenome-ai-for-better-understanding-the-genome/
AlphaGenome github軟件地址:
https://github.com/google-deepmind/alphagenome

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務。

相關閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容