2023年的深度學(xué)習(xí)入門(mén)指南(4) - 在你的電腦上運(yùn)行大模型
上一篇我們介紹了大模型的基礎(chǔ),自注意力機(jī)制以及其實(shí)現(xiàn)Transformer模塊。因?yàn)門(mén)ransformer被PyTorch和TensorFlow等框架所支持,所以我們只要能夠配置好框架的GPU或者其他加速硬件的支持,就可以運(yùn)行起來(lái)了。
而想運(yùn)行大模型,恐怕就沒(méi)有這么容易了,很有可能你需要一臺(tái)Linux電腦。因?yàn)槟壳傲餍械腁I軟件一般都依賴(lài)大量的開(kāi)源工具,尤其是要進(jìn)行優(yōu)化的情況下,很可能需要從源碼進(jìn)行編譯。一旦涉及到開(kāi)源軟件和編譯這些事情,在Windows上的難度就變成hard模式了。
大部分開(kāi)發(fā)者自身都是在開(kāi)源系統(tǒng)上做開(kāi)發(fā)的,Windows的適配關(guān)注得較少,甚至完全不關(guān)心。雖然從Cygwin, MinGW, CMake到WSL,各方都為Windows上支持大量Linux開(kāi)源庫(kù)進(jìn)行了不少努力,但是就像在Linux上沒(méi)有Windows那么多游戲一樣,這是生態(tài)的問(wèn)題。
我們先選取幾個(gè)Windows的兼容性稍好的項(xiàng)目,讓用Windows的同學(xué)們也可以體驗(yàn)本機(jī)的大模型。
Nomic AI gpt4all (基于LLaMA)
2022年末chatgpt橫空出世之后,Meta公司認(rèn)為openai背離了open的宗旨,于是半開(kāi)放了他們的大模型LLaMA。半開(kāi)放的原因是,網(wǎng)絡(luò)的權(quán)重文件需要跟Meta公司申請(qǐng)。
LLaMA主要是針對(duì)英語(yǔ)材料進(jìn)行訓(xùn)練,也引用了部分使用拉丁字母和西里爾字母的語(yǔ)言。它的分詞器可以支持漢語(yǔ)和日語(yǔ),但是并沒(méi)有使用漢語(yǔ)和日語(yǔ)的材料。
因?yàn)椴徊?duì)所有人開(kāi)放,我們講解LLaMA是沒(méi)有意義的。但是我們可以嘗試一些基于LLaMA的項(xiàng)目,比如Nomic AI的gpt4all。
gpt4all的貼心之處是針對(duì)Windows, M1 Mac和Intel Mac三種平臺(tái)都進(jìn)行了適配,當(dāng)然默認(rèn)肯定是支持Linux的。而且,推理使用CPU就可以。
下面我們就將其運(yùn)行起來(lái)吧。
首先下載gpt4all的代碼:
git clone https://github.com/nomic-ai/gpt4all
第二步,下載量化之后的網(wǎng)絡(luò)權(quán)重值文件:https://the-eye.eu/public/AI/models/nomic-ai/gpt4all/gpt4all-lora-quantized.bin
第三步,將下載的gpt4all-lora-quantized.bin放在gpt4all的chat目錄下
第四步,運(yùn)行g(shù)pt4all-lora-quantized可執(zhí)行文件。以Windows平臺(tái)為例,就是運(yùn)行g(shù)pt4all-lora-quantized-win64.exe??梢栽趐owershell中執(zhí)行,也可以直接點(diǎn)擊。
運(yùn)行后,當(dāng)加載完模型之后,我們就可以跟gpt4all對(duì)話了:

下面是gpt4all生成的C++快速排序的代碼:
Please generate quick sort algorithm in C++ language
Here's how you can implement Quick Sort Algorithm using C++ :
// Implementation of QuickSort() function for array A[] with size N and pivot value P[N/2]
void QS(int *A, int n) { // Function prototype declaration
if (n <=1 ) return; /* If there is only one element in the input Array */
// QuickSort() function for array A[] with size N and pivot value P[N/2]
else{ //else statement to check condition of quick sort algorithm.
int i = n-1, j=0; /* Indexes used in the loop*/
while (i>= 0) { /* Loop for sorting elements */
if(A[j]> A[i]){ // Swap condition to check element's position.
int temp = A[i]; // Copying value of array index i into a temporary variable 'temp'.
A[i]=A[j]; /* Exchanging values at two indices */
}else{// Swap condition to check element's position.
int temp = A[j]; // Copying value of array index j into a temporary variable 'temp'.
A[i]=A[j], A[j] = temp; /* Exchanging values at two indices */
} i--, j++; /* Decrement and increment loop indexes*/
}; // End of while Loop for sorting elements.
}//End if statement to check condition QuickSort() function call with array A[] size N & pivot value P[N/2].
恭喜你!你終于成功在自己的電腦上跑起了一個(gè)跟chatgpt一樣了不起的預(yù)訓(xùn)練大模型。
這個(gè)模型雖然能力比chatgpt差了不少,但是它的模型來(lái)自LLaMA,訓(xùn)練數(shù)據(jù)部分取自chatgpt的對(duì)話記錄,采用了低秩適應(yīng)(LoRA: Low-Rank Adaptation)和量化的技術(shù)來(lái)縮減模型大小。每一項(xiàng)技術(shù)都是目前的前沿技術(shù)。
清華團(tuán)隊(duì)的單卡大模型:ChatGLM-6B
清華團(tuán)隊(duì)的ChatGLM-6B也是一個(gè)目標(biāo)為運(yùn)行在單GPU卡上的大模型,其底座為1300萬(wàn)參數(shù)的ChatGLM。
下圖是2022年10月斯坦福大學(xué)對(duì)當(dāng)時(shí)主要大模型的評(píng)測(cè)結(jié)果:

從上圖數(shù)據(jù)來(lái)看,雖然ChatGLM是落后于InstuctGPT的chatgpt,但是從跟其它模型的比較來(lái)看實(shí)力還不錯(cuò)。
我們就來(lái)試試基于ChatGLM的壓縮后的效果。
我們先下載代碼:
git clone https://github.com/THUDM/ChatGLM-6B
然后安裝其依賴(lài)的庫(kù):
pip install -r requirements.txt
上面的gpt4all的Python編程接口不支持Windows,而ChatGLM-6B基于Hugging Face Transformer庫(kù)開(kāi)發(fā),支持Windows下的Python編程,我們寫(xiě)個(gè)代碼來(lái)調(diào)用它吧:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True)
model = AutoModel.from_pretrained("THUDM/chatglm-6b-int4-qe", trust_remote_code=True).half().cuda()
model = model.eval()
response, history = model.chat(tokenizer, "用C++實(shí)現(xiàn)快速排序", history=[])
print(response)
輸出的結(jié)果如下:

#include <iostream>
using namespace std;
void swap(int& a, int& b) {
int temp = a;
a = b;
b = temp;
}
int partition(int arr[], int low, int high) {
int pivot = arr[high];
int i = (low - 1);
for (int j = low; j < high - 1; j++) {
if (arr[j] < pivot) {
i++;
swap(arr[i], arr[j]);
}
}
swap(arr[i+1], arr[high]);
return i+1;
}
void quicksort(int arr[], int low, int high) {
if (low < high) {
int pi = partition(arr, low, high);
quicksort(arr, low, pi - 1);
quicksort(arr, pi + 1, high);
}
}
int main() {
int arr[] = {5, 2, 9, 1, 6, 3, 8};
int n = sizeof(arr) / sizeof(arr[0]);
quicksort(arr, 0, n-1);
cout << arr[0] << endl;
return 0;
}
是不是效果還可以?有點(diǎn)chatgpt的意思了吧?
如果你的PyTorch或者Tensorflow的GPU支持裝好了的話,這個(gè)推理就是用GPU來(lái)完成的。我選用了最省顯存的4位量化,如果你的顯卡更好,可以選擇壓縮比更低一些的模型。
這里面我們可以引出Transformer時(shí)代的門(mén)戶,hugging face。我們?cè)谏厦娲a中所使用的from的 transformers庫(kù),就是hugging face出品的。
from transformers import AutoTokenizer, AutoModel

從上圖我們可以看到,Hugging face基本上就是各種Transformer模型的集散地。使用Hugging face的接口,就可以使用基本上所有的開(kāi)源的大模型。
大模型是如何煉成的
雖然網(wǎng)絡(luò)權(quán)值需要申請(qǐng),但是Meta的LLaMA大模型的模型代碼是開(kāi)源的。我們來(lái)看看LLaMA的Transformer跟我們上一節(jié)構(gòu)造的標(biāo)準(zhǔn)的Transformer有什么區(qū)別:
class Transformer(nn.Module):
def __init__(self, params: ModelArgs):
super().__init__()
self.params = params
self.vocab_size = params.vocab_size
self.n_layers = params.n_layers
self.tok_embeddings = ParallelEmbedding(
params.vocab_size, params.dim, init_method=lambda x: x
)
self.layers = torch.nn.ModuleList()
for layer_id in range(params.n_layers):
self.layers.append(TransformerBlock(layer_id, params))
self.norm = RMSNorm(params.dim, eps=params.norm_eps)
self.output = ColumnParallelLinear(
params.dim, params.vocab_size, bias=False, init_method=lambda x: x
)
self.freqs_cis = precompute_freqs_cis(
self.params.dim // self.params.n_heads, self.params.max_seq_len * 2
)
我們看到,為了加強(qiáng)并發(fā)訓(xùn)練,Meta的全連接網(wǎng)絡(luò)用的是它們自己的ColumnParallelLinear。它們的詞嵌入層也是自己做的并發(fā)版。
根據(jù)層次數(shù),它也是堆了若干層的TransformerBlock。
我們?cè)賮?lái)看這個(gè)Block:
class TransformerBlock(nn.Module):
def __init__(self, layer_id: int, args: ModelArgs):
super().__init__()
self.n_heads = args.n_heads
self.dim = args.dim
self.head_dim = args.dim // args.n_heads
self.attention = Attention(args)
self.feed_forward = FeedForward(
dim=args.dim, hidden_dim=4 * args.dim, multiple_of=args.multiple_of
)
self.layer_id = layer_id
self.attention_norm = RMSNorm(args.dim, eps=args.norm_eps)
self.ffn_norm = RMSNorm(args.dim, eps=args.norm_eps)
def forward(self, x: torch.Tensor, start_pos: int, freqs_cis: torch.Tensor, mask: Optional[torch.Tensor]):
h = x + self.attention.forward(self.attention_norm(x), start_pos, freqs_cis, mask)
out = h + self.feed_forward.forward(self.ffn_norm(h))
return out
我們發(fā)現(xiàn),它沒(méi)有使用標(biāo)準(zhǔn)的多頭注意力,而是自己實(shí)現(xiàn)了一個(gè)注意力類(lèi)。
class Attention(nn.Module):
def __init__(self, args: ModelArgs):
super().__init__()
self.n_local_heads = args.n_heads // fs_init.get_model_parallel_world_size()
self.head_dim = args.dim // args.n_heads
self.wq = ColumnParallelLinear(
args.dim,
args.n_heads * self.head_dim,
bias=False,
gather_output=False,
init_method=lambda x: x,
)
self.wk = ColumnParallelLinear(
args.dim,
args.n_heads * self.head_dim,
bias=False,
gather_output=False,
init_method=lambda x: x,
)
self.wv = ColumnParallelLinear(
args.dim,
args.n_heads * self.head_dim,
bias=False,
gather_output=False,
init_method=lambda x: x,
)
self.wo = RowParallelLinear(
args.n_heads * self.head_dim,
args.dim,
bias=False,
input_is_parallel=True,
init_method=lambda x: x,
)
self.cache_k = torch.zeros(
(args.max_batch_size, args.max_seq_len, self.n_local_heads, self.head_dim)
).cuda()
self.cache_v = torch.zeros(
(args.max_batch_size, args.max_seq_len, self.n_local_heads, self.head_dim)
).cuda()
鬧了半天就是支持了并發(fā)和加了cache的多頭注意力,K,V,Q穿了個(gè)馬甲,本質(zhì)上還是多頭自注意力。
其它有趣的工程
LM Flow
LM Flow也是最近很火的項(xiàng)目,它是香港科技大學(xué)在LLaMA的基礎(chǔ)上搞的全流程開(kāi)源的,可以在單3090 GPU上進(jìn)行訓(xùn)練的工程。
其地址在:https://github.com/OptimalScale/LMFlow
LMFlow目前的獨(dú)特價(jià)值在于,它提供的流程比較完整。
比如,在目前的開(kāi)源項(xiàng)目中,LMFlow是少有的提供了Instruction Tuning的工程。
我們來(lái)看個(gè)Instruction Tuning的例子:
{"id": 0, "instruction": "The sentence you are given might be too wordy, complicated, or unclear. Rewrite the sentence and make your writing clearer by keeping it concise. Whenever possible, break complex sentences into multiple sentences and eliminate unnecessary words.", "input": "If you have any questions about my rate or if you find it necessary to increase or decrease the scope for this project, please let me know.", "infer30b_before_item": " Output: The sentence you are given might be too wordy, complicated, or unclear. Rewrite the sentence and make your writing clearer by keeping it concise. Whenever possible, break complex sentences into multiple sentences and eliminate unnecessary words. If you have any questions about my rate or if you find it necessary to increase or decrease the scope for this project, please let me know.\n---\nInput: Input: The sentence you are given might be too wordy, complicated, or unclear. Rewrite the sentence and make your writing clearer by keeping it concise. Whenever possible, break complex sentences into multiple sentences and eliminate unnecessary words. If you have any questions about my rate or if you find it necessary to increase or decrease the scope for this project, please let me know.\n Output: Output: The sentence you are given might be too wordy, complicated, or unclear. Rewrite the sentence and make your writing clearer by keeping it concise. Whenever possible, break complex sentences into multiple sentences and eliminate unnecessary words. If you have any questions about my rate or if you find it necessary to increase or decrease the scope for this project, please let me know.\n---\nInput: Input: The sentence you are given might be too wordy, complicated,", "infer30b_after_item": " \n Output: If you have any questions about my rate or need to adjust the scope for this project, please let me know. \n\n", "infer13b_before_item": " The sentence you are given might be too wordy, complicated, or unclear. Rewrite the sentence and make your writing clearer by keeping it concise. Whenever possible, break complex sentences into multiple sentences and eliminate unnecessary words. If you have any questions about my rate or if you find it necessary to increase or decrease the scope for this project, please let me know.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n", "infer13b_after_item": " \n Output: If you have any questions about my rate or if you find it necessary to increase or decrease the scope for this project, please let me know. \n\n", "infer7b_before_item": " The sentence you are given might be too wordy, complicated, or unclear. Rewrite the sentence and make your writing clearer by keeping it concise. Whenever possible, break complex sentences into multiple sentences and eliminate unnecessary words. If you have any questions about my rate or if you find it necessary to increase or decrease the scope for this project, please let me know.\nInput: The sentence you are given might be too wordy, complicated, or unclear. Rewrite the sentence and make your writing clearer by keeping it concise. Whenever possible, break complex sentences into multiple sentences and eliminate unnecessary words. If you have any questions about my rate or if you find it necessary to increase or decrease the scope for this project, please let me know.\nOutput: The sentence you are given might be too wordy, complicated, or unclear. Rewrite the sentence and make your writing clearer by keeping it concise. Whenever possible, break complex sentences into multiple sentences and eliminate unnecessary words. If you have any questions about my rate or if you find it necessary to increase or decrease the scope for this project, please let me know.\nInput: The sentence you are given might be too wordy, complicated, or unclear. Rewrite the sentence and make your writing clearer by", "infer7b_after_item": " \n Output: If you have any questions about my rate or if you find it necessary to increase or decrease the scope for this project, please let me know. \n\n"}
這讓我們見(jiàn)識(shí)到了,原來(lái)糾錯(cuò)就是這樣搞的。這是LLaMA中所缺少的。
HuggingGPT
最近浙大和微軟的團(tuán)隊(duì)又推出了充分利用Hugging Face的門(mén)戶中樞地位的Jarvis工程。

很不幸的是,上面的兩個(gè)工程,加上前面工程的高級(jí)應(yīng)用,很難在Windows上面完成。我們后面將統(tǒng)一介紹這些需要在Linux環(huán)境下的實(shí)驗(yàn)。
小結(jié)
- 通過(guò)對(duì)大模型進(jìn)行剪枝、降秩、量化等手段,我們是可以在資源受限的電腦上運(yùn)行推理的。當(dāng)然,性能是有所損失的。我們可以根據(jù)業(yè)務(wù)場(chǎng)景去平衡,如果能用prompt engineer解決最好
- HuggingFace是預(yù)訓(xùn)練大模型的編程接口和模型集散地
- 大模型的基本原理仍然是我們上節(jié)學(xué)習(xí)的自注意力模型