Python 字節(jié)碼
- 雖然 Python 作為解釋型語言,但是其也不是直接對源代碼進行解釋
- Python 解釋器會將源代碼處理成字節(jié)碼后,借助 Python 解釋器運行程序
- 通過 Python 自帶的模塊 dis 可以將目標函數轉換成字節(jié)碼
import dis
def fun(x, y, z):
a = 1
a += 1
print("aaa")
fun(1, 2, 3)
return
dis.dis(fun)
- 控制臺輸出內容如下
- 第一列對應的是源代碼中的行號
- 第二列對應的是源代碼轉化成的字節(jié)碼
- 第三列為此次操作對應的值(括號內為具體值)
- 例如第六行
- 解釋器先讀取了全局對象
print函數,推入程序棧 - 程序又將字符串
'aaa'推入程序棧 - 調用函數,并解釋只有 1 個變量,解釋器便會將棧頂的 1 個變量傳遞給函數,然后調用函數
- 需要注意,如果有多個參數的話,參數入棧順序是從左到右,也就是最右邊的參數在最頂端
-
CALL_FUNCTION會在結束后彈出棧頂對應參數數量的元素,但是函數不會被彈出棧,因此最后有一個POP_TOP
- 解釋器先讀取了全局對象
> python3 -u "/Users/biox/NutStore/Codes/VSCode/python/py1.py"
4 0 LOAD_CONST 1 (1)
2 STORE_FAST 3 (a)
5 4 LOAD_FAST 3 (a)
6 LOAD_CONST 1 (1)
8 INPLACE_ADD
10 STORE_FAST 3 (a)
6 12 LOAD_GLOBAL 0 (print)
14 LOAD_CONST 2 ('aaa')
16 CALL_FUNCTION 1
18 POP_TOP
7 20 LOAD_GLOBAL 1 (fun)
22 LOAD_CONST 1 (1)
24 LOAD_CONST 3 (2)
26 LOAD_CONST 4 (3)
28 CALL_FUNCTION 3
30 POP_TOP
8 32 LOAD_CONST 0 (None)
34 RETURN_VALUE
常見指令
詳細內容見官方文檔
一般指令與一元操作指令
| 指令 | 作用 |
|---|---|
| NOP | 無作用,用于占位 |
| POP_TOP | 彈出棧頂元素 |
| LOAD_CONST | 將讀取的值推入棧 |
| LOAD_GLOBAL | 將全局變量對象壓入棧頂 |
| STORE_FAST | 將棧頂指令存入對應局部變量 |
| COMPARE_OP | 比較操作符 |
| CALL_FUNCTION | 調用函數 |
| BUILD_SLICE | 調用切片,跟的參數為切片的值的個數,一般從上到下為 [Val1:Val2:Val3] |
| JUMP_ABSOLUTE | 向下跳轉幾句操作符,變量為跳轉偏移量 |
| UNARY_POSITIVE | 實現 Val1 = +Val1 |
| UNARY_NEGATIVE | 實現 Val1 = -Val1 |
| UNARY_NOT | 實現 Val1 = not Val1 |
| UNARY_INVERT | 實現 Val1 = ~Val |
| FOR_ITER | for 循環(huán) |
| GET_ITER | 獲取迭代器(一般后面跟循環(huán)) |
| GET_YIELD_FROM_ITER | 獲取 yield 生成器 |
- 二元操作指令
| 指令 | 作用 |
|---|---|
| BINARY_POWER | 乘方,棧頂數為指數 |
| BINARY_MULTIPLY | 乘法 |
| BINARY_MATRIX_MULTIPLY | 矩陣乘法,3.5 引入的新功能 |
| BINARY_FLOOR_DIVIDE | 除法,結果向下取整 |
| BINARY_TRUE_DIVIDE | 除法 |
| BINARY_MODULO | 取余 |
| BINARY_ADD | 加法 |
| BINARY_SUBTRACT | 減法 |
| BINARY_SUBSCR | 數組取下標,棧頂為下標 |
| BINARY_LSHIFT | 左移操作符(乘2) |
| BINARY_RSHIFT | 右移操作符(除2向下取整) |
| BINARY_AND | 按位與 |
| BINARY_XOR | 異或 |
| BINARY_OR | 按位或 |
| STORE_SUBSCR | 列表下標存儲,例如 Val1[Val2] = Val3 |
| DELETE_SUBSCR | 按下標刪除元素,例如 del Val1[Val2] |
自身操作指令,類似
b += 1,就是上方有 BINARY 的指令的 BINARY 改成 INPLACE其他指令見官方文檔
Pyc 文件解析
- Pyc 文件是 PythonCodeObject 對象的持久化保存方式
- 有時候會見到 Pyo 文件,這個是經過 Python 解釋器優(yōu)化后生成的字節(jié)碼
- 這個優(yōu)化只是縮小了文件的體積,在代碼運行速度上和 Pyc 差不多
- 尤其對于被 import 的文件,Python 解釋器為了加快下一次被引用文件的讀取速度,都會生成一個對應的 Pyc 文件
- 當后續(xù)被 import 的時候,解釋器會優(yōu)先尋找持久化存儲的對象
- 在 Python 源代碼運行的時候,Python 解釋器會先將代碼處理成 PythonCodeObject 對象,保存在內存中處理
- 除去預處理 PythonCodeObject 對象的過程,在執(zhí)行速度上 Pyc 文件、Pyo 文件和源代碼文件的速度相差無幾
-
需要注意的是,Pyc 文件只能運行在生成出此文件的解釋器版本上
- Python 在生成 Pyc 文件的時候也引入了 MagicNumber,來標示此 Pyc 文件對應的版本號
- 在 Python 解釋器目錄下
./lib/python3.7/importlib/_bootstrap-external.py中有明確的版本號記錄- 這里的版本號是解釋器字節(jié)碼更新的版本號
# Magic word to reject .pyc files generated by other Python versions.
# It should change for each incompatible change to the bytecode.
#
# The value of CR and LF is incorporated so if you ever read or write
# a .pyc file in text mode the magic number will be wrong; also, the
# Apple MPW compiler swaps their values, botching string constants.
#
# There were a variety of old schemes for setting the magic number.
# The current working scheme is to increment the previous value by
# 10.
#
# Starting with the adoption of PEP 3147 in Python 3.2, every bump in magic
# number also includes a new "magic tag", i.e. a human readable string used
# to represent the magic number in __pycache__ directories. When you change
# the magic number, you must also set a new unique magic tag. Generally this
# can be named after the Python major version of the magic number bump, but
# it can really be anything, as long as it's different than anything else
# that's come before. The tags are included in the following table, starting
# with Python 3.2a0.
#
# Known values:
# Python 1.5: 20121
# Python 1.5.1: 20121
# Python 1.5.2: 20121
# Python 1.6: 50428
# Python 2.0: 50823
# Python 2.0.1: 50823
# Python 2.1: 60202
# Python 2.1.1: 60202
# Python 2.1.2: 60202
# Python 2.2: 60717
# Python 2.3a0: 62011
# Python 2.3a0: 62021
# Python 2.3a0: 62011 (!)
# Python 2.4a0: 62041
# Python 2.4a3: 62051
# Python 2.4b1: 62061
# Python 2.5a0: 62071
# Python 2.5a0: 62081 (ast-branch)
# Python 2.5a0: 62091 (with)
# Python 2.5a0: 62092 (changed WITH_CLEANUP opcode)
# Python 2.5b3: 62101 (fix wrong code: for x, in ...)
# Python 2.5b3: 62111 (fix wrong code: x += yield)
# Python 2.5c1: 62121 (fix wrong lnotab with for loops and
# storing constants that should have been removed)
# Python 2.5c2: 62131 (fix wrong code: for x, in ... in listcomp/genexp)
# Python 2.6a0: 62151 (peephole optimizations and STORE_MAP opcode)
# Python 2.6a1: 62161 (WITH_CLEANUP optimization)
# Python 2.7a0: 62171 (optimize list comprehensions/change LIST_APPEND)
# Python 2.7a0: 62181 (optimize conditional branches:
# introduce POP_JUMP_IF_FALSE and POP_JUMP_IF_TRUE)
# Python 2.7a0 62191 (introduce SETUP_WITH)
# Python 2.7a0 62201 (introduce BUILD_SET)
# Python 2.7a0 62211 (introduce MAP_ADD and SET_ADD)
# Python 3000: 3000
# 3010 (removed UNARY_CONVERT)
# 3020 (added BUILD_SET)
# 3030 (added keyword-only parameters)
# 3040 (added signature annotations)
# 3050 (print becomes a function)
# 3060 (PEP 3115 metaclass syntax)
# 3061 (string literals become unicode)
# 3071 (PEP 3109 raise changes)
# 3081 (PEP 3137 make __file__ and __name__ unicode)
# 3091 (kill str8 interning)
# 3101 (merge from 2.6a0, see 62151)
# 3103 (__file__ points to source file)
# Python 3.0a4: 3111 (WITH_CLEANUP optimization).
# Python 3.0b1: 3131 (lexical exception stacking, including POP_EXCEPT
#3021)
# Python 3.1a1: 3141 (optimize list, set and dict comprehensions:
# change LIST_APPEND and SET_ADD, add MAP_ADD #2183)
# Python 3.1a1: 3151 (optimize conditional branches:
# introduce POP_JUMP_IF_FALSE and POP_JUMP_IF_TRUE
#4715)
# Python 3.2a1: 3160 (add SETUP_WITH #6101)
# tag: cpython-32
# Python 3.2a2: 3170 (add DUP_TOP_TWO, remove DUP_TOPX and ROT_FOUR #9225)
# tag: cpython-32
# Python 3.2a3 3180 (add DELETE_DEREF #4617)
# Python 3.3a1 3190 (__class__ super closure changed)
# Python 3.3a1 3200 (PEP 3155 __qualname__ added #13448)
# Python 3.3a1 3210 (added size modulo 2**32 to the pyc header #13645)
# Python 3.3a2 3220 (changed PEP 380 implementation #14230)
# Python 3.3a4 3230 (revert changes to implicit __class__ closure #14857)
# Python 3.4a1 3250 (evaluate positional default arguments before
# keyword-only defaults #16967)
# Python 3.4a1 3260 (add LOAD_CLASSDEREF; allow locals of class to override
# free vars #17853)
# Python 3.4a1 3270 (various tweaks to the __class__ closure #12370)
# Python 3.4a1 3280 (remove implicit class argument)
# Python 3.4a4 3290 (changes to __qualname__ computation #19301)
# Python 3.4a4 3300 (more changes to __qualname__ computation #19301)
# Python 3.4rc2 3310 (alter __qualname__ computation #20625)
# Python 3.5a1 3320 (PEP 465: Matrix multiplication operator #21176)
# Python 3.5b1 3330 (PEP 448: Additional Unpacking Generalizations #2292)
# Python 3.5b2 3340 (fix dictionary display evaluation order #11205)
# Python 3.5b3 3350 (add GET_YIELD_FROM_ITER opcode #24400)
# Python 3.5.2 3351 (fix BUILD_MAP_UNPACK_WITH_CALL opcode #27286)
# Python 3.6a0 3360 (add FORMAT_VALUE opcode #25483)
# Python 3.6a1 3361 (lineno delta of code.co_lnotab becomes signed #26107)
# Python 3.6a2 3370 (16 bit wordcode #26647)
# Python 3.6a2 3371 (add BUILD_CONST_KEY_MAP opcode #27140)
# Python 3.6a2 3372 (MAKE_FUNCTION simplification, remove MAKE_CLOSURE
# #27095)
# Python 3.6b1 3373 (add BUILD_STRING opcode #27078)
# Python 3.6b1 3375 (add SETUP_ANNOTATIONS and STORE_ANNOTATION opcodes
# #27985)
# Python 3.6b1 3376 (simplify CALL_FUNCTIONs & BUILD_MAP_UNPACK_WITH_CALL
#27213)
# Python 3.6b1 3377 (set __class__ cell from type.__new__ #23722)
# Python 3.6b2 3378 (add BUILD_TUPLE_UNPACK_WITH_CALL #28257)
# Python 3.6rc1 3379 (more thorough __class__ validation #23722)
# Python 3.7a1 3390 (add LOAD_METHOD and CALL_METHOD opcodes #26110)
# Python 3.7a2 3391 (update GET_AITER #31709)
# Python 3.7a4 3392 (PEP 552: Deterministic pycs #31650)
# Python 3.7b1 3393 (remove STORE_ANNOTATION opcode #32550)
# Python 3.7b5 3394 (restored docstring as the first stmt in the body;
# this might affected the first line number #32911)
#
# MAGIC must change whenever the bytecode emitted by the compiler may no
# longer be understood by older implementations of the eval loop (usually
# due to the addition of new opcodes).
#
# Whenever MAGIC_NUMBER is changed, the ranges in the magic_values array
# in PC/launcher.c must also be updated.
- 對于 pyc 文件整體的 C 結構體,可以在
./include/python2.7/code.h或不同版本類似的文件中找到- 具體代碼如下
/* Bytecode object */
typedef struct {
PyObject_HEAD
int co_argcount; /* #arguments, except *args */
int co_nlocals; /* #local variables */
int co_stacksize; /* #entries needed for evaluation stack */
int co_flags; /* CO_..., see below */
PyObject *co_code; /* instruction opcodes */
PyObject *co_consts; /* list (constants used) */
PyObject *co_names; /* list of strings (names used) */
PyObject *co_varnames; /* tuple of strings (local variable names) */
PyObject *co_freevars; /* tuple of strings (free variable names) */
PyObject *co_cellvars; /* tuple of strings (cell variable names) */
/* The rest doesn't count for hash/cmp */
PyObject *co_filename; /* string (where it was loaded from) */
PyObject *co_name; /* string (name, for reference) */
int co_firstlineno; /* first source line number */
PyObject *co_lnotab; /* string (encoding addr<->lineno mapping) See
Objects/lnotab_notes.txt for details. */
void *co_zombieframe; /* for optimization only (see frameobject.c) */
PyObject *co_weakreflist; /* to support weakrefs to code objects */
} PyCodeObject;
- 將 Python 源代碼生成為 Pyc 文件,這里使用的版本是 Python 2.7.6
- 這里我們將一個名為
py1.py的文件
- 這里我們將一個名為
import py_compile
py_compile.compile('./py1.py')
- 會在源代碼文件目錄下找到編譯后的文件
- 使用 010 editor 打開,會提示是否需要安裝 pyc 字節(jié)碼輔助插件,雖然說只支持 2.4 - 2.7
- Python 3 以后的都不能用...

- 此處以中南大學2020校賽的一道逆向題目 py&flower.pyc 作為介紹,使用 010 editor 打開

-
最前面的 4 個字節(jié)為 Magic Number ,其中前兩個直接為解釋器的版本號
- 此處前兩個字節(jié)為 62211,也就是 Python 2.7.0a0 版本的字節(jié)碼解釋器
- 注意這里是小端序,就是高位在后面,所以是 0xF303
Magic Number 之后的四個字節(jié)為時間戳,這里是 0x5EC652B0,之后就是 Python 代碼對象
代碼對象首先一個字節(jié)表示此處的對象類型,這里值為 TYPE_CODE,值為 0x63,
此后四個字節(jié)表示參數的個數,也就是 co_argcount 的值
往后四個字節(jié)是局部變量的個數 co_nlocals
往后四個字節(jié)是??臻g大小 co_stacksize
往后四個字節(jié)是 co_flags
-
之后就是 co_code 了,也就是編譯好的字節(jié)碼的部分
- co_code 部分首先的一個字節(jié)也是表示此處的對象類型,這里是 TYPE_STRING,為 0x73
- 接下來四個字節(jié)表示此 co_code 對象的長度,此后就是代碼對象,這里的代碼長度為 0xA7
- 也就是后方 163 個字節(jié)的長度都是代碼對象

- 此 co_code 對象的字節(jié)碼內容結束后,接著是 co_consts 內容,也就是用到的常量的內容
- 最開始是 TYPE_TUPLE,表示這是個元組類型
- 此后四個字節(jié)是元素個數,這里是 0x23,之后每一個字節(jié)與對應的值一組,一共 0x23 組
- 每組中第一個字節(jié)表示元素類型,比如 0x69 指 TYPE_INT,此后為對應的值
- 后方也對應結構體中的相應內容

字節(jié)碼混淆
Anti-uncompyle6
- 對于正常的 pyc 文件,使用 uncompyle6 插件可以正常的進行字節(jié)碼逆向,得到原來的代碼
> uncompyle6 ./py2.pyc
- 如果需要使 uncompyle6 失效的話,只要在 co_code 頭部加上
0x71 0x03 0x00,然后把記錄 co_code 長度的數據加 3- 這段字節(jié)碼指
JUMP_ABSOLUTE 3,也就是向后跳 3 個字節(jié)后繼續(xù)執(zhí)行,實際上沒有改變代碼邏輯 - 但是 uncompyle6 插件的還原邏輯就沒辦法識別此字節(jié)碼原先的意思,導致解析異常
- 這段字節(jié)碼指
Anti-dis
- 上文的改法會導致 uncompyle6 插件異常,但是這個方法的實質只是增加了一句字節(jié)碼
- Python 可以借助自帶的 dis 庫和 marshal 庫解析 pyc 二進制文件中的信息,此處以一個簡單的代碼作為例子
def fun1():
enc = "Ua`|{f.4V}$l4h4Vx{s.4|``dg.;;vx{s:v}$l:wz;4h4Dxqugq4}zp}wu`q4`|q4g{afwq4ur`qf4m{af4pqf}bu`}{z"
flag = ""
for i in enc:
flag += chr(ord(i) ^ 0x14)
print flag
fun1()
- 編譯成 pyc 文件后,嘗試加入
JUMP_ABSOLUTE 3到代碼頭部- 橙色的字節(jié)為編輯過的

- 使用 uncompyle6 發(fā)生 Parse error 異常,但是還是可以正常運行

- 嘗試使用 marshal 模塊搭配 dis 模塊進行字節(jié)碼解析
import marshal, dis
fp = open("./py1.pyc")
fp.read(8) # Read out magic number and timestamp
co_code = marshal.load(fp)
dis.dis(co_code)
- 程序輸出了完整的字節(jié)碼,根據字節(jié)碼還是可以順利的還原出源代碼信息
- 可以發(fā)現,頭部已經加上了我們自己編輯的
JUMP_ABSOLUTE 3
- 可以發(fā)現,頭部已經加上了我們自己編輯的
> python2 -u "/Users/biox/NutStore/Codes/VSCode/python/py2.py"
1 0 JUMP_ABSOLUTE 3
>> 3 LOAD_CONST 0 (<code object fun1 at 0x1010486b0, file "./py1.py", line 1>)
6 MAKE_FUNCTION 0
7 9 STORE_NAME 0 (fun1)
12 LOAD_NAME 0 (fun1)
15 CALL_FUNCTION 0
18 POP_TOP
19 LOAD_CONST 1 (None)
22 RETURN_VALUE
- 如果我們不想讓 dis 順利的導出字節(jié)碼,也可以用一些指令來使得 dis 模塊產生異常
- 比如來個指令重疊,中間插一個讀取數據的字節(jié)碼
0x71 0x04 0x00 0x64 0x71 0x08 0x00 0x00 - 這里的
0x64為解釋器的 LOAD_CONST 指令,如果正常的話這里應該是LOAD_CONST 0x0871 - 那么 dis 模塊就看不懂了,實際上通過前面的
0x71 0x04 0x00會跳過此字節(jié)碼,實際上是不執(zhí)行的 - 后方的
0x71 0x08 0x00是根據前面第一個0x71開始跳轉的,所以跟的是0x08
- 比如來個指令重疊,中間插一個讀取數據的字節(jié)碼

- 嘗試 dis 字節(jié)碼,直接拋出 IndexError 了,同時 uncompyle6 也 IndexError 了
> python2 -u "/Users/biox/NutStore/Codes/VSCode/python/py2.py" python
1 0 JUMP_ABSOLUTE 4
3 LOAD_CONST 2161
Traceback (most recent call last):
File "/Users/biox/NutStore/Codes/VSCode/python/py2.py", line 5, in <module>
dis.dis(co_code)
File "/usr/local/Cellar/python@2/2.7.16_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/dis.py", line 43, in dis
disassemble(x)
File "/usr/local/Cellar/python@2/2.7.16_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/dis.py", line 98, in disassemble
print '(' + repr(co.co_consts[oparg]) + ')',
IndexError: tuple index out of range
- 如果需要還原成能正常 dis 的 pyc 文件,只能手動修補了
動態(tài)創(chuàng)建內置類型
- Python 還有一個自帶的庫,叫做 types,借助這個庫可以生成 Python 內置的類型
- 比如生成 code 對象,這里以 2020 年 XCTF-CyBRICS 逆向 Ployglot 為例,其最后一層逆向的 Python 代碼就是用了此方法
import types
def define_func(argcount, nlocals, code, consts, names):
#PYTHON3.8!!!
def inner():
return 0
fn_code = inner.__code__
cd_new = types.CodeType(argcount,
0,
fn_code.co_kwonlyargcount,
nlocals,
1024,
fn_code.co_flags,
code,
consts,
names,
tuple(["v%d" for i in range(nlocals)]),
fn_code.co_filename,
fn_code.co_name,
fn_code.co_firstlineno,
fn_code.co_lnotab,
fn_code.co_freevars,
fn_code.co_cellvars)
inner.__code__ = cd_new
return inner
f1 = define_func(2,2,b'|\x00|\x01k\x02S\x00', (None,), ())
f2 = define_func(1,1,b't\x00|\x00\x83\x01S\x00', (None,), ('ord',))
f3 = define_func(0,0,b't\x00d\x01\x83\x01S\x00', (None, 'Give me flag: '), ('input',))
f4 = define_func(1, 3, b'd\x01d\x02d\x03d\x04d\x05d\x01d\x06d\x07d\x08d\td\x03d\nd\x0bd\x0cd\rd\x08d\x0cd\x0ed\x0cd\x0fd\x0ed\x10d\x11d\td\x12d\x03d\x10d\x03d\x0ed\x13d\x0bd\nd\x14d\x08d\x13d\x01d\x01d\nd\td\x01d\x12d\x0bd\x10d\x0fd\x14d\x03d\x0bd\x15d\x16g1}\x01t\x00|\x00\x83\x01t\x00|\x01\x83\x01k\x03r\x82t\x01d\x17\x83\x01\x01\x00d\x18S\x00t\x02|\x00|\x01\x83\x02D\x00]$}\x02t\x03|\x02d\x19\x19\x00t\x04|\x02d\x1a\x19\x00\x83\x01\x83\x02d\x18k\x02r\x8c\x01\x00d\x18S\x00q\x8cd\x1bS\x00',
(None, 99, 121, 98, 114, 105, 115, 123, 52, 97, 100, 51, 101, 55, 57, 53, 54, 48, 49, 50, 56, 102, 125, 'Length mismatch!', False, 1, 0, True),
('len', 'print', 'zip', 'f1', 'f2'))
f5 = define_func(0, 1,b't\x00\x83\x00}\x00t\x01|\x00\x83\x01d\x01k\x08r\x1ct\x02d\x02\x83\x01\x01\x00n\x08t\x02d\x03\x83\x01\x01\x00d\x00S\x00',(None, False, 'Nope!', 'Yep!'), ('f3', 'f4', 'print'))
f5()
- 使用給定的字節(jié)碼,構造 CodeType 對象,直接轉換成函數來調用
- 只要把對應的 PyCodeObject 中應該有的值構造正確,就能順利執(zhí)行
- 這樣的函數如果沒加花指令,是可以直接 dis 出來的
如有錯誤,歡迎師傅們指正