python讀取一個utf-8編碼保存的文件,第一行為空,然后我用line.strip() == ‘’來判斷是否是空行,發(fā)現(xiàn)判斷不對。
line.strip()后, 我發(fā)現(xiàn)顯示的值是‘’, 但為什么與‘’不相等呢?len(line.strip())居然等于3?。√婀至?,顯然不是空值呀,然后我用repr()這個函數(shù)對結(jié)果進行轉(zhuǎn)義,發(fā)現(xiàn)有值\xef\xbb\xbf, 那這個值是什么意思呢?
EF BB BF是被稱為?Byte order mark?(BOM)的文件標記,用來指出這個文件是UTF-8編碼。
處理方式見?Reading Unicode file data with BOM chars in Python?的第一個回答,附下:
There is no reason to check if a BOM exists or not,?utf-8-sig?manages that for you and behaves exactly as?utf-8?if the BOM does not exist:
1. # Standard UTF-8 without BOM
>>> b'hello'.decode('utf-8')
'hello'
>>> b'hello'.decode('utf-8-sig')
'hello'
2. # BOM encoded UTF-8
>>> b'\xef\xbb\xbfhello'.decode('utf-8')
'\ufeffhello'
>>> b'\xef\xbb\xbfhello'.decode('utf-8-sig')
'hello'
In the example above, you can see?utf-8-sig?correctly decodes the given string regardless of the existence of BOM. If you think there is even a small chance that a BOM character might exist in the files you are reading, just use?utf-8-sig?and not worry about it
所以我在讀取文件時,采用utf-8-sig的方式,在python 2.7中,代碼如下:
import codecs
with codecs.open(file_path, 'r', 'utf-8-sig') as fh: