提綱
思路
中文Unicode編碼
根據(jù)字符的UTF8編碼獲取Unicode
需要過濾掉的特殊字符
代碼實現(xiàn)

提綱

思路

中文Unicode

Unicode和UTF8的聯(lián)系

常見特殊字符

過濾特殊字符

思路

常見的特殊字符有很多，查了很多資料，沒找到特殊字符的Unicode編碼范圍，即使找到了也難以保證覆蓋了全部。因此只能從非的角度考慮, 實現(xiàn)目標(biāo)是留下操作系統(tǒng)支持的可作為文件名的字符。

中文Unicode編碼

摘自 https://www.qqxiuzi.cn/zh/hanzi-unicode-bianma.php

<col class="org-left" />
</colgroup>
<thead>
<tr>
<th scope="col" class="org-left">字符集</th>
<th scope="col" class="org-left">字?jǐn)?shù)</th>
<th scope="col" class="org-left">Unicode編碼</th>
</tr>
</thead>

<tr>
<td class="org-left">基本漢字補充</td>
<td class="org-left">74字</td>
<td class="org-left">9FA6-9FEF</td>
</tr>

<tr>
<td class="org-left">PUA增補</td>
<td class="org-left">207字</td>
<td class="org-left">E600-E6CF</td>
</tr>

其中只需要考慮基本漢字字符集即可。

根據(jù)字符的UTF8編碼獲取Unicode

UTF8和Unicode的關(guān)系網(wǎng)上資料很多, 在此不再贅述，簡而言之，中文的UTF8編碼都是三個字節(jié)，1110xxxx 10xxxxxx 10xxxxxx, 剩余的16位正好放下Unicode編碼的兩個字節(jié)，因此只要取出這16位即可知道該字符的Unicode

Lua不支持位操作， b1 % 0xe0 代表 b1 & 0xe0，*2^{12代表左移12位}，依次類推

local b1 = string.byte(str, curIndex)
local b2 = string.byte(str, curIndex + 1)
local b3 = string.byte(str, curIndex + 2)
local unic = (b1 % 0xe0) * 2 ^ 12 + (b2 % 0x80) * 2 ^ 6 + (b3 % 0x80);

需要過濾掉的特殊字符

ASCII中Windows不支持作為文件名的字符正則: [\\\\/:*?\"<>|%s+ ]
兩個字節(jié)的UTF
UTF編碼在四個字節(jié)及四個字節(jié)以上的字符

可以使用此頁面內(nèi)的特殊字符進行測試: https://wenku.baidu.com/view/fddf6408844769eae009ed14.html?re=view

代碼實現(xiàn)

-- 過濾中文特殊字符
function filterInvalidChars(str)
  local result = '';
  local curIndex = 1;
  -- 逐字檢查, 符合要求則放入result
  repeat
    local curByte = string.byte(str, curIndex)
    if curByte > 0 and curByte <= 127 then
      result = result..string.sub(str, curIndex, curIndex)
      curIndex = curIndex + 1
    elseif curByte >= 192 and curByte <= 223 then
      curIndex = curIndex + 2
    elseif curByte >= 224 and curByte <= 239 then
      -- 此處判斷一些中文特殊字符
      local b1 = curByte
      local b2 = string.byte(str, curIndex + 1)
      local b3 = string.byte(str, curIndex + 2)
      local unic = (b1 % 0xe0) * 2 ^ 12 + (b2 % 0x80) * 2 ^ 6 + (b3 % 0x80)
      if unic >= 0x4e00 and unic <= 0x9FA5 then
        result = result..string.sub(str, curIndex, curIndex + 2)
      end
      curIndex = curIndex + 3
    elseif curByte >= 240 and curByte <= 247 then
      curIndex = curIndex + 4
    else
      logger:error('filter invalid chars error: '..str)
      return str
    end
  until(curIndex >= #str);
  return string.gsub(result, '[\\\\/:*?\"<>|%s+ ]', '');
end

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

Lua中特殊字符過濾(UTF8編碼)

Lua中特殊字符過濾(UTF8編碼)

Table of Contents

提綱

思路

中文Unicode

Unicode和UTF8的聯(lián)系

常見特殊字符

過濾特殊字符

思路

中文Unicode編碼

根據(jù)字符的UTF8編碼獲取Unicode

需要過濾掉的特殊字符

代碼實現(xiàn)

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

Lua中特殊字符過濾(UTF8編碼)

Table of Contents

提綱

思路

中文Unicode

Unicode和UTF8的聯(lián)系

常見特殊字符

過濾特殊字符

思路

中文Unicode編碼

根據(jù)字符的UTF8編碼獲取Unicode

需要過濾掉的特殊字符

代碼實現(xiàn)

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av