本篇主要匯總數(shù)據(jù)清洗過程經(jīng)常用到的一些方法,持續(xù)更新,以備后用。
- 查找?guī)е形淖址臄?shù)據(jù)
SELECT * FROM t_table1 WHERE length(name) != char_length(name);
- 去除多余空格
# 去除兩端空格
UPDATE jigou SET name = TRIM(name)
# 去除換行符char(10)和回車符char(13)
-- UPDATE jigou SET name = REPLACE(REPLACE(name, CHAR(10), ''), CHAR(13), '');
# 去除水平制表符char(9)
-- UPDATE jigou SET name = REPLACE(name, CHAR(9), '');
- 查詢重復(fù)數(shù)據(jù)
# 單列
select brand_name, count(*) as count from matchji_map group by brand_name having count > 1;
# 多列
SELECT * FROM t_table1 a group by name,avatar HAVING count(1) > 1 ;
- 復(fù)制數(shù)據(jù)到另外一張表
# 復(fù)制表結(jié)構(gòu)
CREATE TABLE newuser LIKE user;
# 導(dǎo)入數(shù)據(jù)
INSERT INTO newauser SELECT * FROM user;
- 多表關(guān)聯(lián)更新數(shù)據(jù)
UPDATE t_table1 t2 LEFT JOIN t_table2 t1 ON t2.name= t1.name set t2.avatar = t1.avatar
- 刪除重復(fù)數(shù)據(jù)
DELETE FROM t_table1 WHERE id in (SELECT id FROM (
SELECT id FROM t_table1 a
WHERE ((SELECT COUNT(*) FROM t_table1 WHERE id = a.id) > 1) and id not in (
select min(id) from t_table1 group by id having count(*)>1
) ORDER BY id DESC ) as temp
);
- 替換字符串
update t_table1 set name = replace(name,'/','+') where name LIKE '%/%';
- 正則處理
# 匹配以小寫字母a開頭,以o結(jié)尾的數(shù)據(jù)(`BINARY區(qū)分大小寫`)
select * from data_cleaning_demo where `name` regexp BINARY '^a(\w+)$'
# 匹配指定字符串(多個用|分隔)
select * from data_cleaning_trace_result where origin_name REGEXP '專業(yè)|之一|在冊|團隊'
# 匹配指定字符中的任意一個
select * from data_cleaning_demo where `name` regexp '[0-9]'
# 查詢指定字符之外的字符
select "abc" REGEXP "[^a-z]"; // 返回0
# 括號屬于特殊字符要轉(zhuǎn)義
select '中國高科(600730)' REGEXP '\([0-9]+\)'
#
SELECT *
FROM users
WHERE email NOT REGEXP '^[A-Z0-9._%-]+@[A-Z0-9.-]+.[A-Z]{2,4}$'
正則表達式常用字符匹配列表
| 選項 | 說明 | 例子 | 匹配值示例 |
|---|---|---|---|
| ^ | 匹配開頭 | ^a | a1, a2 |
| $ | 匹配結(jié)尾 | b$ | 1b, 2b |
| . | 匹配任何單個字符,包括回車、換行 | a.b | a1b, a2b |
| ? | 匹配0次或1次 | a? | b, ab |
| * | 匹配0次或多次 | a*b | b, ab |
| + | 匹配1次或多次 | a+b | ab, aab |
\ (豎線)
|
匹配a或b | a\b | ac, bc |
| <字符串> | 匹配含指定的字符串 | <bc> | abc, abcd |
| [字符集合] | 匹配集合中的任一字符 | [ab] | ac, bc |
| [^] | 匹配不在括號中的任何字符 | [^a] | bc, bcd |
| 字符串{n,} | 字符串出現(xiàn)至少n次 | a{2} | aa, aaa |
| 字符串{n,m} | 匹配字符串至少n次,最多m次 | a{2,4} | aa, aaa |
參考:
https://blog.csdn.net/moguxiansheng1106/article/details/44258499