2020-03-18

大數(shù)據(jù)技術(shù)之DataX


版本:V1.0

第1章概述

1.1 什么是DataX


?????? DataX是阿里巴巴開源的一個異構(gòu)數(shù)據(jù)源離線同步工具,致力于實現(xiàn)包括關(guān)系型數(shù)據(jù)庫(MySQL、Oracle等)、HDFS、Hive、ODPS、HBase、FTP等各種異構(gòu)數(shù)據(jù)源之間穩(wěn)定高效的數(shù)據(jù)同步功能。

1.2 DataX的設(shè)計

為了解決異構(gòu)數(shù)據(jù)源同步問題,DataX將復(fù)雜的網(wǎng)狀的同步鏈路變成了星型數(shù)據(jù)鏈路,DataX作為中間傳輸載體負(fù)責(zé)連接各種數(shù)據(jù)源。當(dāng)需要接入一個新的數(shù)據(jù)源的時候,只需要將此數(shù)據(jù)源對接到DataX,便能跟已有的數(shù)據(jù)源做到無縫數(shù)據(jù)同步。

1.3 框架設(shè)計

1.4 運行原理

第2章快速入門

2.1 官方地址

下載地址:http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz

http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz

源碼地址:https://github.com/alibaba/DataX

2.2 前置要求

- Linux

- JDK(1.8以上,推薦1.8)

- Python(推薦Python2.6.X)

2.3 安裝

1)將下載好的datax.tar.gz上傳到hadoop102的/opt/softwarez

[atguigu@hadoop102

software]$ ls

datax.tar.gz

2)解壓datax.tar.gz到/opt/module

[atguigu@hadoop102

software]$ tar -zxvf datax.tar.gz -C /opt/module/

3)運行自檢腳本

[atguigu@hadoop102

bin]$ cd /opt/module/datax/bin/

[atguigu@hadoop102

bin]$ python datax.py /opt/module/datax/job/job.json

第3章使用案例

3.1 從stream流讀取數(shù)據(jù)并打印到控制臺

1)查看配置模板

[atguigu@hadoop102

bin]$ python datax.py -r streamreader -w streamwriter


DataX

(DATAX-OPENSOURCE-3.0), From Alibaba !

Copyright

(C) 2010-2017, Alibaba Group. All Rights Reserved.



Please

refer to the streamreader document:

???? https://github.com/alibaba/DataX/blob/master/streamreader/doc/streamreader.md


Please

refer to the streamwriter document:

????https://github.com/alibaba/DataX/blob/master/streamwriter/doc/streamwriter.md


Pleasesave the following configuration as a json file and? use

???? python {DATAX_HOME}/bin/datax.py{JSON_FILE_NAME}.json

to run

the job.


{

??? "job": {

??????? "content": [

??????????? {

??????????????? "reader": {

??????????????????? "name":"streamreader",

??????????????????? "parameter": {

??????????????????????? "column": [],

???????????????????????"sliceRecordCount": ""

??????????????????? }

??????????????? },

??????????????? "writer": {

??????????????????? "name":"streamwriter",

??????????????????? "parameter": {

??????????????????????? "encoding":"",

??????????????????????? "print": true

??????????????????? }

??????????????? }

??????????? }

??????? ],

??????? "setting": {

??????????? "speed": {

??????????????? "channel":""

??????????? }

??????? }

??? }

}

2)根據(jù)模板編寫配置文件

[atguigu@hadoop102

job]$ vim stream2stream.json


填寫以下內(nèi)容:


{

? "job": {

??? "content": [

????? {

??????? "reader": {

????????? "name":"streamreader",

????????? "parameter": {

??????????? "sliceRecordCount": 10,

??????????? "column": [

????????????? {

??????????????? "type":"long",

??????????????? "value":"10"

????????????? },

????????????? {

??????????????? "type":"string",

??????????????? "value": "hello,DataX"

????????????? }

??????????? ]

????????? }

????? ??},

??????? "writer": {

????????? "name":"streamwriter",

????????? "parameter": {

??????????? "encoding":"UTF-8",

??????????? "print": true

????????? }

??????? }

????? }

??? ],

??? "setting": {

????? "speed": {

??????? "channel": 1

?????? }

??? }

? }

}

3)運行

[atguigu@hadoop102

job]$

/opt/module/datax/bin/datax.py

/opt/module/datax/job/stream2stream.json

3.2 讀取MySQL中的數(shù)據(jù)存放到HDFS

3.2.1 查看官方模板

[atguigu@hadoop102

~]$ python /opt/module/datax/bin/datax.py -r mysqlreader -w hdfswriter


{

??? "job": {

??????? "content": [

??????????? {

??????????????? "reader": {

??????????????????? "name":"mysqlreader",

??????????????????? "parameter": {

??????????????????????? "column": [],

??????????????????????? "connection":[

??????????????????????????? {

?????????????? ?????????????????"jdbcUrl": [],

???????????????????????????????"table": []

??????????????????????????? }

??????????????????????? ],

??????????????????????? "password":"",

??????????????????????? "username":"",

??????????????????????? "where":""

??????????????????? }

??????????????? },

??????????????? "writer": {

??????????????????? "name":"hdfswriter",

??????????????????? "parameter": {

??????????????????????? "column": [],

??????????????????????? "compress":"",

??????????????????????? "defaultFS":"",

???????????????????????"fieldDelimiter": "",

??????????????????????? "fileName":"",

??????????????????????? "fileType":"",

??????????????????????? "path":"",

??????????????????????? "writeMode":""

??????????????????? }

??????????????? }

??????????? }

??????? ],

??????? "setting": {

??????????? "speed": {

??????????????? "channel":""

??????????? }

??????? }

??? }

}

mysqlreader參數(shù)解析:

hdfswriter參數(shù)解析:

3.2.2 準(zhǔn)備數(shù)據(jù)

1)創(chuàng)建student表

mysql>

create database datax;

mysql>

use datax;

mysql>

create table student(id int,name varchar(20));

2)插入數(shù)據(jù)

mysql>

insert into student values(1001,'zhangsan'),(1002,'lisi'),(1003,'wangwu');

3.2.3 編寫配置文件

[atguigu@hadoop102

datax]$ vim /opt/module/datax/job/mysql2hdfs.json


{

??? "job": {

??????? "content": [

??????????? {

??????????????? "reader": {

??????????????????? "name":"mysqlreader",

??????????????????? "parameter": {

??????????????????????? "column": [

??????????????????????????? "id",

??????????????????????????? "name"

??????????????????????? ],

??????????????????????? "connection":[

??????????????????????????? {

???????????????????????????????"jdbcUrl": [

???????????????????????????????????"jdbc:mysql://hadoop102:3306/datax"

??????????????????????????????? ],

???????????????????????????????"table": [

???????????????????????????????????"student"

??????????????????????????????? ]

???????????????????????? ???}

??????????????????????? ],

??????????????????????? "username":"root",

??????????????????????? "password":"000000"

??????????????????? }

??????????????? },

??????????????? "writer": {

??????????????????? "name":"hdfswriter",

??????????????????? "parameter": {

??????????????????????? "column": [

??????????????????????????? {

???????????????????????????????"name": "id",

???????????????????????????????"type": "int"

??????????????????????????? },

??????????????????????????? {

???????????????????????????????"name": "name",

???????????????????????????????"type": "string"

??????????????????????????? }

??????????????????????? ],?

??????????????????????? "defaultFS":"hdfs://hadoop102:9000",

?????? ?????????????????"fieldDelimiter":"\t",

??????????????????????? "fileName":"student.txt",

??????????????????????? "fileType":"text",

??????????????????????? "path":"/",

??????????????????????? "writeMode":"append"

??????????????????? }

??????????????? }

??????????? }

??????? ],

??????? "setting": {

??????????? "speed": {

??????????????? "channel": "1"

??????????? }

??????? }

??? }

}

3.2.4 執(zhí)行任務(wù)

[atguigu@hadoop102

datax]$ bin/datax.py job/mysql2hdfs.json


2019-05-1716:02:16.581 [job-0] INFO? JobContainer -

任務(wù)啟動時刻??????????????????? : 2019-05-17 16:02:04

任務(wù)結(jié)束時刻??????????????????? : 2019-05-17 16:02:16

任務(wù)總計耗時??????????????????? :???????????????? 12s

任務(wù)平均流量??????????????????? :??????????????? 3B/s

記錄寫入速度?????????????? ?????:????????????? 0rec/s

讀出記錄總數(shù)??????????????????? :?????????????????? 3

讀寫失敗總數(shù)??????????????????? :?????????????????? 0

3.2.5 查看hdfs

注意:HdfsWriter實際執(zhí)行時會在該文件名后添加隨機(jī)的后綴作為每個線程寫入實際文件名。

3.3 讀取HDFS數(shù)據(jù)寫入MySQL

1)將上個案例上傳的文件改名

[atguigu@hadoop102

datax]$ hadoop fs -mv /student.txt* /student.txt

2)查看官方模板

[atguigu@hadoop102

datax]$ python bin/datax.py -r hdfsreader -w mysqlwriter


{

??? "job": {

??????? "content": [

??????????? {

??????????????? "reader": {

??????????????????? "name":"hdfsreader",

??????????????????? "parameter": {

??????????????????????? "column": [],

??????????????????????? "defaultFS":"",

??????????????????????? "encoding":"UTF-8",

???????????????????????"fieldDelimiter": ",",

??????????????????????? "fileType":"orc",

??????????????????????? "path":""

??????????????????? }

??????????????? },

??????????????? "writer": {

??????????????????? "name":"mysqlwriter",

??????????????????? "parameter": {

??????????????????????? "column": [],

??????????????? ????????"connection": [

??????????????????????????? {

???????????????????????????????"jdbcUrl": "",

???????????????????????????????"table": []

??????????????????????????? }

??????????????????????? ],

??????????????????????? "password":"",

??????????????????????? "preSql": [],

??????????????????????? "session":[],

??????????????????????? "username":"",

??????????????????????? "writeMode":""

??????????????????? }

??????????????? }

??????????? }

???????],

??????? "setting": {

??????????? "speed": {

??????????????? "channel":""

??????????? }

??????? }

??? }

}

3)創(chuàng)建配置文件

[atguigu@hadoop102

datax]$ vim job/hdfs2mysql.json


{

??? "job": {

??????? "content": [

??????????? {

??????????????? "reader": {

??????????????????? "name":"hdfsreader",

??????????????????? "parameter": {

??????????????????????? "column":["*"],

??????????????????????? "defaultFS":"hdfs://hadoop102:9000",

??????????????????????? "encoding":"UTF-8",

???????????????????????"fieldDelimiter": "\t",

??????????????????????? "fileType":"text",

??????????????????????? "path":"/student.txt"

??????????????????? }

??????????????? },

??????????????? "writer": {

??????????????????? "name":"mysqlwriter",

??????????????????? "parameter": {

??????????????????????? "column": [

??????????????????????????? "id",

??????????????????????????? "name"

??????????????????????? ],

??????????????????????? "connection":[

?? ?????????????????????????{

???????????????????????????????"jdbcUrl": "jdbc:mysql://hadoop102:3306/datax",

???????????????????????????????"table": ["student2"]

??????????????????????????? }

??????????????????????? ],

??????????????????????? "password": "000000",

??????????????????????? "username":"root",

??????????????????????? "writeMode":"insert"

??????????????????? }

??????????????? }

??????????? }

??????? ],

??????? "setting": {

??????????? "speed": {

??????????????? "channel":"1"

??????????? }

??????? }

??? }

}

4)在MySQL的datax數(shù)據(jù)庫中創(chuàng)建student2

mysql>

use datax;

mysql>

create table student2(id int,name varchar(20));

5)執(zhí)行任務(wù)

[atguigu@hadoop102

datax]$ bin/datax.py job/hdfs2mysql.json


2019-05-1716:21:53.616 [job-0] INFO? JobContainer -

任務(wù)啟動時刻??????????????????? : 2019-05-17 16:21:41

任務(wù)結(jié)束時刻??????????????????? : 2019-05-17 16:21:53

任務(wù)總計耗時??????????????????? :???????????????? 11s

任務(wù)平均流量???????????????? ???:??????????????? 3B/s

記錄寫入速度??????????????????? :????????????? 0rec/s

讀出記錄總數(shù)??????????????????? :?????????????????? 3

讀寫失敗總數(shù)??????????????????? :?????????????????? 0

6)查看student2表

mysql>

select * from student2;

+------+----------+

|id?? | name???? |

+------+----------+

| 1001

| zhangsan |

| 1002| lisi???? |

| 1003| wangwu?? |

+------+----------+

3 rows

in set (0.00 sec)

第4章 Oracle數(shù)據(jù)庫

以下操作使用root賬號。

4.1 oracle數(shù)據(jù)庫簡介

Oracle Database,又名Oracle RDBMS,或簡稱Oracle。是甲骨文公司的一款關(guān)系數(shù)據(jù)庫管理系統(tǒng)。它是在數(shù)據(jù)庫領(lǐng)域一直處于領(lǐng)先地位的產(chǎn)品。可以說Oracle數(shù)據(jù)庫系統(tǒng)是目前世界上流行的關(guān)系數(shù)據(jù)庫管理系統(tǒng),系統(tǒng)可移植性好、使用方便、功能強(qiáng),適用于各類大、中、小、微機(jī)環(huán)境。它是一種高效率、可靠性好的、適應(yīng)高吞吐量的數(shù)據(jù)庫解決方案。

4.2 安裝前的準(zhǔn)備

4.2.1 檢查依賴

[root@hadoop102~]# rpm -q binutils compat-libstdc++-33 elfutils-libelf elfutils-libelf-develgcc gcc-c++ glibc glibc-common glibc-devel glibc-headers kernel-headers kshlibaio? libaio-devel libgcc libgomplibstdc++ libstdc++-devel make numactl-devel sysstat unixODBC unixODBC-devel

4.2.2 安裝依賴

哪個沒有安裝哪個,如:

[root@hadoop102

~]# yum -y install elfutils-libelf-devel gcc gcc-c++ libaio-devel

libstdc++-devel numactl-devel unixODBC unixODBC-devel

4.2.3 上傳安裝包并解壓

[root@hadoop102

software]# ls

Oracle_Database_12c_Release2_linuxx64.zip

[root@hadoop102

software]# unzip Oracle_Database_12c_Release2_linuxx64.zip -d /opt/module/

4.2.4 配置用戶組

Oracle安裝文件不允許通過root用戶啟動,需要為oracle配置一個專門的用戶。

1)創(chuàng)建sql 用戶組

[root@hadoop102

software]#groupadd sql

2)創(chuàng)建oracle 用戶并放入sql組中

[root@hadoop102

software]#useradd oracle -g sql

3)修改oracle用戶登錄密碼,輸入密碼后即可使用oracle用戶登錄系統(tǒng)

[root@hadoop102

software]#passwd oracle

4)修改所屬用戶和組

[root@hadoop102

module]# chown -R oracle:sql /opt/module/database/

[root@hadoop102

module]# chgrp -R sql /opt/module/database/

4.2.5 修改配置文件sysctl.conf

[root@hadoop102

module]# vim /etc/sysctl.conf

刪除里面的內(nèi)容,添加如下內(nèi)容

net.ipv4.ip_local_port_range

= 9000 65500

fs.file-max

= 6815744

kernel.shmall

= 10523004

kernel.shmmax

= 6465333657

kernel.shmmni

= 4096

kernel.sem

= 250 32000 100 128

net.core.rmem_default=262144

net.core.wmem_default=262144

net.core.rmem_max=4194304

net.core.wmem_max=1048576

fs.aio-max-nr

= 1048576

參數(shù)解析:

net.ipv4.ip_local_port_range :可使用的IPv4端口范圍

fs.file-max :該參數(shù)表示文件句柄的最大數(shù)量。文件句柄設(shè)置表示在linux系統(tǒng)中可以打開的文件數(shù)量。

kernel.shmall :該參數(shù)表示系統(tǒng)一次可以使用的共享內(nèi)存總量(以頁為單位)

kernel.shmmax :該參數(shù)定義了共享內(nèi)存段的最大尺寸(以字節(jié)為單位)

kernel.shmmni :這個內(nèi)核參數(shù)用于設(shè)置系統(tǒng)范圍內(nèi)共享內(nèi)存段的最大數(shù)量

kernel.sem :該參數(shù)表示設(shè)置的信號量。

net.core.rmem_default:默認(rèn)的TCP數(shù)據(jù)接收窗口大小(字節(jié))。

net.core.wmem_default:默認(rèn)的TCP數(shù)據(jù)發(fā)送窗口大?。ㄗ止?jié))。

net.core.rmem_max:最大的TCP數(shù)據(jù)接收窗口(字節(jié))。

net.core.wmem_max:最大的TCP數(shù)據(jù)發(fā)送窗口(字節(jié))。

fs.aio-max-nr :同時可以擁有的的異步IO請求數(shù)目。

4.2.6 修改配置文件limits.conf

[root@hadoop102 module]# vim /etc/security/limits.conf

在文件末尾添加:

oracle

soft nproc 2047

oracle

hard nproc 16384

oracle

soft nofile 1024

oracle

hard nofile 65536

重啟機(jī)器生效。

4.3 安裝Oracle數(shù)據(jù)庫

4.3.1 進(jìn)入虛擬機(jī)圖像化頁面操作

[oracle@hadoop102 ~]# cd /opt/module/database

?[oracle@hadoop102 database]# ./runInstaller

4.3.2 安裝數(shù)據(jù)庫

1)去掉紅框位置對勾

2)選擇僅安裝數(shù)據(jù)庫軟件

3)選擇單實例數(shù)據(jù)庫安裝


4)默認(rèn)下一步

5)設(shè)置安裝位置

6)創(chuàng)建產(chǎn)品清單

7)操作系統(tǒng)組設(shè)置

8)等待安裝

9)查看驗證,按提示修改,如果不能修改再點擊忽略

10)概要,直接點擊安裝

11)按提示操作

12)安裝完成

4.4 設(shè)置環(huán)境變量

[oracle@hadoop102

dbhome_1]# vim /home/oracle/.bash_profile

添加:

#ORACLE_HOME

export

ORACLE_HOME=/home/oracle/app/oracle/product/12.2.0/dbhome_1

export

PATH=$PATH:$ORACLE_HOME/bin

export

ORACLE_SID=orcl

export

NLS_LANG=AMERICAN_AMERICA.ZHS16GBK


[oracle@hadoop102

~]$ source /home/oracle/.bash_profile

4.5 設(shè)置Oracle監(jiān)聽

4.5.1 命令行輸入以下命令

[oracle@hadoop102

~]$ netca


4.5.2 選擇添加

4.5.3 設(shè)置監(jiān)聽名,默認(rèn)即可

4.5.4 選擇協(xié)議,默認(rèn)即可

4.5.5 設(shè)置端口號,默認(rèn)即可

4.5.6 配置更多監(jiān)聽,默認(rèn)

4.5.7 完成

4.6 創(chuàng)建數(shù)據(jù)庫

4.6.1 進(jìn)入創(chuàng)建頁面

[oracle@hadoop102

~]$ dbca

4.6.2 選擇創(chuàng)建數(shù)據(jù)庫

4.6.3 選擇高級配置

4.6.4 選擇數(shù)據(jù)倉庫

4.6.5 將圖中所示對勾去掉

4.6.6 存儲選項

4.6.7 快速恢復(fù)選項

4.6.8 選擇監(jiān)聽程序

4.6.9 如圖設(shè)置

4.6.10 使用自動內(nèi)存管理

4.6.11 管理選項,默認(rèn)

4.6.12 設(shè)置統(tǒng)一密碼

4.6.13 創(chuàng)建選項,選擇創(chuàng)建數(shù)據(jù)庫

4.6.14 概要,點擊完成

4.6.15 等待安裝

4.7 簡單使用

4.7.1 開啟,關(guān)閉監(jiān)聽服務(wù)

開啟服務(wù):

[oracle@hadoop102

~]$ lsnrctl start

關(guān)閉服務(wù):

[oracle@hadoop102

~]$ lsnrctl stop

4.7.2 進(jìn)入命令行

[oracle@hadoop102

~]$ sqlplus


SQL*Plus:

Release 12.2.0.1.0 Production on Wed May 29 17:08:05 2019


Copyright(c) 1982, 2016, Oracle.? All rightsreserved.


Enter

user-name: system

Enter

password: (這里輸入之前配置的統(tǒng)一密碼)

Last

Successful login time: Wed May 29 2019 13:03:39 +08:00


Connected

to:

Oracle

Database 12c Enterprise Edition Release 12.2.0.1.0 - 64bit Production


SQL>

4.7.3 創(chuàng)建用戶并授權(quán)

SQL>

create user atguigu identified by 000000;

User created.

SQL>

grant create session,create table,create view,create sequence,unlimited

tablespace to atguigu;

Grant

succeeded.

4.7.4 進(jìn)入atguigu賬號,創(chuàng)建表

SQL>create

TABLE student(id INTEGER,name VARCHAR2(20));

SQL>insert

into student values (1,'zhangsan');

SQL>

select * from student;

??????? ID? NAME

----------

----------------------------------------

???????? 1? zhangsan

注意:安裝完成后重啟機(jī)器可能出現(xiàn)ORACLE not

available錯誤,解決方法如下:

[oracle@hadoop102

~]$ sqlplus / as sysdba

SQL>startup

SQL>conn

atguigu

Enter

password:

4.8 Oracle與MySQL的SQL區(qū)別

類型OracleMySQL

整型number(N)/integerint/integer

浮點型floatfloat/double

字符串類型varchar2(N)varchar(N)

NULL''null和''不一樣

分頁rownumlimit

""限制很多,一般不讓用與單引號一樣

價格閉源,收費開源,免費

主鍵自動增長×√

if? not exists×√

auto_increment×√

create? database×√

select? * from table as t×√

4.9 DataX案例

4.9.1 從Oracle中讀取數(shù)據(jù)存到MySQL

1)MySQL中創(chuàng)建表

[oracle@hadoop102

~]$ mysql -uroot -p000000

mysql>

create database oracle;

mysql>

use oracle;

mysql>

create table student(id int,name varchar(20));

2)編寫datax配置文件

[oracle@hadoop102

~]$ vim /opt/module/datax/job/oracle2mysql.json


{

??? "job": {

??????? "content": [

??????????? {

??????????????? "reader": {

??????????????????? "name":"oraclereader",

??????????????????? "parameter": {

??????????????????????? "column":["*"],

??????????????????????? "connection":[

??????????????????????????? {

???????????????????????????????"jdbcUrl": ["jdbc:oracle:thin:@hadoop102:1521:orcl"],

???????????????????????????????"table": ["student"]

??????????????????????????? }

??????????????????????? ],

??????????????????????? "password":"000000",

??????????????????????? "username":"atguigu"

??????????????????? }

??????????? ????},

??????????????? "writer": {

??????????????????? "name":"mysqlwriter",

??????????????????? "parameter": {

??????????????????????? "column":["*"],

??????????????????????? "connection":[

??????????????????????????? {

??????????????????????????????? "jdbcUrl":"jdbc:mysql://hadoop102:3306/oracle",

???????????????????????????????"table": ["student"]

??????????????????????????? }

??????????????????????? ],

??????????????????????? "password":"000000",

??????????????????????? "username":"root",

??????????????????????? "writeMode":"insert"

??????????????????? }

??????????????? }

??????????? }

??????? ],

??????? "setting": {

??????????? "speed": {

??????????????? "channel":"1"

??????????? }

??????? }

? ??}

}

3)執(zhí)行命令

[oracle@hadoop102

~]$

/opt/module/datax/bin/datax.py

/opt/module/datax/job/oracle2mysql.json


查看結(jié)果:

mysql>

select * from student;

+------+----------+

|id?? | name???? |

+------+----------+

|??? 1 | zhangsan |

+------+----------+

4.9.2 讀取Oracle的數(shù)據(jù)存入HDFS中

1)編寫配置文件

[oracle@hadoop102

datax]$ vim job/oracle2hdfs.json

{

??? "job": {

??????? "content": [

??????????? {

??????????????? "reader": {

??????????????????? "name":"oraclereader",

??????????????????? "parameter": {

??????????????????????? "column":["*"],

??????????????????????? "connection":[

??????????????????????????? {

???????????????????????????????"jdbcUrl":["jdbc:oracle:thin:@hadoop102:1521:orcl"],

????????????????????? ??????????"table":["student"]

??????????????????????????? }

??????????????????????? ],

??????????????????????? "password":"000000",

??????????????????????? "username":"atguigu"

??????????????????? }

??????????????? },

??????????????? "writer": {

??? ????????????????"name":"hdfswriter",

??????????????????? "parameter": {

??????????????????????? "column": [

??????????????????????????? {

???????????????????????????????"name": "id",

???????????????????????????????"type": "int"

??????????????????????? ????},

??????????????????????????? {

???????????????????????????????"name": "name",

???????????????????????????????"type": "string"

??????????????????????????? }


??????????????????????? ],

??????????????????????? "defaultFS":"hdfs://hadoop102:9000",

???????????????????????"fieldDelimiter": "\t",

??????????????????????? "fileName":"oracle.txt",

??????????????????????? "fileType":"text",

??????????????????????? "path":"/",

??????????????????????? "writeMode":"append"

????? ??????????????}

??????????????? }

??????????? }

??????? ],

??????? "setting": {

??????????? "speed": {

??????????????? "channel":"1"

??????????? }

??????? }

??? }

}

2)執(zhí)行

[oracle@hadoop102

datax]$ bin/datax.py job/oracle2hdfs.json

3)查看HDFS結(jié)果

第5章MongoDB

5.1 什么是MongoDB

MongoDB 是由C++語言編寫的,是一個基于分布式文件存儲的開源數(shù)據(jù)庫系統(tǒng)。MongoDB 旨在為WEB應(yīng)用提供可擴(kuò)展的高性能數(shù)據(jù)存儲解決方案。MongoDB 將數(shù)據(jù)存儲為一個文檔,數(shù)據(jù)結(jié)構(gòu)由鍵值(key=>value)對組成。MongoDB 文檔類似于 JSON 對象。字段值可以包含其他文檔,數(shù)組及文檔數(shù)組。

5.2 MongoDB優(yōu)缺點

5.3 基礎(chǔ)概念解析

SQL術(shù)語/概念MongoDB術(shù)語/概念解釋/說明

databasedatabase數(shù)據(jù)庫

tablecollection數(shù)據(jù)庫表/集合

rowdocument數(shù)據(jù)記錄行/文檔

columnfield數(shù)據(jù)字段/域

indexindex索引

table? joins?不支持表連接,MongoDB不支持

primary? keyprimary? key主鍵,MongoDB自動將_id字段設(shè)置為主鍵

通過下圖實例,我們也可以更直觀的了解Mongo中的一些概念:

5.4 安裝

5.4.1 下載地址

https://www.mongodb.com/download-center#community

5.4.2 安裝

1)上傳壓縮包到虛擬機(jī)中

[atguigu@hadoop102

software]$ ls

mongodb-linux-x86_64-4.0.10.tgz

2)解壓

[atguigu@hadoop102

software]$ tar -zxvf mongodb-linux-x86_64-4.0.10.tgz -C /opt/module/

3)重命名

[atguigu@hadoop102

module]$ mv mongodb-linux-x86_64-4.0.10/ mongodb

4)創(chuàng)建數(shù)據(jù)庫目錄

MongoDB的數(shù)據(jù)存儲在data目錄的db目錄下,但是這個目錄在安裝過程不會自動創(chuàng)建,所以需要手動創(chuàng)建data目錄,并在data目錄中創(chuàng)建db目錄。

[atguigu@hadoop102

module]$ sudo mkdir -p /data/db

[atguigu@hadoop102

mongodb]$ sudo chmod 777 -R /data/db/

5)啟動MongoDB服務(wù)

[atguigu@hadoop102

mongodb]$ bin/mongod

6)進(jìn)入shell頁面

[atguigu@hadoop102

~]$ cd /opt/module/mongodb/

[atguigu@hadoop102

mongodb]$ bin/mongo


MongoDB

shell version v4.0.10

connecting

to: mongodb://127.0.0.1:27017/?gssapiServiceName=mongodb

Implicit

session: session { "id" :

UUID("66e2b331-e3c3-4b81-b676-1ee4bac87abf") }

MongoDB

server version: 4.0.10

Welcome

to the MongoDB shell.

…………..

>?

5.5 基礎(chǔ)概念詳解

5.5.1 數(shù)據(jù)庫

一個mongodb中可以建立多個數(shù)據(jù)庫。MongoDB的默認(rèn)數(shù)據(jù)庫為"db",該數(shù)據(jù)庫存儲在data目錄中。MongoDB的單個實例可以容納多個獨立的數(shù)據(jù)庫,每一個都有自己的集合和權(quán)限,不同的數(shù)據(jù)庫也放置在不同的文件中。

1)顯示所有數(shù)據(jù)庫

>

show dbs

admin?? 0.000GB

config? 0.000GB

local?? 0.000GB

admin:從權(quán)限的角度來看,這是"root"數(shù)據(jù)庫。要是將一個用戶添加到這個數(shù)據(jù)庫,這個用戶自動繼承所有數(shù)據(jù)庫的權(quán)限。一些特定的服務(wù)器端命令也只能從這個數(shù)據(jù)庫運行,比如列出所有的數(shù)據(jù)庫或者關(guān)閉服務(wù)器。

local:這個數(shù)據(jù)永遠(yuǎn)不會被復(fù)制,可以用來存儲限于本地單臺服務(wù)器的任意集合

config:當(dāng)Mongo用于分片設(shè)置時,config數(shù)據(jù)庫在內(nèi)部使用,用于保存分片的相關(guān)信息。

2)顯示當(dāng)前使用的數(shù)據(jù)庫

>

db

test

3)切換數(shù)據(jù)庫

>

use local

switched

to db local

>

db

local

5.5.2 文檔(Document)

文檔是一組鍵值(key-value)對組成。MongoDB 的文檔不需要設(shè)置相同的字段,并且相同的字段不需要相同的數(shù)據(jù)類型,這與關(guān)系型數(shù)據(jù)庫有很大的區(qū)別,也是 MongoDB 非常突出的特點。

一個簡單的例子:

{"name":"atguigu"}

注意:

1.文檔中的鍵/值對是有序的。

2.MongoDB區(qū)分類型和大小寫。

3.MongoDB的文檔不能有重復(fù)的鍵。

4.文檔的鍵是字符串。除了少數(shù)例外情況,鍵可以使用任意UTF-8字符。

5.5.3 集合

集合就是 MongoDB 文檔組,類似于MySQL中的table。

集合存在于數(shù)據(jù)庫中,集合沒有固定的結(jié)構(gòu),這意味著你在對集合可以插入不同格式和類型的數(shù)據(jù),但通常情況下我們插入集合的數(shù)據(jù)都會有一定的關(guān)聯(lián)性。

下面我們來看看如何創(chuàng)建集合:

MongoDB 中使用 createCollection() 方法來創(chuàng)建集合。

語法格式:

db.createCollection(name,

options)

參數(shù)說明:


name: 要創(chuàng)建的集合名稱

options: 可選參數(shù), 指定有關(guān)內(nèi)存大小及索引的選項

options 可以是如下參數(shù):

字段類型描述

capped布爾(可選)如果為 true,則創(chuàng)建固定集合。固定集合是指有著固定大小的集合,當(dāng)達(dá)到最大值時,它會自動覆蓋最早的文檔。當(dāng)該值為 true 時,必須指定 size 參數(shù)。

autoIndexId布爾(可選)如為 true,自動在 _id 字段創(chuàng)建索引。默認(rèn)為 false。

size數(shù)值(可選)為固定集合指定一個最大值(以字節(jié)計)。如果 capped true,也需要指定該字段。

max數(shù)值(可選)指定固定集合中包含文檔的最大數(shù)量。

案例1:在test庫中創(chuàng)建一個atguigu的集合

>

use test

switched

to db test

>

db.createCollection("atguigu")

{

"ok" : 1 }

>

show collections

Atguigu


//插入數(shù)據(jù)

>

db.atguigu.insert({"name":"atguigu","url":"www.atguigu.com"})

WriteResult({

"nInserted" : 1 })

//查看數(shù)據(jù)

>

db.atguigu.find()

{

"_id" : ObjectId("5d0314ceecb77ee2fb2d7566"),

"name" : "atguigu", "url" :

"www.atguigu.com" }

說明:

ObjectId 類似唯一主鍵,可以很快的去生成和排序,包含 12 bytes,含義是:

前 4 個字節(jié)表示創(chuàng)建 unix 時間戳

接下來的 3 個字節(jié)是機(jī)器標(biāo)識碼

緊接的兩個字節(jié)由進(jìn)程 id 組成PID

最后三個字節(jié)是隨機(jī)數(shù)

案例2:創(chuàng)建一個固定集合mycol

>

db.createCollection("mycol",{ capped : true,autoIndexId : true,size :

6142800, max : 1000})

>

show tables;

atguigu

mycol

案例3:自動創(chuàng)建集合

在 MongoDB 中,你不需要創(chuàng)建集合。當(dāng)你插入一些文檔時,MongoDB會自動創(chuàng)建集合。

>

db.mycol2.insert({"name":"atguigu"})

WriteResult({

"nInserted" : 1 })

>

show collections

atguigu

mycol

mycol2

案例4:刪除集合

>

db.mycol2.drop()

True

>

show tables;

atguigu

mycol

5.6 DataX導(dǎo)入導(dǎo)出案例

5.6.1 讀取MongoDB的數(shù)據(jù)導(dǎo)入到HDFS

1)編寫配置文件

[atguigu@hadoop102

datax]$ vim job/mongdb2hdfs.json

{

??? "job": {

?? ?????"content": [

??????????? {

??????????????? "reader": {

??????????????????? "name":"mongodbreader",

??????????????????? "parameter": {

??????????????????????? "address":["127.0.0.1:27017"],

???????????????????????"collectionName": "atguigu",

??????????????????????? "column": [

??????????????????????? ?????? {

??????????????????????? ????????????? "name":"name",

??????????????????????? ????????????? "type":"string"

??????????????????????? ?????? },

??????????????????????? ?????? {

??????????????????????? ????????????? "name":"url",

?????????????? ?????????????????????? "type":"string"

??????????????????????? ?????? }

??????????????????????? ],

??????????????????????? "dbName":"test",

??????????????????? }

??????????????? },

??????????????? "writer": {

??????????????????? "name":"hdfswriter",

????????????????? ??"parameter": {

??????????????????????? "column": [

??????????????????????? ?????? {

??????????????????????? ????????????? "name":"name",

??????????????????????? ????????????? "type":"string"

??????????????????????? ?????? },

??????????????????????? ?????? {

??????????????????????? ????????????? "name":"url",

??????????????????????? ????????????? "type":"string"

??????????????????????? ?????? }

??????????????????????? ],

??????????????????????? "defaultFS":"hdfs://hadoop102:9000",

???????????????????????"fieldDelimiter": "\t",

??????????????????????? "fileName":"mongo.txt",

??????????????????????? "fileType":"text",

??????????????????????? "path":"/",

??????????????????????? "writeMode":"append"

??????????????????? }

??????????????? }

??????????? }

??????? ],

??????? "setting": {

??????????? "speed": {

??????????????? "channel":"1"

??????????? }

??????? }

??? }

}

2)mongodbreader參數(shù)解析

address: MongoDB的數(shù)據(jù)地址信息,因為MonogDB可能是個集群,則ip端口信息需要以Json數(shù)組的形式給出?!颈靥睢?/p>

userName:MongoDB的用戶名?!具x填】

userPassword: MongoDB的密碼?!具x填】

collectionName: MonogoDB的集合名?!颈靥睢?/p>

column:MongoDB的文檔列名?!颈靥睢?/p>

name:Column的名字?!颈靥睢?/p>

type:Column的類型?!具x填】

splitter:因為MongoDB支持?jǐn)?shù)組類型,但是Datax框架本身不支持?jǐn)?shù)組類型,所以mongoDB讀出來的數(shù)組類型要通過這個分隔符合并成字符串?!具x填】

3)執(zhí)行

[atguigu@hadoop102

datax]$ bin/datax.py job/mongdb2hdfs.json

4)查看結(jié)果

5.6.2 讀取MongoDB的數(shù)據(jù)導(dǎo)入MySQL

1)在MySQL中創(chuàng)建表

mysql>

create table atguigu(name varchar(20),url varchar(20));

2)編寫DataX配置文件

[atguigu@hadoop102

datax]$ vim job/mongodb2mysql.json

{

??? "job": {

??????? "content": [

??????????? {

??????????????? "reader": {

??????????????????? "name": "mongodbreader",

??????????????????? "parameter": {

??????????????????????? "address":["127.0.0.1:27017"],

???????????????????????"collectionName": "atguigu",

??????????????????????? "column": [

??????????????????????? ?????? {

??????????????????????? ????????????? "name":"name",

??????????????????????? ????????????? "type":"string"

??????????????????????? ?????? },

??????????????????????? ?????? {

??????????????????????? ????????????? "name":"url",

??????????????????????? ????????????? "type":"string"

??????????????????????? ?????? }

??????????????????????? ],

?????????? ?????????????"dbName":"test",

??????????????????? }

??????????????? },

??????????????? "writer": {

??????????????????? "name":"mysqlwriter",

??????????????????? "parameter": {

??????????????????????? "column": ["*"],

??????????????????????? "connection":[

??????????????????????????? {

???????????????????????????????"jdbcUrl": "jdbc:mysql://hadoop102:3306/test",

???????????????????????????????"table": ["atguigu"]

??????????????????????????? }

?????????????????????? ?],

??????????????????????? "password":"000000",

??????????????????????? "username":"root",

??????????????????????? "writeMode":"insert"

??????????????????? }

??????????????? }

??????????? }

??????? ],

??????? "setting": {

??????????? "speed": {

?? ?????????????"channel": "1"

??????????? }

??????? }

??? }

}

3)執(zhí)行

[atguigu@hadoop102

datax]$ bin/datax.py job/mongodb2mysql.json

4)查看結(jié)果

mysql>

select * from atguigu;

+---------+-----------------+

|name??? | url???????????? |

+---------+-----------------+

|

atguigu | www.atguigu.com |

+---------+-----------------+

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容