Apache Avro序列化/反序列化數(shù)據(jù)及Spark讀取avro數(shù)據(jù)

導(dǎo)語

本篇文章主要講如何使用Apache Avro序列化數(shù)據(jù)以及如何通過spark將序列化數(shù)據(jù)轉(zhuǎn)換成DataSet和DataFrame進(jìn)行操作。

Apache Arvo是什么?


Apache Avro 是一個數(shù)據(jù)序列化系統(tǒng)。

  1. 支持豐富的數(shù)據(jù)結(jié)構(gòu)
  1. 快速可壓縮的二進(jìn)制數(shù)據(jù)格式
  2. 存儲持久數(shù)據(jù)的文件容器
  3. 遠(yuǎn)程過程調(diào)用(RPC)
  4. 動態(tài)語言的簡單集成

Avro提供Java、Python、C、C++、C#等語言API接口,下面我們通過java的一個實例來說明Avro序列化和反序列化數(shù)據(jù)。


Avro官網(wǎng):http://avro.apache.org/
Avro版本:1.8.1
下載Avro相關(guān)jar包:avro-tools-1.8.1.jar 該jar包主要用戶將定義好的schema文件生成對應(yīng)的java文件

定義一個schema文件,命名為CustomerAdress.avsc,格式如下:

{
  "namespace":"com.peach.arvo",
  "type": "record",
  "name": "CustomerAddress",
  "fields": [
    {"name":"ca_address_sk","type":"long"},
    {"name":"ca_address_id","type":"string"},
    {"name":"ca_street_number","type":"string"},
    {"name":"ca_street_name","type":"string"},
    {"name":"ca_street_type","type":"string"},
    {"name":"ca_suite_number","type":"string"},
    {"name":"ca_city","type":"string"},
    {"name":"ca_county","type":"string"},
    {"name":"ca_state","type":"string"},
    {"name":"ca_zip","type":"string"},
    {"name":"ca_country","type":"string"},
    {"name":"ca_gmt_offset","type":"double"},
    {"name":"ca_location_type","type":"string"}
  ]
}``` 
* namespace:在生成java文件時import包路徑
* type:omplex types(record, enum, array, map, union, and fixed)
* name:生成java文件時的類名
* fileds:schema中定義的字段及類型

在這里schema文件定義完成后,通過上面下載的avro-tools-1.8.1.jar包,來生成java code,命令如下:
```java -jar avro-tools-1.8.1.jar compile schema CustomerAddress.avsc .```
末尾的"."代表java code 生成在當(dāng)前目錄,命令執(zhí)行成功后顯示:
![生成javacode](http://upload-images.jianshu.io/upload_images/4861551-84cc805ecf1dfa48.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
在當(dāng)前目錄的com/peach/avro/目錄下有生成相應(yīng)的CustomerAddress.java文件,待工程創(chuàng)建后使用。
####使用maven創(chuàng)建一個java工程,下面為工程的目錄結(jié)構(gòu)
<p>添加maven依賴:</p>

    <dependency>
        <groupId>org.apache.avro</groupId>  
        <artifactId>avro</artifactId>  
        <version>1.8.1</version>  
    </dependency>  

![maven工程目錄結(jié)構(gòu)](http://upload-images.jianshu.io/upload_images/4861551-45804c10c2d94871.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

編寫代碼生成avro數(shù)據(jù)文件,代碼片段

package com.peach;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.InputStreamReader;
import java.util.StringTokenizer;
import org.apache.avro.file.DataFileWriter;
import org.apache.avro.io.DatumWriter;
import org.apache.avro.specific.SpecificDatumWriter;
import com.peach.arvo.CustomerAddress;

/**

  • @author peach

  • 2017-03-02

  • 主要用于生成avro數(shù)據(jù)文件
    */
    public class GenerateDataApp {
    // private static String customerAddress_avsc_path;
    //
    // static {
    // customerAddress_avsc_path = GenerateDataApp.class.getClass().getResource("/CustomerAddress.avsc").getPath();
    // }
    private static String source_data_path = "F:\data\customer_address.dat"; //源數(shù)據(jù)文件路 徑
    private static String dest_avro_data_path = "F:\data\customeraddress.avro"; //生成的avro數(shù)據(jù)文件路徑

    public static void main(String[] args) {

     try {  
    

// if(customerAddress_avsc_path != null) {
// File file = new File(customerAddress_avsc_path);
// Schema schema = new Schema.Parser().parse(file);
// }
DatumWriter<CustomerAddress> caDatumwriter = new SpecificDatumWriter<>(CustomerAddress.class);
DataFileWriter<CustomerAddress> dataFileWriter = new DataFileWriter<>(caDatumwriter);
dataFileWriter.create(new CustomerAddress().getSchema(), new File(dest_avro_data_path));
loadData(dataFileWriter);
dataFileWriter.close();
} catch (Exception e) {
e.printStackTrace();
}
}

/**  
 * 加載源數(shù)據(jù)文件  
 * @param dataFileWriter  
 */  
private static void loadData(DataFileWriter<CustomerAddress> dataFileWriter) {  
    File file = new File(source_data_path);  
    if(!file.isFile()) {  
        return;  
    }  
    try {  
        InputStreamReader isr = new InputStreamReader(new FileInputStream(file));  
        BufferedReader reader = new BufferedReader(isr);  
        String line;  
        CustomerAddress address;  
        while ((line = reader.readLine()) != null) {  
            address = getCustomerAddress(line);  
            if (address != null) {  
                dataFileWriter.append(address);  
            }  
        }  
        isr.close();  
        reader.close();  
    } catch (Exception e) {  
        e.printStackTrace();  
    }  
}  

/**  
 * 通過記錄封裝CustomerAddress對象  
 * @param line  
 * @return  
 */  
private static CustomerAddress getCustomerAddress(String line) {  
    CustomerAddress ca = null;  
    try {  
        if (line != null && line != "") {  
            StringTokenizer token = new StringTokenizer(line, "|"); //使用stringtokenizer拆分字符串時,會去自動除""類型  
            if(token.countTokens() >= 13) {  
                ca = new CustomerAddress();  
                ca.setCaAddressSk(Long.parseLong(token.nextToken()));  
                ca.setCaAddressId(token.nextToken());  
                ca.setCaStreetNumber(token.nextToken());  
                ca.setCaStreetName(token.nextToken());  
                ca.setCaStreetType(token.nextToken());  
                ca.setCaSuiteNumber(token.nextToken());  
                ca.setCaCity(token.nextToken());  
                ca.setCaCounty(token.nextToken());  
                ca.setCaState(token.nextToken());  
                ca.setCaZip(token.nextToken());  
                ca.setCaCountry(token.nextToken());  
                ca.setCaGmtOffset(Double.parseDouble(token.nextToken()));  
                ca.setCaLocationType(token.nextToken());  
            } else {  
                System.err.println(line);  
            }  
        }  
    } catch (NumberFormatException e) {  
        System.err.println(line);  
    }  

    return ca;  
}  

}


動態(tài)生成avro文件,通過將數(shù)據(jù)封裝為GenericRecord對象,動態(tài)的寫入avro文件,以下代碼片段

private static void loadData(DataFileWriter<GenericRecord> dataFileWriter, Schema schema) {
File file = new File(sourcePath);
if(file == null) {
logger.error("[peach], source data not found");
return ;
}

    InputStreamReader inputStreamReader = null;  
    BufferedReader bufferedReader = null;  
    try {  
        inputStreamReader = new InputStreamReader(new FileInputStream(file));  
        bufferedReader = new BufferedReader(inputStreamReader);  
        String line;  
        GenericRecord genericRecord;  
        while((line = bufferedReader.readLine()) != null) {  
            if(line != "") {  
                String[] values = line.split("\\|");  
                genericRecord = SchemaUtil.convertRecord(values, schema);  
                if(genericRecord != null) {  
                    dataFileWriter.append(genericRecord);  
                }  
            }  
        }  

    } catch (Exception e) {  
        e.printStackTrace();  
    } finally {  
        try {  
            if(bufferedReader != null) {  
                bufferedReader.close();  
            }  
            if(inputStreamReader != null) {  
                inputStreamReader.close();  
            }  
        } catch (IOException e) {  
        }  
    }  

}  

avro文件生成完成后,創(chuàng)建scala工程,使用sparkapi讀取avro文件,添加spark maven 依賴
    <dependency>  
        <groupId>com.peach</groupId>  
        <artifactId>generatedata</artifactId>  
        <version>1.0-SNAPSHOT</version>  
    </dependency>  
    <dependency>  
        <groupId>com.databricks</groupId>  
        <artifactId>spark-avro_2.10</artifactId>  
        <version>2.1.0</version>  
    </dependency>  
    <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql_2.10 -->
    <dependency>  
        <groupId>org.apache.spark</groupId>  
        <artifactId>spark-sql_2.10</artifactId>  
        <version>2.1.0</version>  
    </dependency>  
    <dependency>  
        <groupId>org.apache.spark</groupId>  
        <artifactId>spark-core_2.10</artifactId>  
        <version>2.1.0</version>  
    </dependency>  
    <dependency>  
        <groupId>org.apache.avro</groupId>  
        <artifactId>avro</artifactId>  
        <version>1.8.1</version>  
    </dependency>  
![maven scala 工程](http://upload-images.jianshu.io/upload_images/4861551-4c3e36494e8b6f36.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

編寫scala讀取代碼,以下代碼片段

case class CustomerAddressData(ca_address_sk: Long,
ca_address_id: String,
ca_street_number: String,
ca_street_name: String,
ca_street_type: String,
ca_suite_number: String,
ca_city: String,
ca_county: String,
ca_state: String,
ca_zip: String,
ca_country: String,
ca_gmt_offset: Double,
ca_location_type: String
)
// org.apache.spark.sql.catalyst.encoders.OuterScopes.addOuterScope(this)

def main(args: Array[String]): Unit = {
val path = "/Users/zoulihan/Desktop/customeraddress.avro"
val conf = new SparkConf().setAppName("test").setMaster("local[2]")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._ //為什么要加此段代碼?

val _rdd = sc.hadoopFile[AvroWrapper[CustomerAddress], NullWritable, AvroInputFormat[CustomerAddress]](path)  
val ddd = _rdd.map(line => new CustomerAddressData(  
  line._1.datum().getCaAddressSk,  
  line._1.datum().getCaAddressId.toString,  
  line._1.datum().getCaStreetNumber.toString,  
  line._1.datum().getCaStreetName.toString,  
  line._1.datum().getCaStreetType.toString,  
  line._1.datum().getCaSuiteNumber.toString,  
  line._1.datum().getCaCity.toString,  
  line._1.datum().getCaCounty.toString,  
  line._1.datum().getCaState.toString,  
  line._1.datum().getCaZip.toString,  
  line._1.datum().getCaCountry.toString,  
  line._1.datum().getCaGmtOffset,  
  line._1.datum().getCaLocationType.toString  
))  
val ds = sqlContext.createDataset(ddd)  
ds.show()  
val df = ds.toDF();  
df.createTempView("customer_address");

// sqlContext.sql("select count(*) from customer_address").show()
sqlContext.sql("select * from customer_address limit 10").show()
}


<p>spark運(yùn)行結(jié)果</p>

![Paste_Image.png](http://upload-images.jianshu.io/upload_images/4861551-b539947108706374.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

源代碼:
https://github.com/javaxsky/avrotospark
擴(kuò)展:
1.如何將avro數(shù)據(jù)文件load到hive中
2.通過sparksql將統(tǒng)計后的數(shù)據(jù)加載到hive中
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容