Data Quality

Data quality includes

  • missing
  • inconsistent
  • invaild
  • implausible(難以置信的)

Data preparation workflow

  • 1: How to use data profiling(剖析) methods to
    Characterise data and provide high-level insights
    Investigate data quality so it may be cleaned

  • Data preparation workflow includes three steps

  • Firstly, Discover
    What data sources and level of detail
    What spatio-temporal coverage(時(shí)空覆蓋) and cost

  • Secondly, Wrangle(爭(zhēng)辯)
    **Read in data, reformat(重新格式化), transform(轉(zhuǎn)換), link(鏈接)

  • Profile
    Rigorous investigation of data quality

Subset of Data preparation

  • I: Look at your data
    Number of rows
    Example of Values
    Data Formate
    Data Type
    How is it encoded?

    1. Why people must care for Data Encoded
      Explain: If you use anything other than the most basic English text, people may not be able to read your data unless you state the character encoding
    1. File size & number of rows
    1. Check the data types
      Check the format yourself
      Don’t rely on heuristics(啟發(fā)法)
      Don’t assume that all your data files use the same format, even if the files come from one source
    1. Example values
  • II: read your data correctly ---->Watch out for special values

  • III:Is all the data there?

  • 1:Missing values
    Terrible statistical terminology
    Advantages of visualization

  • 1.1: Missing at random(MAR)
    -Related to other variables
    – Term is misleading!

  • 1.2: Missing completely at random (MCAR)
    – Haphazard
    – Unrelated to values of variable, or other variables

  • 1.3: Missing not at random (MNAR)
    – Related to values of the variable itself

  • 2:Coverage (e.g. temporal or geographic)

  • 2.1: Temporal coverage

  • 2.2: Spatialcoverage

  • 3:Duplicates(重復(fù)值)

  • IV: Rigorously check data quality

  • How to write data validation rules
    1.1: Subject-matter special lists typically use free text to describe valid values and explain how to clean them
    1.2: Data scientist may need to write validation & cleaning rules as pseudocode

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書(shū)系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容