Pandas數(shù)據(jù)合并

merage

<em> 內(nèi)連接</em>

<em> 左外連接</em>

<em> 右外連接</em>

<em> 全外連接</em>

<em> 示例</em>

<B>join</B>

<B>concat</B>

merage#

<em>pandas提供了一個(gè)類似于關(guān)系數(shù)據(jù)庫(kù)的連接(join)操作的方法<Strong>merage</Strong>,可以根據(jù)一個(gè)或多個(gè)鍵將不同DataFrame中的行連接起來(lái)</em>

<em>語(yǔ)法如下</em>

merge(left, right, how='inner', on=None, left_on=None, right_on=None,
      left_index=False, right_index=False, sort=True,
      suffixes=('_x', '_y'), copy=True, indicator=False)

參數(shù)說(shuō)明:

  1. left與right:兩個(gè)不同的DataFrame
  2. how:指的是合并(連接)的方式有inner(內(nèi)連接),left(左外連接),right(右外連接),outer(全外連接);默認(rèn)為inner
  3. on : 指的是用于連接的列索引名稱。必須存在右右兩個(gè)DataFrame對(duì)象中,如果沒(méi)有指定且其他參數(shù)也未指定則以兩個(gè)DataFrame的列名交集做為連接鍵
  4. left_on:左則DataFrame中用作連接鍵的列名;這個(gè)參數(shù)中左右列名不相同,但代表的含義相同時(shí)非常有用。
  5. right_on:右則DataFrame中用作 連接鍵的列名
  6. left_index:使用左則DataFrame中的行索引做為連接鍵
  7. right_index:使用右則DataFrame中的行索引做為連接鍵
  8. sort:默認(rèn)為True,將合并的數(shù)據(jù)進(jìn)行排序。在大多數(shù)情況下設(shè)置為False可以提高性能
  9. suffixes:字符串值組成的元組,用于指定當(dāng)左右DataFrame存在相同列名時(shí)在列名后面附加的后綴名稱,默認(rèn)為('_x','_y')
  10. copy:默認(rèn)為True,總是將數(shù)據(jù)復(fù)制到數(shù)據(jù)結(jié)構(gòu)中;大多數(shù)情況下設(shè)置為False可以提高性能
  11. indicator:在 0.17.0中還增加了一個(gè)顯示合并數(shù)據(jù)中來(lái)源情況;如只來(lái)自己于左邊(left_only)、兩者(both)

內(nèi)連接##

<em>符合連接條件和查詢條件的數(shù)據(jù)行,相當(dāng)于數(shù)據(jù)庫(kù)中的jion,示例SQL語(yǔ)句 </em>

SELECT *
FROM df1
INNER JOIN df2
  ON df1.key = df2.key;

SELECT *
FROM df1,df2 where df1.key=df2.key

<em>對(duì)應(yīng)pandas語(yǔ)句</em>

pd.merge(df1, df2, on='key')

左外連接##

<em>符合連接條件和查詢條件的數(shù)據(jù)行并返回左表中不符合連接條件單符合查詢條件的數(shù)據(jù)行,相當(dāng)于數(shù)據(jù)庫(kù)中的left outer join,示例SQL語(yǔ)句 </em>

-- show all records from df1
SELECT *
FROM df1
LEFT OUTER JOIN df2
  ON df1.key = df2.key;

<em>對(duì)應(yīng)pandas語(yǔ)句</em>

pd.merge(df1, df2, on='key', how='left')

右外連接##

<em>符合連接條件和查詢條件的數(shù)據(jù)行并返回右表中不符合連接條件單符合查詢條件的數(shù)據(jù)行,相當(dāng)于數(shù)據(jù)庫(kù)中的right outer join,示例SQL語(yǔ)句 </em>

-- show all records from df2
SELECT *
FROM df1
RIGHT OUTER JOIN df2
  ON df1.key = df2.key;

<em>對(duì)應(yīng)pandas語(yǔ)句</em>

pd.merge(df1, df2, on='key', how='right')

全外連接##

<em>符合連接條件和查詢條件的數(shù)據(jù)行并返回左表和左表中不符合連接條件單符合查詢條件的數(shù)據(jù)行。全外連接相當(dāng)于左外連接與左外連接的合集(去掉重復(fù)),相當(dāng)于數(shù)據(jù)庫(kù)中的full outer join,示例SQL語(yǔ)句 </em>

-- show all records from both tables
SELECT *
FROM df1
FULL OUTER JOIN df2
  ON df1.key = df2.key;

<em>對(duì)應(yīng)pandas語(yǔ)句</em>

pd.merge(df1, df2, on='key', how='outer')

更詳細(xì)的參見(jiàn)官方網(wǎng)站

示例##

#coding=utf-8
from pandas import Series,DataFrame,merge
import numpy as np
data=DataFrame([{"id":0,"name":'lxh',"age":20,"cp":'lm'},{"id":1,"name":'xiao',"age":40,"cp":'ly'},{"id":2,"name":'hua',"age":4,"cp":'yry'},{"id":3,"name":'be',"age":70,"cp":'old'}])
data1=DataFrame([{"id":100,"name":'lxh','cs':10},{"id":101,"name":'xiao','cs':40},{"id":102,"name":'hua2','cs':50}])
data2=DataFrame([{"id":0,"name":'lxh','cs':10},{"id":101,"name":'xiao','cs':40},{"id":102,"name":'hua2','cs':50}])

print "單個(gè)列名做為內(nèi)鏈接的連接鍵\r\n",merge(data,data1,on="name",suffixes=('_a','_b'))
print "多列名做為內(nèi)鏈接的連接鍵\r\n",merge(data,data2,on=("name","id"))
print '不指定on則以兩個(gè)DataFrame的列名交集做為連接鍵\r\n',merge(data,data2) #這里使用了id與name

#使用右邊的DataFrame的行索引做為連接鍵
##設(shè)置行索引名稱
indexed_data1=data1.set_index("name")
print "使用右邊的DataFrame的行索引做為連接鍵\r\n",merge(data,indexed_data1,left_on='name',right_index=True)


print '左外連接\r\n',merge(data,data1,on="name",how="left",suffixes=('_a','_b'))
print '左外連接1\r\n',merge(data1,data,on="name",how="left")
print '右外連接\r\n',merge(data,data1,on="name",how="right")
data3=DataFrame([{"mid":0,"mname":'lxh','cs':10},{"mid":101,"mname":'xiao','cs':40},{"mid":102,"mname":'hua2','cs':50}])

#當(dāng)左右兩個(gè)DataFrame的列名不同,當(dāng)又想做為連接鍵時(shí)可以使用left_on與right_on來(lái)指定連接鍵
print "使用left_on與right_on來(lái)指定列名字不同的連接鍵\r\n",merge(data,data3,left_on=["name","id"],right_on=["mname","mid"])

輸出

單個(gè)列名做為內(nèi)鏈接的連接鍵
   age  cp  id_a  name  cs  id_b
0   20  lm     0   lxh  10   100
1   40  ly     1  xiao  40   101
多列名做為內(nèi)鏈接的連接鍵
   age  cp  id name  cs
0   20  lm   0  lxh  10
不指定on則以兩個(gè)DataFrame的列名交集做為連接鍵
   age  cp  id name  cs
0   20  lm   0  lxh  10
使用右邊的DataFrame的行索引做為連接鍵
   age  cp  id_x  name  cs  id_y
0   20  lm     0   lxh  10   100
1   40  ly     1  xiao  40   101
左外連接
   age   cp  id_a  name  cs  id_b
0   20   lm     0   lxh  10   100
1   40   ly     1  xiao  40   101
2    4  yry     2   hua NaN   NaN
3   70  old     3    be NaN   NaN
左外連接1
   cs  id_x  name  age   cp  id_y
0  10   100   lxh   20   lm     0
1  40   101  xiao   40   ly     1
2  50   102  hua2  NaN  NaN   NaN
右外連接
   age   cp  id_x  name  cs  id_y
0   20   lm     0   lxh  10   100
1   40   ly     1  xiao  40   101
2  NaN  NaN   NaN  hua2  50   102
使用left_on與right_on來(lái)指定列名字不同的連接鍵
   age  cp  id name  cs  mid mname
0   20  lm   0  lxh  10    0   lxh

join#

<em>join方法提供了一個(gè)簡(jiǎn)便的方法用于將兩個(gè)DataFrame中的不同的列索引合并成為一個(gè)DataFrame</em>

join(self, other, on=None, how='left', lsuffix='', rsuffix='',
             sort=False):

其中參數(shù)的意義與merge方法基本相同,只是join方法默認(rèn)為左外連接how=left

示例

#coding=utf-8
from pandas import Series,DataFrame,merge

data=DataFrame([{"id":0,"name":'lxh',"age":20,"cp":'lm'},{"id":1,"name":'xiao',"age":40,"cp":'ly'},{"id":2,"name":'hua',"age":4,"cp":'yry'},{"id":3,"name":'be',"age":70,"cp":'old'}],index=['a','b','c','d'])
data1=DataFrame([{"sex":0},{"sex":1},{"sex":2}],index=['a','b','e'])

print '使用默認(rèn)的左連接\r\n',data.join(data1)  #這里可以看出自動(dòng)屏蔽了data中沒(méi)有的index=e 那一行的數(shù)據(jù)
print '使用右連接\r\n',data.join(data1,how="right") #這里出自動(dòng)屏蔽了data1中沒(méi)有index=c,d的那行數(shù)據(jù);等價(jià)于data1.join(data)
print '使用內(nèi)連接\r\n',data.join(data1,how='inner')
print '使用全外連接\r\n',data.join(data1,how='outer')

輸出

使用默認(rèn)的左連接
   age   cp  id  name  sex
a   20   lm   0   lxh    0
b   40   ly   1  xiao    1
c    4  yry   2   hua  NaN
d   70  old   3    be  NaN
使用右連接
   age   cp  id  name  sex
a   20   lm   0   lxh    0
b   40   ly   1  xiao    1
e  NaN  NaN NaN   NaN    2
使用內(nèi)連接
   age  cp  id  name  sex
a   20  lm   0   lxh    0
b   40  ly   1  xiao    1
使用全外連接
   age   cp  id  name  sex
a   20   lm   0   lxh    0
b   40   ly   1  xiao    1
c    4  yry   2   hua  NaN
d   70  old   3    be  NaN
e  NaN  NaN NaN   NaN    2

concat#

<em>concat方法相當(dāng)于數(shù)據(jù)庫(kù)中的全連接(UNION ALL),可以指定按某個(gè)軸進(jìn)行連接,也可以指定連接的方式j(luò)oin(outer,inner 只有這兩種)。與數(shù)據(jù)庫(kù)不同的時(shí)concat不會(huì)去重,要達(dá)到去重的效果可以使用drop_duplicates方法</em>

 concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
           keys=None, levels=None, names=None, verify_integrity=False, copy=True):

參數(shù)說(shuō)明

Parameters

----------
objs : a sequence or mapping of Series, DataFrame, or Panel objects
    If a dict is passed, the sorted keys will be used as the `keys`
    argument, unless it is passed, in which case the values will be
    selected (see below). Any None objects will be dropped silently unless
    they are all None in which case a ValueError will be raised
axis : {0, 1, ...}, default 0
    The axis to concatenate along
join : {'inner', 'outer'}, default 'outer'
    How to handle indexes on other axis(es)
join_axes : list of Index objects
    Specific indexes to use for the other n - 1 axes instead of performing
    inner/outer set logic
verify_integrity : boolean, default False
    Check whether the new concatenated axis contains duplicates. This can
    be very expensive relative to the actual data concatenation
keys : sequence, default None
    If multiple levels passed, should contain tuples. Construct
    hierarchical index using the passed keys as the outermost level
levels : list of sequences, default None
    Specific levels (unique values) to use for constructing a
    MultiIndex. Otherwise they will be inferred from the keys
names : list, default None
    Names for the levels in the resulting hierarchical index
ignore_index : boolean, default False
    If True, do not use the index values along the concatenation axis. The
    resulting axis will be labeled 0, ..., n - 1. This is useful if you are
    concatenating objects where the concatenation axis does not have
    meaningful indexing information. Note the the index values on the other
    axes are still respected in the join.
copy : boolean, default True
    If False, do not copy data unnecessarily

示例

#coding=utf-8
from pandas import Series,DataFrame,concat

df1 = DataFrame({'city': ['Chicago', 'San Francisco', 'New York City'], 'rank': range(1, 4)})
df2 = DataFrame({'city': ['Chicago', 'Boston', 'Los Angeles'], 'rank': [1, 4, 5]})
print '按軸進(jìn)行內(nèi)連接\r\n',concat([df1,df2],join="inner",axis=1)
print '進(jìn)行外連接并指定keys(行索引)\r\n',concat([df1,df2],keys=['a','b']) #這里有重復(fù)的數(shù)據(jù)
print '去重后\r\n',concat([df1,df2],ignore_index=True).drop_duplicates()

輸出

按軸進(jìn)行內(nèi)連接
            city  rank         city  rank
0        Chicago     1      Chicago     1
1  San Francisco     2       Boston     4
2  New York City     3  Los Angeles     5
進(jìn)行外連接并指定keys(行索引)
              city  rank
a 0        Chicago     1
  1  San Francisco     2
  2  New York City     3
b 0        Chicago     1
  1         Boston     4
  2    Los Angeles     5
去重后
            city  rank
0        Chicago     1
1  San Francisco     2
2  New York City     3
4         Boston     4
5    Los Angeles     5

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容