<em> 內(nèi)連接</em>
<em> 左外連接</em>
<em> 右外連接</em>
<em> 全外連接</em>
<em> 示例</em>
<B>join</B>
<B>concat</B>
merage#
<em>pandas提供了一個(gè)類似于關(guān)系數(shù)據(jù)庫(kù)的連接(join)操作的方法<Strong>merage</Strong>,可以根據(jù)一個(gè)或多個(gè)鍵將不同DataFrame中的行連接起來(lái)</em>
<em>語(yǔ)法如下</em>
merge(left, right, how='inner', on=None, left_on=None, right_on=None,
left_index=False, right_index=False, sort=True,
suffixes=('_x', '_y'), copy=True, indicator=False)
參數(shù)說(shuō)明:
- left與right:兩個(gè)不同的DataFrame
- how:指的是合并(連接)的方式有inner(內(nèi)連接),left(左外連接),right(右外連接),outer(全外連接);默認(rèn)為inner
- on : 指的是用于連接的列索引名稱。必須存在右右兩個(gè)DataFrame對(duì)象中,如果沒(méi)有指定且其他參數(shù)也未指定則以兩個(gè)DataFrame的列名交集做為連接鍵
- left_on:左則DataFrame中用作連接鍵的列名;這個(gè)參數(shù)中左右列名不相同,但代表的含義相同時(shí)非常有用。
- right_on:右則DataFrame中用作 連接鍵的列名
- left_index:使用左則DataFrame中的行索引做為連接鍵
- right_index:使用右則DataFrame中的行索引做為連接鍵
- sort:默認(rèn)為True,將合并的數(shù)據(jù)進(jìn)行排序。在大多數(shù)情況下設(shè)置為False可以提高性能
- suffixes:字符串值組成的元組,用于指定當(dāng)左右DataFrame存在相同列名時(shí)在列名后面附加的后綴名稱,默認(rèn)為('_x','_y')
- copy:默認(rèn)為True,總是將數(shù)據(jù)復(fù)制到數(shù)據(jù)結(jié)構(gòu)中;大多數(shù)情況下設(shè)置為False可以提高性能
- indicator:在 0.17.0中還增加了一個(gè)顯示合并數(shù)據(jù)中來(lái)源情況;如只來(lái)自己于左邊(left_only)、兩者(both)
內(nèi)連接##
<em>符合連接條件和查詢條件的數(shù)據(jù)行,相當(dāng)于數(shù)據(jù)庫(kù)中的jion,示例SQL語(yǔ)句 </em>
SELECT *
FROM df1
INNER JOIN df2
ON df1.key = df2.key;
或
SELECT *
FROM df1,df2 where df1.key=df2.key
<em>對(duì)應(yīng)pandas語(yǔ)句</em>
pd.merge(df1, df2, on='key')
左外連接##
<em>符合連接條件和查詢條件的數(shù)據(jù)行并返回左表中不符合連接條件單符合查詢條件的數(shù)據(jù)行,相當(dāng)于數(shù)據(jù)庫(kù)中的left outer join,示例SQL語(yǔ)句 </em>
-- show all records from df1
SELECT *
FROM df1
LEFT OUTER JOIN df2
ON df1.key = df2.key;
<em>對(duì)應(yīng)pandas語(yǔ)句</em>
pd.merge(df1, df2, on='key', how='left')
右外連接##
<em>符合連接條件和查詢條件的數(shù)據(jù)行并返回右表中不符合連接條件單符合查詢條件的數(shù)據(jù)行,相當(dāng)于數(shù)據(jù)庫(kù)中的right outer join,示例SQL語(yǔ)句 </em>
-- show all records from df2
SELECT *
FROM df1
RIGHT OUTER JOIN df2
ON df1.key = df2.key;
<em>對(duì)應(yīng)pandas語(yǔ)句</em>
pd.merge(df1, df2, on='key', how='right')
全外連接##
<em>符合連接條件和查詢條件的數(shù)據(jù)行并返回左表和左表中不符合連接條件單符合查詢條件的數(shù)據(jù)行。全外連接相當(dāng)于左外連接與左外連接的合集(去掉重復(fù)),相當(dāng)于數(shù)據(jù)庫(kù)中的full outer join,示例SQL語(yǔ)句 </em>
-- show all records from both tables
SELECT *
FROM df1
FULL OUTER JOIN df2
ON df1.key = df2.key;
<em>對(duì)應(yīng)pandas語(yǔ)句</em>
pd.merge(df1, df2, on='key', how='outer')
更詳細(xì)的參見(jiàn)官方網(wǎng)站
示例##
#coding=utf-8
from pandas import Series,DataFrame,merge
import numpy as np
data=DataFrame([{"id":0,"name":'lxh',"age":20,"cp":'lm'},{"id":1,"name":'xiao',"age":40,"cp":'ly'},{"id":2,"name":'hua',"age":4,"cp":'yry'},{"id":3,"name":'be',"age":70,"cp":'old'}])
data1=DataFrame([{"id":100,"name":'lxh','cs':10},{"id":101,"name":'xiao','cs':40},{"id":102,"name":'hua2','cs':50}])
data2=DataFrame([{"id":0,"name":'lxh','cs':10},{"id":101,"name":'xiao','cs':40},{"id":102,"name":'hua2','cs':50}])
print "單個(gè)列名做為內(nèi)鏈接的連接鍵\r\n",merge(data,data1,on="name",suffixes=('_a','_b'))
print "多列名做為內(nèi)鏈接的連接鍵\r\n",merge(data,data2,on=("name","id"))
print '不指定on則以兩個(gè)DataFrame的列名交集做為連接鍵\r\n',merge(data,data2) #這里使用了id與name
#使用右邊的DataFrame的行索引做為連接鍵
##設(shè)置行索引名稱
indexed_data1=data1.set_index("name")
print "使用右邊的DataFrame的行索引做為連接鍵\r\n",merge(data,indexed_data1,left_on='name',right_index=True)
print '左外連接\r\n',merge(data,data1,on="name",how="left",suffixes=('_a','_b'))
print '左外連接1\r\n',merge(data1,data,on="name",how="left")
print '右外連接\r\n',merge(data,data1,on="name",how="right")
data3=DataFrame([{"mid":0,"mname":'lxh','cs':10},{"mid":101,"mname":'xiao','cs':40},{"mid":102,"mname":'hua2','cs':50}])
#當(dāng)左右兩個(gè)DataFrame的列名不同,當(dāng)又想做為連接鍵時(shí)可以使用left_on與right_on來(lái)指定連接鍵
print "使用left_on與right_on來(lái)指定列名字不同的連接鍵\r\n",merge(data,data3,left_on=["name","id"],right_on=["mname","mid"])
輸出
單個(gè)列名做為內(nèi)鏈接的連接鍵
age cp id_a name cs id_b
0 20 lm 0 lxh 10 100
1 40 ly 1 xiao 40 101
多列名做為內(nèi)鏈接的連接鍵
age cp id name cs
0 20 lm 0 lxh 10
不指定on則以兩個(gè)DataFrame的列名交集做為連接鍵
age cp id name cs
0 20 lm 0 lxh 10
使用右邊的DataFrame的行索引做為連接鍵
age cp id_x name cs id_y
0 20 lm 0 lxh 10 100
1 40 ly 1 xiao 40 101
左外連接
age cp id_a name cs id_b
0 20 lm 0 lxh 10 100
1 40 ly 1 xiao 40 101
2 4 yry 2 hua NaN NaN
3 70 old 3 be NaN NaN
左外連接1
cs id_x name age cp id_y
0 10 100 lxh 20 lm 0
1 40 101 xiao 40 ly 1
2 50 102 hua2 NaN NaN NaN
右外連接
age cp id_x name cs id_y
0 20 lm 0 lxh 10 100
1 40 ly 1 xiao 40 101
2 NaN NaN NaN hua2 50 102
使用left_on與right_on來(lái)指定列名字不同的連接鍵
age cp id name cs mid mname
0 20 lm 0 lxh 10 0 lxh
join#
<em>join方法提供了一個(gè)簡(jiǎn)便的方法用于將兩個(gè)DataFrame中的不同的列索引合并成為一個(gè)DataFrame</em>
join(self, other, on=None, how='left', lsuffix='', rsuffix='',
sort=False):
其中參數(shù)的意義與merge方法基本相同,只是join方法默認(rèn)為左外連接how=left
示例
#coding=utf-8
from pandas import Series,DataFrame,merge
data=DataFrame([{"id":0,"name":'lxh',"age":20,"cp":'lm'},{"id":1,"name":'xiao',"age":40,"cp":'ly'},{"id":2,"name":'hua',"age":4,"cp":'yry'},{"id":3,"name":'be',"age":70,"cp":'old'}],index=['a','b','c','d'])
data1=DataFrame([{"sex":0},{"sex":1},{"sex":2}],index=['a','b','e'])
print '使用默認(rèn)的左連接\r\n',data.join(data1) #這里可以看出自動(dòng)屏蔽了data中沒(méi)有的index=e 那一行的數(shù)據(jù)
print '使用右連接\r\n',data.join(data1,how="right") #這里出自動(dòng)屏蔽了data1中沒(méi)有index=c,d的那行數(shù)據(jù);等價(jià)于data1.join(data)
print '使用內(nèi)連接\r\n',data.join(data1,how='inner')
print '使用全外連接\r\n',data.join(data1,how='outer')
輸出
使用默認(rèn)的左連接
age cp id name sex
a 20 lm 0 lxh 0
b 40 ly 1 xiao 1
c 4 yry 2 hua NaN
d 70 old 3 be NaN
使用右連接
age cp id name sex
a 20 lm 0 lxh 0
b 40 ly 1 xiao 1
e NaN NaN NaN NaN 2
使用內(nèi)連接
age cp id name sex
a 20 lm 0 lxh 0
b 40 ly 1 xiao 1
使用全外連接
age cp id name sex
a 20 lm 0 lxh 0
b 40 ly 1 xiao 1
c 4 yry 2 hua NaN
d 70 old 3 be NaN
e NaN NaN NaN NaN 2
concat#
<em>concat方法相當(dāng)于數(shù)據(jù)庫(kù)中的全連接(UNION ALL),可以指定按某個(gè)軸進(jìn)行連接,也可以指定連接的方式j(luò)oin(outer,inner 只有這兩種)。與數(shù)據(jù)庫(kù)不同的時(shí)concat不會(huì)去重,要達(dá)到去重的效果可以使用drop_duplicates方法</em>
concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
keys=None, levels=None, names=None, verify_integrity=False, copy=True):
參數(shù)說(shuō)明
Parameters
----------
objs : a sequence or mapping of Series, DataFrame, or Panel objects
If a dict is passed, the sorted keys will be used as the `keys`
argument, unless it is passed, in which case the values will be
selected (see below). Any None objects will be dropped silently unless
they are all None in which case a ValueError will be raised
axis : {0, 1, ...}, default 0
The axis to concatenate along
join : {'inner', 'outer'}, default 'outer'
How to handle indexes on other axis(es)
join_axes : list of Index objects
Specific indexes to use for the other n - 1 axes instead of performing
inner/outer set logic
verify_integrity : boolean, default False
Check whether the new concatenated axis contains duplicates. This can
be very expensive relative to the actual data concatenation
keys : sequence, default None
If multiple levels passed, should contain tuples. Construct
hierarchical index using the passed keys as the outermost level
levels : list of sequences, default None
Specific levels (unique values) to use for constructing a
MultiIndex. Otherwise they will be inferred from the keys
names : list, default None
Names for the levels in the resulting hierarchical index
ignore_index : boolean, default False
If True, do not use the index values along the concatenation axis. The
resulting axis will be labeled 0, ..., n - 1. This is useful if you are
concatenating objects where the concatenation axis does not have
meaningful indexing information. Note the the index values on the other
axes are still respected in the join.
copy : boolean, default True
If False, do not copy data unnecessarily
示例
#coding=utf-8
from pandas import Series,DataFrame,concat
df1 = DataFrame({'city': ['Chicago', 'San Francisco', 'New York City'], 'rank': range(1, 4)})
df2 = DataFrame({'city': ['Chicago', 'Boston', 'Los Angeles'], 'rank': [1, 4, 5]})
print '按軸進(jìn)行內(nèi)連接\r\n',concat([df1,df2],join="inner",axis=1)
print '進(jìn)行外連接并指定keys(行索引)\r\n',concat([df1,df2],keys=['a','b']) #這里有重復(fù)的數(shù)據(jù)
print '去重后\r\n',concat([df1,df2],ignore_index=True).drop_duplicates()
輸出
按軸進(jìn)行內(nèi)連接
city rank city rank
0 Chicago 1 Chicago 1
1 San Francisco 2 Boston 4
2 New York City 3 Los Angeles 5
進(jìn)行外連接并指定keys(行索引)
city rank
a 0 Chicago 1
1 San Francisco 2
2 New York City 3
b 0 Chicago 1
1 Boston 4
2 Los Angeles 5
去重后
city rank
0 Chicago 1
1 San Francisco 2
2 New York City 3
4 Boston 4
5 Los Angeles 5