pandas 处理格式化数据的利器。 本文会不断更新
1. DataFram
1.1 使用字典构造
>>> import pandas as pd
>>> d = {'doc':['txt1'], 'nid':[100]}
>>> df = pd.DataFrame(data=d, columns=('nid', 'doc'))
>>> df
nid doc
0 100 txt1
>>>
>>> d2 = {'doc':['txt1', 'txt2'], 'nid':[100, 200]}#多个成员, 字典value的长度相等
>>> df2 = pd.DataFrame(data=d2, columns=('nid', 'doc'))
>>> df2
nid doc
0 100 txt1
1 200 txt2
1.2 add 用于成员追加成份
1.2.1 单成员DataFrame追加
>>> import pandas as pd
>>> d = {'doc':['txt1'], 'nid':[100]}
>>> df = pd.DataFrame(data=d, columns=('nid', 'doc'))
>>> d2 = {'doc':['txt2'], 'nid':[200]}
>>> df
nid doc
0 100 txt1
>>> df = df.add(pd.DataFrame(d2))
>>> df
doc nid
0 txt1txt2 300
1.2.2 各个成员一起追加
>>> import pandas as pd
>>> d = {'doc':['txt1', 'text3'], 'nid':[100, 300]}
>>> df = pd.DataFrame(data=d, columns=('nid', 'doc'))
>>> df
nid doc
0 100 txt1
1 300 text3
>>> d2 = {'doc':['txt2'], 'nid':[200]}
>>> df2 = df.add(pd.DataFrame(d2))
>>> df
nid doc
0 100 txt1
1 300 text3
>>> df2 #追加的DataFram成员与原DataFrame成员数相等,出错
doc nid
0 txt1txt2 300.0
1 NaN NaN
>>> d3 = {'doc':['txt2', 'text4'], 'nid':[200, 400]}
>>> df3 = df.add(pd.DataFrame(d3))
>>> df3 #追加DataFram成员数与原DataFrame相等,分别追加
doc nid
0 txt1txt2 300
1 text3text4 700
1.3 append
>>> import pandas as pd
>>> d = {'doc':['txt1', 'text3'], 'nid':[100, 300]}
>>> df = pd.DataFrame(data=d, columns=('nid', 'doc'))
>>> d2 = {'doc':['txt2'], 'nid':[200]}
>>> df = df.append(pd.DataFrame(data=d2,
columns=('nid', 'doc')),
ignore_index=True)
>>> df
nid doc
0 100 txt1
1 300 text3
2 200 txt2
>>> df.to_csv('p.txt', index=False) #保存为csv文件
1.4 merge 合并
方法原型:
DataFrame.merge(right, how=’inner’, on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=(‘_x’, ‘_y’), copy=True, indicator=False)[source]
1.4.1 columns完全相同的DataFrame合并
>>> import pandas as pd
>>> d = {'doc':['txt1', 'text3'], 'nid':[100, 300]}
>>> df = pd.DataFrame(data=d, columns=('nid', 'doc'))
>>> d2 = {'doc':['txt2'], 'nid':[200]}
>>> d2 = {'doc':['txt2', 'txt1'], 'nid':[200, 500]}
>>> df2 = df.merge(pd.DataFrame(d2, columns=('nid', 'doc')))
>>> df2
Empty DataFrame
Columns: [nid, doc]
Index: []
>>> df
nid doc
0 100 txt1
1 300 text3
>>> df2 = df.merge(pd.DataFrame(d2,
columns=('nid', 'doc')),
how='outer') #外链的形式
>>> df2
nid doc
0 100 txt1
1 300 text3
2 200 txt2
3 500 txt1
1.4.2 column部分相同的DataFrame合并
>>> import pandas as pd
>>> d = {'doc':['txt1', 'text3'], 'nid':[100, 300]}
>>> df = pd.DataFrame(data=d, columns=('nid', 'doc'))
>>> d2 = {'nid':[200]} #只有一个column相同
>>> df2 = df.merge(pd.DataFrame(d2, columns=('nid',)), how='outer')
>>> df2
nid doc
0 100 txt1
1 300 text3
2 200 NaN
1.4.3 column完全不相同
>>> import pandas as pd
>>> d = {'doc':['txt1', 'text3'], 'nid':[100, 300]}
>>> df = pd.DataFrame(data=d, columns=('nid', 'doc'))
>>> df2 = pd.DataFrame()
>>> df3 =df2.merge(df, how='outer')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.7/site-packages/pandas/core/frame.py", line 4607, in merge
copy=copy, indicator=indicator)
File "/Library/Python/2.7/site-packages/pandas/tools/merge.py", line 61, in merge
copy=copy, indicator=indicator)
File "/Library/Python/2.7/site-packages/pandas/tools/merge.py", line 538, in __init__
self._validate_specification()
File "/Library/Python/2.7/site-packages/pandas/tools/merge.py", line 883, in _validate_specification
raise MergeError('No common columns to perform merge on')
pandas.tools.merge.MergeError: No common columns to perform merge on
原创文章,作者:ItWorker,如若转载,请注明出处:https://blog.ytso.com/9340.html