pandas DataFrame操作详解大数据

pandas 处理格式化数据的利器。 本文会不断更新

1. DataFram

1.1 使用字典构造

>>> import pandas as pd 
>>> d = {'doc':['txt1'], 'nid':[100]} 
>>> df = pd.DataFrame(data=d, columns=('nid', 'doc')) 
>>> df 
   nid   doc 
0  100  txt1 
>>> 
>>> d2 = {'doc':['txt1', 'txt2'], 'nid':[100, 200]}#多个成员, 字典value的长度相等 
>>> df2 = pd.DataFrame(data=d2, columns=('nid', 'doc')) 
>>> df2 
   nid   doc 
0  100  txt1 
1  200  txt2

1.2 add 用于成员追加成份

1.2.1 单成员DataFrame追加

 
>>> import pandas as pd 
>>> d = {'doc':['txt1'], 'nid':[100]} 
>>> df = pd.DataFrame(data=d, columns=('nid', 'doc')) 
>>> d2 = {'doc':['txt2'], 'nid':[200]} 
>>> df 
   nid   doc 
0  100  txt1 
>>> df = df.add(pd.DataFrame(d2)) 
>>> df 
        doc  nid 
0  txt1txt2  300

1.2.2 各个成员一起追加

>>> import pandas as pd 
>>> d = {'doc':['txt1', 'text3'], 'nid':[100, 300]} 
>>> df = pd.DataFrame(data=d, columns=('nid', 'doc')) 
>>> df 
   nid    doc 
0  100   txt1 
1  300  text3 
>>> d2 = {'doc':['txt2'], 'nid':[200]} 
>>> df2 = df.add(pd.DataFrame(d2)) 
>>> df 
   nid    doc 
0  100   txt1 
1  300  text3 
>>> df2  #追加的DataFram成员与原DataFrame成员数相等,出错 
        doc    nid 
0  txt1txt2  300.0 
1       NaN    NaN 
>>> d3 = {'doc':['txt2', 'text4'], 'nid':[200, 400]}  
>>> df3 = df.add(pd.DataFrame(d3)) 
>>> df3   #追加DataFram成员数与原DataFrame相等,分别追加 
          doc  nid 
0    txt1txt2  300 
1  text3text4  700

1.3 append

>>> import pandas as pd 
>>> d = {'doc':['txt1', 'text3'], 'nid':[100, 300]} 
>>> df = pd.DataFrame(data=d, columns=('nid', 'doc')) 
>>> d2 = {'doc':['txt2'], 'nid':[200]} 
>>> df = df.append(pd.DataFrame(data=d2,  
                                columns=('nid', 'doc')),            
                                ignore_index=True) 
>>> df 
   nid    doc 
0  100   txt1 
1  300  text3 
2  200   txt2 
>>> df.to_csv('p.txt', index=False) #保存为csv文件

1.4 merge 合并

方法原型:
DataFrame.merge(right, how=’inner’, on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=(‘_x’, ‘_y’), copy=True, indicator=False)[source]

1.4.1 columns完全相同的DataFrame合并

>>> import pandas as pd 
>>> d = {'doc':['txt1', 'text3'], 'nid':[100, 300]} 
>>> df = pd.DataFrame(data=d, columns=('nid', 'doc')) 
>>> d2 = {'doc':['txt2'], 'nid':[200]} 
>>> d2 = {'doc':['txt2', 'txt1'], 'nid':[200, 500]} 
>>> df2 = df.merge(pd.DataFrame(d2, columns=('nid', 'doc'))) 
>>> df2 
Empty DataFrame 
Columns: [nid, doc] 
Index: [] 
>>> df 
   nid    doc 
0  100   txt1 
1  300  text3 
>>> df2 = df.merge(pd.DataFrame(d2,  
                                columns=('nid', 'doc')),  
                                how='outer')  #外链的形式 
>>> df2 
   nid    doc 
0  100   txt1 
1  300  text3 
2  200   txt2 
3  500   txt1

1.4.2 column部分相同的DataFrame合并

>>> import pandas as pd 
>>> d = {'doc':['txt1', 'text3'], 'nid':[100, 300]} 
>>> df = pd.DataFrame(data=d, columns=('nid', 'doc')) 
>>> d2 = {'nid':[200]}  #只有一个column相同 
>>> df2 = df.merge(pd.DataFrame(d2, columns=('nid',)), how='outer') 
>>> df2 
   nid    doc 
0  100   txt1 
1  300  text3 
2  200    NaN

1.4.3 column完全不相同

>>> import pandas as pd 
>>> d = {'doc':['txt1', 'text3'], 'nid':[100, 300]} 
>>> df = pd.DataFrame(data=d, columns=('nid', 'doc')) 
>>> df2 = pd.DataFrame() 
>>> df3 =df2.merge(df, how='outer') 
Traceback (most recent call last): 
  File "<stdin>", line 1, in <module> 
  File "/Library/Python/2.7/site-packages/pandas/core/frame.py", line 4607, in merge 
    copy=copy, indicator=indicator) 
  File "/Library/Python/2.7/site-packages/pandas/tools/merge.py", line 61, in merge 
    copy=copy, indicator=indicator) 
  File "/Library/Python/2.7/site-packages/pandas/tools/merge.py", line 538, in __init__ 
    self._validate_specification() 
  File "/Library/Python/2.7/site-packages/pandas/tools/merge.py", line 883, in _validate_specification 
    raise MergeError('No common columns to perform merge on') 
pandas.tools.merge.MergeError: No common columns to perform merge on

原创文章,作者:ItWorker,如若转载,请注明出处:https://blog.ytso.com/9340.html

(0)
上一篇 2021年7月19日
下一篇 2021年7月19日

相关推荐

发表回复

登录后才能评论