pandas(一)操作Series和DataFrame的基本功能详解大数据

reindex:重新索引

pandas对象有一个重要的方法reindex,作用:创建一个适应新索引的新对象

以Series为例

pandas(一)操作Series和DataFrame的基本功能详解大数据

 1 >>> series_obj = Series([4.5,1.3,5,-5.5],index=('a','b','c','d')) 
 2 >>> series_obj 
 3 a    4.5 
 4 b    1.3 
 5 c    5.0 
 6 d   -5.5 
 7 dtype: float64 
 8 >>> obj2 = series_obj.reindex(['a','b','c','e','f']) 
 9 >>> obj2 
10 a    4.5 
11 b    1.3 
12 c    5.0 
13 e    NaN 
14 f    NaN 
15 dtype: float64

重新索引的时候可以自动填充Nan值

pandas(一)操作Series和DataFrame的基本功能详解大数据

1 >>> obj3 = series_obj.reindex(['a','b','c','e','f'],fill_value='0') 
2 >>> obj3 
3 a    4.5 
4 b    1.3 
5 c      5 
6 e      0 
7 f      0

 

对于时间序列这样的有序数据,重新索引可能需要做一些插值操作,reindex的method参数提供此功能。

method的可选选项有:

ffill或pad :前向填充或搬运值

bfill或backfill:后向填充或搬运值

不存在前向或后项的行自动填充Nan

pandas(一)操作Series和DataFrame的基本功能详解大数据

 1 >>> obj4 = Series(['red','blue','green'],index=[0,2,4]) 
 2 >>> obj4 
 3 0      red 
 4 2     blue 
 5 4    green 
 6 dtype: object 
 7 >>> obj4.reindex(range(6),method='ffill') 
 8 0      red 
 9 1      red 
10 2     blue 
11 3     blue 
12 4    green 
13 5    green 
14 dtype: object

DataFrame的重新索引

只传入一个序列的时候,默认是重新索引“行”,可以用关键字参数来定义行索引(index)和列索引(columns)。

pandas(一)操作Series和DataFrame的基本功能详解大数据

 1 >>> frame = DataFrame(np.arange(9).reshape((3,3)),index = ['a','b','c'],columns = ['Ohio','Texas',"Cali"]) 
 2 >>> frame2 = frame.reindex(['a','b','c','d']) 
 3 >>> frame2 
 4    Ohio  Texas  Cali 
 5 a   0.0    1.0   2.0 
 6 b   3.0    4.0   5.0 
 7 c   6.0    7.0   8.0 
 8 d   NaN    NaN   NaN 
 9  
10 >>> frame3 = frame.reindex(columns = ['Ohio','Texas','Cali','Wile'],index=['a','b','c','d'],fill_value=4) 
11 >>> frame3 
12    Ohio  Texas  Cali  Wile 
13 a     0      1     2     4 
14 b     3      4     5     4 
15 c     6      7     8     4 
16 d     4      4     4     4 
17 >>>

如果对DataFrame的行和列重新索引的时候,插值只能按行应用

如果利用ix的标签索功能,重新索引会变得更简洁

pandas(一)操作Series和DataFrame的基本功能详解大数据

1 >>> frame5 = frame.ix[['a','b','c','d'], ['Ohio','Texas','Cali','Wile']] 
2 >>> frame5 
3    Ohio  Texas  Cali  Wile 
4 a   0.0    1.0   2.0   NaN 
5 b   3.0    4.0   5.0   NaN 
6 c   6.0    7.0   8.0   NaN 
7 d   NaN    NaN   NaN   NaN

drop:丢弃指定轴上的项

pandas(一)操作Series和DataFrame的基本功能详解大数据

>>> obj = Series(np.arange(5),index=['a','b','c','d','e']) 
>>> obj 
a    0 
b    1 
c    2 
d    3 
e    4 
dtype: int32 
>>> new_obj = obj.drop('b') 
>>> new_obj 
a    0 
c    2 
d    3 
e    4 
 
>>> new_obj2 = obj.drop(['b','c']) 
>>> new_obj2 
a    0 
d    3 
e    4 
dtype: int32

#dataframe >>> frame = DataFrame(np.arange(16).reshape((4,4)),index=['a','b','c','d'],columns=['one','two','three','four']) >>> frame    one  two  three  four a    0    1      2     3 b    4    5      6     7 c    8    9     10    11 d   12   13     14    15 >>> new_frame = frame.drop('a') >>> new_frame    one  two  three  four b    4    5      6     7 c    8    9     10    11 d   12   13     14    15 >>> new_frame2 = frame.drop(['two','four'],axis = 1) >>> new_frame2    one  three a    0      2 b    4      6 c    8     10 d   12     14

索引、选取和过滤

Series的索引,既可以是类似NumPy数组的索引,也可以是自定义的index

>>> obj 
a    0 
b    1 
c    2 
d    3 
e    4 
dtype: int32 
>>> obj['a'] 
0 
>>> obj[1] 
1
注意:利用标签的切片运算,标签的右侧是封闭区间的,即包含末端。 >>> obj['a':'c'] a 0 b 1 c 2 dtype: int32 >>> obj[3:4] d 3 dtype: int32 >>> obj[2:3] c 2 dtype: int32 >>> obj[[3,1]] d 3 b 1 dtype: int32 >>> obj[['a','c']] a 0 c 2 dtype: int32 >>>

通过索引修改值

>>> obj[['b','d']] *=2 
>>> obj 
a    0 
b    2 
c    2 
d    6 
e    4 
dtype: int32

 

dataframe的索引:

通过直接索引只能获取列

>>> frame 
   one  two  three  four 
a    0    1      2     3 
b    4    5      6     7 
c    8    9     10    11 
d   12   13     14    15 
>>> frame['a'] 
KeyError: 'a' 
>>> frame['one'] 
a     0 
b     4 
c     8 
d    12 
Name: one, dtype: int32 
>>> frame[['one','four']] 
   one  four 
a    0     3 
b    4     7 
c    8    11 
d   12    15 

通过切片或布尔型数组,选取的是行

>>> frame[1:3] #不闭合区间 
   one  two  three  four 
b    4    5      6     7 
c    8    9     10    11 
>>> frame[frame['three'] > 8] 
   one  two  three  four 
c    8    9     10    11 
d   12   13     14    15 
>>>

DataFrame的索引字段ix

>>> frame.ix['a'] #按照行索引 
one      0 
two      1 
three    2 
four     3 
Name: a, dtype: int32 
>>> frame.ix[['b','d']] 
   one  two  three  four 
b    4    5      6     7 
d   12   13     14    15
>>> frame.ix[1]#同样是按照行索引 
one      4 
two      5 
three    6 
four     7 
Name: b, dtype: int32 
>>> frame.ix[1:3] 
   one  two  three  four 
b    4    5      6     7 
c    8    9     10    11

 

>>> frame.ix[1:2,[2,3,1]] 
   three  four  two 
b      6     7    5 
>>> frame.ix[1:3,[2,3,1]] 
   three  four  two 
b      6     7    5 
c     10    11    9 
>>> frame.ix[['b','d'],['one','three']] 
   one  three 
b    4      6 
d   12     14 
>>> frame.ix[['b','d'],[3,1,2]] 
   four  two  three 
b     7    5      6 
d    15   13     14 
>>> frame.ix[:,[2,3,1]]# 选取所有行 
   three  four  two 
a      2     3    1 
b      6     7    5 
c     10    11    9 
d     14    15   13

>>> frame.ix[frame.three >5,:3]
one two three
b 4 5 6
c 8 9 10
d 12 13 14

 

算术运算和数据对齐

>>> s1 = Series([1.3,4.5,6.6,3.4],index=['a','b','c','d']) 
>>> s2 = Series([1,2,3,4,5,6,7],index=['a','b','c','d','e','f','g']) 
>>> s1+s2 
a    2.3 
b    6.5 
c    9.6 
d    7.4 
e    NaN 
f    NaN 
g    NaN 
dtype: float64 
#不重叠的索引处引入缺失值 
#DataFrame也是同理

再算术方法中填充缺失值

>>> df1 = DataFrame(np.arange(12).reshape((3,4)),columns=list('abcd')) 
>>> df2 = DataFrame(np.arange(20).reshape((4,5)),columns=list('abcde')) 
>>> df1+df2#普通的算术运算会产生缺失值 
      a     b     c     d   e 
0   0.0   2.0   4.0   6.0 NaN 
1   9.0  11.0  13.0  15.0 NaN 
2  18.0  20.0  22.0  24.0 NaN 
3   NaN   NaN   NaN   NaN NaN 
#用算术运算方法,可以填充缺失值 
>>> df1.add(df2,fill_value=0) 
      a     b     c     d     e 
0   0.0   2.0   4.0   6.0   4.0 
1   9.0  11.0  13.0  15.0   9.0 
2  18.0  20.0  22.0  24.0  14.0 
3  15.0  16.0  17.0  18.0  19.0 
>>>

算术运算方法有

add 加法

sub 减法

div 除法

mul 乘法

DataFrame和Series之间的运算

>>> frame 
   one  two  three  four 
a    0    1      2     3 
b    4    5      6     7 
c    8    9     10    11 
d   12   13     14    15 
>>> series = frame.ix[0] 
>>> series 
one      0 
two      1 
three    2 
four     3 
Name: a, dtype: int32 
>>> frame - series 
   one  two  three  four 
a    0    0      0     0 
b    4    4      4     4 
c    8    8      8     8 
d   12   12     12    12 
>>>

两者之间的运算会将Series的索引匹配到DataFrame的列,然后沿着行一直向下广播。

 

如果某个索引值在DataFrame的列或Series的索引中找不到,则参与运算的连个对象就会被重新索引以形成并集。

>>> series2 = Series(range(3),index = ['two','four','five']) 
>>> frame +series2 
   five  four  one  three   two 
a   NaN   4.0  NaN    NaN   1.0 
b   NaN   8.0  NaN    NaN   5.0 
c   NaN  12.0  NaN    NaN   9.0 
d   NaN  16.0  NaN    NaN  13.0

如果希望匹配行,且在列上传播,则必须使用算术方法

>>> series3 = frame['two'] 
>>> frame.sub(series3,axis = 0) 
   one  two  three  four 
a   -1    0      1     2 
b   -1    0      1     2 
c   -1    0      1     2 
d   -1    0      1     2 
>>>

 

原创文章,作者:ItWorker,如若转载,请注明出处:https://blog.ytso.com/9221.html

(0)
上一篇 2021年7月19日
下一篇 2021年7月19日

相关推荐

发表回复

登录后才能评论