利用Python开展数据分析

利用Python进行数据分析——pandas入门(五)(3)

1、丢弃指定轴上的项

丢弃某条轴上的一个或多个项很简单，只要有一个索引数组或列表即可。由于需要执行一些数据整理和集合逻辑，所以drop方法返回的是一个在指定轴上删除了指定值的新对象：

In [1]: import pandas as pdIn [2]: import numpy as npIn [3]: obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])In [4]: new_obj = obj.drop('c')In [5]: new_objOut[5]: a    0b    1d    3e    4dtype: float64

对于DataFrame，可以删除任意轴上的索引值：

In [6]: data = pd.DataFrame(np.arange(16).reshape((4, 4)),   ...:                     index=['Ohio', 'Colorado', 'Utah', 'New York'],   ...:                     columns=['one', 'two', 'three', 'four'])In [7]: data.drop(['Colorado', 'Ohio'])Out[7]:           one  two  three  fourUtah        8    9     10    11New York   12   13     14    15[2 rows x 4 columns]In [8]: data.drop('two', axis=1)Out[8]:           one  three  fourOhio        0      2     3Colorado    4      6     7Utah        8     10    11New York   12     14    15[4 rows x 3 columns]In [9]: data.drop(['two', 'four'], axis=1)Out[9]:           one  threeOhio        0      2Colorado    4      6Utah        8     10New York   12     14[4 rows x 2 columns]

2、索引、选取和过滤

Series索引（obj[...]）的工作方式类似于NumPy数组的索引，只不过Series的索引值不只是整数。下面是几个例子：

In [10]: obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])In [11]: obj['b']Out[11]: 1.0In [12]: obj[1]Out[12]: 1.0In [13]: obj[2:4]Out[13]: c    2d    3dtype: float64In [14]: obj[['b', 'a', 'd']]Out[14]: b    1a    0d    3dtype: float64In [15]: obj[[1, 3]]Out[15]: b    1d    3dtype: float64In [16]: obj[obj < 2]Out[16]: a    0b    1dtype: float64

利用标签的切片运算与普通的Python切片运算不同，其末端是包含的（inclusive）：

In [17]: obj['b':'c']Out[17]: b    1c    2dtype: float64

设置的方式也很简单：

In [18]: obj['b':'c'] = 5In [19]: objOut[19]: a    0b    5c    5d    3dtype: float64

如你所见，对DataFrame进行索引其实就是获取一个或多个列：

In [20]: data = pd.DataFrame(np.arange(16).reshape((4, 4)),   ....:        index=['Ohio', 'Colorado', 'Utah', 'New York'],   ....:        columns=['one', 'two', 'three', 'four'])In [21]: dataOut[21]:           one  two  three  fourOhio        0    1      2     3Colorado    4    5      6     7Utah        8    9     10    11New York   12   13     14    15[4 rows x 4 columns]In [22]: data['two']Out[22]: Ohio         1Colorado     5Utah         9New York    13Name: two, dtype: int32In [23]: data[['three', 'one']]Out[23]:           three  oneOhio          2    0Colorado      6    4Utah         10    8New York     14   12[4 rows x 2 columns]

这种索引方式有几个特殊的情况。首先通过切片或布尔型数组选取行：

In [24]: data[:2]Out[24]:           one  two  three  fourOhio        0    1      2     3Colorado    4    5      6     7[2 rows x 4 columns]In [25]: data[data['three'] > 5]Out[25]:           one  two  three  fourColorado    4    5      6     7Utah        8    9     10    11New York   12   13     14    15[3 rows x 4 columns]

有些读者可能会认为这不太合乎逻辑，但这种语法的确来源于实践。另一种用法是通过布尔型DataFrame（比如下面这个由标量比较运算得出的）进行索引：

In [26]: data < 5Out[26]:             one    two  three   fourOhio       True   True   True   TrueColorado   True  False  False  FalseUtah      False  False  False  FalseNew York  False  False  False  False[4 rows x 4 columns]In [27]: data[data < 5] = 0In [28]: dataOut[28]:           one  two  three  fourOhio        0    0      0     0Colorado    0    5      6     7Utah        8    9     10    11New York   12   13     14    15[4 rows x 4 columns]

说明：

这段代码的目的是使DataFrame在语法上更像ndarray。

为了在DataFrame的行上进行标签索引，我引入了专门的索引字段ix。它使你可以通过NumPy式的标记法以及轴标签从DataFrame中选取行和列的子集。之前曾提到过，这也是一种重新索引的简单手段：

In [29]: data.ix['Colorado', ['two', 'three']]Out[29]: two      5three    6Name: Colorado, dtype: int32In [30]: data.ix[['Colorado', 'Utah'], [3, 0, 1]]Out[30]:           four  one  twoColorado     7    0    5Utah        11    8    9[2 rows x 3 columns]In [31]: data.ix[2]Out[31]: one       8two       9three    10four     11Name: Utah, dtype: int32In [32]: data.ix[:'Utah', 'two']Out[32]: Ohio        0Colorado    5Utah        9Name: two, dtype: int32In [33]: data.ix[data.three > 5, :3]Out[33]:           one  two  threeColorado    0    5      6Utah        8    9     10New York   12   13     14[3 rows x 3 columns]

注意：

在设计pandas时，我觉得必须输入frame[:, col]才能选取列实在有些麻烦，因为列的选取是一种最常见的操作。于是，我就把所有的标签索引功能都放到ix中了。

3、算法运算和数据对齐

pandas最重要的一个功能是，它可以对不同索引的对象进行算法运算。在将对象相加时，如果存在不同的索引对，则结果的索引就是该索引对的并集。

In [34]: s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])In [35]: s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])In [36]: s1Out[36]: a    7.3c   -2.5d    3.4e    1.5dtype: float64In [37]: s2Out[37]: a   -2.1c    3.6e   -1.5f    4.0g    3.1dtype: float64

将它们相加就会产生：

In [39]: s1 + s2Out[39]: a    5.2c    1.1d    NaNe    0.0f    NaNg    NaNdtype: float64

说明：

自动的数据对齐操作在不重叠的索引处引入了NA值。缺失值会在算术运算过程中传播。

对于DataFrame，对齐操作会同时发生在行和列上：

In [40]: df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),                            index=['Ohio', 'Texas', 'Colorado'])In [41]: df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])In [42]: df1Out[42]:           b  c  dOhio      0  1  2Texas     3  4  5Colorado  6  7  8[3 rows x 3 columns]In [43]: df2Out[43]:         b   d   eUtah    0   1   2Ohio    3   4   5Texas   6   7   8Oregon  9  10  11[4 rows x 3 columns]

把它们相加后将会返回一个新的DataFrame，其索引和列为原来那两个DataFrame的并集：

In [44]: df1 + df2Out[44]:            b   c   d   eColorado NaN NaN NaN NaNOhio       3 NaN   6 NaNOregon   NaN NaN NaN NaNTexas      9 NaN  12 NaNUtah     NaN NaN NaN NaN[5 rows x 4 columns]

4、在算术方法中填充值

在对不同索引的对象进行算术运算时，你可能希望当一个对象中某个轴标签在另一个对象中找不到时填充一个特殊值（比如0）：

In [45]: df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)), columns=list('abcd'))In [46]: df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)), columns=list('abcde'))In [47]: df1Out[47]:    a  b   c   d0  0  1   2   31  4  5   6   72  8  9  10  11[3 rows x 4 columns]In [48]: df2Out[48]:     a   b   c   d   e0   0   1   2   3   41   5   6   7   8   92  10  11  12  13  143  15  16  17  18  19[4 rows x 5 columns]

将它们相加时，没有重叠的位置就会产生NA值：

In [49]: df1 + df2Out[49]:     a   b   c   d   e0   0   2   4   6 NaN1   9  11  13  15 NaN2  18  20  22  24 NaN3 NaN NaN NaN NaN NaN[4 rows x 5 columns]

使用df1的add方法，传入df2以及一个fill_value参数：

In [50]: df1.add(df2, fill_value=0)Out[50]:     a   b   c   d   e0   0   2   4   6   41   9  11  13  15   92  18  20  22  24  143  15  16  17  18  19[4 rows x 5 columns]

与此类似，在对Series或DataFrame重新索引时，也可以指定一个填充值：

In [51]: df1.reindex(columns=df2.columns, fill_value=0)Out[51]:    a  b   c   d  e0  0  1   2   3  01  4  5   6   7  02  8  9  10  11  0[3 rows x 5 columns]

5、DataFrame和Series之间的运算

跟NumPy数组一样，DataFrame和Series之间算术运算也是有明确规定的。先来看一个具有启发性的例子，计算一个二维数组与其某行之间的差：

In [52]: arr = np.arange(12.).reshape((3, 4))In [53]: arrOut[53]: array([[  0.,   1.,   2.,   3.],       [  4.,   5.,   6.,   7.],       [  8.,   9.,  10.,  11.]])In [54]: arr[0]Out[54]: array([ 0.,  1.,  2.,  3.])In [55]: arr - arr[0]Out[55]: array([[ 0.,  0.,  0.,  0.],       [ 4.,  4.,  4.,  4.],       [ 8.,  8.,  8.,  8.]])

这就叫做广播（broadcasting）。DataFrame和Series之间的运算差不多也是如此：

In [56]: frame = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])In [57]: series = frame.ix[0]In [58]: frameOut[58]:         b   d   eUtah    0   1   2Ohio    3   4   5Texas   6   7   8Oregon  9  10  11[4 rows x 3 columns]In [59]: seriesOut[59]: b    0d    1e    2Name: Utah, dtype: float64

默认情况下，DataFrame和Series之间的算术运算会将Series的索引匹配到DataFrame的列，然后沿着行一直向下广播：

In [60]: frame - seriesOut[60]:         b  d  eUtah    0  0  0Ohio    3  3  3Texas   6  6  6Oregon  9  9  9[4 rows x 3 columns]

如果某个索引值在DataFrame的列或Series的索引中找不到，则参与运算的两个对象就会被重新索引以形成并集：

In [61]: series2 = pd.Series(range(3), index=['b', 'e', 'f'])In [62]: frame + seriesOut[62]:         b   d   eUtah    0   2   4Ohio    3   5   7Texas   6   8  10Oregon  9  11  13[4 rows x 3 columns]In [63]: frame + series2Out[63]:         b   d   e   fUtah    0 NaN   3 NaNOhio    3 NaN   6 NaNTexas   6 NaN   9 NaNOregon  9 NaN  12 NaN[4 rows x 4 columns]

如果你希望匹配行且在列上广播，则必须使用算术运算方法。例如：

In [64]: series3 = frame['d']In [65]: frameOut[65]:         b   d   eUtah    0   1   2Ohio    3   4   5Texas   6   7   8Oregon  9  10  11[4 rows x 3 columns]In [66]: series3Out[66]: Utah       1Ohio       4Texas      7Oregon    10Name: d, dtype: float64In [67]: frame.sub(series3, axis=0)Out[67]:         b  d  eUtah   -1  0  1Ohio   -1  0  1Texas  -1  0  1Oregon -1  0  1[4 rows x 3 columns]

传入的轴号就是希望匹配的轴。在本例中，我们的目的是匹配DataFrame的行索引并进行广播。

6、函数应用和映射

NumPy的ufuncs（元素级数组方法）也可用于操作pandas对象：

In [68]: frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])In [69]: frameOut[69]:                b         d         eUtah   -1.477719 -1.530953 -0.913435Ohio    0.285921  0.337583 -0.114854Texas   0.977180  0.803043  1.179746Oregon  1.121824  1.111941 -1.532408[4 rows x 3 columns]In [70]: np.abs(frame)Out[70]:                b         d         eUtah    1.477719  1.530953  0.913435Ohio    0.285921  0.337583  0.114854Texas   0.977180  0.803043  1.179746Oregon  1.121824  1.111941  1.532408[4 rows x 3 columns]

另一个常见的操作是，将函数应用到由各列或行所形成的一维数组上。DataFrame的apply方法即可实现此功能：

In [71]: f = lambda x: x.max() - x.min()In [72]: frame.apply(f)Out[72]: b    2.599543d    2.642894e    2.712154dtype: float64In [73]: frame.apply(f, axis=1)Out[73]: Utah      0.617518Ohio      0.452436Texas     0.376704Oregon    2.654232dtype: float64

许多最为常见的数组统计功能都被实现成DataFrame的方法（如sum和mean），因此无需使用apply方法。除标量值外，传递给apply的函数还可以返回由多个值组成的Series：

In [75]: def f(x):   ....:     return pd.Series([x.min(), x.max()], index=['min', 'max'])   ....: In [76]: frame.apply(f)Out[76]:             b         d         emin -1.477719 -1.530953 -1.532408max  1.121824  1.111941  1.179746[2 rows x 3 columns]

此外，元素级的Python函数也是可以用的。假如你想得到frame中各个浮点值的格式化字符串，使用applymap即可：

In [77]: format = lambda x: '%.2f' % xIn [78]: frame.applymap(format)Out[78]:             b      d      eUtah    -1.48  -1.53  -0.91Ohio     0.29   0.34  -0.11Texas    0.98   0.80   1.18Oregon   1.12   1.11  -1.53[4 rows x 3 columns]

之所以叫做applymap，是因为Series有一个用于应用元素级函数的map方法：

In [79]: frame['e'].map(format)Out[79]: Utah      -0.91Ohio      -0.11Texas      1.18Oregon    -1.53Name: e, dtype: object

本站仅提供存储服务，所有内容均由用户发布，如发现有害或侵权内容，请点击举报。