Pandas中文官档 ~ 基础用法1

呆鸟云：“在学习 Python 数据分析的过程中，呆鸟发现直接看官档就是牛逼啊，内容全面、丰富、详细，而 Python 数据分析里最核心的莫过于 pandas，于是就想翻译 pandas 官档，于是就发现了 pypandas.cn 这个项目，于是就加入了 pandas 中文官档翻译小组，于是就没时间更新公众号，于是就犯懒想把翻译与校译的 pandas 当公众号文章发上来，于是今后大家就可以在这里看了。”

本节介绍 pandas 数据结构的基础用法。下列代码创建示例数据对象：

In [1]: index = pd.date_range('1/1/2000', periods=8)

In [2]: s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

In [3]: df = pd.DataFrame(np.random.randn(8, 3), index=index,

...: columns=['A', 'B', 'C'])

...:

Head 与 Tail

head() 与 tail() 用于快速预览 Series 与 DataFrame，默认显示 5 条数据，也可以指定要显示的数量。

In [4]: long_series = pd.Series(np.random.randn(1000))

In [5]: long_series.head()

Out[5]:

0 -1.157892

1 -1.344312

2 0.844885

3 1.075770

4 -0.109050

dtype: float64

In [6]: long_series.tail(3)

Out[6]:

997 -0.289388

998 -1.020544

999 0.589993

dtype: float64

属性与底层数据

Pandas 可以通过多个属性访问元数据：

shape:

输出对象的轴维度，与 ndarray 一致

轴标签

Series: Index (仅有此轴)

DataFrame: Index (行) 与列

注意：为属性赋值是安全的！

In [7]: df[:2]

Out[7]:

A B C

2000-01-01 -0.173215 0.119209 -1.044236

2000-01-02 -0.861849 -2.104569 -0.494929

In [8]: df.columns = [x.lower() for x in df.columns]

In [9]: df

Out[9]:

a b c

2000-01-01 -0.173215 0.119209 -1.044236

2000-01-02 -0.861849 -2.104569 -0.494929

2000-01-03 1.071804 0.721555 -0.706771

2000-01-04 -1.039575 0.271860 -0.424972

2000-01-05 0.567020 0.276232 -1.087401

2000-01-06 -0.673690 0.113648 -1.478427

2000-01-07 0.524988 0.404705 0.577046

2000-01-08 -1.715002 -1.039268 -0.370647

Pandas 对象（Index， Series， DataFrame）相当于数组的容器，用于存储数据，并执行计算。大部分类型的底层数组都是 numpy.ndarray。不过，pandas 与第三方支持库一般都会扩展 Numpy 类型系统，添加自定义数组（见数据类型）。

获取 Index 或 Series 里的数据，请用 .array 属性。

In [10]: s.array

Out[10]:

[ 0.4691122999071863, -0.2828633443286633, -1.5090585031735124,

-1.1356323710171934, 1.2121120250208506]

Length: 5, dtype: float64

In [11]: s.index.array

Out[11]:

['a', 'b', 'c', 'd', 'e']

Length: 5, dtype: object

array 一般指 ExtensionArray。至于什么是 ExtensionArray 及 pandas 为什么要用 ExtensionArray 不是本节要说明的内容。更多信息请参阅数据类型。

提取 Numpy 数组，用 to_numpy() 或 numpy.asarray()。

In [12]: s.to_numpy()

Out[12]: array([ 0.4691, -0.2829, -1.5091, -1.1356, 1.2121])

In [13]: np.asarray(s)

Out[13]: array([ 0.4691, -0.2829, -1.5091, -1.1356, 1.2121])

Series 与 Index 的类型是 ExtensionArray 时， to_numpy() 会复制数据，并强制转换值。详情见数据类型。

to_numpy() 可以控制 numpy.ndarray 生成的数据类型。以带时区的 datetime 为例，Numpy 未提供时区信息的 datetime 数据类型，pandas 则提供了两种表现形式：

一种是带 Timestamp 的 numpy.ndarray，提供了正确的 tz 信息。

另一种是 datetime64[ns]，这也是 numpy.ndarray，值被转换为 UTC，但去掉了时区信息。

时区信息可以用 dtype=object 保存。

In [14]: ser = pd.Series(pd.date_range('2000', periods=2, tz='CET'))

In [15]: ser.to_numpy(dtype=object)

Out[15]:

array([Timestamp('2000-01-01 00:00:00+0100', tz='CET', freq='D'),

Timestamp('2000-01-02 00:00:00+0100', tz='CET', freq='D')],

dtype=object)

或用 dtype='datetime64[ns]' 去除。

In [16]: ser.to_numpy(dtype='datetime64[ns]')

Out[16]:

array(['1999-12-31T23:00:00.000000000', '2000-01-01T23:00:00.000000000'],

dtype='datetime64[ns]')

获取 DataFrame 里的原数据略显复杂。DataFrame 里所有列的数据类型都一样时，DataFrame.to_numpy() 返回底层数据：

In [17]: df.to_numpy()

Out[17]:

array([[-0.1732, 0.1192, -1.0442],

[-0.8618, -2.1046, -0.4949],

[ 1.0718, 0.7216, -0.7068],

[-1.0396, 0.2719, -0.425 ],

[ 0.567 , 0.2762, -1.0874],

[-0.6737, 0.1136, -1.4784],

[ 0.525 , 0.4047, 0.577 ],

[-1.715 , -1.0393, -0.3706]])

DataFrame 为同质型数据时，pandas 直接修改原始 ndarray，所做修改会直接反应在数据结构里。对于异质型数据，即 DataFrame 列的数据类型不一样时，就不是这种操作模式了。与轴标签不同，不能为值的属性赋值。

::: tip 注意

处理异质型数据时，输出结果 ndarray 的数据类型适用于涉及的各类数据。若 DataFrame 里包含字符串，输出结果的数据类型就是 object。要是只有浮点数或整数，则输出结果的数据类型是浮点数。

:::

以前，pandas 推荐用 Series.values 或 DataFrame.values 从 Series 或 DataFrame 里提取数据。旧有代码库或在线教程里仍在用这种操作，但其实 pandas 已经对此做出了改进，现在推荐用 .array 或 to_numpy 这两种方式提取数据，别再用 .values 了。.values 有以下几个缺点：

Series 含扩展类型时，Series.values 无法判断到底是该返回 Numpy array，还是返回 ExtensionArray。而 Series.array 则只返回 ExtensionArray，且不会复制数据。Series.to_numpy() 则返回 Numpy 数组，其代价是需要复制、并强制转换数据的值。

DataFrame 含多种数据类型时，DataFrame.values 会复制数据，并将数据的值强制转换同一种数据类型，这是一种代价较高的操作。DataFrame.to_numpy() 则返回 Numpy 数组，这种方式更清晰，也不会把 DataFrame 里的数据都当作一种类型。

加速操作

借助 numexpr 与 bottleneck 支持库，pandas 可以加速特定类型的二进制数值与布尔操作。

处理大型数据集时，这两个支持库特别有用，加速效果也非常明显。 numexpr 使用智能分块、缓存与多核技术。bottleneck 是一组专属 cython 例程，处理含 nans 值的数组时，特别快。

请看下面这个例子（DataFrame 包含 100 列 X 10 万行数据）:

操作0.11.0版 (ms)旧版 (ms)提升比率

df1 > df213.32125.350.1063

df1 * df221.7136.630.5928

df1 + df222.0436.500.6039

强烈建议安装这两个支持库，了解更多信息，请参阅推荐支持库。

这两个支持库默认为启用状态，可用以下选项设置：

0.20.0 版新增

pd.set_option('compute.use_bottleneck', False)

pd.set_option('compute.use_numexpr', False)

二进制操作

pandas 数据结构之间执行二进制操作，要注意下列两个关键点：多维（DataFrame）与低维（Series）对象之间的广播机制；

计算中的缺失值处理。

这两个问题可以同时处理，但下面先介绍怎么分开处理。

匹配/广播机制

DataFrame 支持 add()、sub()、mul()、div() 及 radd()、rsub() 等方法执行二进制操作。广播机制重点关注输入的 Series。通过 axis 关键字，匹配 index 或 columns 即可调用这些函数。

In [18]: df = pd.DataFrame({

....: 'one': pd.Series(np.random.randn(3), index=['a', 'b', 'c']),

....: 'two': pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),

....: 'three': pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})

....:

In [19]: df

Out[19]:

one two three

a 1.394981 1.772517 NaN

b 0.343054 1.912123 -0.050390

c 0.695246 1.478369 1.227435

d NaN 0.279344 -0.613172

In [20]: row = df.iloc[1]

In [21]: column = df['two']

In [22]: df.sub(row, axis='columns')

Out[22]:

one two three

a 1.051928 -0.139606 NaN

b 0.000000 0.000000 0.000000

c 0.352192 -0.433754 1.277825

d NaN -1.632779 -0.562782

In [23]: df.sub(row, axis=1)

Out[23]:

one two three

a 1.051928 -0.139606 NaN

b 0.000000 0.000000 0.000000

c 0.352192 -0.433754 1.277825

d NaN -1.632779 -0.562782

In [24]: df.sub(column, axis='index')

Out[24]:

one two three

a -0.377535 0.0 NaN

b -1.569069 0.0 -1.962513

c -0.783123 0.0 -0.250933

d NaN 0.0 -0.892516

In [25]: df.sub(column, axis=0)

Out[25]:

one two three

a -0.377535 0.0 NaN

b -1.569069 0.0 -1.962513

c -0.783123 0.0 -0.250933

d NaN 0.0 -0.892516

还可以用 Series 对齐多重索引 DataFrame 的某一层级。

In [26]: dfmi = df.copy()

In [27]: dfmi.index = pd.MultiIndex.from_tuples([(1, 'a'), (1, 'b'),

....: (1, 'c'), (2, 'a')],

....: names=['first', 'second'])

....:

In [28]: dfmi.sub(column, axis=0, level='second')

Out[28]:

one two three

first second

1 a -0.377535 0.000000 NaN

b -1.569069 0.000000 -1.962513

c -0.783123 0.000000 -0.250933

2 a NaN -1.493173 -2.385688

Series 与 Index 还支持 divmod() 内置函数，该函数同时执行向下取整除与模运算，返回两个与左侧类型相同的元组。示例如下：

In [29]: s = pd.Series(np.arange(10))

In [30]: s

Out[30]:

0 0

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

dtype: int64

In [31]: div, rem = divmod(s, 3)

In [32]: div

Out[32]:

0 0

1 0

2 0

3 1

4 1

5 1

6 2

7 2

8 2

9 3

dtype: int64

In [33]: rem

Out[33]:

0 0

1 1

2 2

3 0

4 1

5 2

6 0

7 1

8 2

9 0

dtype: int64

In [34]: idx = pd.Index(np.arange(10))

In [35]: idx

Out[35]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')

In [36]: div, rem = divmod(idx, 3)

In [37]: div

Out[37]: Int64Index([0, 0, 0, 1, 1, 1, 2, 2, 2, 3], dtype='int64')

In [38]: rem

Out[38]: Int64Index([0, 1, 2, 0, 1, 2, 0, 1, 2, 0], dtype='int64')

divmod() 还支持元素级运算：

In [39]: div, rem = divmod(s, [2, 2, 3, 3, 4, 4, 5, 5, 6, 6])

In [40]: div

Out[40]:

0 0

1 0

2 0

3 1

4 1

5 1

6 1

7 1

8 1

9 1

dtype: int64

In [41]: rem

Out[41]:

0 0

1 1

2 2

3 0

4 0

5 1

6 1

7 2

8 2

9 3

dtype: int64

缺失值与填充缺失值操作

Series 与 DataFrame 的算数函数支持 fill_value 选项，即用指定值替换某个位置的缺失值。比如，两个 DataFrame 相加，除非两个 DataFrame 里同一个位置都有缺失值，其相加的和仍为 NaN，如果只有一个 DataFrame 里存在缺失值，则可以用 fill_value 指定一个值来替代 NaN，当然，也可以用 fillna 把 NaN 替换为想要的值。

In [42]: df

Out[42]:

one two three

a 1.394981 1.772517 NaN

b 0.343054 1.912123 -0.050390

c 0.695246 1.478369 1.227435

d NaN 0.279344 -0.613172

In [43]: df2

Out[43]:

one two three

a 1.394981 1.772517 1.000000

b 0.343054 1.912123 -0.050390

c 0.695246 1.478369 1.227435

d NaN 0.279344 -0.613172

In [44]: df + df2

Out[44]:

one two three

a 2.789963 3.545034 NaN

b 0.686107 3.824246 -0.100780

c 1.390491 2.956737 2.454870

d NaN 0.558688 -1.226343

In [45]: df.add(df2, fill_value=0)

Out[45]:

one two three

a 2.789963 3.545034 1.000000

b 0.686107 3.824246 -0.100780

c 1.390491 2.956737 2.454870

d NaN 0.558688 -1.226343

比较操作

与上一小节的算数运算类似，Series 与 DataFrame 还支持 eq、ne、lt、gt、le、ge 等二进制比较操作的方法：

序号缩写英文中文

1eqequal to等于

2nenot equal to不等于

3ltless than小于

4gtgreater than大于

5leless than or equal to小于等于

6gegreater than or equal to大于等于

In [46]: df.gt(df2)

Out[46]:

one two three

a False False False

b False False False

c False False False

d False False False

In [47]: df2.ne(df)

Out[47]:

one two three

a False False True

b False False False

c False False False

d True False False

这些操作生成一个与左侧输入对象类型相同的 pandas 对象，即，dtype 为 bool。这些 boolean 对象可用于索引操作，参阅布尔索引小节。

布尔简化

empty、any()、all()、bool() 可以把数据汇总简化至单个布尔值。

In [48]: (df > 0).all()

Out[48]:

one False

two True

three False

dtype: bool

In [49]: (df > 0).any()

Out[49]:

one True

two True

three True

dtype: bool

还可以进一步把上面的结果简化为单个布尔值。

In [50]: (df > 0).any().any()

Out[50]: True

通过 empty 属性，可以验证 pandas 对象是否为空。

In [51]: df.empty

Out[51]: False

In [52]: pd.DataFrame(columns=list('ABC')).empty

Out[52]: True

用 bool() 方法验证单元素 pandas 对象的布尔值。

In [53]: pd.Series([True]).bool()

Out[53]: True

In [54]: pd.Series([False]).bool()

Out[54]: False

In [55]: pd.DataFrame([[True]]).bool()

Out[55]: True

In [56]: pd.DataFrame([[False]]).bool()

Out[56]: False

::: danger 警告

以下代码：

>>> if df:

... pass

或

>>> df and df2

上述代码试图比对多个值，因此，这两种操作都会触发错误：

ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all().

:::

了解详情，请参阅各种坑小节的内容。

比较对象是否等效

一般情况下，多种方式都能得出相同的结果。以 df + df 与 df * 2 为例。应用上一小节学到的知识，测试这两种计算方式的结果是否一致，一般人都会用 (df + df == df * 2).all()，不过，这个表达式的结果是 False：

In [57]: df + df == df * 2

Out[57]:

one two three

a True True False

b True True True

c True True True

d False True True

In [58]: (df + df == df * 2).all()

Out[58]:

one False

two True

three False

dtype: bool

注意：布尔型 DataFrame df + df == df * 2 中有 False 值！这是因为两个 NaN 值的比较结果为不等：

In [59]: np.nan == np.nan

Out[59]: False

为了验证数据是否等效，Series 与 DataFrame 等 N 维框架提供了 equals() 方法，，用这个方法验证 NaN 值的结果为相等。

In [60]: (df + df).equals(df * 2)

Out[60]: True

注意：Series 与 DataFrame 索引的顺序必须一致，验证结果才能为 True：

In [61]: df1 = pd.DataFrame({'col': ['foo', 0, np.nan]})

In [62]: df2 = pd.DataFrame({'col': [np.nan, 0, 'foo']}, index=[2, 1, 0])

In [63]: df1.equals(df2)

Out[63]: False

In [64]: df1.equals(df2.sort_index())

Out[64]: True

比较 array 型对象

用标量值与 pandas 数据结构对比数据元素非常简单：

In [65]: pd.Series(['foo', 'bar', 'baz']) == 'foo'

Out[65]:

0 True

1 False

2 False

dtype: bool

In [66]: pd.Index(['foo', 'bar', 'baz']) == 'foo'

Out[66]: array([ True, False, False])

pandas 还能对比两个等长 array 对象里的数据元素：

In [67]: pd.Series(['foo', 'bar', 'baz']) == pd.Index(['foo', 'bar', 'qux'])

Out[67]:

0 True

1 True

2 False

dtype: bool

In [68]: pd.Series(['foo', 'bar', 'baz']) == np.array(['foo', 'bar', 'qux'])

Out[68]:

0 True

1 True

2 False

dtype: bool

对比不等长的 Index 或 Series 对象会触发 ValueError：

In [55]: pd.Series(['foo', 'bar', 'baz']) == pd.Series(['foo', 'bar'])

ValueError: Series lengths must match to compare

In [56]: pd.Series(['foo', 'bar', 'baz']) == pd.Series(['foo'])

ValueError: Series lengths must match to compare

注意：这里的操作与 Numpy 的广播机制不同：

In [69]: np.array([1, 2, 3]) == np.array([2])

Out[69]: array([False, True, False])

Numpy 无法执行广播操作时，返回 False:

In [70]: np.array([1, 2, 3]) == np.array([1, 2])

Out[70]: False

合并重叠数据集

有时会合并两个近似数据集，两个数据集中，其中一个的数据比另一个多。比如，展示特定经济指标的两个数据序列，其中一个是“高质量”指标，另一个是“低质量”指标。一般来说，低质量序列可能包含更多的历史数据，或覆盖更广的数据。因此，要合并这两个 DataFrame 对象，其中一个 DataFrame 中的缺失值将按指定条件用另一个 DataFrame 里类似标签中的数据进行填充。要实现这一操作，请用下列代码中的 combine_first() 函数。

In [71]: df1 = pd.DataFrame({'A': [1., np.nan, 3., 5., np.nan],

....: 'B': [np.nan, 2., 3., np.nan, 6.]})

....:

In [72]: df2 = pd.DataFrame({'A': [5., 2., 4., np.nan, 3., 7.],

....: 'B': [np.nan, np.nan, 3., 4., 6., 8.]})

....:

In [73]: df1

Out[73]:

A B

0 1.0 NaN

1 NaN 2.0

2 3.0 3.0

3 5.0 NaN

4 NaN 6.0

In [74]: df2

Out[74]:

A B

0 5.0 NaN

1 2.0 NaN

2 4.0 3.0

3 NaN 4.0

4 3.0 6.0

5 7.0 8.0

In [75]: df1.combine_first(df2)

Out[75]:

A B

0 1.0 NaN

1 2.0 2.0

2 3.0 3.0

3 5.0 4.0

4 3.0 6.0

5 7.0 8.0

通用的 DataFrame 合并方法

上述 combine_first() 方法调用了更普适的 DataFrame.combine() 方法。该方法提取另一个 DataFrame 及合并器函数，并将之与输入的 DataFrame 对齐，再传递与 Series 配对的合并器函数（比如，名称相同的列）。

下面的代码复现了上述的 combine_first() 函数：

In [76]: def combiner(x, y):

....: return np.where(pd.isna(x), y, x)

....:

本站仅提供存储服务，所有内容均由用户发布，如发现有害或侵权内容，请点击举报。