pandas使用小结(一)

1.How to select rows from a DataFrame based on values in somecolumn in pandas?
In SQL I would use:

select * from table where colume_name = some_value.

2.我已经有一个dataFrame了，格式如下：

年月用户代码规模

201201 500 1

203103 123 2

现在我想实现SQL中的如下操作：

Select count(distinct CLIENTCODE) from table group by YEARMONTH

解决办法(一)：

In [2]: tableOut[2]:    CLIENTCODE  YEARMONTH0           1     2013011           1     2013012           2     2013013           1     2013024           2     2013025           2     2013026           3     201302In [3]: table.groupby('YEARMONTH').CLIENTCODE.nunique()Out[3]: YEARMONTH201301       2201302       3

解决办法（二）

len(unique())也可以用，速度是nunique()的3到15倍，不过具体什么意思没参透[1]

3.有一个rpt表，内容如下：

rptMultiIndex: 47518 entries, ('000002', '20120331') to ('603366', '20091231')Data columns:STK_ID                    47518  non-null valuesSTK_Name                  47518  non-null valuesRPT_Date                  47518  non-null valuessales                     47518  non-null values

将某个STK_ID的记录全部过滤出来，命令是：rpt[rpt['STK_ID']=='600809']

将一个数列

stk_list = ['600809','600141','600329']中的全部记录过滤出来，命令是：

rpt[rpt['STK_ID'].isin(stk_list)].

或者

rpt.query('STK_ID in (600809,600141,600329)')

或者

rpt.query('60000 < STK_ID < 70000')

如果用模糊匹配的话，命令是：

rpt[rpt['STK_ID'].str.contains(r'^600[0-9]{3}$')]

4.获取dataFrame的行数和列数，使用的命令是：dataframe.shape[0]和dataframe.shape[1]

此外，获取行数还有方法：len(DataFrame.index)

5.dataFrame去重：

一般使用命令 drop和drop_duplicates,不过其结果是产生一个新的dataframe,除非你在inplace参数赋值是true,这

种情况下是在原来dataframe上进行修改

设 df = pd.read_csv(data_path, header=0, names=['a', 'b', 'c', 'd', 'e'])

将b列中重复的项删除，命令是 df = df.drop_duplicates('b')

6.将dataframe中，某列进行清洗的命令

删除换行符：misc['product_desc'] = misc['product_desc'].str.replace('\n', '')

删除字符串前后空格：df["Make"] = df["Make"].map(str.strip)

7.从外部导入到dataframe中时，根据需要转换数据类型：

  
8.apply,applymap和map的应用，总结是apply 用在dataframe上，用于对row或者column进行计算, applymap 用于dataframe上，是元素级别的操作，map 用于series上，是元素级别的操作。
In [116]: frame = DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])In [117]: frameOut[117]:                b         d         eUtah   -0.029638  1.081563  1.280300Ohio    0.647747  0.831136 -1.549481Texas   0.513416 -0.884417  0.195343Oregon -0.485454 -0.477388 -0.309548In [118]: f = lambda x: x.max() - x.min()In [119]: frame.apply(f)Out[119]: b    1.133201d    1.965980e    2.829781dtype: float64

Many of the most common array statistics (like sum and mean) areDataFrame methods, so using apply is not necessary.
Element-wise Python functions can be used, too. Suppose you wantedto compute a formatted string from each floating point value inframe. You can do this with applymap:

In [120]: format = lambda x: '%.2f' % xIn [121]: frame.applymap(format)Out[121]:             b      d      eUtah    -0.03   1.08   1.28Ohio     0.65   0.83  -1.55Texas    0.51  -0.88   0.20Oregon  -0.49  -0.48  -0.31

The reason for the name applymap is that Series has a map methodfor applying an element-wise function:

In [122]: frame['e'].map(format)Out[122]: Utah       1.28Ohio      -1.55Texas      0.20Oregon    -0.31Name: e, dtype: object


 
9.根据某列，将两个 dataframe合并：

# Merge multiple dataframesdf1 = pd.DataFrame(np.array([    ['a', 5, 9],    ['b', 4, 61],    ['c', 24, 9]]),    columns=['name', 'attr11', 'attr12'])df2 = pd.DataFrame(np.array([    ['a', 5, 19],    ['b', 14, 16],    ['c', 4, 9]]),    columns=['name', 'attr21', 'attr22'])df3 = pd.DataFrame(np.array([    ['a', 15, 49],    ['b', 4, 36],    ['c', 14, 9]]),    columns=['name', 'attr31', 'attr32'])pd.merge(pd.merge(df1,df2,on='name'),df3,on='name')


alternatively, as mentioned by cwharland

df1.merge(df2,on='name').merge(df3,on='name')

10.将一个list添加到set中:

keep.add(onemorevalue)

keep.update(yoursequenceofvalues)

将某个元素添加到set中:

table = pd.read_table("data.csv", sep=r',',                      names=["Year", "Make", "Model", "Description"],                      converters = {'Description' : strip,                                    'Model' : strip,                                    'Make' : strip,                                    'Year' : make_int})

本站仅提供存储服务，所有内容均由用户发布，如发现有害或侵权内容，请点击举报。