1.How to select rows from a DataFrame based on values in somecolumn in pandas?
In SQL I would use:
select * from table where colume_name = some_value.
2.我已经有一个dataFrame了,格式如下:
年月
201201
203103
现在我想实现SQL中的如下操作:
Select count(distinct CLIENTCODE) from table group by YEARMONTH
解决办法(一):
In [2]: tableOut[2]: CLIENTCODE YEARMONTH0 1 2013011 1 2013012 2 2013013 1 2013024 2 2013025 2 2013026 3 201302In [3]: table.groupby('YEARMONTH').CLIENTCODE.nunique()Out[3]: YEARMONTH201301 2201302 3
解决办法(二)
len(unique())也可以用,速度是nunique()的3到15倍,不过具体什么意思没参透[1]
3.有一个rpt表,内容如下:
rptMultiIndex: 47518 entries, ('000002', '20120331') to ('603366', '20091231')Data columns:STK_ID 47518 non-null valuesSTK_Name 47518 non-null valuesRPT_Date 47518 non-null valuessales 47518 non-null values
将某个STK_ID的记录全部过滤出来,命令是:rpt[rpt['STK_ID']=='600809']
将一个数列
stk_list = ['600809','600141','600329']中的全部记录过滤出来,命令是:
rpt[rpt['STK_ID'].isin(stk_list)].
或者
rpt.query('STK_ID in (600809,600141,600329)')
或者
rpt.query('60000 < STK_ID < 70000')
如果用模糊匹配的话,命令是:
rpt[rpt['STK_ID'].str.contains(r'^600[0-9]{3}$')]
4.获取dataFrame的行数和列数,使用的命令是:dataframe.shape[0]和dataframe.shape[1]
此外,获取行数还有方法:len(DataFrame.index)
5.dataFrame去重:
一般使用命令 drop和drop_duplicates,不过其结果是产生一个新的dataframe,除非你在inplace参数赋值是true,这
种情况下是在原来dataframe上进行修改
设 df = pd.read_csv(data_path, header=0, names=['a', 'b', 'c', 'd', 'e'])
将b列中重复的项删除,命令是 df = df.drop_duplicates('b')
6.将dataframe中,某列进行清洗的命令
删除换行符:misc['product_desc'] = misc['product_desc'].str.replace('\n', '')
删除字符串前后空格:df["Make"] = df["Make"].map(str.strip)
7.从外部导入到dataframe中时,根据需要转换数据类型:
8.apply,applymap和map的应用,总结是apply
用在dataframe上,用于对row或者column进行计算 ,applymap
用于dataframe上,是元素级别的操作, map
用于series上,是元素级别的操作。 In [116]: frame = DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])In [117]: frameOut[117]: b d eUtah -0.029638 1.081563 1.280300Ohio 0.647747 0.831136 -1.549481Texas 0.513416 -0.884417 0.195343Oregon -0.485454 -0.477388 -0.309548In [118]: f = lambda x: x.max() - x.min()In [119]: frame.apply(f)Out[119]: b 1.133201d 1.965980e 2.829781dtype: float64
Many of the most common array statistics (like sum and mean) areDataFrame methods, so using apply is not necessary.
Element-wise Python functions can be used, too. Suppose you wantedto compute a formatted string from each floating point value inframe. You can do this with applymap:
In [120]: format = lambda x: '%.2f' % xIn [121]: frame.applymap(format)Out[121]: b d eUtah -0.03 1.08 1.28Ohio 0.65 0.83 -1.55Texas 0.51 -0.88 0.20Oregon -0.49 -0.48 -0.31
The reason for the name applymap is that Series has a map methodfor applying an element-wise function:
In [122]: frame['e'].map(format)Out[122]: Utah 1.28Ohio -1.55Texas 0.20Oregon -0.31Name: e, dtype: object
9.根据某列,将两个 dataframe合并:
# Merge multiple dataframesdf1 = pd.DataFrame(np.array([ ['a', 5, 9], ['b', 4, 61], ['c', 24, 9]]), columns=['name', 'attr11', 'attr12'])df2 = pd.DataFrame(np.array([ ['a', 5, 19], ['b', 14, 16], ['c', 4, 9]]), columns=['name', 'attr21', 'attr22'])df3 = pd.DataFrame(np.array([ ['a', 15, 49], ['b', 4, 36], ['c', 14, 9]]), columns=['name', 'attr31', 'attr32'])pd.merge(pd.merge(df1,df2,on='name'),df3,on='name')
alternatively, as mentioned by cwharland
df1.merge(df2,on='name').merge(df3,on='name')
10.将一个list添加到set中:
keep.add(onemorevalue)
keep.update(yoursequenceofvalues)
将某个元素添加到set中:
table = pd.read_table("data.csv", sep=r',', names=["Year", "Make", "Model", "Description"], converters = {'Description' : strip, 'Model' : strip, 'Make' : strip, 'Year' : make_int})
联系客服