前一篇介绍了Pandas实现简单的SQL操作,本篇中将主要介绍一些相对复杂一点的操作。为了方便后面实操,先给出一份简化版的设备统计数据:
0 android NLL 387546520 20994579110 ios NLL 52877990 9164217551 android 魅族 8995958 1203695971 android 酷派 9915906 2008181721 android 三星 16500493 7189695141 android 小米 23933856 2907875901 android 华为 26706736 6419077611 ios 苹果 52877990 9164217552 android 小米-小米4 2786675 553765812 android 魅族-m2-note 4642112 1309842052 android OPPO-A31 4893428 629769972 ios 苹果-iPhone-6s 5728609 99948716
其中,第一列表示维度组合编号,第二列表示操作系统类型,第三列为维度值(NLL表示缺失,即第一行、第二行表示操作系统的统计,其余表示厂商或机型),第三列、第四列分别表示UV、PV;且字段之间为\t
分隔。读取该文件为DataFrame:
import pandas as pddf = pd.read_csv(path, names=['id', 'os', 'dim', 'uv', 'pv'], sep='\t')
在原dataframe上,增加一行数据;可通过dataframe的append函数来追加:
import numpy as nprow_df = pd.DataFrame(np.array([['2', 'ios', '苹果-iPad 4', 3287509, 32891811]]), columns=['id', 'os', 'dim', 'uv', 'pv'])df = df.append(row_df, ignore_index=True)
增加一列数据,则比较简单:
df['time'] = '2016-07-19'
关于android、ios的PV、UV的dict:
def where(df, column_name, id_value): df = df[df[column_name] == id_value] return dfdef to_dict(df): """ {"pv" or "uv" -> {"os": os_value}} :return: dict """ df = where(df, 'id', 0) df_dict = df.set_index('os')[['uv', 'pv']].to_dict() return df_dict
group某列后的top值,比如,android、ios的UV top 2的厂商:
def group_top(df, group_col, sort_col, top_n): """ get top(`sort_col`) after group by `group_col` :param df: dataframe :param group_col: string, column name :param sort_col: string, column name :param top_n: int :return: dataframe """ return df.assign(rn=df.sort_values([sort_col], ascending=False) .groupby(group_col) .cumcount() + 1) .query('rn < ' + str(top_n + 1)) .sort_values([group_col, 'rn'])
全局top值加上group某列后的top值,并有去重:
def top(df, group_col, sort_col, top_n): """overall top and group top""" all_top_df = df.nlargest(top_n, columns=sort_col) grouped_top_df = group_top(df, group_col, sort_col, top_n) grouped_top_df = grouped_top_df.ix[:, 0:-1] result_df = pd.concat([all_top_df, grouped_top_df]).drop_duplicates() return result_df
对某列排序后并编号,相当于给出排序名次。比如,对UV的排序编号:
df['rank'] = df['uv'].rank(method='first', ascending=False).apply(lambda x: int(x))
Pandas的left join对NULL的列没有指定默认值,下面给出简单的实现:
def left_join(left, right, on, right_col, default_value): df = pd.merge(left, right, how='left', on=on) df[right_col] = df[right_col].map(lambda x: default_value if pd.isnull(x) else x) return df
对某一列做较为复杂的自定义操作,比如,厂商的UV占比:
def percentage(part, whole): return round(100*float(part)/float(whole), 2)os_dict = to_dict(df)all_uv = sum(os_dict['uv'].values())df = where(df, 'id', 1)df['per'] = df.apply(lambda r: percentage(r['uv'], all_uv), axis=1)
某列的重复值的行:
duplicate = df.duplicated(subset=columns, keep=False)
Pandas的to_sql函数支持Dataframe直接写MySQL数据库。在公司开发时,常常会有办公网与研发网是不通的,Python的sshtunnel模块提供ssh通道,便于入库debug。
import MySQLdbfrom sshtunnel import SSHTunnelForwarderwith SSHTunnelForwarder(('porxy host', port), ssh_password='os passwd', ssh_username='os user name', remote_bind_address=('mysql host', 3306)) as server: conn = MySQLdb.connect(host="127.0.0.1", user="mysql user name", passwd="mysql passwd", db="db name", port=server.local_bind_port, charset='utf8') df.to_sql(name='tb name', con=conn, flavor='mysql', if_exists='append', index=False)
联系客服