打开APP
userphoto
未登录

开通VIP,畅享免费电子书等14项超值服

开通VIP
python通过一个简单爬虫实例简单了解文本解析与读写

https://m.toutiao.com/is/BmaSuFv/?= 


python通过request模块可以很简单地通过链接地址获取网络文本。

python的re模板有强大的正则表达式功能来处理文本。

python的文件读写功能也很简单和强大。

1 python通过request模块通过链接地址获取网络文本

1.1 安装request模块

在CMD进入py.exe目录

开始菜单→运行(windows+r)→cmd→通过cd命令进入到python安装目录下的Scripts文件,如:

cd C:\Users\userName\AppData\Local\Programs\Python\Python36-32\Scripts

输入pip install requests

或者 打开Python文件的安装目录,进入Scripts文件中,按住Shift键+鼠标右击,在右键中选择“在此处打开命令窗口”。

或者直接在cmd窗口中输入以下命令:

pip install requests -i http://pypi.douban.com/simple --trusted-host=pypi.douban.com

1.2 通过链接地址获取网络文本

import requestshref = 'https://www.3zmm.net/files/article/html/98709/98709808/'html_response = requests.get(href)#html_response.encoding = 'utf-8'html = html_response.textprint(html)

运行结果:

2 建立目录文件

对需要提取网页的链接制作目录文件index.html(可手工也可通过代码提取)。

(为演示需要,截取一部分):

<a href ='https://www.3zmm.net/files/article/html/98709/98709808/13110286.html'>第1章 出门即是江湖</a><a href ='https://www.3zmm.net/files/article/html/98709/98709808/13110285.html'>第2章 麻将出千</a><a href ='https://www.3zmm.net/files/article/html/98709/98709808/13110284.html'>第3章 移山卸岭</a><a href ='https://www.3zmm.net/files/article/html/98709/98709808/13110283.html'>第4章 初次试探</a><a href ='https://www.3zmm.net/files/article/html/98709/98709808/13110282.html'>第5章 炸金花</a>

当然也可以直接获取网络文本,将通过正则表达式查找建立list。这里为演示需要,建立index.html目录文件。(目录文件可以随时修改,相当于网络截取的目录,演示时更灵活)

3 读取index.html,并建立链接和标题list

import rewith open('index.html','rU',encoding='utf-8') as strf: str = strf.read()res = r'<a href ='(.*?)'>(.*?)</a>' # 使用()分组(分为两组)indexList = re.findall(res,str)for link in indexList: print('href: ',link[0]) print('title: ',link[1],'\n')

运行效果:

4 读取index.html中的链接的网络文本

通过链接读取网络文本。

import reimport requestswith open('index.html','rU',encoding='utf-8') as strf:    str = strf.read()res = r'<a href ='(.*?)'>(.*?)</a>' # 使用()分组(分为两组)indexList = re.findall(res,str)for link in indexList:    chapter_response = requests.get(link[0])    #chapter_response.encoding = 'utf-8'    chapter_html = chapter_response.text    print(link[1],'\n\n')    print(chapter_html,'\n\n')

运行效果:

5 文本提取

在网页源文件中提取主体文本。

import reimport requestswith open('index.html','rU',encoding='utf-8') as strf: str = strf.read()res = r'<a href ='(.*?)'>(.*?)</a>' # 使用()分组(分为两组)indexList = re.findall(res,str)for link in indexList: chapter_response = requests.get(link[0]) #chapter_response.encoding = 'utf-8' chapter_html = chapter_response.text chapter_content = re.findall(r'<div id='content' class='showtxt'>(.*?)</div>',chapter_html)[0] print(link[1],'\n\n') print(chapter_content,'\n\n')

运行效果:

6 文本清洗

将不需要的文本替换为空白。

import reimport requestswith open('index.html','rU',encoding='utf-8') as strf:    str = strf.read()res = r'<a href ='(.*?)'>(.*?)</a>' # 使用()分组(分为两组)indexList = re.findall(res,str)for link in indexList:    chapter_response = requests.get(link[0])    #chapter_response.encoding = 'utf-8'    chapter_html = chapter_response.text    chapter_content = re.findall(r'<div id='content' class='showtxt'>(.*?)</div>',chapter_html)[0]    str = '<script>chaptererror();</script><br />  请记住本书首发域名:www.3zmm.net。三掌门手机版阅读网址:m.3zmm.net'    chapter_content = chapter_content.replace(str,'')    chapter_content = chapter_content.replace(link[0],'')    print(link[1],'\n\n')    print(chapter_content,'\n\n')

运行效果:

7 文本处理(文本查找、替换)

import reimport requestswith open('index.html','rU',encoding='utf-8') as strf: str = strf.read()res = r'<a href ='(.*?)'>(.*?)</a>' # 使用()分组(分为两组)indexList = re.findall(res,str)for link in indexList: chapter_response = requests.get(link[0]) #chapter_response.encoding = 'utf-8' chapter_html = chapter_response.text chapter_content = re.findall(r'<div id='content' class='showtxt'>(.*?)</div>',chapter_html)[0] chapter_content = chapter_content.replace('<script>app2();</script><br />','<p>') chapter_content = chapter_content.replace('<br /><br />','</p>\r\n<p>') str = '<script>chaptererror();</script><br />  请记住本书首发域名:www.3zmm.net。三掌门手机版阅读网址:m.3zmm.net' chapter_content = chapter_content.replace(str,'') chapter_content = chapter_content.replace(link[0],'') print(link[1],'\n\n') print(chapter_content,'\n\n')

运行效果:

8 将文本分别写入文件

import reimport requests# 1 读取目录文件并提取包含链接和标题的listwith open('index.html','rU',encoding='utf-8') as strf:    str = strf.read()res = r'<a href ='(.*?)'>(.*?)</a>' # 使用()分组(分为两组)indexList = re.findall(res,str)for link in indexList:    # 2 按链接读取网页文本    chapter_response = requests.get(link[0])    #chapter_response.encoding = 'utf-8'    chapter_html = chapter_response.text    # 3 提取(截取)文本    chapter_content = re.findall(r'<div id='content' class='showtxt'>(.*?)</div>',chapter_html)[0]    # 4 文本清洗(删除不需要文本)    str = '<script>chaptererror();</script><br />  请记住本书首发域名:www.3zmm.net。三掌门手机版阅读网址:m.3zmm.net'    chapter_content = chapter_content.replace(str,'')    chapter_content = chapter_content.replace(link[0],'')    # 5 文本处理(查找、替换)    chapter_content = chapter_content.replace('<script>app2();</script><br />','<p>')    chapter_content = chapter_content.replace('<br /><br />','</p>\r\n<p>')    print(link[1],'\n\n')    # 6 数据持久化(写入到文件)    fb = open('%s.html'%link[1], 'w', encoding='utf-8');#%s用%link[1]替换    fb.write(chapter_content)    fb.close

9 将文本分别写入文件并适当的添加CSS、JS

import reimport requests # 1 读取目录文件并提取包含链接和标题的listwith open('index.html','rU',encoding='utf-8') as strf: str = strf.read()res = r'<a href ='(.*?)'>(.*?)</a>' # 使用()分组(分为两组)indexList = re.findall(res,str)for link in indexList: # 2 按链接读取网页文本 chapter_response = requests.get(link[0]) #chapter_response.encoding = 'utf-8' chapter_html = chapter_response.text # 3 提取(截取)文本 chapter_content = re.findall(r'<div id='content' class='showtxt'>(.*?)</div>',chapter_html)[0] # 4 文本清洗(删除不需要文本) str = '<script>chaptererror();</script><br />  请记住本书首发域名:www.3zmm.net。三掌门手机版阅读网址:m.3zmm.net' chapter_content = chapter_content.replace(str,'') chapter_content = chapter_content.replace(link[0],'') # 5 文本处理(查找、替换) chapter_content = chapter_content.replace('<script>app2();</script><br />','<p>') chapter_content = chapter_content.replace('<br /><br />','</p>\r\n<p>') print(link[1],'\n\n') # 6 数据持久化(写入到文件,并适当添加CSS、JS) sn = re.findall(r'第(.*?)章',link[1])[0] fb = open('%s.html'%sn, 'w', encoding='utf-8');#%s用%link[1]替换 fheader = open('header.html','r',encoding='UTF-8') fb.write(fheader.read()) fheader.close() fb.write('\n<h4>') fb.write(sn) cha = link[1].replace(sn,''); cha = cha.replace('第章 ','') fb.write(' ') fb.write(cha) fb.write('</h4>\n') fb.write(chapter_content) ffooter = open('footer.html','r',encoding='UTF-8') fb.write(ffooter.read()) ffooter.close() fb.close()

也可以直接将文件头部、尾部写入文件:

import reimport requests    # 1 读取目录文件并提取包含链接和标题的listwith open('index.html','rU',encoding='utf-8') as strf:    str = strf.read()res = r'<a href ='(.*?)'>(.*?)</a>' # 使用()分组(分为两组)indexList = re.findall(res,str)for link in indexList:    # 2 按链接读取网页文本    chapter_response = requests.get(link[0])    #chapter_response.encoding = 'utf-8'    chapter_html = chapter_response.text    # 3 提取(截取)文本    chapter_content = re.findall(r'<div id='content' class='showtxt'>(.*?)</div>',chapter_html)[0]    # 4 文本清洗(删除不需要文本)    str = '<script>chaptererror();</script><br />  请记住本书首发域名:www.3zmm.net。三掌门手机版阅读网址:m.3zmm.net'    chapter_content = chapter_content.replace(str,'')    chapter_content = chapter_content.replace(link[0],'')    # 5 文本处理(查找、替换)    chapter_content = chapter_content.replace('<script>app2();</script><br />','<p>')    chapter_content = chapter_content.replace('<br /><br />','</p>\r\n<p>')    print(link[1],'\n\n')    # 6 数据持久化(写入到文件,并适当添加CSS、JS)    sn = re.findall(r'第(.*?)章',link[1])[0]    fb = open('%s.html'%sn, 'w', encoding='utf-8');#%s用%link[1]替换        # 6.1 写文件头数据    #fheader = open('header.html','r',encoding='UTF-8')    #fb.write(fheader.read())    #fheader.close()    headertxt = '''<!DOCTYPE html><head><meta http-equiv='Content-Type' content='text/html; charset=utf-8' /><title></title><link ID='CSS' href='../cssjs/css.css' rel='stylesheet' type='text/css' /><script charset='utf-8' language='JavaScript' type='text/javascript' src='../cssjs/js.js'></script><script>docWrite1();</script></head><body><div id='container'>        '''    fb.write(headertxt)        # 6.2 写文件主体    fb.write('\n<h4>')    fb.write(sn)    cha = link[1].replace(sn,'');    cha = cha.replace('第章 ','')    fb.write(' ')    fb.write(cha)    fb.write('</h4>\n')    fb.write(chapter_content)        # 6.2 写文件尾部    #ffooter = open('footer.html','r',encoding='UTF-8')    #fb.write(ffooter.read())    #ffooter.close()        footertxt = '''<div><script type=text/javascript>	docWrite2();    bootfunc();    window.onload = myfun;</script></div></body></html>        '''    fb.write(footertxt)    fb.close()   

-End-

本站仅提供存储服务,所有内容均由用户发布,如发现有害或侵权内容,请点击举报
打开APP,阅读全文并永久保存 查看更多类似文章
猜你喜欢
类似文章
【热】打开小程序,算一算2024你的财运
教你用Python批量爬取小说!这年头了谁看小说还充钱啊!
用python实现的抓取腾讯视频所有电影的爬虫
爬虫 | Python 自动下载百度图片
从零开始学爬虫系列2:下载小说的正确姿势
UnicodeEncodeError: 'gbk' codec can't encode character: illegal multibyte sequence
python抓取gb2312/gbk编码网页乱码问题
更多类似文章 >>
生活服务
热点新闻
分享 收藏 导长图 关注 下载文章
绑定账号成功
后续可登录账号畅享VIP特权!
如果VIP功能使用有故障,
可点击这里联系客服!

联系客服