关关采集器,主要使用正则采集,以下是正则的一些表达
\d* 表示数字
\s* 表示空格+换行
.+? 表示字符(不能为空)
.* 表示字符(可以为空)
() 表示我们需要的部分
((.|\n)*) 章节的内容部分,包括了换行。
=====杰奇对应=====
!!!! 相当于 ([^><]*)
~~~~ 相当于 ([^><'"]*)
^^^^ 相当于 ([^><\d]*)
$$$$ 相当于 ([\d]*)
**** 相当于 (.*)
=====其他基本=====
. 匹配任何单个字符。例如正则表达式r.t匹配这些字符串:rat、rut、r t,但是不匹配root。
$ 匹配行结束符。例如正则表达式weasel$ 能够匹配字符串"He's a weasel"的末尾,但是不能匹配字符串"They are a bunch of weasels."。
^ 匹配一行的开始。例如正则表达式^When in能够匹配字符串"When in the course of human events"的开始,但是不能匹配"What and When
in the"。
* 匹配0或多个正好在它之前的那个字符。例如正则表达式.*意味着能够匹配任意数量的任何字符。
\ 这是引用府,用来将这里列出的这些元字符当作普通的字符来进行匹配。例如正则表达式\$被用来匹配美元符号,而不是行尾,类似的,正则
表达式\.用来匹配点字符,而不是任何字符的通配符。
万能图片规则<[^<]*((?<=<(?:img|IMG)[^>]*(?:(?:src|SRC)(?:\s*=\s*(?:["']?))))(?:[^\s"'>]*)\.(?:jpg|gif|jpeg|bmp|png|GIF|JPG))
[^>]*>
附带:藏海阁文学网 采集规则,全文字的哦
<?xml version="1.0"?>
<RuleConfigInfo xmlns:xsi="
http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="
http://www.w3.org/2001/XMLSchema">
<RuleVersion>
<RegexName />
<Pattern />
<Method>Match</Method>
<Options>None</Options>
<FilterPattern />
</RuleVersion>
<RuleID>
<RegexName>RuleID</RegexName>
<Pattern>1</Pattern>
<Method>Match</Method>
<Options>None</Options>
<FilterPattern />
</RuleID>
<GetSiteName>
<RegexName>GetSiteName</RegexName>
<Pattern>藏海阁</Pattern>
<Method>Match</Method>
<Options>None</Options>
<FilterPattern />
</GetSiteName>
<GetSiteCharset>
<RegexName>GetSiteCharset</RegexName>
<Pattern>utf-8</Pattern>
<Method>Match</Method>
<Options>None</Options>
<FilterPattern />
</GetSiteCharset>
<GetSiteUrl>
<RegexName>GetSiteUrl</RegexName>
<Pattern>http://www.canghaige.com/</Pattern>
<Method>Match</Method>
<Options>None</Options>
<FilterPattern />
</GetSiteUrl>
<NovelSearchUrl>
<RegexName>NovelSearchUrl</RegexName>
<Pattern>http://www.canghaige.com/Book/Search.aspx</Pattern>
<Method>Match</Method>
<Options>None</Options>
<FilterPattern />
</NovelSearchUrl>
<NovelSearchData>
<RegexName>NovelSearchData</RegexName>
<Pattern>SearchKey={SearchKey}&SearchClass=1</Pattern>
<Method>Match</Method>
<Options>None</Options>
<FilterPattern />
</NovelSearchData>
<NovelSearch_GetNovelKey>
<RegexName>NovelSearch_GetNovelKey</RegexName>
<Pattern><div id="CListTitle"><a href="/Book/(\d*)/Index.aspx" target="_blank"><b>{SearchKey}</b></a></Pattern>
<Method>Match</Method>
<Options>None</Options>
<FilterPattern />
</NovelSearch_GetNovelKey>
<NovelListUrl>
<RegexName>NovelListUrl</RegexName>
<Pattern>http://www.canghaige.com/type/1/</Pattern>
<Method>Match</Method>
<Options>None</Options>
<FilterPattern />
</NovelListUrl>
<NovelList_GetNovelKey>
<RegexName>NovelList_GetNovelKey</RegexName>
<Pattern><a href="
http://www.canghaige.com/books/(\d*)/" id=".+?" title=".+?">(.+?)</a></Pattern>
<Method>Match</Method>
<Options>None</Options>
<FilterPattern />
</NovelList_GetNovelKey>
<NovelUrl>
<RegexName>NovelUrl</RegexName>
<Pattern>http://www.canghaige.com/books/{NovelKey}/</Pattern>
<Method>Match</Method>
<Options>None</Options>
<FilterPattern />
</NovelUrl>
<NovelErr>
<RegexName>NovelErr</RegexName>
<Pattern>未找到该编号的书籍信息</Pattern>
<Method>Match</Method>
<Options>None</Options>
<FilterPattern />
</NovelErr>
<NovelName>
<RegexName>NovelName</RegexName>
<Pattern><h1>(.+?)</h1></Pattern>
<Method>Match</Method>
<Options>None</Options>
<FilterPattern />
</NovelName>
<NovelAuthor>
<RegexName>NovelAuthor</RegexName>
<Pattern>作者:(.+?)</span></Pattern>
<Method>Match</Method>
<Options>None</Options>
<FilterPattern />
</NovelAuthor>
<LagerSort>
<RegexName>LagerSort</RegexName>
<Pattern>书籍类别:(.+?)</span></Pattern>
<Method>Match</Method>
<Options>None</Options>
<FilterPattern />
</LagerSort>
<SmallSort>
<RegexName>SmallSort</RegexName>
<Pattern>书籍类别:(.+?)</span></Pattern>
<Method>Match</Method>
<Options>None</Options>
<FilterPattern />
</SmallSort>
<NovelIntro>
<RegexName>NovelIntro</RegexName>
<Pattern><div>内容简介:((.|\n)*?)</div>\s*</li></Pattern>
<Method>Match</Method>
<Options>None</Options>
<FilterPattern><span(.|\n)+?</span>|<p>|<a.+?</a>|</div></FilterPattern>
</NovelIntro>
<NovelKeyword>
<RegexName>NovelKeyword</RegexName>
<Pattern><h1>(.+?)</h1></Pattern>
<Method>Match</Method>
<Options>None</Options>
<FilterPattern />
</NovelKeyword>
<NovelDegree>
<RegexName>NovelDegree</RegexName>
<Pattern>连载状态:(.+?)</span></Pattern>
<Method>Match</Method>
<Options>None</Options>
<FilterPattern />
</NovelDegree>
<NovelCover>
<RegexName>NovelCover</RegexName>
<Pattern><a class="pic"><img src="(.+?)"</Pattern>
<Method>Match</Method>
<Options>None</Options>
<FilterPattern />
</NovelCover>
<NovelDefaultCoverUrl>
<RegexName>NovelDefaultCoverUrl</RegexName>
<Pattern />
<Method>Match</Method>
<Options>None</Options>
<FilterPattern />
</NovelDefaultCoverUrl>
<NovelInfo_GetNovelPubKey>
<RegexName>NovelInfo_GetNovelPubKey</RegexName>
<Pattern>连载状态:(.+?)</span></Pattern>
<Method>Match</Method>
<Options>None</Options>
<FilterPattern />
</NovelInfo_GetNovelPubKey>
<PubCookies>
<RegexName>PubCookies</RegexName>
<Pattern />
<Method>Match</Method>
<Options>None</Options>
<FilterPattern />
</PubCookies>
<PubIndexUrl>
<RegexName>PubIndexUrl</RegexName>
<Pattern>http://www.canghaige.com/books/{NovelKey}/</Pattern>
<Method>Match</Method>
<Options>None</Options>
<FilterPattern />
</PubIndexUrl>
<PubIndexErr>
<RegexName>PubIndexErr</RegexName>
<Pattern>这里必须填写</Pattern>
<Method>Match</Method>
<Options>None</Options>
<FilterPattern />
</PubIndexErr>
<PubVolumeContent>
<RegexName>PubVolumeContent</RegexName>
<Pattern />
<Method>Match</Method>
<Options>None</Options>
<FilterPattern />
</PubVolumeContent>
<PubVolumeSplit>
<RegexName>PubVolumeSplit</RegexName>
<Pattern><h3></Pattern>
<Method>Spilt</Method>
<Options>None</Options>
<FilterPattern />
</PubVolumeSplit>
<PubVolumeName>
<RegexName>PubVolumeName</RegexName>
<Pattern>Title">(.+?)</div></Pattern>
<Method>Match</Method>
<Options>None</Options>
<FilterPattern> </FilterPattern>
</PubVolumeName>
<PubChapterName>
<RegexName>PubChapterName</RegexName>
<Pattern><li><a href="
>
<Method>Match</Method>
<Options>None</Options>
<FilterPattern />
</PubChapterName>
<PubChapter_GetChapterKey>
<RegexName>PubChapter_GetChapterKey</RegexName>
<Pattern><li><a href="(
>
<Method>Match</Method>
<Options>None</Options>
<FilterPattern />
</PubChapter_GetChapterKey>
<PubContentUrl>
<RegexName>PubContentUrl</RegexName>
<Pattern>{ChapterKey}</Pattern>
<Method>Match</Method>
<Options>None</Options>
<FilterPattern />
</PubContentUrl>
<PubContentErr>
<RegexName>PubContentErr</RegexName>
<Pattern>这里必须填写</Pattern>
<Method>Match</Method>
<Options>None</Options>
<FilterPattern />
</PubContentErr>
<PubContent_GetTextKey>
<RegexName>PubContent_GetTextKey</RegexName>
<Pattern />
<Method>Match</Method>
<Options>None</Options>
<FilterPattern />
</PubContent_GetTextKey>
<PubTextUrl>
<RegexName>PubTextUrl</RegexName>
<Pattern />
<Method>Match</Method>
<Options>None</Options>
<FilterPattern />
</PubTextUrl>
<PubContentText>
<RegexName>PubContentText</RegexName>
<Pattern><div id="zjneirong" style="font-size:14px;width:100%;">((.|\n)+?)<hr</Pattern>
<Method>Match</Method>
<Options>None</Options>
<FilterPattern><div.+?>|<div>|</div>|<DIV.+?>|</DIV>|<script(.|\n)+?</script>|<style(.|\n)+?</style>|<a(.|\n)+?</a></FilterPattern>
</PubContentText>
<PubContentReplace>
<RegexName>PubContentReplace</RegexName>
<Pattern />
<Method>Match</Method>
<Options>None</Options>
<FilterPattern />
</PubContentReplace>
<PubContentImages>
<RegexName>PubContentImages</RegexName>
<Pattern><[^<]*((?<=<(?:img|IMG)[^>]*(?:(?:src|SRC)(?:\s*=\s*(?:["']?))))(?:[^\s"'>]*)\.(?:jpg|gif|jpeg|bmp|png|GIF|JPG))[^>]*></Pattern>
<Method>Match</Method>
<Options>None</Options>
<FilterPattern />
</PubContentImages>
</RuleConfigInfo>