- 浏览: 226033 次
- 性别:
- 来自: 北京
文章分类
最新评论
-
saiyaren:
husxwy 写道请教一个问题,是否碰见一个请求,nginx认 ...
nginx upstream 容错机制 原创-胡志广 -
husxwy:
请教一个问题,是否碰见一个请求,nginx认为tomcat1失 ...
nginx upstream 容错机制 原创-胡志广 -
ct518lovepwj:
楼主,请教一下,我的nutch集群只有一个节点运行,并且在抓取 ...
nutch集群,威力很大,哈哈!! -
saiyaren:
songbgi 写道saiyaren 写道saiyaren 写 ...
java web 开发问题总结 1 原创-胡志广 -
songbgi:
saiyaren 写道saiyaren 写道saiyaren ...
java web 开发问题总结 1 原创-胡志广
nutch1.4在2011年的11月26日正式发布了,nutch1.4之后更新了一些内容和一些配置,但是和1.3差别还是不大,但是和1.2之前的差异就比较大了,在nutch1.3之后,索引就用solr来进行生成了,包括查询也是用solr,所以在nutch1.2之前的web搜索服务也就不需要了。
首先我们去nutch的官网下载最新版的nutch1.4
地址为:
http://www.apache.org/dyn/closer.cgi/nutch/
下载apache-nutch-1.4-bin.zip或者apache-nutch-1.4-bin.tar.gz都可以
下载下来后,我们解压,现在先进行linux下的应用,下一节我会写eclipse中进行nutch开发
解压之后,我们会看到如下目录:
然后我们进入nutch/runtime/local的目录下,下目录下会有个conf文件夹,我们进入文件夹会看到如下文件:
在这里我们只需要知道2个文件即可:
nutch-default.xml和regex-urlfilter.txt
nutch-default.xml 是nutch 的配置文件
regex-urlfilter.txt文件内是编辑NUTCH爬取的策略规则的
我们这是进行初次爬取,那么我们测试的话不需要对其他设置进行优化,只需要做到如下即可:
在nutch-default.xml文件中找到http.agent.name属性,将其中的value内容加上;
<!-- HTTP properties --> <property> <name>http.agent.name</name> <value>jdodrc</value> <description>HTTP 'User-Agent' request header. MUST NOT be empty - please set this to a single word uniquely related to your organization. NOTE: You should also check other related properties: http.robots.agents http.agent.description http.agent.url http.agent.email http.agent.version and set their values appropriately. </description> </property>
如果不加上该属性的话,在执行nutch的时候会报如下错误:
Exception in thread "main" java.lang.IllegalArgumentException: Fetcher: No agents listed in 'http.agent.name' property.
增加上属性后,我们还需要进行规则的设置,比如我们要爬取www.163.com ,但是我们不是要把里面的所有链接都爬取下来,如sohu的广告,我们就不需要爬,我们只需要爬取163的内容,那么我们就需要设置爬取规则,爬取规则采用正则表达式进行编写(正则表达式在这里不做具体阐述)
那么我们在哪里编写规则呢?
regex-urlfilter.txt文件中编写规则:
# skip image and other suffixes we can't yet parse # for a more extensive coverage use the urlfilter-suffix plugin -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ 这里是过滤的扩展名
抓取动态网页
# skip URLs containing certain characters as probable queries, etc. #-[?*!@=]如果需要抓取动态网页就把这里注释掉 -[~]
页面链接过滤规则,如下为过滤163站的
# accept anything else #+^http://([a-z0-9]*\.)*(.*\.)*.*/ +^http://([a-z0-9]*\.)*163\.com
如果做测试用只需要修改过滤规则即可。
nutch-default.xml的http.agent.name配置好后
regex-urlfilter.txt正则规则配置好后
那么我们在linux 在把runtime/local/bin下的.sh全部改为可执行文件
打开bin目录后,执行:
chmod +x *.sh
将所有的sh变为可执行
然后我们做下测试:
在runtime/local目录下,创建一个urls目录,然后里面创建一个文件test,在test文件里面输入我们要进行爬取的网站入口:
http://www.163.com/
然后保存,现在在我们的local目录下有一个urls目录,里面有一个入口文件
那么我们现在就进行一下测试:
测试之前我们需要对nutch的参数进行一下了解:
Crawl <urlDir> -solr <solrURL> [-dir d] [-threads n] [-depth i] [-topN N]
[]中间的是可选的
urlDir就是入口文件地址
-solr <solrUrl>为solr的地址(如果没有则为空)
-dir 是保存爬取文件的位置
-threads 是爬取开的线程(线程不是越多越好,实现要求即可,默认为10)
-depth 是访问的深度 (默认为5)
-topN 是访问的广度 (默认是Long.max)
然后在bin目录下有一个 nutch的shell文件,在nutch的shell文件中有一个crawl参数就是启动我们抓取类的:
我们现在测试爬行一下,现在我们的 目录位置是在nutch/runtime/local下
bin/nutch crawl urls -solr http://localhost:8080/solr/ -dir crawl -depth 2 -threads 5 -topN 100
如果要以后查看日志的话,那么就在最后加上一个 >& (输出位置)
solr需要单独配置,我会在solr一篇文章中讲怎么部署,这里的-solr的位置,只需要输入solr的url地址即可
如想了解solr部署请看solr 部署的文章
如果要想在windows下测试或者开发,那么需要首先安装cygwin,安装cygwin我会在eclipse中部署nutch1.4中介绍
测试结果:
crawl started in: crawl rootUrlDir = urls/test.txt threads = 10 depth = 2 solrUrl=http://localhost:8080/solr/ topN = 100 Injector: starting at 2012-02-07 14:21:20 Injector: crawlDb: crawl/crawldb Injector: urlDir: urls/test.txt Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: finished at 2012-02-07 14:21:25, elapsed: 00:00:04 Generator: starting at 2012-02-07 14:21:25 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 100 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: crawl/segments/20120207142128 Generator: finished at 2012-02-07 14:21:30, elapsed: 00:00:05 Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. Fetcher: starting at 2012-02-07 14:21:30 Fetcher: segment: crawl/segments/20120207142128 Using queue mode : byHost Fetcher: threads: 10 Fetcher: time-out divisor: 2 QueueFeeder finished: total 1 records + hit by time limit :0 Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost fetching http://www.163.com/ -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost Fetcher: throughput threshold: -1 -finishing thread FetcherThread, activeThreads=1 Fetcher: throughput threshold retries: 5 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: finished at 2012-02-07 14:21:36, elapsed: 00:00:05 ParseSegment: starting at 2012-02-07 14:21:36 ParseSegment: segment: crawl/segments/20120207142128 Parsing: http://www.163.com/ ParseSegment: finished at 2012-02-07 14:21:39, elapsed: 00:00:03 CrawlDb update: starting at 2012-02-07 14:21:39 CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/segments/20120207142128] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: 404 purging: false CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2012-02-07 14:21:42, elapsed: 00:00:03 Generator: starting at 2012-02-07 14:21:42 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 100 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: crawl/segments/20120207142145 Generator: finished at 2012-02-07 14:21:48, elapsed: 00:00:05 Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. Fetcher: starting at 2012-02-07 14:21:48 Fetcher: segment: crawl/segments/20120207142145 Using queue mode : byHost Fetcher: threads: 10 Fetcher: time-out divisor: 2 Using queue mode : byHost QueueFeeder finished: total 97 records + hit by time limit :0 Using queue mode : byHost fetching http://bbs.163.com/ Using queue mode : byHost fetching http://bbs.163.com/rank/ Using queue mode : byHost fetching http://tech.163.com/cnstock/ Using queue mode : byHost fetching http://tech.163.com/ Using queue mode : byHost fetching http://tech.163.com/digi/nb/ Using queue mode : byHost Using queue mode : byHost fetching http://g.163.com/a?CID=10625&Values=3331479594&Redirect=http:/www.edu-163.com/Item/list.asp?id=1164 fetching http://g.163.com/r?site=netease&affiliate=homepage&cat=homepage&type=textlinkhouse&location=1 Using queue mode : byHost fetching http://g.163.com/a?CID=10627&Values=896009995&Redirect=http:/www.dv37.com/jiaoyu/xiaoxinxing/ Using queue mode : byHost fetching http://g.163.com/a?CID=10635&Values=1012801948&Redirect=http:/www.worldwayhk.com/ Fetcher: throughput threshold: -1 Fetcher: throughput threshold retries: 5 fetching http://g.163.com/a?CID=12392&Values=441270714&Redirect=http:/www.qinzhe.com/chinese/index.htm fetching http://g.163.com/a?CID=10634&Values=2943411042&Redirect=http:/www.kpeng.com.cn/ fetching http://g.163.com/a?CID=12337&Values=3289604641&Redirect=http:/www.offcn.com/zg/2011ms/index.html fetching http://g.163.com/a?CID=10633&Values=1745739655&Redirect=http:/www.edu-163.com/aidi/aidinj1.htm fetching http://g.163.com/a?CID=12307&Values=3388898846&Redirect=http:/www.offcn.com/zg/2011ms/index.html fetching http://g.163.com/a?CID=10629&Values=740233954&Redirect=http:/www.embasjtu.com/ fetching http://g.163.com/a?CID=10632&Values=715626766&Redirect=http:/www.edu-163.com/aidi/aidimg.htm fetching http://g.163.com/a?CID=12259&Values=3180311081&Redirect=http:/www.gpkdtx.com/ fetching http://g.163.com/a?CID=12271&Values=904657751&Redirect=http:/www.vipabc.com/count.asp?code=QnfF0agFbn fetching http://g.163.com/a?CID=10628&Values=2735701856&Redirect=http:/www.wsi.com.cn fetching http://g.163.com/a?CID=10623&Values=1704187161&Redirect=http:/www.wsi.com.cn fetching http://g.163.com/a?CID=12267&Values=608079303&Redirect=http:/edu.163.com/special/official/ fetching http://g.163.com/a?CID=10631&Values=3773655455&Redirect=http:/www.xinhaowei.cn/zt/sasheng-new/ fetching http://g.163.com/a?CID=10630&Values=4025376053&Redirect=http:/www.bwpx.com/ fetching http://g.163.com/a?CID=12283&Values=1441209353&Redirect=http:/www.zyqm.org/ fetching http://mobile.163.com/ fetching http://mobile.163.com/app/ fetching http://reg.vip.163.com/enterMail.m?enterVip=true----------- fetching http://product.tech.163.com/mobile/ fetching http://hea.163.com/ -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=68 fetching http://reg.email.163.com/mailregAll/reg0.jsp?from=163®Page=163 -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=67 fetching http://yuehui.163.com/ fetching http://auto.163.com/ fetching http://auto.163.com/buy/ fetching http://gongyi.163.com/ fetching http://reg.163.com/Main.jsp?username=pInfo fetching http://reg.163.com/Logout.jsp?username=accountName&url=http:/www.163.com/ -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=61 fetching http://money.163.com/fund/ fetching http://money.163.com/stock/ fetching http://money.163.com/hkstock/ fetching http://money.163.com/ -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=57 fetching http://blog.163.com/passportIn.do?entry=163 fetching http://blog.163.com/?fromNavigation fetching http://pay.163.com/ fetching http://baby.163.com/ fetching http://discovery.163.com/ -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=52 fetching http://p.mail.163.com/mailinfo/shownewmsg_www_0819.htm fetching http://help.163.com?b01abh1 fetching http://www.163.com/rss/ fetching http://home.163.com/ fetching http://product.auto.163.com/ -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=47 fetching http://ecard.163.com/ fetching http://photo.163.com/?username=pInfo fetching http://photo.163.com/pp/square/ fetching http://email.163.com/ fetching http://m.163.com/ -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=42 fetching http://edu.163.com/ fetching http://edu.163.com/liuxue/ fetching http://xf.house.163.com/gz/ fetching http://game.163.com/ fetching http://travel.163.com/ -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=37 fetching http://baoxian.163.com/?from=index -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=36 fetching http://zx.caipiao.163.com?from=shouye fetching http://entry.mail.163.com/coremail/fcg/ntesdoor2?verifycookie=1&lightweight=1 fetching http://biz.163.com/ fetching http://t.163.com/rank?f=163dh fetching http://t.163.com/chat?f=163dh -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=31 fetching http://t.163.com/?f=wstopmicoblogmsg fetch of http://zx.caipiao.163.com?from=shouye failed with: org.apache.nutch.protocol.http.api.HttpException: bad status line '<html>': For input string: "<html>" fetching http://t.163.com/rank/daren?f=163dh fetching http://t.163.com/?f=wstopmicoblogmsg.enter fetching http://t.163.com/ fetching http://sports.163.com/ -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=26 fetching http://sports.163.com/nba/ fetching http://sports.163.com/cba/ -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=24 fetching http://sports.163.com/yc/ -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=23 fetching http://vipmail.163.com/ fetching http://digi.163.com/ fetching http://lady.163.com/beauty/ fetching http://lady.163.com/ fetching http://lady.163.com/sense/ -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=18 fetching http://house.163.com/ fetching http://news.163.com/review/ -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=16 fetching http://news.163.com/photo/ fetching http://news.163.com/ fetching http://v.163.com/doc/ -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=13 fetching http://v.163.com/zongyi/ fetching http://v.163.com/ fetching http://v.163.com/focus/ fetching http://fushi.163.com/ -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=9 fetching http://yc.163.com/ -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=8 fetching http://mall.163.com/ fetching http://ent.163.com/movie/ fetching http://ent.163.com/ fetching http://ent.163.com/music/ fetching http://ent.163.com/tv/ fetching http://war.163.com/ -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=2 * queue: http://fashion.163.com maxThreads = 10 inProgress = 0 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1328595704430 now = 1328595728444 0. http://fashion.163.com/ * queue: http://book.163.com maxThreads = 10 inProgress = 0 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1328595704430 now = 1328595728445 0. http://book.163.com/ fetching http://fashion.163.com/ -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=1 * queue: http://book.163.com maxThreads = 10 inProgress = 0 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1328595704430 now = 1328595729445 0. http://book.163.com/ fetching http://book.163.com/ -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=0 -finishing thread FetcherThread, activeThreads=9 -finishing thread FetcherThread, activeThreads=8 -activeThreads=8, spinWaiting=0, fetchQueues.totalSize=0 -finishing thread FetcherThread, activeThreads=7 -finishing thread FetcherThread, activeThreads=6 -finishing thread FetcherThread, activeThreads=5 -finishing thread FetcherThread, activeThreads=4 -finishing thread FetcherThread, activeThreads=3 -finishing thread FetcherThread, activeThreads=2 -activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0 -finishing thread FetcherThread, activeThreads=1 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: finished at 2012-02-07 14:22:20, elapsed: 00:00:32 ParseSegment: starting at 2012-02-07 14:22:20 ParseSegment: segment: crawl/segments/20120207142145 Parsing: http://auto.163.com/ Parsing: http://auto.163.com/buy/ Parsing: http://baby.163.com/ Parsing: http://baoxian.163.com/?from=index Parsing: http://bbs.163.com/ Parsing: http://bbs.163.com/rank/ Parsing: http://biz.163.com/ Parsing: http://blog.163.com/?fromNavigation Parsing: http://book.163.com/ Parsing: http://digi.163.com/ Parsing: http://discovery.163.com/ Parsing: http://edu.163.com/ Parsing: http://edu.163.com/liuxue/ Parsing: http://email.163.com/ Parsing: http://ent.163.com/ Parsing: http://ent.163.com/movie/ Parsing: http://ent.163.com/music/ Parsing: http://ent.163.com/tv/ Parsing: http://fashion.163.com/ Parsing: http://fushi.163.com/ Parsing: http://g.163.com/a?CID=10623&Values=1704187161&Redirect=http:/www.wsi.com.cn Error parsing: http://g.163.com/a?CID=10623&Values=1704187161&Redirect=http:/www.wsi.com.cn: failed(2,0): Can't retrieve Tika parser for mime-type application/octet-stream Parsing: http://g.163.com/a?CID=10625&Values=3331479594&Redirect=http:/www.edu-163.com/Item/list.asp?id=1164 Error parsing: http://g.163.com/a?CID=10625&Values=3331479594&Redirect=http:/www.edu-163.com/Item/list.asp?id=1164: failed(2,0): Can't retrieve Tika parser for mime-type application/octet-stream Parsing: http://g.163.com/a?CID=10627&Values=896009995&Redirect=http:/www.dv37.com/jiaoyu/xiaoxinxing/ Error parsing: http://g.163.com/a?CID=10627&Values=896009995&Redirect=http:/www.dv37.com/jiaoyu/xiaoxinxing/: failed(2,0): Can't retrieve Tika parser for mime-type application/octet-stream Parsing: http://g.163.com/a?CID=10628&Values=2735701856&Redirect=http:/www.wsi.com.cn Error parsing: http://g.163.com/a?CID=10628&Values=2735701856&Redirect=http:/www.wsi.com.cn: failed(2,0): Can't retrieve Tika parser for mime-type application/octet-stream Parsing: http://g.163.com/a?CID=10629&Values=740233954&Redirect=http:/www.embasjtu.com/ Error parsing: http://g.163.com/a?CID=10629&Values=740233954&Redirect=http:/www.embasjtu.com/: failed(2,0): Can't retrieve Tika parser for mime-type application/octet-stream Parsing: http://g.163.com/a?CID=10630&Values=4025376053&Redirect=http:/www.bwpx.com/ Error parsing: http://g.163.com/a?CID=10630&Values=4025376053&Redirect=http:/www.bwpx.com/: failed(2,0): Can't retrieve Tika parser for mime-type application/octet-stream Parsing: http://g.163.com/a?CID=10631&Values=3773655455&Redirect=http:/www.xinhaowei.cn/zt/sasheng-new/ Error parsing: http://g.163.com/a?CID=10631&Values=3773655455&Redirect=http:/www.xinhaowei.cn/zt/sasheng-new/: failed(2,0): Can't retrieve Tika parser for mime-type application/octet-stream Parsing: http://g.163.com/a?CID=10632&Values=715626766&Redirect=http:/www.edu-163.com/aidi/aidimg.htm Parsing: http://g.163.com/a?CID=10633&Values=1745739655&Redirect=http:/www.edu-163.com/aidi/aidinj1.htm Parsing: http://g.163.com/a?CID=10634&Values=2943411042&Redirect=http:/www.kpeng.com.cn/ Error parsing: http://g.163.com/a?CID=10634&Values=2943411042&Redirect=http:/www.kpeng.com.cn/: failed(2,0): Can't retrieve Tika parser for mime-type application/octet-stream Parsing: http://g.163.com/a?CID=10635&Values=1012801948&Redirect=http:/www.worldwayhk.com/ Error parsing: http://g.163.com/a?CID=10635&Values=1012801948&Redirect=http:/www.worldwayhk.com/: failed(2,0): Can't retrieve Tika parser for mime-type application/octet-stream Parsing: http://g.163.com/a?CID=12259&Values=3180311081&Redirect=http:/www.gpkdtx.com/ Error parsing: http://g.163.com/a?CID=12259&Values=3180311081&Redirect=http:/www.gpkdtx.com/: failed(2,0): Can't retrieve Tika parser for mime-type application/octet-stream Parsing: http://g.163.com/a?CID=12267&Values=608079303&Redirect=http:/edu.163.com/special/official/ Error parsing: http://g.163.com/a?CID=12267&Values=608079303&Redirect=http:/edu.163.com/special/official/: failed(2,0): Can't retrieve Tika parser for mime-type application/octet-stream Parsing: http://g.163.com/a?CID=12271&Values=904657751&Redirect=http:/www.vipabc.com/count.asp?code=QnfF0agFbn Error parsing: http://g.163.com/a?CID=12271&Values=904657751&Redirect=http:/www.vipabc.com/count.asp?code=QnfF0agFbn: failed(2,0): Can't retrieve Tika parser for mime-type application/octet-stream Parsing: http://g.163.com/a?CID=12283&Values=1441209353&Redirect=http:/www.zyqm.org/ Error parsing: http://g.163.com/a?CID=12283&Values=1441209353&Redirect=http:/www.zyqm.org/: failed(2,0): Can't retrieve Tika parser for mime-type application/octet-stream Parsing: http://g.163.com/a?CID=12307&Values=3388898846&Redirect=http:/www.offcn.com/zg/2011ms/index.html Parsing: http://g.163.com/a?CID=12337&Values=3289604641&Redirect=http:/www.offcn.com/zg/2011ms/index.html Parsing: http://g.163.com/a?CID=12392&Values=441270714&Redirect=http:/www.qinzhe.com/chinese/index.htm Parsing: http://g.163.com/r?site=netease&affiliate=homepage&cat=homepage&type=textlinkhouse&location=1 Parsing: http://game.163.com/ Parsing: http://gongyi.163.com/ Parsing: http://hea.163.com/ Parsing: http://home.163.com/ Parsing: http://house.163.com/ Parsing: http://lady.163.com/ Parsing: http://lady.163.com/beauty/ Parsing: http://lady.163.com/sense/ Parsing: http://mall.163.com/ Parsing: http://mobile.163.com/ Parsing: http://mobile.163.com/app/ Parsing: http://money.163.com/ Parsing: http://money.163.com/fund/ Parsing: http://money.163.com/hkstock/ Parsing: http://money.163.com/stock/ Parsing: http://news.163.com/ Parsing: http://news.163.com/photo/ Parsing: http://news.163.com/review/ Parsing: http://p.mail.163.com/mailinfo/shownewmsg_www_0819.htm Parsing: http://pay.163.com/ Parsing: http://photo.163.com/pp/square/ Parsing: http://product.auto.163.com/ Parsing: http://product.tech.163.com/mobile/ Parsing: http://reg.163.com/Logout.jsp?username=accountName&url=http:/www.163.com/ Parsing: http://reg.163.com/Main.jsp?username=pInfo Parsing: http://reg.email.163.com/mailregAll/reg0.jsp?from=163®Page=163 Parsing: http://reg.vip.163.com/enterMail.m?enterVip=true----------- Parsing: http://sports.163.com/ Parsing: http://sports.163.com/cba/ Parsing: http://sports.163.com/nba/ Parsing: http://sports.163.com/yc/ Parsing: http://t.163.com/chat?f=163dh Parsing: http://t.163.com/rank/daren?f=163dh Parsing: http://t.163.com/rank?f=163dh Parsing: http://tech.163.com/ Parsing: http://tech.163.com/cnstock/ Parsing: http://tech.163.com/digi/nb/ Parsing: http://travel.163.com/ Parsing: http://v.163.com/ Parsing: http://v.163.com/doc/ Parsing: http://v.163.com/focus/ Parsing: http://vipmail.163.com/ Parsing: http://war.163.com/ Parsing: http://www.163.com/rss/ Parsing: http://xf.house.163.com/gz/ Parsing: http://yc.163.com/ Parsing: http://yuehui.163.com/ ParseSegment: finished at 2012-02-07 14:22:26, elapsed: 00:00:06 CrawlDb update: starting at 2012-02-07 14:22:26 CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/segments/20120207142145] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: 404 purging: false CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2012-02-07 14:22:30, elapsed: 00:00:04 crawl finished: crawl
评论
那个eclipse中进行nutch1.4开发的 你在哪讲了
我自己有文档,一直没有发上去,后来没搞了,所以博客也没有续写,有时间我发上去吧
那个eclipse中进行nutch1.4开发的 你在哪讲了
[yzb@www local]$ bin/nutch crawl urls -solr http://localhost:8080/solr/ -dir crawl -depth 2 -threads 5 -topN 100
crawl started in: crawl
rootUrlDir = urls
threads = 5
depth = 2
solrUrl=http://localhost:8080/solr/
topN = 100
Injector: starting at 2012-05-22 20:50:14
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
at org.apache.nutch.crawl.Injector.inject(Injector.java:217)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:127)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
这个怎么解决了,看到了请速度给个反馈,谢谢!
请检查nutch-default.xml的plugin.folders是否修改为./src/plugin,默认为plugins,
修改后启动正常
一般是插件的地址问题!
[yzb@www local]$ bin/nutch crawl urls -solr http://localhost:8080/solr/ -dir crawl -depth 2 -threads 5 -topN 100
crawl started in: crawl
rootUrlDir = urls
threads = 5
depth = 2
solrUrl=http://localhost:8080/solr/
topN = 100
Injector: starting at 2012-05-22 20:50:14
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
at org.apache.nutch.crawl.Injector.inject(Injector.java:217)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:127)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
这个怎么解决了,看到了请速度给个反馈,谢谢!
,好的,这几天一直在忙乎工作,周一加了通宵,没时间更新,实在不好意思啊……
这些我都有写好的东西
在哪,能否给我看看,把你这方面相关的资料都给我借鉴借鉴吧!
在我本机的word上一直没有发布上来
这些我都有写好的东西
在哪,能否给我看看,把你这方面相关的资料都给我借鉴借鉴吧!
这些我都有写好的东西
好的,今天晚上回去写下吧,要是今天晚上没有其他事情的话,然后再把我之前遇到的一些问题也贴出来,呵呵
非常感谢
没事,群里面好多兄弟也等呢……最近不是太忙了,没时间了嘛,呵呵
好的,今天晚上回去写下吧,要是今天晚上没有其他事情的话,然后再把我之前遇到的一些问题也贴出来,呵呵
非常感谢
好的,今天晚上回去写下吧,要是今天晚上没有其他事情的话,然后再把我之前遇到的一些问题也贴出来,呵呵
我今天晚上写吧……最近忙乎换工作的事情呢……
非常感谢!!
昨天晚上回来太晚了,就没有写,回去我看时间,然后尽快写出来吧,今天入职新工作了
你先忙自己的,我不是太着忙!
我今天晚上写吧……最近忙乎换工作的事情呢……
非常感谢!!
昨天晚上回来太晚了,就没有写,回去我看时间,然后尽快写出来吧,今天入职新工作了
我今天晚上写吧……最近忙乎换工作的事情呢……
非常感谢!!
我今天晚上写吧……最近忙乎换工作的事情呢……
写好了没,另外,你那是什么系统下的eclipse配置
相关推荐
nutch1.4帮助文档,学习nutch1.4必备,最新nutch1.4核心类解读!
nutch1.4 在windows下的安装配置环境搭建
nutch 1.4 在windows下安装配置
Nutch是一个由Java实现的,刚刚诞生开放源代码(open-source)的web搜索引擎。Nutch目前最新的版本为version1.4。这个为nutch的最新版 1.4。
apache-nutch-1.4-bin.tar.gz.part2
Nutch1[1].4_windows下eclipse配置图文详解
apache-nutch-1.4-bin.part2
apache-nutch-1.4-bin.part1
apache-nutch-1.4-bin.tar.gz.part1
Nutch 是一个开源Java 实现的搜索引擎。这里是它的安装包。
Nutch搜索引擎·Nutch简单应用(第3期) 1.1 Nutch 命令详解 1.2 Nutch 简单应用
Nutch在Tomcat下的部署.doc
很好的一个开源搜索引擎,可以自己设计添加代码。
在该示例中,首先带领读者开发一个作为 Nutch 爬虫抓取的目标网站,目标网站将被部署在域名为 myNutch.com 的服务器上。然后示例说明 Nutch 爬虫如何抓取目标网站内容,产生片断和索引,并将结果存放在集群的2个节点...
Nutch是一个由Java实现的,刚刚诞生开放源代码(open-source)的web搜索引擎。
nutch1.3在myclipse部署工程源码nutch1.3在myclipse部署工程源码nutch1.3在myclipse部署工程源码
1.4 nutch VS lucene.....2 2. nutch的安装与配置.....3 2.1 JDK的安装与配置.3 2.2 nutch的安装与配置........5 2.3 tomcat的安装与配置......5 3. nutch初体验7 3.1 爬行企业内部网....7 3.1.1 配置nutch....7 ...
nutch 爬虫数据nutch 爬虫数据nutch 爬虫数据nutch 爬虫数据nutch 爬虫数据nutch 爬虫数据nutch 爬虫数据nutch 爬虫数据nutch 爬虫数据
Nutch 是一个开源Java 实现的搜索引擎。它提供了我们运行自己的搜索引擎所需的全部工具。包括全文搜索和Web爬虫
这是本人在完全分布式环境下在Cent-OS中配置Nutch-1.1时的总结文档,但该文档适合所有Linux系统和目前各版本的nutch。 目 录 介绍 ............................................................... 2 0 集群...