`
saiyaren
  • 浏览: 226033 次
  • 性别: Icon_minigender_1
  • 来自: 北京
社区版块
存档分类
最新评论

nutch1.4 部署应用

阅读更多

 

nutch1.4在2011年的11月26日正式发布了,nutch1.4之后更新了一些内容和一些配置,但是和1.3差别还是不大,但是和1.2之前的差异就比较大了,在nutch1.3之后,索引就用solr来进行生成了,包括查询也是用solr,所以在nutch1.2之前的web搜索服务也就不需要了。

首先我们去nutch的官网下载最新版的nutch1.4

地址为:

http://www.apache.org/dyn/closer.cgi/nutch/

 

下载apache-nutch-1.4-bin.zip或者apache-nutch-1.4-bin.tar.gz都可以

下载下来后,我们解压,现在先进行linux下的应用,下一节我会写eclipse中进行nutch开发

解压之后,我们会看到如下目录:


然后我们进入nutch/runtime/local的目录下,下目录下会有个conf文件夹,我们进入文件夹会看到如下文件:



 在这里我们只需要知道2个文件即可:

nutch-default.xml和regex-urlfilter.txt

 

nutch-default.xml 是nutch 的配置文件

regex-urlfilter.txt文件内是编辑NUTCH爬取的策略规则的

 

我们这是进行初次爬取,那么我们测试的话不需要对其他设置进行优化,只需要做到如下即可:

在nutch-default.xml文件中找到http.agent.name属性,将其中的value内容加上;

 

<!-- HTTP properties -->

<property>
  <name>http.agent.name</name>
  <value>jdodrc</value>
  <description>HTTP 'User-Agent' request header. MUST NOT be empty - 
  please set this to a single word uniquely related to your organization.

  NOTE: You should also check other related properties:

	http.robots.agents
	http.agent.description
	http.agent.url
	http.agent.email
	http.agent.version

  and set their values appropriately.

  </description>
</property>

 

 

如果不加上该属性的话,在执行nutch的时候会报如下错误:

 

Exception in thread "main" java.lang.IllegalArgumentException: Fetcher: No agents listed in 'http.agent.name' property.
 

增加上属性后,我们还需要进行规则的设置,比如我们要爬取www.163.com ,但是我们不是要把里面的所有链接都爬取下来,如sohu的广告,我们就不需要爬,我们只需要爬取163的内容,那么我们就需要设置爬取规则,爬取规则采用正则表达式进行编写(正则表达式在这里不做具体阐述)

 

那么我们在哪里编写规则呢?

 

regex-urlfilter.txt文件中编写规则:

 

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

这里是过滤的扩展名

 

抓取动态网页

 

# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]如果需要抓取动态网页就把这里注释掉
-[~]

 

页面链接过滤规则,如下为过滤163站的

 

 

# accept anything else
#+^http://([a-z0-9]*\.)*(.*\.)*.*/
+^http://([a-z0-9]*\.)*163\.com
 

如果做测试用只需要修改过滤规则即可。

 

nutch-default.xml的http.agent.name配置好后

regex-urlfilter.txt正则规则配置好后

那么我们在linux 在把runtime/local/bin下的.sh全部改为可执行文件

 

打开bin目录后,执行:

chmod +x *.sh

将所有的sh变为可执行

 

然后我们做下测试:

 

在runtime/local目录下,创建一个urls目录,然后里面创建一个文件test,在test文件里面输入我们要进行爬取的网站入口:

 

http://www.163.com/
 

然后保存,现在在我们的local目录下有一个urls目录,里面有一个入口文件

那么我们现在就进行一下测试:

测试之前我们需要对nutch的参数进行一下了解:

Crawl <urlDir> -solr <solrURL> [-dir d] [-threads n] [-depth i] [-topN N]

[]中间的是可选的

urlDir就是入口文件地址

-solr <solrUrl>为solr的地址(如果没有则为空)

-dir 是保存爬取文件的位置

-threads 是爬取开的线程(线程不是越多越好,实现要求即可,默认为10)

-depth 是访问的深度 (默认为5)

-topN 是访问的广度 (默认是Long.max)

 

然后在bin目录下有一个 nutch的shell文件,在nutch的shell文件中有一个crawl参数就是启动我们抓取类的:

我们现在测试爬行一下,现在我们的 目录位置是在nutch/runtime/local下

 

 

bin/nutch crawl urls -solr http://localhost:8080/solr/ -dir crawl -depth 2 -threads 5 -topN 100
 

如果要以后查看日志的话,那么就在最后加上一个 >& (输出位置)

 

solr需要单独配置,我会在solr一篇文章中讲怎么部署,这里的-solr的位置,只需要输入solr的url地址即可


如想了解solr部署请看solr 部署的文章

 

如果要想在windows下测试或者开发,那么需要首先安装cygwin,安装cygwin我会在eclipse中部署nutch1.4中介绍

 

测试结果:

 

crawl started in: crawl
rootUrlDir = urls/test.txt
threads = 10
depth = 2
solrUrl=http://localhost:8080/solr/
topN = 100
Injector: starting at 2012-02-07 14:21:20
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls/test.txt
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2012-02-07 14:21:25, elapsed: 00:00:04
Generator: starting at 2012-02-07 14:21:25
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 100
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/segments/20120207142128
Generator: finished at 2012-02-07 14:21:30, elapsed: 00:00:05
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
Fetcher: starting at 2012-02-07 14:21:30
Fetcher: segment: crawl/segments/20120207142128
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 1 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
fetching http://www.163.com/
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
Fetcher: throughput threshold: -1
-finishing thread FetcherThread, activeThreads=1
Fetcher: throughput threshold retries: 5
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2012-02-07 14:21:36, elapsed: 00:00:05
ParseSegment: starting at 2012-02-07 14:21:36
ParseSegment: segment: crawl/segments/20120207142128
Parsing: http://www.163.com/
ParseSegment: finished at 2012-02-07 14:21:39, elapsed: 00:00:03
CrawlDb update: starting at 2012-02-07 14:21:39
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20120207142128]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2012-02-07 14:21:42, elapsed: 00:00:03
Generator: starting at 2012-02-07 14:21:42
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 100
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/segments/20120207142145
Generator: finished at 2012-02-07 14:21:48, elapsed: 00:00:05
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
Fetcher: starting at 2012-02-07 14:21:48
Fetcher: segment: crawl/segments/20120207142145
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: time-out divisor: 2
Using queue mode : byHost
QueueFeeder finished: total 97 records + hit by time limit :0
Using queue mode : byHost
fetching http://bbs.163.com/
Using queue mode : byHost
fetching http://bbs.163.com/rank/
Using queue mode : byHost
fetching http://tech.163.com/cnstock/
Using queue mode : byHost
fetching http://tech.163.com/
Using queue mode : byHost
fetching http://tech.163.com/digi/nb/
Using queue mode : byHost
Using queue mode : byHost
fetching http://g.163.com/a?CID=10625&Values=3331479594&Redirect=http:/www.edu-163.com/Item/list.asp?id=1164
fetching http://g.163.com/r?site=netease&affiliate=homepage&cat=homepage&type=textlinkhouse&location=1
Using queue mode : byHost
fetching http://g.163.com/a?CID=10627&Values=896009995&Redirect=http:/www.dv37.com/jiaoyu/xiaoxinxing/
Using queue mode : byHost
fetching http://g.163.com/a?CID=10635&Values=1012801948&Redirect=http:/www.worldwayhk.com/
Fetcher: throughput threshold: -1
Fetcher: throughput threshold retries: 5
fetching http://g.163.com/a?CID=12392&Values=441270714&Redirect=http:/www.qinzhe.com/chinese/index.htm
fetching http://g.163.com/a?CID=10634&Values=2943411042&Redirect=http:/www.kpeng.com.cn/
fetching http://g.163.com/a?CID=12337&Values=3289604641&Redirect=http:/www.offcn.com/zg/2011ms/index.html
fetching http://g.163.com/a?CID=10633&Values=1745739655&Redirect=http:/www.edu-163.com/aidi/aidinj1.htm
fetching http://g.163.com/a?CID=12307&Values=3388898846&Redirect=http:/www.offcn.com/zg/2011ms/index.html
fetching http://g.163.com/a?CID=10629&Values=740233954&Redirect=http:/www.embasjtu.com/
fetching http://g.163.com/a?CID=10632&Values=715626766&Redirect=http:/www.edu-163.com/aidi/aidimg.htm
fetching http://g.163.com/a?CID=12259&Values=3180311081&Redirect=http:/www.gpkdtx.com/
fetching http://g.163.com/a?CID=12271&Values=904657751&Redirect=http:/www.vipabc.com/count.asp?code=QnfF0agFbn
fetching http://g.163.com/a?CID=10628&Values=2735701856&Redirect=http:/www.wsi.com.cn
fetching http://g.163.com/a?CID=10623&Values=1704187161&Redirect=http:/www.wsi.com.cn
fetching http://g.163.com/a?CID=12267&Values=608079303&Redirect=http:/edu.163.com/special/official/
fetching http://g.163.com/a?CID=10631&Values=3773655455&Redirect=http:/www.xinhaowei.cn/zt/sasheng-new/
fetching http://g.163.com/a?CID=10630&Values=4025376053&Redirect=http:/www.bwpx.com/
fetching http://g.163.com/a?CID=12283&Values=1441209353&Redirect=http:/www.zyqm.org/
fetching http://mobile.163.com/
fetching http://mobile.163.com/app/
fetching http://reg.vip.163.com/enterMail.m?enterVip=true-----------
fetching http://product.tech.163.com/mobile/
fetching http://hea.163.com/
-activeThreads=10, spinWaiting=0, fetchQueues.totalSize=68
fetching http://reg.email.163.com/mailregAll/reg0.jsp?from=163&regPage=163
-activeThreads=10, spinWaiting=0, fetchQueues.totalSize=67
fetching http://yuehui.163.com/
fetching http://auto.163.com/
fetching http://auto.163.com/buy/
fetching http://gongyi.163.com/
fetching http://reg.163.com/Main.jsp?username=pInfo
fetching http://reg.163.com/Logout.jsp?username=accountName&url=http:/www.163.com/
-activeThreads=10, spinWaiting=0, fetchQueues.totalSize=61
fetching http://money.163.com/fund/
fetching http://money.163.com/stock/
fetching http://money.163.com/hkstock/
fetching http://money.163.com/
-activeThreads=10, spinWaiting=0, fetchQueues.totalSize=57
fetching http://blog.163.com/passportIn.do?entry=163
fetching http://blog.163.com/?fromNavigation
fetching http://pay.163.com/
fetching http://baby.163.com/
fetching http://discovery.163.com/
-activeThreads=10, spinWaiting=0, fetchQueues.totalSize=52
fetching http://p.mail.163.com/mailinfo/shownewmsg_www_0819.htm
fetching http://help.163.com?b01abh1
fetching http://www.163.com/rss/
fetching http://home.163.com/
fetching http://product.auto.163.com/
-activeThreads=10, spinWaiting=0, fetchQueues.totalSize=47
fetching http://ecard.163.com/
fetching http://photo.163.com/?username=pInfo
fetching http://photo.163.com/pp/square/
fetching http://email.163.com/
fetching http://m.163.com/
-activeThreads=10, spinWaiting=0, fetchQueues.totalSize=42
fetching http://edu.163.com/
fetching http://edu.163.com/liuxue/
fetching http://xf.house.163.com/gz/
fetching http://game.163.com/
fetching http://travel.163.com/
-activeThreads=10, spinWaiting=0, fetchQueues.totalSize=37
fetching http://baoxian.163.com/?from=index
-activeThreads=10, spinWaiting=0, fetchQueues.totalSize=36
fetching http://zx.caipiao.163.com?from=shouye
fetching http://entry.mail.163.com/coremail/fcg/ntesdoor2?verifycookie=1&lightweight=1
fetching http://biz.163.com/
fetching http://t.163.com/rank?f=163dh
fetching http://t.163.com/chat?f=163dh
-activeThreads=10, spinWaiting=0, fetchQueues.totalSize=31
fetching http://t.163.com/?f=wstopmicoblogmsg
fetch of http://zx.caipiao.163.com?from=shouye failed with: org.apache.nutch.protocol.http.api.HttpException: bad status line '<html>': For input string: "<html>"
fetching http://t.163.com/rank/daren?f=163dh
fetching http://t.163.com/?f=wstopmicoblogmsg.enter
fetching http://t.163.com/
fetching http://sports.163.com/
-activeThreads=10, spinWaiting=0, fetchQueues.totalSize=26
fetching http://sports.163.com/nba/
fetching http://sports.163.com/cba/
-activeThreads=10, spinWaiting=0, fetchQueues.totalSize=24
fetching http://sports.163.com/yc/
-activeThreads=10, spinWaiting=0, fetchQueues.totalSize=23
fetching http://vipmail.163.com/
fetching http://digi.163.com/
fetching http://lady.163.com/beauty/
fetching http://lady.163.com/
fetching http://lady.163.com/sense/
-activeThreads=10, spinWaiting=0, fetchQueues.totalSize=18
fetching http://house.163.com/
fetching http://news.163.com/review/
-activeThreads=10, spinWaiting=0, fetchQueues.totalSize=16
fetching http://news.163.com/photo/
fetching http://news.163.com/
fetching http://v.163.com/doc/
-activeThreads=10, spinWaiting=0, fetchQueues.totalSize=13
fetching http://v.163.com/zongyi/
fetching http://v.163.com/
fetching http://v.163.com/focus/
fetching http://fushi.163.com/
-activeThreads=10, spinWaiting=0, fetchQueues.totalSize=9
fetching http://yc.163.com/
-activeThreads=10, spinWaiting=0, fetchQueues.totalSize=8
fetching http://mall.163.com/
fetching http://ent.163.com/movie/
fetching http://ent.163.com/
fetching http://ent.163.com/music/
fetching http://ent.163.com/tv/
fetching http://war.163.com/
-activeThreads=10, spinWaiting=0, fetchQueues.totalSize=2
* queue: http://fashion.163.com
  maxThreads    = 10
  inProgress    = 0
  crawlDelay    = 5000
  minCrawlDelay = 0
  nextFetchTime = 1328595704430
  now           = 1328595728444
  0. http://fashion.163.com/
* queue: http://book.163.com
  maxThreads    = 10
  inProgress    = 0
  crawlDelay    = 5000
  minCrawlDelay = 0
  nextFetchTime = 1328595704430
  now           = 1328595728445
  0. http://book.163.com/
fetching http://fashion.163.com/
-activeThreads=10, spinWaiting=0, fetchQueues.totalSize=1
* queue: http://book.163.com
  maxThreads    = 10
  inProgress    = 0
  crawlDelay    = 5000
  minCrawlDelay = 0
  nextFetchTime = 1328595704430
  now           = 1328595729445
  0. http://book.163.com/
fetching http://book.163.com/
-activeThreads=10, spinWaiting=0, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=9
-finishing thread FetcherThread, activeThreads=8
-activeThreads=8, spinWaiting=0, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=7
-finishing thread FetcherThread, activeThreads=6
-finishing thread FetcherThread, activeThreads=5
-finishing thread FetcherThread, activeThreads=4
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=2
-activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=1
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2012-02-07 14:22:20, elapsed: 00:00:32
ParseSegment: starting at 2012-02-07 14:22:20
ParseSegment: segment: crawl/segments/20120207142145
Parsing: http://auto.163.com/
Parsing: http://auto.163.com/buy/
Parsing: http://baby.163.com/
Parsing: http://baoxian.163.com/?from=index
Parsing: http://bbs.163.com/
Parsing: http://bbs.163.com/rank/
Parsing: http://biz.163.com/
Parsing: http://blog.163.com/?fromNavigation
Parsing: http://book.163.com/
Parsing: http://digi.163.com/
Parsing: http://discovery.163.com/
Parsing: http://edu.163.com/
Parsing: http://edu.163.com/liuxue/
Parsing: http://email.163.com/
Parsing: http://ent.163.com/
Parsing: http://ent.163.com/movie/
Parsing: http://ent.163.com/music/
Parsing: http://ent.163.com/tv/
Parsing: http://fashion.163.com/
Parsing: http://fushi.163.com/
Parsing: http://g.163.com/a?CID=10623&Values=1704187161&Redirect=http:/www.wsi.com.cn
Error parsing: http://g.163.com/a?CID=10623&Values=1704187161&Redirect=http:/www.wsi.com.cn: failed(2,0): Can't retrieve Tika parser for mime-type application/octet-stream
Parsing: http://g.163.com/a?CID=10625&Values=3331479594&Redirect=http:/www.edu-163.com/Item/list.asp?id=1164
Error parsing: http://g.163.com/a?CID=10625&Values=3331479594&Redirect=http:/www.edu-163.com/Item/list.asp?id=1164: failed(2,0): Can't retrieve Tika parser for mime-type application/octet-stream
Parsing: http://g.163.com/a?CID=10627&Values=896009995&Redirect=http:/www.dv37.com/jiaoyu/xiaoxinxing/
Error parsing: http://g.163.com/a?CID=10627&Values=896009995&Redirect=http:/www.dv37.com/jiaoyu/xiaoxinxing/: failed(2,0): Can't retrieve Tika parser for mime-type application/octet-stream
Parsing: http://g.163.com/a?CID=10628&Values=2735701856&Redirect=http:/www.wsi.com.cn
Error parsing: http://g.163.com/a?CID=10628&Values=2735701856&Redirect=http:/www.wsi.com.cn: failed(2,0): Can't retrieve Tika parser for mime-type application/octet-stream
Parsing: http://g.163.com/a?CID=10629&Values=740233954&Redirect=http:/www.embasjtu.com/
Error parsing: http://g.163.com/a?CID=10629&Values=740233954&Redirect=http:/www.embasjtu.com/: failed(2,0): Can't retrieve Tika parser for mime-type application/octet-stream
Parsing: http://g.163.com/a?CID=10630&Values=4025376053&Redirect=http:/www.bwpx.com/
Error parsing: http://g.163.com/a?CID=10630&Values=4025376053&Redirect=http:/www.bwpx.com/: failed(2,0): Can't retrieve Tika parser for mime-type application/octet-stream
Parsing: http://g.163.com/a?CID=10631&Values=3773655455&Redirect=http:/www.xinhaowei.cn/zt/sasheng-new/
Error parsing: http://g.163.com/a?CID=10631&Values=3773655455&Redirect=http:/www.xinhaowei.cn/zt/sasheng-new/: failed(2,0): Can't retrieve Tika parser for mime-type application/octet-stream
Parsing: http://g.163.com/a?CID=10632&Values=715626766&Redirect=http:/www.edu-163.com/aidi/aidimg.htm
Parsing: http://g.163.com/a?CID=10633&Values=1745739655&Redirect=http:/www.edu-163.com/aidi/aidinj1.htm
Parsing: http://g.163.com/a?CID=10634&Values=2943411042&Redirect=http:/www.kpeng.com.cn/
Error parsing: http://g.163.com/a?CID=10634&Values=2943411042&Redirect=http:/www.kpeng.com.cn/: failed(2,0): Can't retrieve Tika parser for mime-type application/octet-stream
Parsing: http://g.163.com/a?CID=10635&Values=1012801948&Redirect=http:/www.worldwayhk.com/
Error parsing: http://g.163.com/a?CID=10635&Values=1012801948&Redirect=http:/www.worldwayhk.com/: failed(2,0): Can't retrieve Tika parser for mime-type application/octet-stream
Parsing: http://g.163.com/a?CID=12259&Values=3180311081&Redirect=http:/www.gpkdtx.com/
Error parsing: http://g.163.com/a?CID=12259&Values=3180311081&Redirect=http:/www.gpkdtx.com/: failed(2,0): Can't retrieve Tika parser for mime-type application/octet-stream
Parsing: http://g.163.com/a?CID=12267&Values=608079303&Redirect=http:/edu.163.com/special/official/
Error parsing: http://g.163.com/a?CID=12267&Values=608079303&Redirect=http:/edu.163.com/special/official/: failed(2,0): Can't retrieve Tika parser for mime-type application/octet-stream
Parsing: http://g.163.com/a?CID=12271&Values=904657751&Redirect=http:/www.vipabc.com/count.asp?code=QnfF0agFbn
Error parsing: http://g.163.com/a?CID=12271&Values=904657751&Redirect=http:/www.vipabc.com/count.asp?code=QnfF0agFbn: failed(2,0): Can't retrieve Tika parser for mime-type application/octet-stream
Parsing: http://g.163.com/a?CID=12283&Values=1441209353&Redirect=http:/www.zyqm.org/
Error parsing: http://g.163.com/a?CID=12283&Values=1441209353&Redirect=http:/www.zyqm.org/: failed(2,0): Can't retrieve Tika parser for mime-type application/octet-stream
Parsing: http://g.163.com/a?CID=12307&Values=3388898846&Redirect=http:/www.offcn.com/zg/2011ms/index.html
Parsing: http://g.163.com/a?CID=12337&Values=3289604641&Redirect=http:/www.offcn.com/zg/2011ms/index.html
Parsing: http://g.163.com/a?CID=12392&Values=441270714&Redirect=http:/www.qinzhe.com/chinese/index.htm
Parsing: http://g.163.com/r?site=netease&affiliate=homepage&cat=homepage&type=textlinkhouse&location=1
Parsing: http://game.163.com/
Parsing: http://gongyi.163.com/
Parsing: http://hea.163.com/
Parsing: http://home.163.com/
Parsing: http://house.163.com/
Parsing: http://lady.163.com/
Parsing: http://lady.163.com/beauty/
Parsing: http://lady.163.com/sense/
Parsing: http://mall.163.com/
Parsing: http://mobile.163.com/
Parsing: http://mobile.163.com/app/
Parsing: http://money.163.com/
Parsing: http://money.163.com/fund/
Parsing: http://money.163.com/hkstock/
Parsing: http://money.163.com/stock/
Parsing: http://news.163.com/
Parsing: http://news.163.com/photo/
Parsing: http://news.163.com/review/
Parsing: http://p.mail.163.com/mailinfo/shownewmsg_www_0819.htm
Parsing: http://pay.163.com/
Parsing: http://photo.163.com/pp/square/
Parsing: http://product.auto.163.com/
Parsing: http://product.tech.163.com/mobile/
Parsing: http://reg.163.com/Logout.jsp?username=accountName&url=http:/www.163.com/
Parsing: http://reg.163.com/Main.jsp?username=pInfo
Parsing: http://reg.email.163.com/mailregAll/reg0.jsp?from=163&regPage=163
Parsing: http://reg.vip.163.com/enterMail.m?enterVip=true-----------
Parsing: http://sports.163.com/
Parsing: http://sports.163.com/cba/
Parsing: http://sports.163.com/nba/
Parsing: http://sports.163.com/yc/
Parsing: http://t.163.com/chat?f=163dh
Parsing: http://t.163.com/rank/daren?f=163dh
Parsing: http://t.163.com/rank?f=163dh
Parsing: http://tech.163.com/
Parsing: http://tech.163.com/cnstock/
Parsing: http://tech.163.com/digi/nb/
Parsing: http://travel.163.com/
Parsing: http://v.163.com/
Parsing: http://v.163.com/doc/
Parsing: http://v.163.com/focus/
Parsing: http://vipmail.163.com/
Parsing: http://war.163.com/
Parsing: http://www.163.com/rss/
Parsing: http://xf.house.163.com/gz/
Parsing: http://yc.163.com/
Parsing: http://yuehui.163.com/
ParseSegment: finished at 2012-02-07 14:22:26, elapsed: 00:00:06
CrawlDb update: starting at 2012-02-07 14:22:26
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20120207142145]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2012-02-07 14:22:30, elapsed: 00:00:04
crawl finished: crawl


 
4
0
分享到:
评论
25 楼 oMoChi_10 2013-04-22  
nutch1.6导入myeclipse是不是也一样呀。。。。我是做毕业设计的,老师还说要改一下源代码。。。这个哪里会呀。在这里看看有没有大神还是做这方面的。。。
24 楼 saiyaren 2013-03-11  
shantouyyt 写道
还有吗??  
那个eclipse中进行nutch1.4开发的  你在哪讲了 

我自己有文档,一直没有发上去,后来没搞了,所以博客也没有续写,有时间我发上去吧
23 楼 shantouyyt 2013-03-07  
还有吗??  
那个eclipse中进行nutch1.4开发的  你在哪讲了 
22 楼 青花瓷101 2012-08-02  
写的太棒了,,继续关注,,,
21 楼 saiyaren 2012-05-23  
youzhibing 写道
兄弟,我这按照你说的那样配置的,这么这样输出了
[yzb@www local]$ bin/nutch crawl urls -solr http://localhost:8080/solr/ -dir crawl -depth 2 -threads 5 -topN 100
crawl started in: crawl
rootUrlDir = urls
threads = 5
depth = 2
solrUrl=http://localhost:8080/solr/
topN = 100
Injector: starting at 2012-05-22 20:50:14
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
        at org.apache.nutch.crawl.Injector.inject(Injector.java:217)
        at org.apache.nutch.crawl.Crawl.run(Crawl.java:127)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
这个怎么解决了,看到了请速度给个反馈,谢谢!

请检查nutch-default.xml的plugin.folders是否修改为./src/plugin,默认为plugins,
修改后启动正常
一般是插件的地址问题!
20 楼 youzhibing 2012-05-22  
兄弟,我这按照你说的那样配置的,这么这样输出了
[yzb@www local]$ bin/nutch crawl urls -solr http://localhost:8080/solr/ -dir crawl -depth 2 -threads 5 -topN 100
crawl started in: crawl
rootUrlDir = urls
threads = 5
depth = 2
solrUrl=http://localhost:8080/solr/
topN = 100
Injector: starting at 2012-05-22 20:50:14
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
        at org.apache.nutch.crawl.Injector.inject(Injector.java:217)
        at org.apache.nutch.crawl.Crawl.run(Crawl.java:127)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
这个怎么解决了,看到了请速度给个反馈,谢谢!
19 楼 saiyaren 2012-04-12  
yaochanghong 写道
哥哇,你是太好了,我弄这个弄了好多天了,但是一直没有理想的结果。麻烦你的后续文档文档赶快上传啊。我们都等着啊。

,好的,这几天一直在忙乎工作,周一加了通宵,没时间更新,实在不好意思啊……
18 楼 yaochanghong 2012-04-11  
哥哇,你是太好了,我弄这个弄了好多天了,但是一直没有理想的结果。麻烦你的后续文档文档赶快上传啊。我们都等着啊。
17 楼 saiyaren 2012-04-11  
youzhibing 写道
saiyaren 写道
youzhibing 写道
环境貌似搭好了,solr也搭建了,nutch抓取之后如何对其进行搜索了,solr提供的那个界面执行的结果返回的是xml内容;怎么进入一般的查询界面( 只是一个查询框),返回的结果也是一般搜索引擎的查询结果格式了??

这些我都有写好的东西

在哪,能否给我看看,把你这方面相关的资料都给我借鉴借鉴吧!

在我本机的word上一直没有发布上来
16 楼 youzhibing 2012-04-10  
saiyaren 写道
youzhibing 写道
环境貌似搭好了,solr也搭建了,nutch抓取之后如何对其进行搜索了,solr提供的那个界面执行的结果返回的是xml内容;怎么进入一般的查询界面( 只是一个查询框),返回的结果也是一般搜索引擎的查询结果格式了??

这些我都有写好的东西

在哪,能否给我看看,把你这方面相关的资料都给我借鉴借鉴吧!
15 楼 saiyaren 2012-04-10  
youzhibing 写道
环境貌似搭好了,solr也搭建了,nutch抓取之后如何对其进行搜索了,solr提供的那个界面执行的结果返回的是xml内容;怎么进入一般的查询界面( 只是一个查询框),返回的结果也是一般搜索引擎的查询结果格式了??

这些我都有写好的东西
14 楼 youzhibing 2012-04-09  
环境貌似搭好了,solr也搭建了,nutch抓取之后如何对其进行搜索了,solr提供的那个界面执行的结果返回的是xml内容;怎么进入一般的查询界面( 只是一个查询框),返回的结果也是一般搜索引擎的查询结果格式了??
13 楼 saiyaren 2012-04-06  
youzhibing 写道
saiyaren 写道
youzhibing 写道
大哥,挤点时间写下在eclipse下配置nutch1.4,感激不尽!

好的,今天晚上回去写下吧,要是今天晚上没有其他事情的话,然后再把我之前遇到的一些问题也贴出来,呵呵

非常感谢

没事,群里面好多兄弟也等呢……最近不是太忙了,没时间了嘛,呵呵
12 楼 youzhibing 2012-04-06  
saiyaren 写道
youzhibing 写道
大哥,挤点时间写下在eclipse下配置nutch1.4,感激不尽!

好的,今天晚上回去写下吧,要是今天晚上没有其他事情的话,然后再把我之前遇到的一些问题也贴出来,呵呵

非常感谢
11 楼 saiyaren 2012-04-06  
youzhibing 写道
大哥,挤点时间写下在eclipse下配置nutch1.4,感激不尽!

好的,今天晚上回去写下吧,要是今天晚上没有其他事情的话,然后再把我之前遇到的一些问题也贴出来,呵呵
10 楼 youzhibing 2012-04-05  
大哥,挤点时间写下在eclipse下配置nutch1.4,感激不尽!
9 楼 youzhibing 2012-03-31  
youzhibing 写道
youzhibing 写道
saiyaren 写道
youzhibing 写道
那个eclipse中进行nutch1.4开发的  你在哪讲了

我今天晚上写吧……最近忙乎换工作的事情呢……

非常感谢!!

昨天晚上回来太晚了,就没有写,回去我看时间,然后尽快写出来吧,今天入职新工作了

你先忙自己的,我不是太着忙!
8 楼 saiyaren 2012-03-31  
youzhibing 写道
saiyaren 写道
youzhibing 写道
那个eclipse中进行nutch1.4开发的  你在哪讲了

我今天晚上写吧……最近忙乎换工作的事情呢……

非常感谢!!

昨天晚上回来太晚了,就没有写,回去我看时间,然后尽快写出来吧,今天入职新工作了
7 楼 youzhibing 2012-03-30  
saiyaren 写道
youzhibing 写道
那个eclipse中进行nutch1.4开发的  你在哪讲了

我今天晚上写吧……最近忙乎换工作的事情呢……

非常感谢!!
6 楼 youzhibing 2012-03-30  
saiyaren 写道
youzhibing 写道
那个eclipse中进行nutch1.4开发的  你在哪讲了

我今天晚上写吧……最近忙乎换工作的事情呢……

写好了没,另外,你那是什么系统下的eclipse配置

相关推荐

Global site tag (gtag.js) - Google Analytics