scrapy抓取列表页面、详细页面、下载图片 作者:马育民 • 2020-05-14 10:45 • 阅读:10113 # 创建项目 ``` scrapy startproject csdn ``` # 创建爬虫 进入项目根路径(scrapy.cfg文件所在的路径),执行下面命令: ``` scrapy genspider article csdn.com ``` 在`spiders`文件夹下有`article.py`文件 # 修改爬虫 修改 `article.py`文件 ,内容如下: ``` # -*- coding: utf-8 -*- import scrapy,uuid class ArticleSpider(scrapy.Spider): name = 'article' allowed_domains = ['csdn.net'] start_urls = ['https://www.csdn.net/nav/python'] def parse(self, response): print("-"*50) elements=response.xpath('//ul[@id="feedlist_id"]/li/descendant::h2/a/@href') for element in elements: # 不能写 //h2/a/@href,这样表示从网页根目录分析,会出错 # href=element.xpath(".//h2/a/@href").get() href=element.get() print(href) # 获取到url后,发送请求,下载到内容后交由parse_detail()函数处理 yield scrapy.Request(href,callback=self.parse_detail) print("-"*50) def parse_detail(self,response): title=response.xpath('//h1/text()').get() content_selector=response.xpath('//div[@id="article_content"]') content=content_selector.get() img_src_selectors=content_selector.xpath(".//img/@src") for item in img_src_selectors: img_src=item.get() new_img_name=str(uuid.uuid1()).replace('-','')+".jpg" yield {"img_src":img_src,"pipeline":"ImgPipeline","name":new_img_name} # img_srcs=content_selector.xpath(".//img/@src").extract() # if img_srcs: # yield {"img_srcs":img_srcs,"pipeline":"ImgPipeline"} if title and content: title=title.strip() data={"title":title,"content":content,"pipeline":"CsdnPipeline"} yield data ``` **注意:** 1. 一般情况,所有管道都会执行,为了让不同的数据由指定的管道执行,在数据中加入了 `pipeline` 2. 由管道下载图片时,图片名字是url的sha1算法生成的结果,想指定图片名字,要从spider传递 # 修改 pipeline.py 1. CsdnPipeline:用于保存标题、内容 2. ImgPipeline:用于下载图片。 内容如下: ``` import datetime,os from scrapy.pipelines.images import ImagesPipeline import scrapy class CsdnPipeline: path="/Users/mym/Desktop/download_img/csdn/data" def process_item(self, item, spider): # return item # datetime.datetime.today().strftime('%Y%m%d%H:%M:%S') if item["pipeline"]!="CsdnPipeline": return item title=item["title"] content=item["content"] file_path=os.path.join(CsdnPipeline.path,title+".html") with open(file_path,"wt",encoding="utf-8") as f: f.write(content) class ImgPipeline(ImagesPipeline): def get_media_requests(self, item, info): # 下载图片,如果传过来的是集合需要循环下载 if item["pipeline"]!="ImgPipeline": return item # for item in item["img_srcs"]: # yield scrapy.Request(url=item,meta={'name':item['title']}) # meta里面的数据是从spider获取,然后通过meta传递给下面方法:file_path img_src=item["img_src"] print("img_src:",img_src) yield scrapy.Request(url=img_src,meta={'name':item['name']}) def item_completed(self, item, spider): # return item # 是一个元组,第一个元素是布尔值表示是否成功 if not item[0][0]: print('下载失败') return item def file_path(self, request, response=None, info=None): name = request.meta['name'] # url=request.url # index=url.rfind("/") # if index>0: # img_name=url[index+1:] # else: # img_name=datetime.datetime.today().strftime('%Y%m%d%H:%M:%S') return name ``` # 修改settings.py ### 修改 DEFAULT_REQUEST_HEADERS ``` DEFAULT_REQUEST_HEADERS = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en', 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36' } ``` ### 增加图片下载默认路径 ``` BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__))) IMAGES_STORE=os.path.join(BASE_DIR,"data/imgs") ``` ### 启用 ImgPipeline 管道 ``` ITEM_PIPELINES = { 'csdn.pipelines.CsdnPipeline': 300, # 启用 ImgPipeline 管道,后面的数字是管道执行顺序 'csdn.pipelines.ImgPipeline': 305, } ``` 原文出处:http://malaoshi.top/show_1EF5WIbUdH2b.html