scrapy入门使用-马育民老师

# 介绍

本文主要介绍下面内容：

1. 创建项目
2. 创建爬虫
3. 分析页面
4. 运行爬虫

# 创建项目

创建空文件夹，在命令提示符执行命令：
```
scrapy startproject 项目名
```

创建工程文件，目录结构如下：

[![](https://www.malaoshi.top/upload/0/0/1EF5VBiy1nrX.png)](https://www.malaoshi.top/upload/0/0/1EF5VBiy1nrX.png)

### 文件目录说明

```
scrapy.cfg: 项目的配置文件。
test_scrapy/: 项目的Python模块，将会从这里引用代码。
    items.py: 目标文件。
    middlewares.py: 中间件，如：反爬代理
    pipelines.py: 管道文件。
    settings.py: 配置文件。是否遵守robots协议，设置请求头等
    spiders/: 存储爬虫代码目录
```

### settings.py 作用
不遵守robots协议
```
# 将下面这句改成False
ROBOTSTXT_OBEY = False
```

增加User-Agent，否则一般会被网站服务器屏蔽
```
# 取消下面的注释，并增加User-Agent
DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
  'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36'
}
```

# 创建爬虫

在项目文件夹中，即：在当前目录中可以看到 `scrapy.cfg`等文件

```
scrapy genspider 爬虫名 域名
```
**注意：**
1. 爬虫名不要与项目名同名
2. 域名，是限制爬取网址的

例子：
```
scrapy genspider testspider github.com
```

在 spiders 文件夹中生成`testspider.py`文件，如下图：

[![](https://www.malaoshi.top/upload/0/0/1EF5VC2yTqpq.png)](https://www.malaoshi.top/upload/0/0/1EF5VC2yTqpq.png)

### 爬虫文件内容说明
打开该文件如下：
```
# -*- coding: utf-8 -*-
import scrapy

class TestspiderSpider(scrapy.Spider):
    name = 'testspider'
    allowed_domains = ['github.com']
    start_urls = ['http://github.com/']

def parse(self, response):
        pass

```

1. 继承`scrapy.Spider`类，有三个强制的属性 和 一个方法。

2. `name = ""` ：这个爬虫的识别名称，必须是唯一的，在不同的爬虫必须定义不同的名字。

3. `allow_domains = []` 是爬虫的域名范围，也就是爬虫的约束区域，规定爬虫只爬取这个域名下的网页，不存在的URL会被忽略。

4. `start_urls = () `：爬取的URL元祖/列表。爬虫从这里开始抓取数据，所以，第一次下载的数据将会从这些urls开始。其他子URL将会从这些起始URL中继承性生成。

5. `parse(self, response) `：解析方法，每个初始URL完成下载后将被调用，调用的时候传入从每一个URL传回的Response对象来作为唯一参数，主要作用如下：

1. 负责解析返回的网页数据(response.body)，提取结构化数据(生成item)
	2. 生成需要下一页的URL请求。

# 分析页面

在`testspider.py`文件中的类，继承`scrapy.Spider`类，在`parse(response)`函数中增加分析代码，

该函数的形参`response`支持以下解析方式：
1. 正则表达式
2. xpath
	参见教程：https://www.w3school.com.cn/xpath/index.asp
3. css

### 例子：
```
# -*- coding: utf-8 -*-
import scrapy

class TestspiderSpider(scrapy.Spider):
    name = 'testspider'
    allowed_domains = ['github.com']
    start_urls = ['https://github.com/search?p=1&q=python&type=Repositories']

def parse(self, response):
        titles=response.xpath("//div[@class='f4 text-normal']/a/text()")
        # print(titles)
        for item in titles:
            # 打印测试
            print(item.get())

```

# 运行爬虫

```
scrapy crawl 爬虫名
```

例子
```
scrapy crawl testspider
```

执行结果：

[![](https://www.malaoshi.top/upload/0/0/1EF5VDmG5T0b.png)](https://www.malaoshi.top/upload/0/0/1EF5VDmG5T0b.png)

# 自定义保存数据

将抓取的文本内容，按照自定义的格式保存到指定文件中，这需要通过 pipeline 实现

### 修改爬虫类的`parse()`函数

在爬虫类的`parse()`函数内，增加`yield 爬取数据`的代码

例子：
```
# -*- coding: utf-8 -*-
import scrapy

class TestspiderSpider(scrapy.Spider):
    name = 'testspider'
    allowed_domains = ['github.com']
    start_urls = ['https://github.com/search?p=1&q=python&type=Repositories']

def parse(self, response):
        datas=[]
        titles=response.xpath("//div[@class='f4 text-normal']/a/text()")
        # print(titles)
        for item in titles:
            # 打印测试
            # print(item.get())
            datas.append(item.get())

# 必须放入到dict中
        ret={"datas":datas}
        yield ret
```

### 修改文件 pipeline.py
在scrapy中，管道用于保存数据
打开文件 pipeline.py ，内容如下：
```
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

class TestScrapyPipeline:
    def process_item(self, item, spider):
    """
    在spiders中的parse函数 yield值 时执行
	"""
        return item

```
增加2个函数
```
open_spider(self,spider) 爬虫开始运行时执行

close_spider(self,spider) 爬虫关闭时执行
```
修改如下：
```
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

class TestScrapyPipeline:

def open_spider(self,spider):
        #创建my.txt文件，并将字符集设为utf-8
        self.file = open('data.txt', 'w', encoding='utf-8') 
 
    def close_spider(self,spider):
        self.file.close()

def process_item(self, item, spider):
        """
        item：是testspider.py 函数parse() yield 后面的数据
        """
        try:
            datas=item['datas']
            for item in datas:
                self.file.write(item+'\n')
        except:
            pass
```

### 修改settings.py文件
将下面的注释取消，启用管道
```
ITEM_PIPELINES = {
   'test_scrapy.pipelines.TestScrapyPipeline': 300,
}
```

### 运行

```
scrapy crawl testspider
```
在工程目录下可找到`data.txt`文件

# 增加翻页功能

在`TestspiderSpider`类的`parse()`函数中增加分析下一页的连接，并发出请求

```
# -*- coding: utf-8 -*-
import scrapy

class TestspiderSpider(scrapy.Spider):
    name = 'testspider'
    allowed_domains = ['github.com']
    start_urls = ['https://github.com/search?p=1&q=python&type=Repositories']

ret={"datas":datas}
        print("ret::",ret)
        yield ret

# 获取下一页url
        next_page=response.xpath("//a[@class='next_page']/@href").get()
        if next_page:
            # 拼接网址
            next_page_url=response.urljoin(next_page)
            # 发请求，交给自己处理
            print("下一页")
            yield scrapy.Request(next_page_url,callback=self.parse)
```

运行

```
scrapy crawl testspider
```
在工程目录下可找到`data.txt`文件

# 运行，保存数据到json文件

### 修改爬虫类的`parse()`函数

在爬虫类的`parse()`函数内，增加`yield 爬取数据`的代码

例子：
```
# -*- coding: utf-8 -*-
import scrapy

class TestspiderSpider(scrapy.Spider):
    name = 'testspider'
    allowed_domains = ['github.com']
    start_urls = ['https://github.com/search?p=1&q=python&type=Repositories']

# 必须放入到dict中
        ret={"datas":datas}
        yield ret
```

### 执行命令
```
scrapy crawl 爬虫名 -o 文件名.json
```
例子：
```
scrapy crawl testspider -o 1.json
```

### 结果

在工程目录下可找到1.json，内容如下：
```
[
{"datas": ["TheAlgorithms/", "geekcomputers/", "injetlee/", "TwoWater/", "Show-Me-the-Code/", "kubernetes-client/", "xxg1413/", "jakevdp/", "DataScienceHandbook", "docker-library/", "joeyajames/"]}
]
```

# 运行，保存数据到csv文件

### 修改爬虫类的`parse()`函数
在爬虫类的`parse()`函数内，增加`yield 爬取数据`的代码

例子
```
# -*- coding: utf-8 -*-
import scrapy

class TestspiderSpider(scrapy.Spider):
    name = 'testspider'
    allowed_domains = ['github.com']
    start_urls = ['https://github.com/search?p=1&q=python&type=Repositories']

# 必须放入到dict中
        ret={"datas":datas}
        yield ret
```

### 执行命令
```
scrapy crawl 爬虫名 -o 文件名.csv
```
例子：
```
scrapy crawl testspider -o 1.csv
```

### 结果

在工程目录下可找到1.csv，内容如下：
```
datas
"TheAlgorithms/,geekcomputers/,injetlee/,TwoWater/,Show-Me-the-Code/,kubernetes-client/,xxg1413/,jakevdp/,DataScienceHandbook,docker-library/,joeyajames/"
```

原文出处：http://malaoshi.top/show_1EF5VE00LA42.html