Python逆向爬虫之urllib

urllib库是python内置的一个http请求库，requests库就是基于该库开发出来的，虽然requests 库使用更方便，但作为最最基本的请求库，了解一下原理和用法还是很有必要的。

urllib 包包含以下几个模块：

urllib.request – 打开和读取 URL。
urllib.error – 包含 urllib.request 抛出的异常。
urllib.parse – 解析 URL。
urllib.robotparser – 解析 robots.txt 文件。

一、urllib.request

urllib.request 定义了一些打开 URL 的函数和类，包含授权验证、重定向、浏览器 cookies等。

1.1 urlopen函数

urllib.request 可以模拟浏览器的一个请求发起过程。

我们可以使用 urllib.request 的 urlopen 方法来打开一个 URL，语法格式如下：

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

url：url 地址。
data：发送到服务器的其他数据对象，默认为 None。
timeout：设置访问超时时间。
cafile 和 capath：cafile 为 CA 证书， capath 为 CA 证书的路径，使用 HTTPS 需要用到。
cadefault：已经被弃用。
context：ssl.SSLContext类型，用来指定 SSL 设置。

实例：

# -*- coding: utf-8 -*-
from urllib.request import urlopen
data = urlopen(url="https://www.baidu.com/")
print(data.read())

以上代码使用 urlopen 打开一个 URL，然后使用 read() 函数获取网页的 HTML 实体代码。

read() 是读取整个网页内容，我们可以指定读取的长度：

# -*- coding: utf-8 -*-
from urllib.request import urlopen
data = urlopen(url="https://www.baidu.com/")
print(data.read(100))

我们在对网页进行抓取时，经常需要判断网页是否可以正常访问，这里我们就可以使用 getcode() 函数获取网页状态码，返回 200 说明网页正常，返回 404 说明网页不存在:

# -*- coding: utf-8 -*-
from urllib.request import urlopen

data = urlopen(url="https://blog.abck8s.com/")
print(data.getcode())

没有data参数时，发送的是一个get请求，加上data参数后，请求就变成了post方式。

mport urllib.request
import urllib.parse

data1= bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf-8')
response = urllib.request.urlopen('http://httpbin.org/post',data = data1)
print(response.read())

1.2 response 响应类型

import urllib
from urllib import request

response = urllib.request.urlopen('http://www.baidu.com')
print(response.status)
print(response.getheaders())
print(response.getheader('Server'))

1.3 Request对象

如果我们需要发送复杂的请求，在urllib库中就需要使用一个Request对象。

import urllib.request
 
#直接声明一个Request对象，并把url当作参数直接传递进来
request = urllib.request.Request('http://www.baidu.com')
response = urllib.request.urlopen(request)
print(response.read().decode('utf-8'))

声明了一个Request对象，把url当作参数传递给这个对象，然后把这个对昂作为urlopen函数的参数

更复杂的请求，加headers

利用Request对象实现一个post请求

import urllib.request
url = 'http://httpbin.org/post'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'
}
data = {'word':'hello'}
data = bytes(str(data),encoding='utf-8')
req = urllib.request.Request(url = url,data = data,headers = headers,method = 'POST')
response = urllib.request.urlopen(req)
print(response.read().decode('utf-8'))

import urllib.request
import urllib.parse

url = 'https://s.weibo.com/weibo?q=%23%E5%A5%A5%E8%BF%90%E5%81%A5%E5%84%BF%E6%97%B6%E9%9A%94%E4%B8%80%E5%B9%B4%E5%86%8D%E6%88%98%E4%B8%96%E7%95%8C%E8%B5%9B%E5%9C%BA%23'

#携带cookie进行访问
headers = {
    'GET https': '//s.weibo.com/weibo?q=%23%E5%A5%A5%E8%BF%90%E5%81%A5%E5%84%BF%E6%97%B6%E9%9A%94%E4%B8%80%E5%B9%B4%E5%86%8D%E6%88%98%E4%B8%96%E7%95%8C%E8%B5%9B%E5%9C%BA%23 HTTP/1.1',
    'Host': ' weibo.cn',
    'Connection': ' keep-alive',
    'Upgrade-Insecure-Requests': ' 1',
    'User-Agent': ' Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36',
    'Accept': ' text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'Accept-Language': ' zh-CN,zh;q=0.9',
    'Cookie': ' _T_WM=c1913301844388de10cba9d0bb7bbf1e; SUB=_2A253Wy_dDeRhGeNM7FER-CbJzj-IHXVUp7GVrDV6PUJbkdANLXPdkW1NSesPJZ6v1GA5MyW2HEUb9ytQW3NYy19U; SUHB=0bt8SpepeGz439; SCF=Aua-HpSw5-z78-02NmUv8CTwXZCMN4XJ91qYSHkDXH4W9W0fCBpEI6Hy5E6vObeDqTXtfqobcD2D32r0O_5jSRk.; SSOLoginState=1516199821',
}
request = urllib.request.Request(url=url, headers=headers)
response = urllib.request.urlopen(request)
# 输出所有
# print(response.read().decode('gbk'))
# 将内容写入文件中
with open('weibo.html', 'wb') as fp:
    fp.write(response.read())

二、urllib.error

可以捕获三种异常：URLError,HTTPError(是URLError类的一个子类)，ContentTooShortError

URLError只有一个reason属性

HTTPError有三个属性：code,reason,headers

import urllib
from urllib import request
from urllib import error
#先捕捉http异常，再捕捉url异常
try:
    response = urllib.request.urlopen('http://123.com')
except error.HTTPError as e:
    print(e.reason, e.code, e.headers)
except error.URLError as e:
    print(e.reason)
else:
    print('RequestSucess!')

三、urlparse函数

该函数是对传入的url进行分割,分割成几部分，并对每部分进行赋值。

import urllib
from urllib import parse

result = urllib.parse.urlparse('http://www.baidu.com/index.html;user?id=5#comment')
print(type(result))
print(result)

四、urlunparse函数

与urlparse函数作用相反，是对url进行拼接的。

from urllib.parse import urlunparse
data = ['http', 'www.baidu.com', 'index.html', 'user', 'a=6', 'comment']
print(urlunparse(data))

五、urlencode函数

可以把一个字典转化为get请求参数。

from urllib.parse import urlencode
params = {
    'name': 'alvin',
    'age': 18
}
base_url = 'https://www.baidu.com?'
usr = base_url + urlencode(params)
print(usr)

原创文章，作者：ItWorker，如若转载，请注明出处：https://blog.ytso.com/tech/pnotes/281043.html

Python逆向爬虫之urllib

Python逆向爬虫之urllib

一、urllib.request

1.1 urlopen函数

1.2 response 响应类型

1.3 Request对象

1.4 设置 Cookie

二、urllib.error

三、urlparse函数

四、urlunparse函数

五、urlencode函数

相关推荐

发表回复