Python自带的网络请求库urllib的使用,能满足基本使用,但用起来有些繁琐,如Cookies处理、代理设置等,还有一点默认不支持gzip/deflate。
现在的网页普遍支持gzip压缩,以节约传输时间,在urlib中要返回压缩格式,需在请求头中写明accept-encoding,然后检查响应头里是否有accept-encoding判断是否需要解码
Cookies
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
| import requests as req import json headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64)' } url = "https://www.baidu.com" resp = req.get(url=url,headers=headers)
cookies = resp.cookies print(type(cookies)) print(cookies) for name, value in cookies.items(): print("Cookie键值对:", name, value)
cookies_dict = req.utils.dict_from_cookiejar(cookies) print(cookies_dict)
with open("cookie.txt","w+") as f: f.write(json.dumps(cookies_dict))
with open("cookies.txt","r+") as f: load_cookies = json.loads(f.read())
exchange_cookies = req.utils.cookiejar_from_dict(load_cookies) print(exchange_cookies)
n_resp = req.get(url=url, headers=headers, cookies=cookies)
|
Session维持
如果多次传递Cookies麻烦,可以试试Session(会话)维持,它会自动处理Cookies。类似于在一个浏览器页面登录后,后续请求都在这个页面进行,处于登录状态。
1 2 3 4 5 6 7 8 9 10 11 12 13
| session = req.Session login_resp = session.get("登录页地址") user_info_resp = session.get("用户信息页地址")
url = 'https://www.httpbin.org/cookies' session = requests.session() cook = {'eare': 'https://www.httpbin.org'} session.cookies.update(cook)
|
身份验证
在访问需要验证登录的站点时,偶尔会遇到下面这样的身份页面:
此时可以使用reqeusts自带的身份认证功能来登录。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
| import requests from requests.auth import HTTPBasicAuth from requests.auth import HTTPDigestAuth from requests_oauthlib import OAuth1
requests.get('网站', auth=HTTPBasicAuth('用户名', '密码'))
requests.get('网站', auth=('用户名', '密码'))
url = 'http://httpbin.org/digest-auth/auth/user/pass' resp = requests.get(url, auth=HTTPDigestAuth('user', 'pass')) print(resp.status_code)
url = 'https://api.twitter.com/1.1/account/verify_credentials.json' auth = OAuth1('YOUR_APP_KEY', 'YOUR_APP_SECRET', 'USER_OAUTH_TOKEN', 'USER_OAUTH_TOKEN_SECRET') requests.get(url, auth=auth)
|
SSL证书验证
越来越多的网站采用HTTPS协议,如果站点未能正确设置https证书或所用的证书不被ca机构认可,会出现SSL证书错误的提示,在浏览器上表现为,您的连接不是私密连接,在requests中报错
requests.exceptions.SSLError: HTTPSConnectionPool(host='xxx.xxx.xxx', port=443): Max retries exceeded with url: / (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'tls_process_server_certificate', 'certificate verify failed')])")))
可以设置verify=False
来跳过证书验证,跳过后可能还会报错:
/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py:857: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings InsecureRequestWarning)
如果觉得厌烦可以通过设置忽略警告的方式来屏蔽
1 2 3 4
| import requests from requests.packages import urllib3 urllib3.disable_warnings() resp = requests.get(test_url, verify=False)
|
也可以指定一个本地证书作为客户端证书
response = resquests.get(url,cert=('/path/server.crt','/path/server.key'))
本地私有证书的key需为解密状态,不支持加密状态的key!!!
超时
网络状况复杂,当本地网络较差、远程服务器网络延时甚至无响应时,客户端可能要等待很久才能收到响应,甚至收不到响应而报错。为了防止服务器不能及时响应,应当设置一个超时时间,当超过这个时间没得到响应,就抛出异常。需要用到timeout参数,代表发出请求到服务器器响应的时间,默认None,永久等待,永不超时。
1 2 3 4 5
| requests.get('https://httpbin.org/get', timeout=30)
requests.get('https://httpbin.org/get', timeout=(5,25)
|
socks
除了基本的HTTP代理,Requests还支持Socks协议的带来,可选,如果想使用,需安装第三方库
pip install requests[socks]
安装好后和使用方式与HTTP代理类似:
1 2 3 4
| proxies = { 'http': 'socks5://user:pass@host:port', 'https': 'socks5://user:pass@host:port' }
|
Prepared Requests
除了直接通过request.get()/post()来发起一个请求外,你还可以自行构造一个Request对象来发起请求。代码示例如下:
1 2 3 4 5 6 7 8 9 10 11 12 13
| from requests import Request, Session
url = 'http://httpbin.org/post' data = {'name': 'CoderPig'} headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) ' 'Chrome/83.0.4103.97 Safari/537.36 ', } session = Session() req = Request('POST', url, data=data, headers=headers) prepped = session.prepare_request(req) resp = session.send(prepped) print(resp.json())
|