网络篇・HTTP协议

HTTP是一种能够获取如HTML这样的网络资源的protocol(通讯协议)。它是在Web上进行数据交换的基础，是一种client-server协议，也就是说，请求通常是由像浏览器这样的接受方发起的。一个完整的Web文档通常是由不同的子文档拼接而成的，像是文本、布局描述、图片、视频、脚本等等。

客户端和服务端通过交换各自的消息（与数据流正好相反）进行交互。由像浏览器这样的客户端发出的消息叫做requests，被服务端响应的消息叫做responses。

作为客户端与HTTP服务交互

对于简单的操作，通常可以使用urllib.request模块。例如发送一个简单的HTTP/HTTPS请求，获取远程服务上的一张图像:

from urllib import request

url = 'https://www.python.org/static/img/python-logo.png'

resp = request.urlopen(url)
with open('python-logo.png', 'wb') as f:
	f.write(resp.read())

以下是下载的图像:

使用GET方法传递参数

如果需要在GET方法中传递参数可以这样:

from urllib import request, parse

# 设置基础URL
url = 'http://httpbin.org/get'

# 使用字典结构存储请求参数
parms = {
   'name1' : 'value1',
   'name2' : 'value2'
}

# 对请求进行编码
querystring = parse.urlencode(parms)

# 构造GET请求
resp = request.urlopen(url+'?' + querystring)
data = resp.read().decode()
print(data)

运行结果:

{
  "args": {
    "name1": "value1", 
    "name2": "value2"
  }, 
  "url": "http://httpbin.org/get?name1=value1&name2=value2",
  "headers": {
    "Accept-Encoding": "identity", 
    "Host": "httpbin.org", 
  ... 省略 ...

使用POST方法传递参数

如果需要使用POST方法在请求主体中发送查询参数，可以将参数编码后作为可选参数提供给urlopen函数:

from urllib import request, parse

url = 'http://httpbin.org/post'

parms = {
   'name1' : 'value1',
   'name2' : 'value2'
}

querystring = parse.urlencode(parms)

# 设置POST参数，构造POST请求
resp = request.urlopen(url, querystring.encode())
data = resp.read().decode()
print(data)

运行结果:

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "name1": "value1", 
    "name2": "value2"
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Content-Length": "25", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
  ... 省略 ...

自定义HTTP请求Headers

如果需要在发出的请求中提供一些自定义的HTTP头，例如修改user-agent字段,可以创建一个包含字段值的字典，并创建一个Request实例然后将其传给urlopen:

from urllib import request, parse

url = 'http://httpbin.org/post'

parms = {
   'name1' : 'value1',
   'name2' : 'value2'
}

querystring = parse.urlencode(parms)

# 添加额外的Headers信息
headers = {
    'User-agent' : 'none/ofyourbusiness',
    'Spam' : 'Eggs'
}

req = request.Request(url, querystring.encode(), headers=headers)
resp = request.urlopen(req)
data = resp.read().decode()
print(data)

运行结果:

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "name1": "value1", 
    "name2": "value2"
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Content-Length": "25", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "Spam": "Eggs", 
    "User-Agent": "none/ofyourbusiness", 
  ... 省略 ...

下篇网络篇・Web爬虫