python scrapy 网络采集使用代理的方法详解编程语言

1.在Scrapy工程下新建“middlewares.py”

 
# Importing base64 library because we'll need it ONLY in case if the proxy we are going to use requires authentication 
import base64 
  
# Start your middleware class 
class ProxyMiddleware(object): 
    # overwrite process request 
    def process_request(self, request, spider): 
        # Set the location of the proxy 
        request.meta['proxy'] = "http://YOUR_PROXY_IP:PORT" 
  
        # Use the following lines if your proxy requires authentication 
        proxy_user_pass = "USERNAME:PASSWORD" 
        # setup basic authentication for the proxy 
        encoded_user_pass = base64.encodestring(proxy_user_pass) 
        request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass 
 
 
#该代码片段来自于: http://www.sharejs.com/codes/python/8309

2.在项目配置文件里(./project_name/settings.py)添加

DOWNLOADER_MIDDLEWARES = { 
    'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110, 
    'project_name.middlewares.ProxyMiddleware': 100, 
}

只要两步，现在请求就是通过代理的了。测试一下^_^

from scrapy.spider import BaseSpider 
from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.http import Request 
  
class TestSpider(CrawlSpider): 
    name = "test" 
    domain_name = "whatismyip.com" 
    # The following url is subject to change, you can get the last updated one from here : 
    # http://www.whatismyip.com/faq/automation.asp 
    start_urls = ["http://xujian.info"] 
  
    def parse(self, response): 
        open('test.html', 'wb').write(response.body)

原创文章，作者：ItWorker，如若转载，请注明出处：https://blog.ytso.com/tech/pnotes/8171.html

python scrapy 网络采集使用代理的方法详解编程语言

相关推荐

发表回复