网络爬虫必须使用HTTP代理

怎样才能访问已经被封了IP的网站。为了防止被抓取数据，反爬虫是网站都有的，反爬虫就成为了爬虫的最大困扰。如果不能绕过反爬虫机制，自然是什么数据都采集不到。那么，爬虫出现IP限制该怎么办？

反爬虫限制就是对IP进行监测，它会严格的对每个IP进行检查，一旦有频繁访问的情况，就会被拉进黑名单。这种问题其实很好解决，只需要用代理IP就可以了，代理IP可以帮助你的电脑更换不同的IP，对于爬虫工作来说是必备的。

爬虫程序怎么样规避反爬，可以说是一个很常见的一种需求。做网络爬虫时，一般对代理IP的需求量比较大。因为在爬取网站信息的过程中，很多网站做了反爬虫策略，可能会对每个IP做频次控制。这样我们在爬取网站时就需要很多代理IP。

网络爬虫通过爬虫程序采集数据的时候，可以采集到有价值的信息，在进行爬虫之前我们需要准备高质量的代理ip，使用网络爬虫进行采集，都是需要使用隧道转发的爬虫代理加强版的，虽然使用并不是免费的，但是免费的代理ip并不能支持网络爬虫获取大量的信息，选择收费代理ip才更有效果。

const http = require("http");
const url = require("url");

// 要访问的目标页面
const targetUrl = "http://httpbin.org/ip";


const urlParsed = url.parse(targetUrl);

// 代理服务器(产品官网 www.16yun.cn)
const proxyHost = "t.16yun.cn";
const proxyPort = "36600";

// 生成一个随机 proxy tunnel
var seed = 1;
function random() {
    var x = Math.sin(seed++) * 10000;
    return x - Math.floor(x);
}
const tunnel = random()*100;

// 代理验证信息
const proxyUser = "username";
const proxyPass = "password";

const base64    = new Buffer.from(proxyUser + ":" + proxyPass).toString("base64");

const options = {
    host: proxyHost,
    port: proxyPort,
    path: targetUrl,
    method: "GET",
    headers: {
        "Host": urlParsed.hostname,
        "Proxy-Tunnel": tunnel,
        "Proxy-Authorization" : "Basic " + base64
    }
};

http.request(options, function (res) {
    console.log("got response: " + res.statusCode);
    res.pipe(process.stdout);
}).on("error", function (err) {
    console.log(err);
}).end();

原创文章，作者：ItWorker，如若转载，请注明出处：https://blog.ytso.com/53353.html

网络爬虫必须使用HTTP代理

相关推荐

发表回复