Python爬虫学习02(使用selenium爬取网页数据)

2022年7月13日 21:59 • 编程笔记

Python爬虫学习02(使用selenium爬取网页数据)

目录

Python爬虫学习02(使用selenium爬取网页数据)

1.1，使用的库

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.select import Select

1.2，流程

#1，打开浏览器
driver = webdriver.Chrome()
#该方式会显示浏览器界面
# option = webdriver.ChromeOptions()
# option.add_argument("headless")
# driver = webdriver.Chrome(options=option)
## 该方式不会显示浏览器界面
#2，通过url打开界面
driver.get('http://xzqh.mca.gov.cn/map')
#3，对打开的界面进行操作
s1 = Select(driver.find_element(by=By.NAME,value='shengji'))

1.3，用到的函数

1，driver.find_elements(by=By.OPTIONS,value='VALUES')
#作用：根据要求获取元素
#示例:driver.find_element(by=By.NAME,value='shengji')
#driver.find_element(by=By.CLASS_NAME,value="info_table")
#返回类型:list
2,Select(ELEMENT)
#作用：根据给定的元素获取select对象
#示例:s = Select(driver.find_element(by=By.NAME,value='shengji'))
#可以通过s.options[i]获取select中的选项
#示例:province = s1.options[i].text.split('（')[0]
#可以通过s.select_by_index()(或者select_by_value)来选择选项
#示例:s1.select_by_index(i)

1.3，示例：利用selenium从中华人民共和国民政部网站获取行政区划信息

from selenium import webdriver
from selenium.webdriver.common.by import By
import time as TIME

#打开浏览器
driver = webdriver.Chrome()
#通过下面的方式打开浏览器可以不打开图形界面
# option = webdriver.ChromeOptions()
# option.add_argument("headless")
# driver = webdriver.Chrome(options=option)

driver.get('http://xzqh.mca.gov.cn/map')
#获取select元素
s1 = Select(driver.find_element(by=By.NAME,value='shengji'))
#用字典保存province与index对应的关系
provinces={}
index = 0
for i in s1.options:
    provinces[i.text.split('（')[0]]=index
    index+=1

list = ['湖北省','湖南省','四川省']
for i in list:
    index = provinces[i]
    #获取select元素
    s1 = Select(driver.find_element(by=By.NAME, value='shengji'))
    #选择想要的省份
    s1.select_by_index(index)
    #获取提交按钮元素
    button = driver.find_element(by=By.CLASS_NAME,value='select_bn')
    #点击跳转
    button.click()
    #延迟等待网页加载
    TIME.sleep(2)
    #获取table元素
    table = driver.find_element(by=By.CLASS_NAME,value="info_table")
    #获取area元素
    areas = table.find_elements(by=By.NAME,value='hidzxs')
    for area in areas:
        print(i+' '+area.get_property('value'),area.get_property('alt'))
    #退回上一页
    driver.back()

1.4，优化

1.4.1，问题描述

使用上述方式，不论是否打开浏览器的图形界面都很慢，原因是Selenium页面加载策略的选择问题

selenium有三种页面加载策略：

策略	准备完成的状态	备注
normal	complete	默认情况下使用, 等待所有资源下载完成
eager	interactive	DOM访问已准备就绪, 但其他资源 (如图像) 可能仍在加载中
none	Any	完全不阻塞WebDriver

使用方式：

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.page_load_strategy = 'eager'#此处选择策略
driver = webdriver.Chrome(options=options)
driver.get("http://www.google.com")
driver.quit()

在没有选择策略的时候，默认使用nomal策略，等待所有资源加载完才会返回，所以很慢。

原创文章，作者：kepupublish，如若转载，请注明出处：https://blog.ytso.com/274152.html

02 1.1，使用的库 1.2，流程 1.3，用到的函数 1.3，示例：利用selenium从中华人民共和国民政部网站获取行政区划信息 1.4，优化 1.4.1，问题描述 driver find options python Python爬虫学习02(使用selenium爬取网页数据)selenium value webdriver 爬取

赞 (0)

0

C++(17)：filesystem

上一篇 2022年7月13日

下一篇 2022年7月13日

发表回复

登录后才能评论