之前给我的采集站用Python写了个自动采集发布新闻头条的小爬虫,用得不错,直到上个月目标站似乎不再维护,停更了,有点尴尬,无奈只能换个站,这次选个大点的站,下面是python采集代码。完整教程前面已经发布过了,详见:使用Python写一个简单的WordPress网站采集程序
#!/usr/bin/python3 # -*- coding: UTF-8 -*- import requests from bs4 import BeautifulSoup import sql import time import html def getippage(conn): pageurl='http://www.cs.com.cn/xwzx/' gkr = requests.get(pageurl) gkr.encoding = 'UTF-8' gksoup = BeautifulSoup(gkr.text, "html") article=gksoup.find('ul',attrs={'class':'ch_type3_list'}) li=article.find_all('li') for i in range(0, len(li)): singleurl=li[i].find("a").get("href") #singleurl=singleurl.replace("../","") singleurl=singleurl.replace("./","") num=msql.ishave("SELECT * from cj_5afxw where url='"+singleurl+"'") if num==0: print(singleurl) getsingle(singleurl) sqlstr = "INSERT INTO cj_5afxw(url,insert_time) VALUES ('"+singleurl+"','"+time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())+"');" msql.insert(sqlstr) else: print(singleurl+"已存在") time.sleep(5) msql.mclose() def getsingle(url): gkr = requests.get('http://www.cs.com.cn/xwzx/'+url) gkr.encoding = 'GBK' gksoup = BeautifulSoup(gkr.text, "html") title=gksoup.find('h1').text content=gksoup.find('section') #print(content) if content.find("style") : content.find('style').decompose() if content.find("style") : content.find('style').decompose() if content.find("style") : content.find('style').decompose() if content.find("style") : content.find('style').decompose() #content.find('div',attrs={"id": "toc"}).decompose() url = 'XXXXXXXX' data = {'post_title': title,'post_content':content,'post_category':507} #print(data) r = requests.post(url, data=data) print(r.text) msql=sql.msql('127.0.0.1','XXXX','XXXX','XXXXXX') msql.conn() getippage(msql) #getsingle("hg/202108/t20210811_6192885.html")
代码中的mysql存储请查看前面的文章使用Python写一个简单的WordPress网站采集程序,这里就不再贴代码了,有问题可以留言,有需要python采集程序的可以找我哟,违法的事不要找我,本人虽然穷但还吃得起饭,不吃牢饭!
原创文章,作者:奋斗,如若转载,请注明出处:https://blog.ytso.com/242509.html