有一些网站限制单个 ip 访问频率,切换代理来破除限制
前往https://www.docker.com/get-started/安装 docker(过程省略)(docker desktop)
将https://github.com/jhao104/proxy_pool克隆到本地
1
| git clone https://github.com/jhao104/proxy_pool
|
启动容器
1 2
| cd proxy_pool docker-compose -f docker-compose.yml up -d
|
在你的 python 程序开头定义函数
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
| import requests headers = { 'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6' } def get_proxy(): return requests.get("http://127.0.0.1:5010/get/").json()
def delete_proxy(proxy): requests.get("http://127.0.0.1:5010/delete/?proxy={}".format(proxy)) def getHtml(l): g=0 while 1==1: try: proxy = get_proxy().get("proxy") html = requests.get(l,headers=headers, proxies={"http": "http://{}".format(proxy)},timeout=2) if html.status_code ==200 and html!=None and html.text.find('cannot find token param')==-1: return html else: delete_proxy(proxy)
except Exception: g=g+1
|
在你想要获取 url 的时候,使用 getHtml 即可
例如
1
| x=getHtml('http://baidu.com').content.decode()
|
利用代理下载教材代码:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
| from email import header from bs4 import BeautifulSoup import requests import re,time import os,glob from fpdf import FPDF path ='C:/Users/Administrator/pyproject/.vscode/image/' headers = { 'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6' } def get_proxy(): return requests.get("http://127.0.0.1:5010/get/").json()
def delete_proxy(proxy): requests.get("http://127.0.0.1:5010/delete/?proxy={}".format(proxy)) def getHtml(l): g=0 while 1==1: try: proxy = get_proxy().get("proxy") html = requests.get(l,headers=headers, proxies={"http": "http://{}".format(proxy)},timeout=2) if html.status_code ==200 and html!=None and html.text.find('cannot find token param')==-1: return html else: delete_proxy(proxy)
except Exception: g=g+1 pdf = FPDF() pdf.set_auto_page_break(0) i=0 href=input("不包含结尾的href(‘/’开头,‘/’结尾)") num=int(input("总数")) while num>0: i=i+1 while 1==1: x=getHtml('http://www.haoduoyun.cc'+href+str(i)+'.shtml').content.decode() re1 = "ebookPage(.*?)," reg = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+' a=re.findall(re1, x) url=re.findall(reg,a[0])[0].strip("'") if url.endswith('wx_fmt=jpeg')==False: break r = getHtml(url) print(str(i)+':'+url) with open('C:/Users/Administrator/pyproject/.vscode/image/'+str(i)+".jpg",'wb') as f: f.write(r.content) pdf.add_page() pdf.image(os.path.join(path, str(i)+".jpg"), w=210, h=297) num=num-1
pdf.output(os.path.join(path, "output.pdf"), "F") for infile in glob.glob(os.path.join(path, '*.jpg')): os.remove(infile)
|
遇到问题记得在评论区评论~