【教程】python爬取html时使用随机ip

有一些网站限制单个 ip 访问频率,切换代理来破除限制

前往https://www.docker.com/get-started/安装 docker(过程省略)(docker desktop)
https://github.com/jhao104/proxy_pool克隆到本地

1
git clone https://github.com/jhao104/proxy_pool

启动容器

1
2
cd proxy_pool
docker-compose -f docker-compose.yml up -d

在你的 python 程序开头定义函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import requests
headers = {
'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'
}
def get_proxy():
return requests.get("http://127.0.0.1:5010/get/").json()

def delete_proxy(proxy):
requests.get("http://127.0.0.1:5010/delete/?proxy={}".format(proxy))
def getHtml(l):
# ....
g=0
while 1==1:
try:
proxy = get_proxy().get("proxy")
html = requests.get(l,headers=headers, proxies={"http": "http://{}".format(proxy)},timeout=2)
# 使用代理访问
if html.status_code ==200 and html!=None and html.text.find('cannot find token param')==-1:
return html
else:
delete_proxy(proxy)

except Exception:
g=g+1

在你想要获取 url 的时候,使用 getHtml 即可
例如

1
x=getHtml('http://baidu.com').content.decode()

利用代理下载教材代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
from email import header
from bs4 import BeautifulSoup
import requests
import re,time
import os,glob
from fpdf import FPDF
path ='C:/Users/Administrator/pyproject/.vscode/image/'
headers = {
'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'
}
def get_proxy():
return requests.get("http://127.0.0.1:5010/get/").json()

def delete_proxy(proxy):
requests.get("http://127.0.0.1:5010/delete/?proxy={}".format(proxy))
def getHtml(l):
# ....
g=0
while 1==1:
try:
proxy = get_proxy().get("proxy")
html = requests.get(l,headers=headers, proxies={"http": "http://{}".format(proxy)},timeout=2)
# 使用代理访问
if html.status_code ==200 and html!=None and html.text.find('cannot find token param')==-1:
return html
else:
delete_proxy(proxy)

except Exception:
g=g+1
# 删除代理池中代理
pdf = FPDF()
pdf.set_auto_page_break(0)
i=0
href=input("不包含结尾的href(‘/’开头,‘/’结尾)")
num=int(input("总数"))
while num>0:
i=i+1
while 1==1:
x=getHtml('http://www.haoduoyun.cc'+href+str(i)+'.shtml').content.decode()
re1 = "ebookPage(.*?),"
reg = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
a=re.findall(re1, x)
url=re.findall(reg,a[0])[0].strip("'")
if url.endswith('wx_fmt=jpeg')==False:
break
r = getHtml(url)
print(str(i)+':'+url)
with open('C:/Users/Administrator/pyproject/.vscode/image/'+str(i)+".jpg",'wb') as f:
f.write(r.content)
pdf.add_page()
pdf.image(os.path.join(path, str(i)+".jpg"), w=210, h=297)
num=num-1




pdf.output(os.path.join(path, "output.pdf"), "F")
for infile in glob.glob(os.path.join(path, '*.jpg')):
os.remove(infile)

遇到问题记得在评论区评论~


【教程】python爬取html时使用随机ip
https://xiaoxinblog.xyz/posts/fc05b170.html
发布于
2022年7月14日
许可协议