show	version	enable_checker
step	1.0	true

爬取图片进阶

回忆

上次爬了卫星
卫星云图具有相应的规律
可以通过分析 url 把图片一张张 wget 下来
这其实并不是真正的爬取
如何通过模拟浏览器来进行爬取呢？🤔
我们这次来爬取纸飞机

浏览

这个网站对于爬虫还是比较友好的

设置 xpath

import requests
from lxml import etree
response = requests.get("https://www.foldnfly.com/")
et_html = etree.HTML(response.content)
list_pa = et_html.xpath("/html/body/main/div/div[2]/div[2]/div[1]/div/div/a")
url = "https://www.foldnfly.com/" + list_pa[0].get("href")
name = list_pa[0].xpath("b")[0].text
print(url,name)

开始遍历

import requests
from lxml import etree
response = requests.get("https://www.foldnfly.com/")
et_html = etree.HTML(response.content)
list_pa = et_html.xpath("/html/body/main/div/div[2]/div[2]/div[1]/div/div/a")
url = "https://www.foldnfly.com/" + list_pa[0].get("href")
name = list_pa[0].xpath("b")[0].text
print(url,name)

爬取过程

import os
import requests
from lxml import etree
response = requests.get("https://www.foldnfly.com/")
et_html = etree.HTML(response.content)
list_pa = et_html.xpath("/html/body/main/div/div[2]/div[2]/div[1]/div/div/a")
plane_no = 1
for anchor in list_pa:
    name = "%02d"%plane_no + "_" + anchor.xpath("b")[0].text.replace(" ","")
    os.system("mkdir " + name)
    print(name)
    url = "https://www.foldnfly.com/" + anchor.get("href")
    response2 = requests.get(url)
    et_html2 = etree.HTML(response2.content)
    list_pic = et_html2.xpath("body/main/div/div[2]/div/div[1]/div/picture/source/source/img")
    #print(list_pic)
    pic_no = 0
    for pic in list_pic:
        print(pic.get("src"))
        response3 = requests.get(pic.get("src"))
        with open(name + "/" + str(pic_no) + ".jpg","wb") as f:
             f.write(response3.content)
        pic_no += 1
    plane_no += 1

这里需要注意
- 图片地址选取的时候 img 是在 source 里面的
- 但是浏览器和 xpath 好像对于 source 标签支持不好
- 于是我把那两层的 source 换成了//
- 或者手动写清楚
- 就可以爬取到了
一层一层尝试
一层成功了再进入下一层

输出结果

观察到纸飞机图片来自于一个图床
图片本身命名具有较强的规律

总结

这次爬了纸飞机的网站
从首页爬到内容页
从内容页第一个示例图开始
一个个地遍历
经过二重循环完成任务
到最后才发现图像的普遍性规律
不过有的网站中的图片是嵌入到网页中的
这种情况怎么办呢？🤔
下次再说

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

572-252877-爬取图片进阶_纸飞机_wget.sy.md

572-252877-爬取图片进阶_纸飞机_wget.sy.md

爬取图片进阶

回忆

浏览

设置 xpath

开始遍历

爬取过程

输出结果

总结

Files

572-252877-爬取图片进阶_纸飞机_wget.sy.md

Latest commit

History

572-252877-爬取图片进阶_纸飞机_wget.sy.md

File metadata and controls

爬取图片进阶

回忆

浏览

设置 xpath

开始遍历

爬取过程

输出结果

总结