自己弄的爬虫爬取壁纸

learningman 10月 02, 2018

因为我实在是太菜了,所以并没有办法用某种语言完成所有的工作,不得不靠好几个程序之间相互调用来完成任务。
爬取了Google EarthBing壁纸(截至2018年9月8日)的所有图片

Bing壁纸的爬取结果

BingDaily

Google Earth的爬取结果

Google Earth

从自己电脑上找出来的Windows聚焦

Windows Focus
爬取的方式参见使用Windows聚焦作为桌面壁纸

Bing壁纸的爬虫代码
@echo off
title BingPaperDownload
cd %~dp0
set com1=https://bing.ioliu.cn/v1?d=
set com2=^&w=1920^&h=1080
set day=0
:forstart
set "comfull=%com1%%day%%com2%"
curl "%comfull%" --output %day%.jpg
set /a day+=1
if %day%==1200 do goto endl
goto forstart
:endl
pause>nul
Google Earth爬虫代码

写了两个版本QwQ,Python版扫了2000后莫名出错,换bat跑完了

bat版
@echo off
title BingPaperDownload
cd %~dp0
set com1=http://www.gstatic.com/prettyearth/assets/full/
set com2=.jpg
set day=1722
:forstart
set "comfull=%com1%%day%%com2%"
echo %comfull%
curl "%comfull%" --output %day%.jpg
set /a day+=1
if %day%==8000 do goto endl
goto forstart
:endl
pause>nul

python版,写了两版,一个Python 2一个Python 3
因为国内特殊的网络环境,代码内有代理接口,请自行配置Http代理
需要安装bs4库,具体方法请善用搜索引擎

import urllib2
import urllib
import json
import os
from bs4 import BeautifulSoup

os.environ['http_proxy'] = 'http://127.0.0.1:1080'
os.environ['https_proxy'] = 'https://127.0.0.1:1080'

result = []
file = open('output.txt','a')
for x in xrange(1000, 10000):
    x = str(x)
    try:
        print ("Fetching" + x + " ...")
        response = urllib2.urlopen('https://earthview.withgoogle.com/' + x)
        html = response.read()
        html = BeautifulSoup(html)
        Region = str((html.find("div", class_="content__location__region")).text.encode('utf-8'))
        Country = str((html.find("div", class_="content__location__country")).text.encode('utf-8'))
        Everything = html.find("a", id="globe", href=True)
        GMapsURL = Everything['href']
        Image = 'https://www.gstatic.com/prettyearth/assets/full/' + x + '.jpg'
        result.append({'region': Region, 'country': Country, 'map': GMapsURL, 'image': Image})
        urlretrieve(Image, os.path.join('images','./%s'%Region,'/%s'%Country,'/%s.jpg'%x ))

    except urllib2.HTTPError, e:
        continue 

meow = json.dumps(result) 
file.write(meow)            
file.close()
from urllib.request import urlopen,urlretrieve
from urllib.error import HTTPError
import time,os

os.environ['http_proxy'] = 'http://127.0.0.1:1080'
os.environ['https_proxy'] = 'https://127.0.0.1:1080'

for x in range(1676, 7030):
    x = str(x)
    try:
        response = urlopen('https://earthview.withgoogle.com/' + x)
        if response.getcode() == 200:
            start = time.time()
            Image = 'https://www.gstatic.com/prettyearth/assets/full/' + x + '.jpg'
            if not os.path.isfile(os.path.join(Image.split('/')[-1])) : # for download resume
                urlretrieve(Image, os.path.join(Image.split('/')[-1]) )
                print("Fetched -> | " + x + " | in " + "{0:.4f}".format(time.time() - start) + " sec" )
            else :
                print("Already fetched! -> | " + x + " | ")

    except (HTTPError,AttributeError):
        continue #If the page is 404, then just continue

本文采用 CC BY-NC-SA 3.0 协议进行许可,在您遵循此协议的情况下,可以自由共享与演绎本文章。
本文链接:https://learningman.top/archives/956.html

发表评论

电子邮件地址不会被公开。 必填项已用*标注