Web Analytics
首先瞧瞧Google Search的網址,嘗試輸入任意關鍵字執行搜尋後可以發現,搜尋的網址是長這樣的:
shell
http://www.google.com.tw/search?q=
"="後面便是搜尋的關鍵字了,再觀察網頁原始碼,搜尋結果就在class="g"的div區塊中。
既然爬蟲能這樣到處玩耍,想必也會有不歡迎爬蟲的網站,畢竟要是放任大量爬蟲在自家網站撒野,可是會給伺服器帶來困擾的呢。所以Web Crawler也會有許多技巧來偽裝,讓自己在伺服器的認知裡看起來像是人為操作:例如,在request加上user agent偽裝成瀏覽器,或在多個request之間設置隨機延遲,除了模擬人為操作,亦避免造成他人伺服器的負擔...
Code
python
#!/usr/bin/env python3
# *** coding : utf-8 ***
import random
import requests as rq
from bs4 import BeautifulSoup as bs
user_agent = ["Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0",
"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X x.y; rv:42.0) Gecko/20100101 Firefox/42.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1"]
target = input('search:')
url = 'http://www.google.com.tw/search?q=' + target
try:
res = rq.get(url=url, headers={'User-Agent': random.choice(user_agent)})
res.raise_for_status()
except rq.exceptions.HTTPError:
print('[HTTP_Error]')
soup = bs(res.text, 'html.parser')
link = soup.select('.g .r a')
for index in range(2):
print(link[index].string) # title
print(link[index]['href']) # link
輸出結果: