在Heroku遠端主機控制瀏覽器

最近在研究Telegram Bot API，雖然讀過文件知道許多語言都已經有不少第三方框架支援，基於某些應用目的不想被框架綁住，所以打算自己寫。

然後我就耗了整整兩天，嘗試如何讓部署在Heroku上的Bot能順利使用headless browser擷取到我要的數據...。說也奇怪，本機執行是沒問題的，但在遠端主機就是一直丟出Timeout Error，原來遠端主機控制Chrome訪問目標網頁的時候總是遭遇某DDoS Protection服務所阻擋，但經過一番嘗試目前仍是應對不了這個情況，耗了好幾天...於是學到了windows.navigator.webdriver這個參數。

*假設Heroku APP已建立。

Add Config Vars & Buildpacks

在APP的Setting頁面，設定以下兩個環境變數：

shell

CHROMEDRIVER_PATH = /app/.chromedriver/bin/chromedriver
GOOGLE_CHROME_BIN = /app/.apt/usr/bin/google-chrome

在APP的Setting頁面，設定以下兩個buildpack：

shell

https://github.com/heroku/heroku-buildpack-google-chrome
https://github.com/heroku/heroku-buildpack-chromedriver

Code

程式的部分，Chrome driver的參數會使用到上述所設定的環境變數。

python

import os
from selenium import webdriver

def get_chrome():
    op = webdriver.ChromeOptions()
    op.binary_location = os.environ.get("GOOGLE_CHROME_BIN")
    op.add_argument("--headless")
    op.add_argument("--disable-dev-shm-usage")
    op.add_argument("--no-sandbox")

    '''
    # avoid detection 好孩子先不要 ^.<
    op.add_argument('--disable-infobars')
    op.add_experimental_option('useAutomationExtension', False)
    op.add_experimental_option("excludeSwitches", ["enable-automation"])
    '''

    return webdriver.Chrome(executable_path=os.environ.get("CHROMEDRIVER_PATH"), options=op)

Deploy

然後就可以進行部署使用了，Heroku APP會需要安裝許多依賴套件而增加100+MB的使用空間。

透過APP bash測試從證交所擷取上市股票代號一覽表⬇︎

Ref.

Note

關於動態網頁資料擷取的小技巧：雖然Selenium WebDriverWait提供explicit waits可等待指定頁面元素載入，但實際上並沒有這麼順利，不過這也是讓我覺得學習web crawler有趣的地方(儘管有時候很頭疼😖)。有時候會發生指定元素確實在頁面上載入完成了，但內容卻是空空如也，又或許是我慧根不夠用不好text_to_be_present_in_element，試了半天還是抓不到我要的數據。哭啊，我就是要擷取內容文字咩！只好換個方式等待：

python

from selenium.webdriver.support.ui import WebDriverWait

driver = get_chrome()
driver.get("{{url}}")
is_text = lambda driver: driver.find_element_by_css_selector('{{css_selector}}').text.strip() != ''
WebDriverWait(driver, 30, 0.5).until(is_text)

Add Config Vars & Buildpacks ​

Code ​

Deploy ​

Ref. ​

Note ​

Add Config Vars & Buildpacks

Code

Deploy

Ref.

Note