Scrape Data From Javascript Rendered Sites

I was working on this model that will classify whether a piece of given news is fake or not based on different parameters that I think will be useful. And the first phase in any machine learning project is to collect relevant dataset. I found this GitHub repository with a pretty good dataset and with that I decided to get more data because I like scraping and indexing the internet. Naturally, for collecting “real” news I decided to get links from the RSS feed of Google and Yahoo news and then scrape those news sites.

Pretty easy task, or so I thought. I wrote a function in python that will get each link from RSS feed and do a GET request on then to get the page source code. I quickly realised that some of these news sites render its content on the client-side using JavaScript, so as long as the JS is not rendered the page will not load the information in it. For this reason, my function returned a lot of javascript code and no actual news.

To solve this problem, I used Selenium to render the javascript and then get the page source. I used PhantomJS because of two reasons: 1. I do not want to launch a browser GUI and 2. Firefox Headless was not working (and I'm too lazy to fix that). Here is a function that I wrote to do this task.

def render_page(url):
    driver = webdriver.PhantomJS()
    driver.get(url)

    # To get the redirected URL (if any)
    url = driver.current_url 
    
    body = driver.page_source
    driver.close()
    return body, url

Since I’m using Google News RSS feed which uses a redirect link instead of the actual link to the news website I had to get the redirected URL in line 4 of the code snippet.

That is for this short post.

Scrape Data From Javascript Rendered Sites

Related Posts

Array – Data Structure

What exactly is Big O notation?