I was working on this model that will classify whether a piece of given news is fake or not based on different parameters that I think will be useful. And the first phase in any machine learning project is to collect relevant dataset. I found this GitHub repository with a pretty good dataset and with that I decided to get more data because I like scraping and indexing the internet. Naturally, for collecting “real” news I decided to get links from the RSS feed of Google and Yahoo news and then scrape those news sites.
def render_page(url): driver = webdriver.PhantomJS() driver.get(url) # To get the redirected URL (if any) url = driver.current_url body = driver.page_source driver.close() return body, url
Since I’m using Google News RSS feed which uses a redirect link instead of the actual link to the news website I had to get the redirected URL in line 4 of the code snippet.
That is for this short post.