Scrape News and Corporate Announcements in Real-Time -1

If you already have Telegram installed, you can just add news_bot by typing @info_scraper_bot in telegram’s search bar to add it.

The server is on 24/7 and free for 7days. Since I’m a Korean, the news bot is scraping news and announcements of Korea and supports only Korean.

Key features are as follows.

  1. The bot scrapes through over 50 news websites and a website where all corporate announcements are published in real-time. Then, a user can receive news and announcements filtered by keywords configured by a user.
  2. Using Natural Language Processing, the bot shows listed companies related to each news article. Also, a user can register the names of listed companies to receive related news.
  3. Using Natural Language Processing, the bot provides a keyword searching feature. For example, if a user types “/s 북한(North Korea)” to see related listed companies, the bot calculates correlations between a keyword and every listed company and fetches the results for a user.

This project began last December. At first, I started it thinking a simple scraper for my portfolio but it has become much larger than I expected and gained many users.

I got the idea of this project from my friend. He invests in stocks and I found he was using RSS feed to receive news articles. I asked him about using RSS out of curiosity and he said RSS takes some time to fetch newly updated news, and in some news publishers, it wasn’t available since they didn’t support RSS protocol.

So, I decided to make one by myself. It would be killing two birds with one stone since I can gather news data for my use and help my friend at the same time.

Preparation

As I’m familiar with Python, I looked up python libraries for scraping. I found three main tools: Scrapy, BeautifulSoup, Selenium.

Those three had both pros and cons respectively.

First, Scrapy is a framework, not a library. Like Django which is a web framework, Scrapy has a unique workflow of its own, so a user has to learn and understand how to use Scrapy. Though the learning process can be harder than the others, it provides unparalleled performance and features against other tools once you get used to it.

Scrapy’s powerful performance comes from its asynchronous processing. This can be a huge advantage when you have to parse dozens of web pages in a short period since every task doesn’t have to wait until the earlier task to be done. However, because of the asynchronous workflow, you can run into lots of trial and error if you use it without understanding it.

For example, the order of scraped data can be mixed up and the code you write may not work as you expected. Therefore, it can be shortcomings of Scrapy that the learning process can be cumbersome and a user has to follow Scrapy’s guidelines.

Secondly, BeautifulSoup is a Python library for parsing HTML. Its biggest advantage is it’s easy to use, So, if you don’t need a big scraper, using BeautifulSoup can be a better choice for a simple task. However, as it doesn’t have many features, you have to build every feature by yourself if you want to build a complicated scraper. Also, BeautifulSoup doesn’t make use of multiprocessors, which means it’s much slower than Scrapy. You can utilize multiprocessors using other libraries like Multiprocessing and Thread, but those are the things that you have to build by yourself.

Lastly, Selenium is a little bit different from those two tools. Originally, Selenium was created for web developers to test their websites. When a website gets bigger, it’s hard for a human to check every URL, so Selenium was used to automate those processes.

Using those features, you can scrape web pages as well. In particular, if the webpage that you want to scrape is a dynamic webpage, Selenium is extremely useful and could be the only way to scrape sometimes.

For example, if the webpage is built with Javascript and the screen you actually see in the web browser and the page source you downloaded are different, you can’t scrape what you want by just extracting the original HTML source from that URL. In those cases, you can fetch what you want by downloading the processed HTML source by a browser using Selenium.

The problem is since Selenium uses an actual web browser as a human does, it takes much time to render the page and consumes lots of memories and CPU resources compared to other tools. Therefore, it’s better to avoid using Selenium unless you have to handle dynamic webpages.

I chose Scrapy because, for me, speed and efficiency was the key to scrape through hundreds of web pages in a couple of seconds. Also, I used MariaDB to store scraped data. Finally, for deployment, I used Telegram’s bot API.

As for the Telegram bot, it’s officially supported by Telegram and Telegram provides API and tutorials for development, and most importantly, it was free of charge.

I built the whole workflow as follows.

All the user info generated from the telegram bot and scraped data from Scrapy flows into MariaDB and those stored data are used in the whole work processes including responding to a query from each user and filtering out old data while scraping.

One thing to note is that this service is scalable since the number of cycles for scraping news stays the same even though the number of users grows up. In other words, whether there’s one user or 1,000 users, the bot scrapes only 1 time per cycle and the only part that requires more computation is sending news articles corresponding to each user’s configured keywords, which doesn’t require much computation nor does have to send tons of requests to servers to scrape.

Development Environment

As this project gets much bigger than the last one, I started use a gaming laptop as a server for deployment instead of a RaspberriPi.

Also, I started to use Docker. Docker is a magical software that makes it possible to isolate any process to be copied, moved, transformed.

Further information can be found below.

There two main advantages to using Docker.

One is I can easily move everything from my files, directories, packages, to environment settings by just converting the whole thing into a Docker Image. That is, you don’t need to say ‘It worked on my computer. Why doesn’t it work on your computer?’ anymore. If you have some development experience, you may well know how annoying settings can be before jumping into development. With Docker, you don’t have to waste your time dealing with those tricky problems. It’s like you can quick-freeze what you were cooking and bring it wherever you want, and then, you continue cooking after thawing it.

The other advantage is that Docker has the strengths of a virtual machine but extremely efficient at the same time. A virtual machine such as VMware is much less efficient than Docker since it can utilize only allocated CPU and memory resources from a host. Also, allocated resources can never be used by other machines even while there are resting. On the other hand, a Docker container works as a process which means it doesn’t waste the host’s resources. It uses resources only when it needs them. Also, the performance in the container is almost the same as the performance in the host system.

One thing to be aware of is since Docker was originally built to work on a Linux system, it runs on WSL2 of Windows if you install Docker for Windows.

Those things like WSL2, Docker will torture your brain at first as it did to mine. However, I had no choice at the time and found it was the perfect opportunity to learn in retrospect.

As for WSL2, it enables you to run Linux on Windows using a hypervisor. That is, you can not only operate Linux but also access Linux’s file system from the Windows environment and vise versa. Therefore, with WSL2, you don’t need to configure a dual booting system for Linux. Its performance was also quite impressive. From what I tested, I could see only 10% of performance loss compared to the native Linux system. However, there were a few problems with it, which are it consumes lots of memory resources to operate WSL2 and there are still a couple of tasks only available on native Linux.

After all, I started development in VSCODE’s remote container window to access to Docker container in a remote server.

Scraper development

When you first start developing crawlers with Scrapy, you can easily get confused. If you are not familiar with the basics, it’s good for your mental health to start with following the tutorials in Scrapy’s official documentation.

At first, when you start a new project by typing ‘scrapy startproject projectname’ in a terminal, a folder with the project name and .py files in it are automatically generated like the image on the left. Think of the allnews_not.py (name can be configured) file in the Spiders folder as Python code that creates an actual crawler, and the items.py, middlewares.py, pipelines.py, settings.py in the parent directory are the files that help you streamline the whole workflow.

As it may seem quite complicated, you might think, ‘Why don’t you make a crawler only with a bot.py file and without any other .py files?’, which is not only very inefficient when dealing with crawled data but also can make whole codes like spaghetti in the future. Therefore, if you want to build a decent crawler, you have to learn how to use items.py and pipelines.py.

items.py is a file where you can generate item objects to store scraped data from allnews_bot.py before sending it to a database. Each item object can be used as a column in a database, and its structure is similar to a Python dictionary object. To be specific, in allnews_bot.py, you can import AllnewsItem Class defined from items.py to instantiate item objects in allnews_bot.py. Then, you can store crawled data into item objects and transfer them to pipelines.py with the yield command.

Left: items.py, Right: allnews_bot.py

One thing to keep in mind is Scrapy handles every task asynchronously, which means every crawled data is transferred to pipelines.py after stored in item objects even while scraping is in progress. In other words, it doesn’t wait until all work to be done before sending data, but rather it sends crawled data one by one to pipeline.py.

Then, you can store what you received in pipeline.py into a database right away or you can preprocess those data before storing.

What if you don’t use both items.py and pipelines.py?

When I first developed a scraper, I didn’t use those and tried to do everything only with allnews_bot.py. To do so, I had to use Pandas DataFrame or Python dictionary to gather and preprocess crawled data such as filtering duplicates, sorting, handling errors. As a result, the codes became so convoluted that it was almost impossible to maintain.

So, I just decided to follow what Scrapy said.

As for middleware.py, it’s similar to extensions of the Chrome web browser. Basically, Scrapy is integrated with various features including routing through proxy servers and modify the request’s header and many other features to avoid being banned from a server and streamline the crawling process. You can easily use those features without manipulating settings by yourself once you add a few lines of code into middleware.py.

Lastly, setting.py is a file where you can manage all the settings for Scrapy. You can switch on and off of any feature Scrapy has. Also, you can ignore robot.txt from a server by changing the ROBOTSTXT_OBY variable. You can find further information in the comment part of the setting.py file.

From now on, I’d like to talk about real development.

Assuming there’s only one news website to crawl, the workflow of Scrapy is as follows.

The workflow above is applied to all news websites and 1 cycle ends when all of them are crawled.

In my case, I built a scraper to crawl more than 50 news websites every 10 seconds.

Here’s a detailed code. The image on the left shows URLs of news websites to crawl and the image on the right shows the code for sending requests to each URL to start crawling with Scrapy.Request.

After sending requests, news links are extracted by defined parse functions for each news website. One of the advantages Scrapy has is that it supports CSS and XPath selector, which makes it super easy to extract certain text you want from HTML. That is, you don’t have to use Python regular expressions to extract text because Scrapy does all the extracting work for you with the name of CSS style or XPath as below.

If you use Google Chrome, you can easily add SelectorGadget as a Chrome extension, and use it to get a name of CSS style or XPath on a web page.

After that, parse() functions hand over the entire news links and the selector information to send_links_to_crawl(). The function send_links_to_crawl() serves as a filter to check duplicates by comparing received news links to stored ones in the database. Then, send_links_to_crawl() sends filtered links and the selector information that came from parse() to crawl_articles().

The reason why it delivers the selector information separately is that the structure of each news website is different, so it’s necessary to design individual CSS or XPath selector for each website. Otherwise, I would have had to build the same number of crawling functions for every news website, which makes the whole code convoluted and hard to maintain.

Therefore, I chose that method to make the code succinct by sharing common tasks.

Those title_selector and content_selector are String objects, and they are ultimately delivered to crawl_articles() via send_links_to_crawl(). Then, those Sting objects are converted into commands by eval() function in send_links_to_crawl().

The core part of scraping is almost done. After finishing work in allnews_bot.py, now it’s pipelines.py’s turn. All you have to do is decide what to do with Item objects in which scraped data is stored. If you want to do further preprocessing with it, then you can do whatever you want here in pipelines.py.

In my case, I handled all the work from taking care of duplicated data to creating a table and column for a database here. Also, I made two identical tables to store scraped news. One was a temporary table for updated news articles per each scraping cycle, and the other was a permanent repository for all news articles.

Scrapy Test

You can start crawling by type ‘scrapy crawl allnews_bot(your bot name)’ in a terminal in the Spiders folder where allnews_bot.py exists.

Once it starts without an error, you can see the output like below.

Using Workbench, you can easily make sure every data is stored in the database correctly.

The next post would be about the deployment of the Scrapy crawler using the Telegram bot.

Get the Medium app