Scrape News and Corporate Announcements in Real-Time — 2(Deployment)

The previous post can be found in the link below.

Scrape News and Corporate Announcements in Real-Time -1 | by Charlie_the_wanderer | Mar, 2021 | Medium

Since I explained about developing a scraper with Scrapy in the earlier post, let’s see how to deploy it with a Telegram bot.
Telegram officially supports bot API for developers. It works as below.

Telegram bot is located between a user and a developer passing the input from a user to a developer. Then, with the input from a user, the developer can execute any application in the back-end and return the results to a user via the bot. Therefore, you can think of a Telegram bot as a go-between that connects a developer to a user.

Further information for the Telegram bot can be found in the link below.

One of the biggest advantages of using a Telegram bot is you can make use of all features of Telegram which includes support for both PC and mobile environments and sending images, audios, and even files.
Without a Telegram bot, you would have to build a messenger by yourself from scratch. That’s why I chose to use a Telegram bot for deployment.

Like the images above, the core part of my project is each user can register and remove keywords as they want, and scraped news articles are filtered by those keywords each user registered. To do so, I had to allow a user to register and remove keywords through a Telegram bot, which should be connected to my database. Therefore, I made functions that can handle not only inputs from a user but also can store and update those keywords in the database.
After finishing those management functions, I built a function that can actually filter the scraped news articles by keywords.
The picture below is the outline of what I built.

As explained in the previous post, the number of crawling stays the same even when the number of users grows up, so it has high scalability since the filtering process itself doesn’t require much computational resource.

Since the whole codes are too wordy to put here, I’d like to focus only on the core parts.

Telegram library

First, you have to import a library to use Telegram API. The most frequently used classes are Updater and CommandHandler. Updater is like a bucket that contains every data from the bot’s chat room, so when a user sends a message to the bot, you can fetch the data such as the user’s id, user’s inputs from the Updater instance.
As for CommandHandler, it’s used for creating commands for the bot, so with CommandHandler, you can create any commands you want.

You would easily understand how it works with the images above. After defining a function that I want to execute, I set it to operate with CommandHandler if a user types ‘/help’ in the bot’s chat room.

When you write code using the Telegram library, you will handle the last part of your code as follows.

When those two lines are executed after functions for commands have been defined, executed Python process will remain alive allowing the Python process to communicate with the Telegram bot. That is, it’s not terminated after executing all lines of codes, but instead, it’s waiting permanently for the inputs from a user.
Then, here comes a big problem. Scrapy crawler has to be executed in an infinite loop in the same Python codes to crawl news articles periodically. However, If I put an infinite loop in the middle of the codes, the workflow will be stuck in the infinite loop and won’t be able to proceed.
To solve this, I converted the process of crawling with Scrapy into a function and executed it as a separate thread using the Python Thread module, so that it can be executed in parallel with the Telegram bot.

The example of converting it into a function is as follows.

As Scrapy is a framework, not a library, you have to type ‘scrapy crawl name_of_your_bot’ command in a terminal to start crawling. Therefore, to execute it in a Python code, I used the Python os module to replicate the terminal environment. Then, I fetched crawled news data and user info from the database to filter crawled news by each user’s configured keywords in the scrapy() function.

After completing scrapy() function, you can add a thread using the Thread module at the end of the code as above. The parameter daemon on the first line is the switch whether to terminate the scrapy function when the parent process is terminated. Since I wanted to terminate the scrapy loop as well when I terminate the Telegram bot, I gave it True.

As for the functions for handling keywords of each user, I used Python pymysql library to work with the MariaDB database. Once you get familiar with the basic SQL and pymysql library, there’s nothing too complicated.

Examples of Telegram functions

However, one thing to be aware of is that you must terminate the connection to the database using a close() method after finishing your work with the database. Otherwise, errors can come out from the part of I/O of the database, which can make you waste a lot of time. So, don’t forget to close your connection after your work is done.

Finally here comes the code for filtering news articles.

# helper function
def filter_by_keyword(df, keyword_list, one_timeset, one_on_off, one_search_context, one_user_id):
# Replace None to ' ' to avoid Error when masking df with boolean list
df = df.fillna(' ')
if keyword_list == None:
return None
elif 'all' in keyword_list:
filtered_df = df
return filtered_df
else:
# Creating empty DataFrame
filtered_df = pd.DataFrame()
for word in keyword_list:
word_split = word.split('&') # Split each keyword by & if it has it
# ex) word = 'apple&tesla!fruit' -> word_split = ['apple', 'tesla!fruit']
if one_search_context == True:
boolean_list = [(df['title'].str.contains(i))|(df['content'].str.contains(i)) for i in word_split]
string = ''

for i in range(len(boolean_list)):
# If there's no '!' in word_split, just append &
if '!' not in word_split[i]:
string += f"&boolean_list[{i}]"
# If there's '!' in word_split, append &~
else:
# When ! is detected in word_split
word_split_exc = word_split[i].split('!')
boolean_exc_list = [(df['title'].str.contains(i))|(df['content'].str.contains(i)) for i in word_split_exc]
for i in range(len(word_split_exc)):
if i == 0:
string += f"&boolean_exc_list[{i}]"
else:
string += f"&~boolean_exc_list[{i}]"
try:
string = string.strip('&')
string = string.strip('!')
filtered_df_one_keyword = df[eval(string)]
filtered_df = filtered_df.append(filtered_df_one_keyword, ignore_index=True)
except:
print('Error occured while processing &, ! in filter_by_keyword function')
body = f'키워드를 재설정 해 주세요.\n\n'
body += f'{word} 키워드에 문제가 있습니다.\n\n'
body += f'조건식 입력 시, &와 !의 순서를 2번 이상 섞지 마세요.\n\n'
body += f'ex) A&B&C!D!E 혹은 A!B!C&D&E -> 정상\n\n'
body += f'ex) A&B&C!D&E 혹은 A!B!C&D!E -> 오류'
bot.send_message(chat_id=one_user_id, text=body)

# In the case of searching keywords only in titles
else:
boolean_list = [(df['title'].str.contains(i)) for i in word_split]
string = ''

for i in range(len(boolean_list)):
# If there's no '!' in word_split, just append &
if '!' not in word_split[i]:
string += f"&boolean_list[{i}]"
# If there's '!' in word_split, append &~
else:
# When ! is detected in word_split
word_split_exc = word_split[i].split('!')
boolean_exc_list = [(df['title'].str.contains(i)) for i in word_split_exc]
for i in range(len(word_split_exc)):
if i == 0:
string += f"&boolean_exc_list[{i}]"
else:
string += f"&~boolean_exc_list[{i}]"
try:
string = string.strip('&')
string = string.strip('!')
filtered_df_one_keyword = df[eval(string)]
filtered_df = filtered_df.append(filtered_df_one_keyword, ignore_index=True)
except:
print('Error occured while processing &, ! in filter_by_keyword function')
body = f'키워드를 재설정 해 주세요.\n\n'
body += f'{word} 키워드에 문제가 있습니다.\n\n'
body += f'조건식 입력 시, &와 !의 순서를 2번 이상섞지 마세요.\n\n'
body += f'ex) A&B&C!D!E 혹은 A!B!C&D&E -> 정상\n\n'
body += f'ex) A&B&C!D&E 혹은 A!B!C&D!E -> 오류'
bot.send_message(chat_id=one_user_id, text=body)
filtered_df = filtered_df.drop_duplicates(subset=['link'])
return filtered_df

It looks quite complicated because of the try and except clauses, but the intuition behind the code is very simple.
First, the filter_by_keyword function receives crawled news articles, user info, and other settings stored as a DataFrame object. Then, using a Pandas method of ‘.contains’, it finds news articles that include each keyword of a user’s keyword list.
For example, if a user’s keyword list is [‘Samsung’, ‘LG’, ‘Tesla’], it finds news articles that include each of those keywords in order using a ‘for word in keyword_list’ loop. More specifically, it finds news articles that include ‘Samsung’ first and then finds news articles that include ‘LG’, and lastly, it finds news articles that include Tesla. After those processes, it just appends all the filtered news and removes duplicates using the Pandas drop_duplicates() method. By doing this, you can obtain news articles filtered by each of the user’s keywords.

The ampersand and exclamation mark in the code is an additional feature for users to fine-tune their keywords. Those serve as logical operators when configuring keywords.
For example, when you want to receive only news that includes both ‘Samsung’ and ‘Semiconductor’, you can do it by typing a keyword like ‘Samsung&Semiconductor’. Also, when you want to receive news that includes only the word ‘Samsung’, not ‘Semiconductor’, then you can register a keyword like ‘Samsung!Semiconductor’.
The code above includes the part of handling those processes with the ampersand and exclamation mark.

At first, I wasn’t planning to develop further features. However, as I got many ideas in the process of development, I’ve come to add various features including Natural Language Processing using text data from online communities of stock investment.
So, the next post will be about how to implement Natural Language Processing.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store