Real-time tweet analysis platform
Design a platform which could give real-time insights for desired twitter hash tag.
It was quiet challenging to create a platform that could handle real-time data log with results from past analysis to give insights for latest trend on twitter. The very first problem that we had to address was designing a data structure for storing results. Any client who uses TweetOLAP would not like to wait for minutes to get the results. Our task was to decide a structure which could best suit the target response time of 1-2 seconds for any arbitrary hash tags.
We found our solution in a new family of database which save data in key value pairs. HBase is one of the popular columnar databases which can use Hadoop Distributed File System (HDFS) as underlying storage. HBase is suitable choice we made to address this problem. We decided to represent timestamp as row and each hashtag as column. The intersection of each row and column would give the result of particular analysis to which the data structure belongs.
This helped us to maintain a timeline of results. We could query for a full timeline of results for any particular type of analysis for any required hash tag. Second problem that needed to be addressed was increased demand of processing time. Every tweet that is received through API client is in JSON format. We needed to extract all the hash tags from the tweet, then calculate a sentiment score for that tweet and categorize it according to its sentiment score. Then update the database appropriately.
We got our solution in Apache Storm. We divided this task into smaller chunks and created a storm topology to complete it step by step. The overhead of designing a distributed system was overcome by the use of storm for this purpose.This platform could not have got success without addressing our third problem; that is flexibility. We wanted to design something that could be re-used. We wanted a single platform to perform various tasks that could be re-defined any time. And not hard-wire only the analysis functions to the platform. So, TweetOLAP design was changed; each component was now standalone.
We created analytics functions as plug-ins that could be registered with the system. When TweetOLAP captures a tweet from the API, it checks for the registered plug-ins. It then passes the received tweet to each plug-in. The implementation of plug-in is kept out of the core design of the system. This gave a tremendous flexibility to the system. Now we can change the functionality of the system at any time. We could add new features or remove existing features form the TweetOLAP.
All we need is to create or remove plug-ins from the core system. Whenever there is a speed mismatch between twitter API and client API, twitter breaks the connection. There was a need of some mechanism that could monitor the connection and re-initiate it when required. We created a custom server which had a proprietary protocol. We could communicate through a pre-defined port and protocol to control the transfers and processing through a dashboard.
Clients’ identity and research methodology shared with us has been kept confidential in this document in lieu of confidentiality agreement with client and our policy. If you would like to know more about this research, please write to us at email@example.com.