How to Extract Text and Hashtags From Tweets
To extract hashtags from tweets and filter for your most mentioned tweets on Twitter.com.
Regular expressions are a powerful tool. Being able to narrow down relevant data from the sea of noise can give you a better understanding of the big picture.
Follow along by downloading this Twitter sample data:DatameerTweets.csv
This sample data was created by importing data via Twitter. We set up a connection using the Twitter Rest API connector and configured an import job to look for tweets that contain the word “Datameer”.
You can upload the this .csv file to Datameer as a FileUpload.
We are looking for tweets containing hashtags (#). The first step is to clean up the data and remove any tweets that do not contain hashtags.
Click on the Filter button in the toolbar.
We will create a simple filter that filters data from the text columns that contain "#" characters.
Select the column, use the "contains" expression, and filter by the # sign. Check the box at the top to create this filter in a new sheet.
Name your new filter sheet "TweetsContainHashtags".
Write Some Regex
Now that we have filtered down our data, we can work on extracting the hashtags with our functions. Creating a new sheet by clicking the + tab at the bottom, then copy over the tweet text in column A with by selecting the Fx bar and adding the formula:
Extracting the hashtags with the function REGEXTRACT:
Click in the next column to bring up the function wizard. Select the function REGEXTRACT.
The string we want to use is the new text column we just copied and the regular expression will be \#\w*
After clicking OK it will look like this in the formula bar at the top.
An explanation of this regular expression:
The \ character escapes the REGEX expression. The # is where the REGEX will begin extracting the hashtag. The \w represents “word character”, which is anything A-Z, a-z. or 0-9. The * at the end of the expression will look for 0 or more words after the hashtag.
Rename this worksheet "HashtagExtract". The column name for the extracted tags should be called Tokens.
The worksheet for the hashtag extraction should look like this:
Find the Top Hashtags
Now we will determine which hashtags are being used most in our sample data.
Create a new worksheet and perform the following:
Use the GROUPBY function on the Tokens column from your HashtagsExtract sheet.
Click in the next column and use the GROUPCOUNT function to count the occurrences of the hashtags in the data set:
The last step is to find the top 10 most frequent hashtags.
Sort the data by clicking the sort button in the toolbar.
Sort by the count based on the GROUPCOUNT column:
This last step will show you the most frequent hashtags used in your Twitter tweet texts.
Run workbook to get full results!
“Got a question? Have an answer? Join the Datameer Community!”