Creating categories from common words or group by common words in a text field


Hello I have 3 columns of data that involve information for help desk tickets, such as description, action and resolution.  I want to try and find what words are the most repeated across these 3 columns and then have that word returned in another column to try and help set up categories.  For example if Description says:  The database is down and not responding.  The action says: User reported that the database is not responding.  The resolution says: the database server was rebooted, the database is now back up.  Then this would return the word database.  I know there is some risk that little words like is and the could throw this off so ideal I would combine the text, and have it ignore those types of words.  Right now I am doing this but it is through using the following column calculations:  CONCAT (all 3 fields), LOWER (the concat result), EXTRACT_WORDS (lower result), REMOVE_STOP_WORDS (extract_words result), STEM_WORDS (remove_stop_words result), T (remove_stop_words result), COUNTMATCHES (t result and then a list of categories I had to come up with on my own first), MAX (results of COUNTMATCHES), IF(EQUALS (against the results of the Countmatches and the Max), and then COALESCE.  

This meant putting in hours of work trying to manually figure out what categories I might have feed the formula for COUNTMATCHES.  I want to know if there is a way to do this heavy lifting on the datameer side by having it return the most likely categories instead.  In the example above if I have 6 categories then I have to have six columns of COUNTMATCHES where I have to tell it what keywords there are for each category.  

Is there a better way?


Mary Ayala

1 comment

  • Avatar
    Alan Mark

    Hi Mary,

    First, do your sanitation steps, removing stop words, converting to lower, etc.

    Then follow these steps:
    1. On a new sheet in column one simply add your sanitized column.
    2. On the same sheet in the second column TOKENIZE(#Sheet1!A) to break out all the words for the record into single row entries.
    3. Create a second new sheet.
    4. In the first column on the second sheet, so we know which record our tokenized counts will represent, GROUPBY(#Sheet1!A)
    5. In column two on the second sheet, GROUPBY(#Sheet1!Tokenized_A)
    6. In column three on the second sheet, GROUPCOUNT()
    7. Create a third new sheet.
    8. In the first column on the second sheet, GROUPBY(#Sheet2!A)
    9. In the second column, GROUPTOPN(#Sheet2!Count;1)
    10. Lasty, in the third column, #Sheet2!Tokenized_A

    This will give you the most prevalent word as well as a count of the number of instances for that word.

Please sign in to leave a comment.