How to Parse a FileUpload and Analyze Using Regex

Goal

Do you have data you would like to bring into Datameer, but maybe the connector does not come standard? Are you interested in using advanced regex to build an analysis?

In this example, you will upload a file using the Regex Parsable Text option and build an analysis using advanced regex.

 

Learn

Sample data

Download demo data: cloudera-server.log

 

Load the example file into Datameer

  1. Start a FileUpload into your Datameer instance so you can work with the demo data. The demo data you will use is a log file from Cloudera.
     
  2. Navigate to the file and select Regex Parsable Text as your data type. Choosing the Regex Parsable Text file type allows you to parse the file using your own regex. Click next.



  3. Now you need to parse the file. To do this, use the following Regex expression: (The regex parse pattern needs to be precise. Ensure you use the exact parse pattern below including the spaces)

    (\d{4}-\d{2}-\d{2} \d+:\d+:\d+,\d+)  (\S+) (\[.*\]) (.*)



    Parse pattern break down

    Now that you have the pattern, let’s break it down. Anything between () represents a column.

    First you have your timestamp: 2014-03-25 18:19:36,311

    (\d{4}-\d{2}-\d{2} \d+:\d+:\d+,\d+)

     

    The \d represents Digit, so the first part (\d{4}- tells Datameer to look for 4 digits until there is a -. Therefore, you use a combination of 4 digits-2 digits-2 digits to get the date.
    Next, you break down the time with the second half. The \d+: looks for any digit with 1 or more characters before the : and so on.

    Now you have your next column, the log level: INFO

    (\S+)

     

    This simply looks for any string \S with 1 or more characters +.

    Next, you have your thread column: [Parcel Update Service:components.LocalParcelManagerImpl@105]

    (\[.*\])

     

    This part of the regex will look for contents in [], with 0 or more characters *.

    Last, you have your message column: Updating parcel state based on local filesystem state and remote repos.

    (.*)

     

    This is just a body of text, so the regex looks for any characters, any length.

    Learn more about regular expressions. 



    Click Next to continue uploading the file. 

  4. Now you will see a preview of the parsed data. It should look something like this:



    You will need to make sure you change all column names as shown above. Additionally, make sure you change the Date column to a DATE type. Use this date pattern:

    yyyy-MM-dd HH:mm:ss,SSS
  5. Once you have made your changes, click through until you can save your file. Name your upload ClouderaServerLogs.

 

Cleaning the logs in a workbook

  1. Add the data to a workbook after the file upload is complete.



  2. Now you can clean up the data using some more regex. Right click on your source sheet tab at the bottom of the workbook. Select to create a duplicate worksheet and duplicate all columns. 



  3. In the duplicated sheet, click in the next empty column to bring up the formula builder. 

    Select the REGEXTRACT function. The string will be your Thread column and the REGEX will be ^(\S+):

    Regex explained

    ^(\S+): 

    "^" means “begins with”.
    "(\S+):" is anything that is a readable ascii string.




    Click OK. 
    The workbook should look like this:



    The output gives us the name of the tokens with some extra characters.

  4. Now you will remove the extra characters from the Tokens column. 
    Click in the next empty column and find REPLACEALL in the formula builder. 
    Choose the Tokens column as your string and use \[|: for your Regex.

    Regex explained

    \[|:

     "\" escapes special character - want to treat [ as an actual open bracket instead of reading the [ as Regex.
     "|" is the OR operator.
     ":" is anything before the colon.




    Click OK. 
    The workbook should look like this:



    As you can see, the special characters have been removed from the new Tokens1 column.

  5. Now you can build this out further as there is more information depending on the Thread column. 

    Click in the next empty column and close the formula builder that pops up. You will be creating this analysis in the formula bar at the top.
    Enter the following into the formula bar: Fx=

    REPLACEALL(REGEXTRACT(#Thread;"(^\\[\\S+):");"\\[\\d*\\@|\\[\\d*|\\[|\\d*:|-\\d*:|-\\d-\\d:|:") 

    This combines both the REGEXTRACT function you already performed, the REPLACEALL function, along with some additional regex to parse all the data that could be in the Thread column. Below is a description of the new regex.

    Regex explained

    \[\d*\@|\[\d*|\[|\d*:|-\d*:|-\d-\d:|: 
     
    "\[\d*\@" Anything starting with [ and followed by a number and before @
    OR
    "\[\d*" Anything starting with [ and followed by a number
    OR
    "\[" Bracket
    OR
    "\d*:" A digit followed by :
    OR
    "-\d*:" A dash and digit followed by :
    OR
    "-\d-\d:" Special case - 2 dashes
    OR
    ":" Colon


    The workbook should look like this:



    Name your final column ProcessName, and the sheet CleanData.

 

Analyse your findings

Now that you have clean data, you can analyze the process, as well as being able to tell what kind of threads are running, how many processes, what had warnings, etc. on a daily basis.

Process analysis

  1. Create a new sheet by clicking the + at the bottom of the page.
     
  2. In the first column, perform a GROUPBY on the ProcessName column from the CleanData worksheet. Name this column ProcessName in the new worksheet.





  3. In the next column, perform a GROUPBYBIN on the Date column from CleanData. Your bin size should be 1d.





  4. In the third column, do a GROUPBY of the Level column from your CleanData worksheet. 






  5. Finally, click in the next empty column and preform a GROUPCOUNT.



 

Monitoring the last four hours of data 

  1. Create a new sheet by clicking the + at the bottom of the page.
     
  2. In the first column, GROUPBY Level from your CleanData worksheet.





  3. In the next column, GROUPBY the ProcessName from your CleanData worksheet.





  4. In the third column, GROUPBYBIN on the Date column from CleanData. Your bin size is 4h.





  5. In the last column, perform a GROUPCOUNT.



 

As you can see, this log data can provide many insights that you were able to glean by importing your file using Regex

 

“Got a question? Have an answer? Join the Datameer Community!