How do I convert a list of tokens back into a single string (e.g., with a space as a delimiter)

3 followers
0
Avatar

I have converted a set of documents into a set of lists of strings so I could apply some the text & text mining functions to normalize the documents (e.g., correct common spelling errors).  Now I want to derive character-based ngrams, for which I need string (not list) input.  How do convert the lists back to strings?  I couldn't find a function for that.

Steve Bernstein

4 comments

  • Avatar
    Pablo Redondo

    You can convert a list to JSON, which is also a string. The function is toJSON(). Then you can apply string fuctions (replace, tokenize, etc)

    0
  • Avatar
    Steve Bernstein

    Got it, thanks Pablo.  Then REPLACEALL to remove to JSON array decoration to reconstitute the normalized string for ngramming.

    0
  • Avatar
    Joel Stewart

    The T function converts most types to String type. With a List type object though, it will include extra punctuation when converting it to a String.

    Here is an example: 

    T([apple, orange]) returns a string with value: ["apple", "orange"]

    In your later analysis the following characters may be undesirable: [],"

    You can remove these by using REPLACE, REPLACEALL or SUBSTR functions.

    0
  • Avatar
    Steve Bernstein

    Here's how I cleaned up the JSON array:

    TRIM(REPLACEALL(#toJSON_string;"[\",\\[\\]]+";" "))

    --replaces all the delimiting decoration w/ a single space char, deletes the brackets and trims the opening and closing space.

    0
Please sign in to leave a comment.