Expand a list to columns, one list element per column on the same row

3 followers
0
Avatar

I'm working on clustering a corpus of free text documents.  On each row I have a document ID in the first column and an ordered element list of integer values in the second.  In order to take advantage of K Means Clustering in Advanced Analytics, I need to spread the list across columns.

How can I do that?

Steve Bernstein

8 comments

  • Avatar
    Grant Eaton

    Hi Steve,

    There is a function called EXPAND that extrapolates a list/array into rows, but this isn't quite what you've asked for. I have an idea - do you know what the maximum number of array/list values you might have? I.e., is it a known maximum number of columns you'll need to create?

    0
  • Avatar
    Steve Bernstein

    Hi Grant.  You're correct, EXPAND doesn't apply in this case.  Yes the max is known--I set that.  There are exactly 1000 list elements in each document's list.  I picked that length though in the steps to create the list from the docs.  I could shorten it, but it needs to be pretty big, say >500.

     

    Super curious to hear your idea...

    0
  • Avatar
    Grant Eaton

    Hi Steve,

    It sounds like there would be exactly the same number of populated columns per row, which makes this a bit easier (or more computationally efficient) as compared to an indeterminate number of columns per row, which would require some lightweight Boolean logic for each column.

    However, my idea probably isn't feasible for such a huge number of columns as you stated. My idea is that if it was a small number of columns you were deciphering, say, less than 30-60, I would manually create each column as a formula to pull-out the array's positional values using LISTELEMENT.

    Another option, although I haven't quite thought through it in detail, would be to convert the list/array into a JSON object and then export that to JSON format. You might then be able to re-ingest the JSON object and have the array values materialized into columns. There are a number of JSON functions that could potentially help with this... Perhaps someone else will respond here with a more concrete way of how to achieve this using JSON functions and/or export/import via the JSON format as I don't have the precise recipe figured out in my head.

    Does that last idea give you something you can experiment with?

    0
  • Avatar
    Steve Bernstein

    Ah, right, it occurred to me too that manually inserting values via LISTELEMENT functions would work--wondered if the API could be used to automate/scale this somehow (I'm completely unfamiliar with the API).

    Thought the SELECT function might be relevant, but I think the task is equally manual.  There's a GROUPSELECT too, same problem.

    There's a Pivot Table widget documented--that might work--need to explore how to deploy it.

    0
  • Avatar
    Grant Eaton

    I'm not familiar enough with the REST API to say for sure, but I'm 95% certain this could be automated via code - I'm not a coder, however.

    0
  • Avatar
    Pablo Redondo

    Hi Steve,

    You can use the REST API to achieve this task. I put together a quick python script that should achieve what you are looking for. It might need some minor tweaks but should work. Not sure if you are familiar with python, but feel free to shoot me an email if you need assistance.

    FYI: The current Java API wont allow you to create a function that wouldautomatically populate multiple columns at once.

    Script:

    #creates muliple columns froma list field
    import requests
    import json

    #input variables
    url = 'http://localhost:8081/rest/workbook/'
    user = 'admin'
    pwd = 'admin'
    workbook_id = '1038'
    mysheet = 'Expand_List' #Worbook's sheet name were you have the list element
    mycolumn = 'List' #Column name with the list
    listsize = 10 #List max size
    newWbk = "/Users/admin/Research/Miscelanious/Workbook_NEW.wbk" #Name and path of new workbook.

    #other variables
    headers = {'content-type': 'application/json', 'Accept-Charset': 'UTF-8'}

    #get workbook definition
    data = requests.get(url+workbook_id, auth=(user,pwd))
    if data.status_code == 200:
    print "Succesful request..."
    json_data = data.json()

    #find sheet with information
    sheets = json_data['sheets']
    for i in sheets:
    if i['name']==mysheet:
    nextColumn = i['nextColumnId']
    #create a column for each list element
    for x in range(0,listsize):
    i['formulas'].append({"columnName": "C","columnId": str(nextColumn+x),"columnIndex": nextColumn+x,"formulaString": "=LISTELEMENT(#List;"+str(x)+")"})
    i['columnStyles'].append({"columnId": str(nextColumn+x),"name": "Col_"+str(nextColumn+x),"width": 100,"thousandSeparator": True})
    #update workbook json definition
    json_data['sheets']=sheets

    #Change name
    json_data['file']['path']=newWbk
    print json_data
    #Create new workbook with new parameters
    r=requests.post('http://localhost:8081/rest/workbook', auth=(user,pwd),headers=headers,json=json_data)
    print r.text
    else:
    print "Error while requesting workbook definition."
    print data

    0
  • Avatar
    Steve Bernstein

    Thanks Pablo, awesome.  I know enough Python to be able to understand and edit this as needed.  Just need to sort out how to integrate into the REST API--that'll be new.

    0
  • Avatar
    Steve Bernstein

    You know what would be much better?--update the K Means function to support a LIST as input.  It has to be the case that most applications will end up with a list--the data almost never start out as a set of columns all ready for clustering.  The whole point is to transform to create the dimensional data.

    0
Please sign in to leave a comment.