How to do group B - group A?

2 followers
0
Avatar

Use case: For column A, I want to know how many new string that're never seen before has added in the past week, and I want to be able to schedule this job to run weekly.

Create a group A (every unique string before last week) is easy. GROUPBY can do it, so does creating a group B (every unique string last week), but How do I do the set complement? i.e. Group B - Group A.

 

Simon Gao

Official comment

  • Avatar
    Joel Stewart

    To get the set complement, you can utilize the Join functionality along with a filter: 

    1. Perform a Right Outer Join of Group A to Group B (this will preserve the entire set of Group B).
    2. Filter the Joined sheet to remove entries where a joined key from Group A was found for the Group B. 

    This will leave you with the set compliment of Group B - Group A.

    Hope this helps!

    0
    Comment actions Permalink

6 comments

  • Avatar
    Simon Gao

    Thanks Joel.

     

    I followed the step and I do see a smart sample of result (pretty excited so far!)

    but when I run the job. I get the following exception.

    ERROR [2016-03-25 22:06:15.952] [JobScheduler worker1-thread-5106] (DasJobCallable.java:135) - Job failed! Execution plan: null
    java.lang.NullPointerException: No sheet with ID 'ca30b021-a050-4ee3-8ea4-562ad3e4fb08' found.
    	at datameer.com.google.common.base.Preconditions.checkNotNull(Preconditions.java:229)
    	at datameer.dap.common.entity.WorkbookConfigurationImpl.getSheet(WorkbookConfigurationImpl.java:372)
    	at datameer.dap.common.job.WorkbookJob.exchangeWithSnapshotSheet(WorkbookJob.java:192)
    	at datameer.dap.common.job.WorkbookJob.compileWorkbook(WorkbookJob.java:162)
    	at datameer.dap.common.job.WorkbookJob.registerJobOperations(WorkbookJob.java:264)
    	at datameer.dap.common.job.DatameerJob.createExecutionPlan(DatameerJob.java:99)
    	at datameer.dap.common.job.DasJobCallable.call(DasJobCallable.java:95)
    	at datameer.dap.common.job.DasJobCallable.call(DasJobCallable.java:50)
    	at datameer.dap.conductor.job.JobSchedulerJob$2.call(JobSchedulerJob.java:124)
    	at datameer.dap.conductor.job.JobSchedulerJob$2.call(JobSchedulerJob.java:106)
    	at datameer.dap.common.security.DatameerSecurityService.runAsUser(DatameerSecurityService.java:100)
    	at datameer.dap.conductor.job.JobSchedulerJob.call(JobSchedulerJob.java:106)
    	at datameer.dap.conductor.job.JobSchedulerJob.call(JobSchedulerJob.java:40)
    	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
    	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
    	at java.lang.Thread.run(Thread.java:662)

    0
    Comment actions Permalink
  • Avatar
    Joel Stewart

    The ID ca30b021-a050-4ee3-8ea4-562ad3e4fb08 is a Sheet ID for the job that does not exist in its full data set. To check the sheet ID directly, you may download the Job Trace from the particular job and review the job-definition.json file inside of the downloaded zip file. This will show which sheet name corresponds to this ID. 

    Is this workbook linked to other workbooks as the source? If so, do the parent sheets still exist in full? This error commonly indicates that a sheet did not exist or was not saved in full for reference. 

     

    0
    Comment actions Permalink
  • Avatar
    Simon Gao

    Thanks, Joel. I was able to get around that problem by moving all logics into one workbook. Just for the future reference, may I ask what do you mean by "exist in full"?

    0
    Comment actions Permalink
  • Avatar
    Joel Stewart

    I intended to ensure that the following situation was not affecting the environment: 

    1. ParentWorkbook has sheets: ParentSheet1 and ParentSheet2 
    2. ParentSheet1 and ParentSheet2 are "Saved" sheets in ParentWorkbook
    3. ChildWorkbook uses ParentSheet1 as a source sheet
    4. ParentWorkbook's configuration changes and ParentSheet1 is no longer a "Saved" sheet. 

    In this circumstance, ChildWorkbook initiated using ParentSheet1 which was saved, but the configuration has changed and ParentSheet1 is no longer saved.

    Does that clarify what I meant by "still exist in full" before? 

     

    0
    Comment actions Permalink
Please sign in to leave a comment.