What is Datameer and How it Integrates Within the Hadoop Framework?


Datameer acts as a job compiler or code generator like Hive. This means every function, filter or join that the user designs in the spreadsheet will be translated into native MapReduce (MR) code. MR is great for splitting up workloads into smaller pieces. To do so, Datameer compiles a job for a Hadoop cluster, where it is sent to be executed. You may notice jobs occupying Execution Frameworks other than basic MR jobs, like Tez or Spark. After the job is compiled and sent to the cluster Datameer does not have much control over the job and only receives telemetry metrics provided by the cluster's services. The job will run with any scheduling settings and use resources granted by the scheduler.

Even if (business) users are not aware about this just from the Excel like User Interface (UI), they are generating a Java program for distributed computing on the cluster backend. This high level of abstraction makes Datameer such an outstanding technology. However, this approach will not make all problems disappear which every programmer deals with, i.e. data types, memory and disk usage.

This indicates there will be mainly two stages during work. First the design/edit time and second the execution/runtime of a data link/import job/workbook. Both stages are located on different parts within your distributed computing system (cluster).



The first stage is served on the Datameer application server, running the Datameer service Java Virtual Machine (JVM), started and executed under the Datameer service account user. Depending on your configuration and if (Secure) Impersonation is configured or not, calls are made from <datameerServiceAccountUser>@<datameerHost> or <loggedinUser>@<datameerHost>.


The second stage is served on random DataNodes (DN) in the cluster. The DN is running the container JVM, started by the ApplicationMaster (AM) and executed under the YARN service account user. Depending on the configuration and if (Secure) Impersonation is configured or not, calls are made from <yarnServiceAccountUser>@<dataNode> or <impersonatedUser>@<dataNode>.


Since this description is on a very high level, you may read further books about the technology and framework like Hadoop: The Definitive Guide, 4th Ed. by Tom White and Hadoop Security, 1st Ed. by Ben Spivey and Joey Echeverria.