Using Hadoop in DataCleaner desktop

Within DataCleaner desktop you can use CSV datastores on HDFS. The commercial editions of DataCleaner also allow you to run jobs on a Hadoop Cluster from DataCleaner desktop.

Configure Hadoop clusters

To be able to execute jobs from DataCleaner desktop on a Hadoop Cluster you have a number of configuration options which are managed in the Hadoop clusters tab in the Options dialog.

  1. Default

    By default DataCleaner uses the HADOOP_CONF_DIR and YARN_CONF_DIR environment variables to determine the location of the Hadoop/Yarn configuration files such as core-site.xml and yarn-site.xml.

  2. Using configuration directory

    By clicking the Add Hadoop cluster button and then selecting the Using configuration directory you can register additional Hadoop clusters by adding locations which contain Hadoop/Yarn configuration files.

  3. Using direct namenode connection

    By clicking the Add Hadoop cluster button and then selecting the Using direct namenode connection you can registerd additional Hadoop clusters using their file system URI (e.g. hdfs://bigdatavm:9000/).

If you have added additional Hadoop clusters, when selecting a file on HDFS, it first opens a dialog where you can select from which Hadoop custer you want to select a file.

CSV datastores on HDFS

When registering a CSV datastore you have the option to select "hdfs" as scheme for the source of the CSV. In the path field you can either fill in an absolute path, including the scheme, e.g. hdfs://bigdatavm:9000/datacleaner/customers.csv or the relative path to a file on HDFS, e.g. /datacleaner/customers.csv. Note that a relative path only works when you have set the HADOOP_CONF_DIR or YARN_CONF_DIR environment variables (see Setting up Spark and DataCleaner environment).

Running jobs on a Hadoop Cluster

To be able to execute jobs from DataCleaner desktop on a Hadoop Cluster you have to set the HADOOP_CONF_DIR or YARN_CONF_DIR environment variables (see Setting up Spark and DataCleaner environment) and you also have to set the SPARK_HOME environment variable which points to your Apache Spark location. DataCleaner works with the Apache Spark 1.6 releases which can be downloaded from the Apache Spark website. Now you can run DataCleaner jobs on your Hadoop cluster by using the Run on Hadoop cluster option: