Using Hadoop in DataCleaner monitor

Within DataCleaner monitor you can use CSV datastores on HDFS and you can run jobs on a Hadoop Cluster from DataCleaner monitor.

CSV datastores on HDFS

When registering a CSV datastore you can indicate it is located on a server / Hadoop cluster and then select a path on the Haddop cluster. In the path field you can either fill in an absolute path, including the scheme, e.g. hdfs://bigdatavm:9000/datacleaner/customers.csv or the relative path to a file on HDFS, e.g. /datacleaner/customers.csv. Note that a relative path only works when you have set the HADOOP_CONF_DIR or YARN_CONF_DIR environment variables (see Setting up Spark and DataCleaner environment).

Running jobs on a Hadoop Cluster

To be able to execute jobs from DataCleaner monitor on a Hadoop Cluster you have to set the HADOOP_CONF_DIR or YARN_CONF_DIR environment variables (see Setting up Spark and DataCleaner environment) and you also have to set the SPARK_HOME environment variable which points to your Apache Spark location. DataCleaner works with the Apache Spark 1.6 releases which can be downloaded from the Apache Spark website. Now you can run DataCleaner jobs on your Hadoop cluster by going to the schedule dialog and checking the Run on Hadoop cluster option: