Launching DataCleaner jobs using Spark

Go to the Spark installation path to run the job. Use the following command line template:

			bin/spark-submit --class org.datacleaner.spark.Main --master yarn-cluster /path/to/DataCleaner-spark.jar
			/path/to/conf.xml /path/to/job_file.analysis.xml ([/path/to/])

A convenient way to organize it is in a shell script like the below, where every individual argument can be edited line by line:

			DC_COMMAND="$DC_COMMAND --class org.datacleaner.spark.Main"
			echo "Using DataCleaner executable: $DC_PRIMARY_JAR"
			if [ "$DC_EXTENSION_JARS" != "" ]; then
			  echo "Adding extensions: $DC_EXTENSION_JARS"

			echo "Submitting DataCleaner job $DC_JOB_FILE to Spark $SPARK_MASTER"

The example makes it clear that there are a few more parameters to invoking the job. Let's go through them:

  1. SPARK_MASTER represents the location of the Driver program, see the section on Hadoop deployment overview.

  2. DC_EXTENSION_JARS allows you to add additional JAR files with extensions to DataCleaner.

  3. DC_PROPS is maybe the most important one. It allows you to add a .properties file which can be used for a number of things:

    1. Special property datacleaner.result.hdfs.path which allows you to specify the filename (on HDFS) where the analysis result (.analysis.result.dat) file is stored. It defaults to /datacleaner/results/[job name]-[timestamp].analysis.result.dat

    2. Special property datacleaner.result.hdfs.enabled which can be either 'true' (default) or 'false'. Setting this property to false will disable result gathering completely from the DataCleaner job, which gives a significant increase in performance, but no analyzer results are gathered or written. This is thus only relevant for ETL-style jobs where the purpose of the job is to create/insert/update/delete from other datastores or files.

    3. Properties to override configuration defaults.

    4. Properties to set job variables/parameters.