Cluster configuration (distributed execution)

DataCleaner monitor allows execution of jobs through a cluster of machines - essentially to increase fault tolerance and performance by adding more machines instead of having to upgrade the hardware of a single machine.

When executing distributed jobs, DataCleaner will initially estimate how many records needs to be processed. Depending on this number, a number of "chunks" of records will be assigned to be executed on different slave execution nodes. After execution, the master node will collect the slave node results and combine them into a single result report.

The configuration of DataCleaner's cluster is handled through the file WEB-INF/classes/context/cluster-context.xml within the deployed web archive folder. By default it defines this <bean> element:

			<bean id="clusterManagerFactory" class="org.datacleaner.monitor.cluster.HttpClusterManagerFactory">
			  <property name="username" value="admin" />
			  <property name="password" value="admin" />
			  <property name="slaveServerUrls">
			    <list>
			      <value>http://localhost:8080/DataCleaner-monitor</value>
			      <value>http://localhost:9090/DataCleaner-monitor</value>
			    </list>
			  </property>
			</bean> 

The above definition states that the cluster has two slave execution nodes. As an example, these are using 'localhost' references, but you can also use other hostnames.

To enable clustered execution of a job, you need to open it's .schedule.xml file in the 'jobs' folder of the repository. In this XML file you will find a <distributed-execution> element which determines if local or distributed execution will be executed. For example, the file 'Customer completeness.schedule.xml' starts like this:

			<?xml version="1.0"
			encoding="UTF-8" standalone="yes"?>
			<schedule
			xmlns="http://eobjects.org/datacleaner/schedule/1.0"
			  xmlns:ns2="http://eobjects.org/datacleaner/shared/1.0"
			xmlns:ns3="http://eobjects.org/datacleaner/timeline/1.0"
			  xmlns:ns4="http://eobjects.org/datacleaner/execution-log/1.0">
			  <cron-expression>@daily</cron-expression>
			  <distributed-execution>false</distributed-execution>
			  <alerts>
			    ...
			  </alerts>
			</schedule> 

Changing this value to 'true' would trigger DataCleaner monitor to use the cluster configuration when executing the job.

Tip

The enterprise edition of DataCleaner also include other mechanisms of communication between cluster nodes. One short-coming of the above approach is that it is not tolerant to network issues or crashing nodes. Consider DataCleaner enterprise edition for such deployments, since it supports elastic clusters without having the master to be aware of each single node.