Simple Example

Building - A note about hadoop / spark versions

Our examples are built and tested on Cloudera cdh5.0.0. Spark and Hadoop are installed and setup on our cluster using Cloudera Manager. We recommend using the Cloudera distribution of spark and Hadoop to simplify your cluster management but any compatible versions of Spark and Hadoop should work.

To build Spark for other Hadoop versions see the Spark documentation.

If you use a different version of spark or hadoop you will have to modify the build.gradle script accordingly. Depending on your version of spark you may need to include a dependency on hadoop-client.

Running the a local example

Build the project with gradle

gradle clean dist

Run the training phase to pre-process the input vectors and cache the generated projects and centroids

'./training.sh example/training.properties'

Run the bulk mode to correlate every vector against every other vector in the system.

./run_bulk.sh example/run.properties

Results are stored in the 'output' folder
You can also run the interactive example

./run_interactive example/run.properites

To remove any cached centroids / projects clean the local directory

./clean.sh

Running On a cluster.

Ensure the gradle.build file is setup to use the version of spark running on your cluster (see above)
Build the project

gradle clean dist

Make a local directory for you cluster configuration

cp -r examples mycluster

Move your data to a location on hdfs. If you have small data you can still run on local files, this example assumes you want to use a distributed file system.
Edit mycluster/training.properties.

set the master uri for your cluster. "master_uri=spark://mymasternode:7077"

ensure SPARK_HOME is set correctly for your cluster (default set up for cloudera cdh5.0.0-beta-2)

set the inputPath to your location in hdfs (example inputPath=hdfs:/// )

set the output files to point to a location in hdfs
```
  centroid_dir=hdfs://<namenode>/<path>/generated_centroids
 
  projection_dir=hdfs://<namenode>/<path>/generated_projections
 
  training_matrix_path=hdfs://<namenode>/<path>/training_matrix_mapping_v2
```
Edit mycluster/run.properties

set the master uri for your cluster. "master_uri=spark://mymasternode:7077"

ensure SPARK_HOME is set correctly for your cluster (default set up for cloudera cdh5.0.0-beta-2)

set the original_data_path to the location of you data in hdfs (example original_data_path=hdfs:/// )

set the output path to a location in hdfs

set centroid_dir, projection_dir, and training_matrix_path to the same as in your training.properties file
run the training phase on the provided example

./training.sh mycluster/training.properties

Run the bulk mode to correlate every vector against every other vector in the system.

./run_bulk.sh mycluster/run.properties

Results are stored in the 'output' folder
You can also run the interactive example. *note: you'll have to enter your hdfs locations instead of the default local locations

'./run_interactive /mycluster/run.properites'

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Simple Example

Building - A note about hadoop / spark versions

Running the a local example

Running On a cluster.

Uh oh!

Uh oh!

Clone this wiki locally