- 
                Notifications
    You must be signed in to change notification settings 
- Fork 28
test
        Ondřej Moravčík edited this page Mar 14, 2015 
        ·
        6 revisions
      
    More informations
- Wiki
- ruby-doc
Add this line to your application's Gemfile:
gem 'ruby-spark'And then execute:
$ bundle
Or install it yourself as:
$ gem install ruby-spark
To install latest supported Spark. Project is build by SBT.
$ ruby-spark build
You can use Ruby Spark via interactive shell
$ ruby-spark pry
Or on existing project
require 'ruby-spark'
Spark.start
Spark.sc # => contextIf you want configure Spark first. See configurations for more details.
require 'ruby-spark'
Spark.load_lib(spark_home)
Spark.config do
   set_app_name "RubySpark"
   set 'spark.ruby.batch_size', 100
   set 'spark.ruby.serializer', 'oj'
end
Spark.start
Spark.sc # => contextSingle file
$sc.text_file(FILE, workers_num, custom_options)All files on directory
$sc.whole_text_files(DIRECTORY, workers_num, custom_options)Direct
$sc.parallelize([1,2,3,4,5], workers_num, custom_options)
$sc.parallelize(1..5, workers_num, custom_options)- workers_num
- 
    Min count of works computing this task.
 (This value can be overwriten by spark)
- custom_options
- 
    serializer: name of serializator used for this RDD
 batch_size: see configuration
 
 (Available only for parallelize)
 use: direct (upload direct to java), file (upload throught a file)
Sum of numbers
$sc.parallelize(0..10).sum
# => 55Words count using methods
rdd = $sc.text_file(PATH)
rdd = rdd.flat_map(lambda{|line| line.split})
         .map(lambda{|word| [word, 1]})
         .reduce_by_key(lambda{|a, b| a+b})
rdd.collect_as_hashEstimating pi with a custom serializer
slices = 3
n = 100000 * slices
def map(_)
  x = rand * 2 - 1
  y = rand * 2 - 1
  if x**2 + y**2 < 1
    return 1
  else
    return 0
  end
end
rdd = Spark.context.parallelize(1..n, slices, serializer: 'oj')
rdd = rdd.map(method(:map))
puts 'Pi is roughly %f' % (4.0 * rdd.sum / n)