Skip to content
abk edited this page Jul 27, 2020 · 1 revision

EMR (Elastic MapReduce).

  • Mapreduce is obsolete part of EMR.
  • Managed Hadoop framework on EC2 instances
  • Includes Spark, HBase, Presto, Flink, Hive and More.
  • EMR notebooks (interactively query data)
  • Several integration points with AWS. (It's lot like Cloudera)
  • EMR cluster
    • Set of EC2 instances called node.
    • 1 Master node (minimum) - manages the cluster (1 EC2)
    • Core Node : Hosts HDFS data and runs tasks
Can be scaled up and down, but some risk involved. May lose data. 
Data is stored in core node.
    • Task Node: Run tasks, does not host data.
      • No risk of data loss when removing.
      • Good use of Spot Instances.
    • EMR usage
      • Transient cluster Vs Long running cluster
        • Can spin up task nodes using spot instances for temp capacity
        • Can use reserved instances on long running cluster to save $.
      • Connect directly to master to run jobs
      • Submit ordered steps via the console.
    • EMR / AWS integration
      • EC2 for it's instances that comprise the nodes in cluster.
      • VPC to configure virtual network in which you launch instances.
      • S3 to store input and output data.
      • CloudWAtch to monitor cluster perf and configure alarms.
      • IAM to configure permissions
      • CloudTrail to audit requests
      • Data Pipeline to schedule and start your cluster.
    • EMR storage
      • HDFS - Hadoop distributed filesystem. It's scalable FS across EC2.
        • HDFS is split in blocks, with 128MB. Big file will be split into 128MB. 
HDFS is ephemeral.. It's useful for caching or workloads using random IO.
      • EMRFS - Creates a FS like HDFS but on top of S3.
        • S3 will be used to store I/O.
        • Offers consistent view - Optional for S3 consistency. (3.2.1 EMR). 
EMR will use dynamoDB to track consistency. It does have some capacity limits.
      • Local File System.
      • EBS for HDFS (automatically 10G SSD for enhanced performance).
        • Can attach more EBS volumes.
  • EMR promises
    • EMR charges by the hour (non serverless) irrespective of use or not.
    • Provisions new nodes if core node fails.
    • Can add and remove tasks on the fly.
    • Can resize a running cluster's core nodes.
  • Hadoop (what is it?)
    • Hadoop common
      • MapReduce - Software framework for easily writing apps to process data in parallel. Has map functions and reduce functions to produce file output. Can be parallelized across the cluster.
      • YARN - Yet Another Resource Negotiator.
Manages resources . What gets run where.
      • HDFS - Distributed scalable FS for Hadoop.
Clone this wiki locally