-
Notifications
You must be signed in to change notification settings - Fork 0
EMR
abk edited this page Jul 27, 2020
·
1 revision
EMR (Elastic MapReduce).
- Mapreduce is obsolete part of EMR.
- Managed Hadoop framework on EC2 instances
- Includes Spark, HBase, Presto, Flink, Hive and More.
- EMR notebooks (interactively query data)
- Several integration points with AWS. (It's lot like Cloudera)
- EMR cluster
- Set of EC2 instances called node.
- 1 Master node (minimum) - manages the cluster (1 EC2)
- Core Node : Hosts HDFS data and runs tasks Can be scaled up and down, but some risk involved. May lose data. Data is stored in core node.
- Task Node: Run tasks, does not host data.
- No risk of data loss when removing.
- Good use of Spot Instances.
- EMR usage
- Transient cluster Vs Long running cluster
- Can spin up task nodes using spot instances for temp capacity
- Can use reserved instances on long running cluster to save $.
- Connect directly to master to run jobs
- Submit ordered steps via the console.
- Transient cluster Vs Long running cluster
- EMR / AWS integration
- EC2 for it's instances that comprise the nodes in cluster.
- VPC to configure virtual network in which you launch instances.
- S3 to store input and output data.
- CloudWAtch to monitor cluster perf and configure alarms.
- IAM to configure permissions
- CloudTrail to audit requests
- Data Pipeline to schedule and start your cluster.
- EMR storage
- HDFS - Hadoop distributed filesystem. It's scalable FS across EC2.
- HDFS is split in blocks, with 128MB. Big file will be split into 128MB. HDFS is ephemeral.. It's useful for caching or workloads using random IO.
- EMRFS - Creates a FS like HDFS but on top of S3.
- S3 will be used to store I/O.
- Offers consistent view - Optional for S3 consistency. (3.2.1 EMR). EMR will use dynamoDB to track consistency. It does have some capacity limits.
- Local File System.
- EBS for HDFS (automatically 10G SSD for enhanced performance).
- Can attach more EBS volumes.
- HDFS - Hadoop distributed filesystem. It's scalable FS across EC2.
- EMR promises
- EMR charges by the hour (non serverless) irrespective of use or not.
- Provisions new nodes if core node fails.
- Can add and remove tasks on the fly.
- Can resize a running cluster's core nodes.
- Hadoop (what is it?)
- Hadoop common
- MapReduce - Software framework for easily writing apps to process data in parallel. Has map functions and reduce functions to produce file output. Can be parallelized across the cluster.
- YARN - Yet Another Resource Negotiator. Manages resources . What gets run where.
- HDFS - Distributed scalable FS for Hadoop.
- Hadoop common