aws glue

AWS Glue

Servrereless discovery and definition of table definitions and schema.
- S3 "data lakes"
- RDS
- RedShift
- Most other SQL DB
Custom ETL jobs
- Trigger driven, on a schedule 
Components of AWS Glue
- Glue Crawler / Data Catalog
  - Can run periodically and scan S3.
  - IT will extract partitions based on how S3 data is organized.
  - Think up front how you will be querying your data lake in S3.
  - Example : Devices send sensor data every hours.
    - Do you query primary by time ranges?
      - If so organize as by yyyy/mm/dd/device-id
    - Do you query primarily by device?
      - If so organize your buckets by device-id/yyyy/mm/dd 
- Glue data catalog only stores metadata (about the table information etc).  
Glue + Hive
- Hive is service that a runs on EMR - and run SQL like queries.
- Glue data catalog can be read by Hive.  
Glue ETL
- Automatic code generation
- Scala or Python
- Encryption
  - Server side (at rest)
  - SSL in transit
- Can be event driven
- Can provision additional DPUs (data processing units) to increase the perf of underlying Spark jobs.
- Errors reports by CloudWatch
- Transform data, clean data and enrich data (before doing analysis)
  - Bundled transformations
    - DropFields, DropNullFields
    - Filter - specify a function to filter records. (Remove outliers).
    - Join - to enrich data.
    - Map - add fields, delete fields, perform external lookups.
  - Machine Learning Transformations
    - FindMatches ML : Identify duplicate or matching records.
  - Format conversions : Csv, json, avrò, parquet, orc, xml.
  - Apache Spark transformations (K-means).
  - Develop ETL scripts using notebook.
    - Then create an ETL job that runs your script (using Spark & glue)
    - Endpoint is in a VPC controlled bySG, connect via
      - Apache Zeppelin on your local machine
      - Zeppelin notebook server on EC2.
      - SageMaker notebook
      - Terminal Window
      - Pycharm professional edition.
      - Use ElasticIP to access private endpoint.
- Other salient features.
  - Generate ETL code in Python or Scala. You can modify the code.
  - Can provide your own Spark or PySpark scripts
  - Output / Target can be S3, JDBC (RDS, Redshift) or Glue data Catalog
  - Fully managed, cost effective, pay only for the resources
  - Jobs are run on a servrerless Spark Platform
  - Glue scheduler to schedule the jobs.
  - Glue triggers to automate job runs based on triggers.
- Running Glue jobs
  - Time based schedules (cron style)
  - Job bookmarks.
    - Persists state from the job run
    - Prevents reprocessing old data.
    - Allows you to process new data when re-running on schedule.
    - Works with S3
    - Works with relational db via JDBC.
  - CloudWatch events
    - Fire off a lambda or SNS notification when ETL succeeds or fails.
    - Invoke EC2 run , send event to Kinesis, actiate step function
Glue cost model
- Billed by the min for crawler and ETL jobs
- First 1M objects and accesses are free for the Glue Data catalog
- Development endpoints
Glue Anti patterns
- Not recommended for Streaming data (glue is batch oriented, min 5m)
- Multiple ETL engines
- No support for NoSQL databases.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

aws glue

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally