Skip to content

aws glue

abk edited this page Jul 27, 2020 · 1 revision

AWS Glue

  • Servrereless discovery and definition of table definitions and schema.
    • S3 "data lakes"
    • RDS
    • RedShift
    • Most other SQL DB
  • Custom ETL jobs
    • Trigger driven, on a schedule

  • Components of AWS Glue
    • Glue Crawler / Data Catalog
      • Can run periodically and scan S3.
      • IT will extract partitions based on how S3 data is organized.
      • Think up front how you will be querying your data lake in S3.
      • Example : Devices send sensor data every hours.
        • Do you query primary by time ranges?
          • If so organize as by yyyy/mm/dd/device-id
        • Do you query primarily by device?
          • If so organize your buckets by device-id/yyyy/mm/dd

    • Glue data catalog only stores metadata (about the table information etc). 

  • Glue + Hive
    • Hive is service that a runs on EMR - and run SQL like queries.
    • Glue data catalog can be read by Hive. 

  • Glue ETL
    • Automatic code generation
    • Scala or Python
    • Encryption
      • Server side (at rest)
      • SSL in transit
    • Can be event driven
    • Can provision additional DPUs (data processing units) to increase the perf of underlying Spark jobs.
    • Errors reports by CloudWatch
    • Transform data, clean data and enrich data (before doing analysis)
      • Bundled transformations
        • DropFields, DropNullFields
        • Filter - specify a function to filter records. (Remove outliers).
        • Join - to enrich data.
        • Map - add fields, delete fields, perform external lookups.
      • Machine Learning Transformations
        • FindMatches ML : Identify duplicate or matching records.
      • Format conversions : Csv, json, avrò, parquet, orc, xml.
      • Apache Spark transformations (K-means).
      • Develop ETL scripts using notebook.
        • Then create an ETL job that runs your script (using Spark & glue)
        • Endpoint is in a VPC controlled bySG, connect via
          • Apache Zeppelin on your local machine
          • Zeppelin notebook server on EC2.
          • SageMaker notebook
          • Terminal Window
          • Pycharm professional edition.
          • Use ElasticIP to access private endpoint.
    • Other salient features.
      • Generate ETL code in Python or Scala. You can modify the code.
      • Can provide your own Spark or PySpark scripts
      • Output / Target can be S3, JDBC (RDS, Redshift) or Glue data Catalog
      • Fully managed, cost effective, pay only for the resources
      • Jobs are run on a servrerless Spark Platform
      • Glue scheduler to schedule the jobs.
      • Glue triggers to automate job runs based on triggers.
    • Running Glue jobs
      • Time based schedules (cron style)
      • Job bookmarks.
        • Persists state from the job run
        • Prevents reprocessing old data.
        • Allows you to process new data when re-running on schedule.
        • Works with S3
        • Works with relational db via JDBC.
      • CloudWatch events
        • Fire off a lambda or SNS notification when ETL succeeds or fails.
        • Invoke EC2 run , send event to Kinesis, actiate step function
  • Glue cost model
    • Billed by the min for crawler and ETL jobs
    • First 1M objects and accesses are free for the Glue Data catalog
    • Development endpoints
  • Glue Anti patterns
    • Not recommended for Streaming data (glue is batch oriented, min 5m)
    • Multiple ETL engines
    • No support for NoSQL databases.
Clone this wiki locally