-
Notifications
You must be signed in to change notification settings - Fork 0
aws glue
abk edited this page Jul 27, 2020
·
1 revision
AWS Glue
- Servrereless discovery and definition of table definitions and schema.
- S3 "data lakes"
- RDS
- RedShift
- Most other SQL DB
- Custom ETL jobs
- Trigger driven, on a schedule
- Components of AWS Glue
- Glue Crawler / Data Catalog
- Can run periodically and scan S3.
- IT will extract partitions based on how S3 data is organized.
- Think up front how you will be querying your data lake in S3.
- Example : Devices send sensor data every hours.
- Do you query primary by time ranges?
- If so organize as by yyyy/mm/dd/device-id
- Do you query primarily by device?
- If so organize your buckets by device-id/yyyy/mm/dd
- Do you query primary by time ranges?
- Glue data catalog only stores metadata (about the table information etc).
- Glue Crawler / Data Catalog
- Glue + Hive
- Hive is service that a runs on EMR - and run SQL like queries.
- Glue data catalog can be read by Hive.
- Glue ETL
- Automatic code generation
- Scala or Python
- Encryption
- Server side (at rest)
- SSL in transit
- Can be event driven
- Can provision additional DPUs (data processing units) to increase the perf of underlying Spark jobs.
- Errors reports by CloudWatch
- Transform data, clean data and enrich data (before doing analysis)
- Bundled transformations
- DropFields, DropNullFields
- Filter - specify a function to filter records. (Remove outliers).
- Join - to enrich data.
- Map - add fields, delete fields, perform external lookups.
- Machine Learning Transformations
- FindMatches ML : Identify duplicate or matching records.
- Format conversions : Csv, json, avrò, parquet, orc, xml.
- Apache Spark transformations (K-means).
- Develop ETL scripts using notebook.
- Then create an ETL job that runs your script (using Spark & glue)
- Endpoint is in a VPC controlled bySG, connect via
- Apache Zeppelin on your local machine
- Zeppelin notebook server on EC2.
- SageMaker notebook
- Terminal Window
- Pycharm professional edition.
- Use ElasticIP to access private endpoint.
- Bundled transformations
- Other salient features.
- Generate ETL code in Python or Scala. You can modify the code.
- Can provide your own Spark or PySpark scripts
- Output / Target can be S3, JDBC (RDS, Redshift) or Glue data Catalog
- Fully managed, cost effective, pay only for the resources
- Jobs are run on a servrerless Spark Platform
- Glue scheduler to schedule the jobs.
- Glue triggers to automate job runs based on triggers.
- Running Glue jobs
- Time based schedules (cron style)
- Job bookmarks.
- Persists state from the job run
- Prevents reprocessing old data.
- Allows you to process new data when re-running on schedule.
- Works with S3
- Works with relational db via JDBC.
- CloudWatch events
- Fire off a lambda or SNS notification when ETL succeeds or fails.
- Invoke EC2 run , send event to Kinesis, actiate step function
- Glue cost model
- Billed by the min for crawler and ETL jobs
- First 1M objects and accesses are free for the Glue Data catalog
- Development endpoints
- Glue Anti patterns
- Not recommended for Streaming data (glue is batch oriented, min 5m)
- Multiple ETL engines
- No support for NoSQL databases.