Skip to content

As Glue limits comments to 255 characters, we may need to truncate them. #38

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mikklepp
Copy link

No description provided.

shahbazaamir pushed a commit to shahbazaamir/aws-glue-samples that referenced this pull request Jan 14, 2025
@manabery
Copy link
Contributor

manabery commented May 21, 2025

Investigation

Max length of column comment on Hive Metastore is 256 characters.
https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/sql/mysql/hive-schema-3.0.0.mysql.sql#L54

Glue Data Catalog allows 255 characters for comment.
https://docs.aws.amazon.com/glue/latest/webapi/API_Column.html#Glue-Type-Column-Comment

We need to truncate the character as @mikklepp pointed out.

Current behavior

Create a table on Hive Metastore v3.1.3. Hive automatically truncates longer comments to 255 characters.

hive> CREATE TABLE
    > comment_test (
    >     long_comment int COMMENT "Set up an AWS Glue ETL job which extracts metadata from your Hive metastore (MySQL) and loads it into your AWS Glue Data Catalog. This method requires an AWS Glue connection to the Hive metastore as a JDBC source. An ETL script is provided to extract metadata from the Hive metastore and write it to AWS Glue Data Catalog.",
    >     short_comment int COMMENT "aws-glue-samples",
    >     none_comment int
    > );
OK
Time taken: 1.269 seconds
hive> describe comment_test;
OK
long_comment        	int                 	Set up an AWS Glue ETL job which extracts metadata from your Hive metastore (MySQL) and loads it into your AWS Glue Data Catalog. This method requires an AWS Glue connection to the Hive metastore as a JDBC source. An ETL script is provided to extract metad
short_comment       	int                 	aws-glue-samples
none_comment        	int
Time taken: 0.288 seconds, Fetched: 3 row(s)

The migration job from the Hive Metastore to Glue Data Catalog failed because the comment is 256 characters.

py4j.protocol.Py4JJavaError: An error occurred while calling o1954.pyWriteDynamicFrame.  
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 122.0 failed 4 times, most recent failure: Lost task 5.3 in stage 122.0 (TID 1430) (172.31.49.211 executor 8): software.amazon.awssdk.services.glue.model.ValidationException: 1 validation error detected: Value 'Set up an AWS Glue ETL job which extracts metadata from your Hive metastore (MySQL) and loads it into your AWS Glue Data Catalog. This method requires an AWS Glue connection to the Hive metastore as a JDBC source. An ETL script is provided to extract metad' at 'table.storageDescriptor.columns.1.member.comment' failed to satisfy constraint: Member must have length less than or equal to 255 (Service: Glue, Status Code: 400, Request ID: 

Test for from-jdbc mode

Steps

  • Run the patched tool with the above Hive metastore. The hive table contains 3 columns
    • column with 256 characters comment
    • column with 16 characters comment
    • no comment column
    def transform_ms_columns(self, ms_columns):
        def extract_row(row):
            def truncate(x):
                return x[:255] if hasattr(x,"__getitem__") else x
            return (
                row['COLUMN_NAME'],
                row['TYPE_NAME'],
                truncate(row['COMMENT'])
            )
        return self.transform_df_with_idx(
            df=ms_columns,
            id_col="CD_ID",
            idx="INTEGER_IDX",
            payloads_column_name="columns",
            payload_type=StructType(
                [
                    StructField(name="name", dataType=StringType()),
                    StructField(name="type", dataType=StringType()),
                    StructField(name="comment", dataType=StringType()),
                ]
            ),
            payload_func=extract_row,
        )

Result

  • Job succeeded and the long comment was truncated into 255 characters.
  • other columns also imported successfully.
            "Columns": [
                {
                    "Name": "long_comment",
                    "Type": "int",
                    "Comment": "Set up an AWS Glue ETL job which extracts metadata from your Hive metastore (MySQL) and loads it into your AWS Glue Data Catalog. This method requires an AWS Glue connection to the Hive metastore as a JDBC source. An ETL script is provided to extract meta"
                },
                {
                    "Name": "short_comment",
                    "Type": "int",
                    "Comment": "aws-glue-samples"
                },
                {
                    "Name": "none_comment",
                    "Type": "int"
                }
            ],

Test for from-metastore mode

Steps

  1. Create the same comment_test table on EMR 7.9.0 cluster.
  2. Run hive_metastore_migration.py on the cluster.
$ spark-submit \
  --jars $MYSQL_JAR_PATH \
  /home/hadoop/hive_metastore_migration.py \
  --mode from-metastore \
  --jdbc-url jdbc:mysql://**:3306 \
  --jdbc-user hive \
  --jdbc-password ** \
  --output-path s3://path/

Result

The output json file contains truncated comment.

$ aws s3 cp s3://path/tables/part-00000-**.json -
...
"columns":[{"name":"long_comment","type":"int","comment":"Set up an AWS Glue ETL job which extracts metadata from your Hive metastore (MySQL) and loads it into your AWS Glue Data Catalog. This method requires an AWS Glue connection to the Hive metastore as a JDBC source. An ETL script is provided to extract meta"},
{"name":"short_comment","type":"int","comment":"aws-glue-samples"},
{"name":"none_comment","type":"int"}],
...

Note

Hive accepts 4,000 characters comment for partition key, and this implementation doesn't truncate comments on partition column. We need to add the same change in transform_ms_partition_keys() as well.
https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/sql/mysql/hive-schema-3.0.0.mysql.sql#L263

manabery added a commit to manabery/aws-glue-samples that referenced this pull request May 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants