As Glue limits comments to 255 characters, we may need to truncate them. #38

mikklepp · 2018-10-24T10:53:00Z

No description provided.

manabery · 2025-05-21T08:49:11Z

Investigation

Max length of column comment on Hive Metastore is 256 characters.
https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/sql/mysql/hive-schema-3.0.0.mysql.sql#L54

Glue Data Catalog allows 255 characters for comment.
https://docs.aws.amazon.com/glue/latest/webapi/API_Column.html#Glue-Type-Column-Comment

We need to truncate the character as @mikklepp pointed out.

Current behavior

Create a table on Hive Metastore v3.1.3. Hive automatically truncates longer comments to 255 characters.

hive> CREATE TABLE
    > comment_test (
    >     long_comment int COMMENT "Set up an AWS Glue ETL job which extracts metadata from your Hive metastore (MySQL) and loads it into your AWS Glue Data Catalog. This method requires an AWS Glue connection to the Hive metastore as a JDBC source. An ETL script is provided to extract metadata from the Hive metastore and write it to AWS Glue Data Catalog.",
    >     short_comment int COMMENT "aws-glue-samples",
    >     none_comment int
    > );
OK
Time taken: 1.269 seconds
hive> describe comment_test;
OK
long_comment        	int                 	Set up an AWS Glue ETL job which extracts metadata from your Hive metastore (MySQL) and loads it into your AWS Glue Data Catalog. This method requires an AWS Glue connection to the Hive metastore as a JDBC source. An ETL script is provided to extract metad
short_comment       	int                 	aws-glue-samples
none_comment        	int
Time taken: 0.288 seconds, Fetched: 3 row(s)

The migration job from the Hive Metastore to Glue Data Catalog failed because the comment is 256 characters.

py4j.protocol.Py4JJavaError: An error occurred while calling o1954.pyWriteDynamicFrame.  
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 122.0 failed 4 times, most recent failure: Lost task 5.3 in stage 122.0 (TID 1430) (172.31.49.211 executor 8): software.amazon.awssdk.services.glue.model.ValidationException: 1 validation error detected: Value 'Set up an AWS Glue ETL job which extracts metadata from your Hive metastore (MySQL) and loads it into your AWS Glue Data Catalog. This method requires an AWS Glue connection to the Hive metastore as a JDBC source. An ETL script is provided to extract metad' at 'table.storageDescriptor.columns.1.member.comment' failed to satisfy constraint: Member must have length less than or equal to 255 (Service: Glue, Status Code: 400, Request ID:

Test for from-jdbc mode

Steps

Run the patched tool with the above Hive metastore. The hive table contains 3 columns
- column with 256 characters comment
- column with 16 characters comment
- no comment column

    def transform_ms_columns(self, ms_columns):
        def extract_row(row):
            def truncate(x):
                return x[:255] if hasattr(x,"__getitem__") else x
            return (
                row['COLUMN_NAME'],
                row['TYPE_NAME'],
                truncate(row['COMMENT'])
            )
        return self.transform_df_with_idx(
            df=ms_columns,
            id_col="CD_ID",
            idx="INTEGER_IDX",
            payloads_column_name="columns",
            payload_type=StructType(
                [
                    StructField(name="name", dataType=StringType()),
                    StructField(name="type", dataType=StringType()),
                    StructField(name="comment", dataType=StringType()),
                ]
            ),
            payload_func=extract_row,
        )

Result

Job succeeded and the long comment was truncated into 255 characters.
other columns also imported successfully.

            "Columns": [
                {
                    "Name": "long_comment",
                    "Type": "int",
                    "Comment": "Set up an AWS Glue ETL job which extracts metadata from your Hive metastore (MySQL) and loads it into your AWS Glue Data Catalog. This method requires an AWS Glue connection to the Hive metastore as a JDBC source. An ETL script is provided to extract meta"
                },
                {
                    "Name": "short_comment",
                    "Type": "int",
                    "Comment": "aws-glue-samples"
                },
                {
                    "Name": "none_comment",
                    "Type": "int"
                }
            ],

Test for from-metastore mode

Steps

Create the same comment_test table on EMR 7.9.0 cluster.
Run hive_metastore_migration.py on the cluster.

$ spark-submit \
  --jars $MYSQL_JAR_PATH \
  /home/hadoop/hive_metastore_migration.py \
  --mode from-metastore \
  --jdbc-url jdbc:mysql://**:3306 \
  --jdbc-user hive \
  --jdbc-password ** \
  --output-path s3://path/

Result

The output json file contains truncated comment.

$ aws s3 cp s3://path/tables/part-00000-**.json -
...
"columns":[{"name":"long_comment","type":"int","comment":"Set up an AWS Glue ETL job which extracts metadata from your Hive metastore (MySQL) and loads it into your AWS Glue Data Catalog. This method requires an AWS Glue connection to the Hive metastore as a JDBC source. An ETL script is provided to extract meta"},
{"name":"short_comment","type":"int","comment":"aws-glue-samples"},
{"name":"none_comment","type":"int"}],
...

Note

Hive accepts 4,000 characters comment for partition key, and this implementation doesn't truncate comments on partition column. We need to add the same change in transform_ms_partition_keys() as well.
https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/sql/mysql/hive-schema-3.0.0.mysql.sql#L263

As Glue limits comments to 255 characters, we may need to truncate them.

f3baf57

shahbazaamir pushed a commit to shahbazaamir/aws-glue-samples that referenced this pull request Jan 14, 2025

Add an example for API Gateway with CORS, and CRUD Lambdas with Dynam…

4243b4e

…oDB (aws-samples#38)

manabery added a commit to manabery/aws-glue-samples that referenced this pull request May 23, 2025

Resolved conflicts in aws-samples#38

98a0373

manabery mentioned this pull request May 23, 2025

As Glue limits comments to 255 characters, we may need to truncate them #174

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

As Glue limits comments to 255 characters, we may need to truncate them. #38

As Glue limits comments to 255 characters, we may need to truncate them. #38

Uh oh!

mikklepp commented Oct 24, 2018

Uh oh!

manabery commented May 21, 2025 •

edited

Loading

Uh oh!

Uh oh!

As Glue limits comments to 255 characters, we may need to truncate them. #38

Are you sure you want to change the base?

As Glue limits comments to 255 characters, we may need to truncate them. #38

Uh oh!

Conversation

mikklepp commented Oct 24, 2018

Uh oh!

manabery commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Investigation

Current behavior

Test for from-jdbc mode

Steps

Result

Test for from-metastore mode

Steps

Result

Note

Uh oh!

Uh oh!

manabery commented May 21, 2025 •

edited

Loading