Skip to content

Add an option to provide parameters by JSON file for hive metastore migration utility #175

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

manabery
Copy link
Contributor

@manabery manabery commented May 28, 2025

This is a PR to merge #13

Changes made in #13

  • The parameters are passed by yaml config file for from_metastore and to_metastore options. This prevents JDBC password exposure on spark-submit command.

Changes added in this PR

  • Made specifying a config file optional, so that we can keep using command line parameters depending on users' needs.
  • Changed the config file format from yaml to json. This change avoids the necessity of installing PyYAML on Glue 5.0 job. (PyYAML is provided until Glue 4.0. doc)

Tests

Testing with a config file

Case 1: run with expected config file

  1. Launched EMR 7.9 cluster
  2. Create a config file
{
    "mode": "from-metastore",
    "jdbc_url": "jdbc:mysql://**:3306",
    "jdbc_username": "hive",
    "jdbc_password": "password",
    "database_prefix": "dbpre_",
    "table_prefix": "tablepre_",
    "output_path": "s3://path"
}
  1. Run hive_metastore_migration.py
$ spark-submit \
  --jars $MYSQL_JAR_PATH \
  /home/hadoop/hive_metastore_migration.py \
 --config_file config.json

Result: Succeeded and exported metastore to S3.

Case 2: run with a config file and lack mandatory parameter

  1. Launched EMR 7.9 cluster
  2. Create a config file (without jdbc_password)
{
    "mode": "from-metastore",
    "jdbc_url": "jdbc:mysql://**:3306",
    "jdbc_username": "hive",
    "output_path": "s3://path"
}
  1. Run hive_metastore_migration.py
$ spark-submit \
  --jars $MYSQL_JAR_PATH \
  /home/hadoop/hive_metastore_migration.py \
 --config_file config.json

Result: failed the command with the following message as expected.

2025-05-28 07:53:52,041 - root - INFO - config_file provided. Parsing arguments from config.json
Traceback (most recent call last):
  File "/home/hadoop/hive_metastore_migration.py", line 1775, in <module>
    main()
  File "/home/hadoop/hive_metastore_migration.py", line 1757, in main
    options = parse_arguments(sys.argv)
  File "/home/hadoop/hive_metastore_migration.py", line 1613, in parse_arguments
    validate_options_in_mode(
  File "/home/hadoop/hive_metastore_migration.py", line 1691, in validate_options_in_mode
    raise AssertionError("Option %s is required for mode %s" % (option, mode))
AssertionError: Option jdbc_password is required for mode from-metastore

Case 3: run with a config file and command line parameter

  1. Launched EMR 7.9 cluster
  2. Create a config file
{
    "mode": "from-metastore",
    "jdbc_url": "jdbc:mysql://**:3306",
    "jdbc_username": "hive",
    "jdbc_password": "password",
    "database_prefix": "dbpre_",
    "table_prefix": "tablepre_",
    "output_path": "s3://path"
}
  1. Run hive_metastore_migration.py with --database-prefix
$ spark-submit \
  --jars $MYSQL_JAR_PATH \
  /home/hadoop/hive_metastore_migration.py \
 --database-prefix dbpre_updated_ \
 --config_file config.json

Result: successfully metastore was exported, and --database-prefix was overriden by command line config.

Testing without a config file (regression)

case 4: run with command line parameters

  1. Launched EMR 7.9 cluster
  2. Run hive_metastore_migration.py
spark-submit \
  --jars $MYSQL_JAR_PATH \
  /home/hadoop/hive_metastore_migration.py \
  --mode from-metastore \
  --jdbc-url jdbc:mysql://**:3306 \
  --jdbc-user hive \
  --jdbc-password ** \
  --output-path s3://**/

Result: the job succeeded.

case 5: run with from-s3 mode

  1. Run import_into_datacatalog.py with from-s3 mode on Glue 5.0

Result: Succeeded (importing updated hive_metastore_migration.py didn't affect the behavior)

case 6: run with to-s3 mode

  1. Run export_from_datacatalog.py with to-s3 mode on Glue 3.0

Result: Succeeded (importing updated hive_metastore_migration.py didn't affect the behavior)

Comment on lines -363 to +377
- Set `--direction` to `to_metastore`.
- Set `--mode` to `to_metastore`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm, there was not the argument --direction and we are going to update README to match the implementation, correct?

Copy link
Contributor Author

@manabery manabery Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, --direction option has not existed in the code and it corrected README.

Comment on lines 1600 to +1614
options = get_options(parser, args)

if options["mode"] == FROM_METASTORE:
if options.get("config_file") is not None:
# parse json config file if provided
config_file_path = options["config_file"]
logger.info(f"config_file provided. Parsing arguments from {config_file_path}")
with open(config_file_path, 'r') as json_file_stream:
config_options = json.load(json_file_stream)

# merge options. command line options are prioritized.
for key in config_options:
if not options.get(key):
options[key] = config_options[key]
elif options[key] is None:
options[key] = config_options[key]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do like this, for example, if both config_file and mode are provided, the mode is replaced by the mode specified in the config file. Is my understanding correct? Is this intended behavior? To me, if we explicitly set arguments, it's natural to prioritize the explicit argument over the config file.

In any of those decisions, we will need to explain how it is prioritized in README.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, if a same argument is provided from both on command line and via config_file, the value in the config file is ignored. options variable is created at line 1600, and command line arguments are read at the time. Arguments in the config files are added into options variable only when corresponding arguments are not provided from command line. This is a test case 3 scenario.

I will update README to explain this behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants