Skip to content

Add an option to provide parameters by JSON file for hive metastore migration utility #175

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 26 additions & 1 deletion utilities/Hive_metastore_migration/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -206,6 +206,20 @@ as an Glue ETL job, if AWS Glue can directly connect to your Hive metastore.
- `--database-prefix` and `--table-prefix` (optional) to set a string prefix that is applied to the
database and table names. They are empty by default.

- Optionally, you can set `--config_file` to `<path_to_your_config_json_file>` which contains the configuration parameters. If same parameters are specified in both the configuration json file and the command line, the parameters specified on the command line will be used.
- Provide the following configuration parameters in the configuration json file:
```json
{
"mode": "from-metastore",
"jdbc_url": "JDBC URL",
"jdbc_username": "JDBC username",
"jdbc_password": "JDBC password",
"database_prefix": "Database prefix",
"table_prefix": "Table prefix",
"output_path": "Output local or s3 path"
}
```

- Example spark-submit command to migrate Hive metastore to S3, tested on EMR-4.7.1:
```bash
MYSQL_JAR_PATH=/usr/lib/hadoop/mysql-connector-java-5.1.42-bin.jar
Expand Down Expand Up @@ -360,7 +374,7 @@ as an Glue ETL job, if AWS Glue can directly connect to your Hive metastore.

3. Submit the `hive_metastore_migration.py` Spark script to your Spark cluster.

- Set `--direction` to `to_metastore`.
- Set `--mode` to `to_metastore`.
Comment on lines -363 to +377
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm, there was not the argument --direction and we are going to update README to match the implementation, correct?

Copy link
Contributor Author

@manabery manabery Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, --direction option has not existed in the code and it corrected README.

- Provide the JDBC connection information through the arguments:
`--jdbc-url`, `--jdbc-username`, and `--jdbc-password`.
- The argument `--input-path` is required. This can be a local directory or
Expand All @@ -382,6 +396,17 @@ as an Glue ETL job, if AWS Glue can directly connect to your Hive metastore.

s3://gluemigrationbucket/export_output/<year-month-day-hour-minute-seconds>/

- Optionally, you can set `--config_file` to `<path_to_your_config_json_file>` which contains the configuration parameters. If same parameters are specified in both the configuration json file and the command line, the parameters specified on the command line will be used.
- Provide the following configuration parameters in the configuration json file:
```json
{
"mode": "to-metastore",
"jdbc_url": "JDBC URL",
"jdbc_username": "JDBC username",
"jdbc_password": "JDBC password",
"input_path": "Input local or S3 path"
}
```

#### AWS Glue Data Catalog to another AWS Glue Data Catalog

Expand Down
43 changes: 35 additions & 8 deletions utilities/Hive_metastore_migration/src/hive_metastore_migration.py
Original file line number Diff line number Diff line change
Expand Up @@ -1580,25 +1580,52 @@ def get_options(parser, args):


def parse_arguments(args):
"""
parse arguments for the metastore migration.
If arguments are provided by both a json config file and command line, command line arguments will override any parameters specified on the json file.
----------
Return:
Dictionary of config options
"""
parser = argparse.ArgumentParser(prog=args[0])
parser.add_argument("-m", "--mode", required=True, choices=[FROM_METASTORE, TO_METASTORE], help="Choose to migrate metastore either from JDBC or from S3")
parser.add_argument("-U", "--jdbc-url", required=True, help="Hive metastore JDBC url, example: jdbc:mysql://metastore.abcd.us-east-1.rds.amazonaws.com:3306")
parser.add_argument("-u", "--jdbc-username", required=True, help="Hive metastore JDBC user name")
parser.add_argument("-p", "--jdbc-password", required=True, help="Hive metastore JDBC password")
parser.add_argument("-m", "--mode", required=False, choices=[FROM_METASTORE, TO_METASTORE], help="Choose to migrate metastore either from JDBC or from S3")
parser.add_argument("-U", "--jdbc-url", required=False, help="Hive metastore JDBC url, example: jdbc:mysql://metastore.abcd.us-east-1.rds.amazonaws.com:3306")
parser.add_argument("-u", "--jdbc-username", required=False, help="Hive metastore JDBC user name")
parser.add_argument("-p", "--jdbc-password", required=False, help="Hive metastore JDBC password")
parser.add_argument("-d", "--database-prefix", required=False, help="Optional prefix for database names in Glue DataCatalog")
parser.add_argument("-t", "--table-prefix", required=False, help="Optional prefix for table name in Glue DataCatalog")
parser.add_argument("-o", "--output-path", required=False, help="Output path, either local directory or S3 path")
parser.add_argument("-i", "--input_path", required=False, help="Input path, either local directory or S3 path")

parser.add_argument("-f", "--config_file", required=False, help="json configuration file path to read migration arguments from.")
options = get_options(parser, args)

if options["mode"] == FROM_METASTORE:
if options.get("config_file") is not None:
# parse json config file if provided
config_file_path = options["config_file"]
logger.info(f"config_file provided. Parsing arguments from {config_file_path}")
with open(config_file_path, 'r') as json_file_stream:
config_options = json.load(json_file_stream)

# merge options. command line options are prioritized.
for key in config_options:
if not options.get(key):
options[key] = config_options[key]
elif options[key] is None:
options[key] = config_options[key]
Comment on lines 1600 to +1614
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do like this, for example, if both config_file and mode are provided, the mode is replaced by the mode specified in the config file. Is my understanding correct? Is this intended behavior? To me, if we explicitly set arguments, it's natural to prioritize the explicit argument over the config file.

In any of those decisions, we will need to explain how it is prioritized in README.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, if a same argument is provided from both on command line and via config_file, the value in the config file is ignored. options variable is created at line 1600, and command line arguments are read at the time. Arguments in the config files are added into options variable only when corresponding arguments are not provided from command line. This is a test case 3 scenario.

I will update README to explain this behavior.


if options.get("mode") is None:
raise AssertionError("--mode options is required: either from_metastore or to_metastore")
elif options["mode"] == FROM_METASTORE:
validate_options_in_mode(
options=options, mode=FROM_METASTORE, required_options=["output_path"], not_allowed_options=["input_path"]
options=options, mode=FROM_METASTORE,
required_options=["jdbc_url", "jdbc_username", "jdbc_password", "output_path"],
not_allowed_options=["input_path"]
)
elif options["mode"] == TO_METASTORE:
validate_options_in_mode(
options=options, mode=TO_METASTORE, required_options=["input_path"], not_allowed_options=["output_path"]
options=options, mode=TO_METASTORE,
required_options=["jdbc_url", "jdbc_username", "jdbc_password", "input_path"],
not_allowed_options=["output_path"]
)
else:
raise AssertionError("unknown mode " + options["mode"])
Expand Down