Skip to content

improvement: brand new way of installing and using Spark #54

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 107 commits into
base: master
Choose a base branch
from

Conversation

scality-gdoumergue
Copy link
Contributor

@scality-gdoumergue scality-gdoumergue commented May 23, 2025

This PR brings the following changes:

  • Spark version is now 3.5.2.
  • Its installation doesn't use an Ansible playbook anymore. The installation documentation is now [here].(https://scality.atlassian.net/wiki/spaces/TS/pages/2710667467/DRAFT+Spark+revamped+for+EL8+compatible+with+hybrid+EL7+EL8+EL9+environments).
  • Its startup is handled by a unique shell script (spark_run.sh).
  • This script handles RHEL/Centos7 and RHEL/Rocky8 environments (and even mixed ones).
  • The submit.py script is now the only source of SparkContext configuration.
  • As a consequence, python scripts have been shrunk, to avoid redundant configuration everywhere.
  • The s3a committer is now magic, and improves performances when writing into S3.
  • Spark worker "local" and "S3 buffer" directories are now configurable in both spark_run.sh and config.yml.

Thanks a lot @mobidyc and @scality-fno for your hard work bringing Spark to a new level! It is now time to CONVERT THE TRY!

mobidyc and others added 30 commits September 2, 2024 19:08
@scality-gdoumergue scality-gdoumergue force-pushed the improve/python3_from_cg branch from 0a25208 to 5693cc4 Compare June 5, 2025 15:37
@scality-gdoumergue scality-gdoumergue force-pushed the improve/python3_from_cg branch from 0860d55 to 41c5822 Compare June 12, 2025 15:50
Copy link
Member

@TrevorBenson TrevorBenson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing major, just questions and some minor nitpick thoughts.

The build.sh script:

  • Results in lots of warnings of unmet peer dependencies and non-deterministic behavior of steps being skipped.
  • Dies on step 25/31 during Docker build
    --> 0edb15cea50b
    [2/2] STEP 23/31: COPY conf/spark-defaults.conf ${SPARK_HOME}/conf
    --> d9039f8c56fe
    [2/2] STEP 24/31: COPY conf/spark-env.sh ${SPARK_HOME}/conf
    --> 54bcea75de6f
    [2/2] STEP 25/31: COPY aws-java-sdk-bundle-1.12.770.ja[r] /spark/jars/
    Error: building at STEP "COPY aws-java-sdk-bundle-1.12.770.ja[r] /spark/jars/": checking on sources under "/home/trevorbenson/Projects/spark": Rel: can't make  relative to /home/trevorbenson/Projects/spark; copier: stat: ["/aws-java-sdk-bundle-1.12.770.ja[r]"]: no such file or directory
    Upload /tmp/spark-image-3.5.2-12.tgz and /tmp/scality-spark-scripts-3.5.2-12.tgz to the supervisor.
    

Once it builds I'll approve. If you don't observe the same failure during build let me know and I'll see if it is somehow unique to my environment.

Comment on lines +98 to +103
COPY --from=ghcr.io/astral-sh/uv:0.4.8 /uv /bin/uv

RUN --mount=type=cache,target=/root/.cache/uv \
uv pip compile /tmp/requirements.txt > /tmp/requirements-compiled.txt \
&& uv pip sync --system /tmp/requirements-compiled.txt \
&& uv pip install --system /tmp/scality-0.1-py3-none-any.whl
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question:

Is the legacy uv version a workaround to get pip3.8 to install the requirements file without Tracebacks?

With Python 3.8 EOL in Oct 2024, is there any reason not to take this time to bump to Python 3.9 or Python 3.11, the spark 3.5.2 max supported version?

result = calculate_sum_of_squares()

# Afficher le résultat
print(f"La somme des carrés est : {result}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick:

Use one consistent language for all tools, all French or all English, etc.

Comment on lines +201 to +202
# Add the server's short hostname for master
echo "${master} $(hostname -s) # Added by spark_run.sh" >> /etc/hosts
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consistent indentation depth.

Suggested change
# Add the server's short hostname for master
echo "${master} $(hostname -s) # Added by spark_run.sh" >> /etc/hosts
# Add the server's short hostname for master
echo "${master} $(hostname -s) # Added by spark_run.sh" >> /etc/hosts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants