-
Notifications
You must be signed in to change notification settings - Fork 0
improvement: brand new way of installing and using Spark #54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
scality-gdoumergue
wants to merge
107
commits into
master
Choose a base branch
from
improve/python3_from_cg
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
107 commits
Select commit
Hold shift + click to select a range
cea4e07
Remove useless files/scripts
mobidyc 5ceacfc
build: Add script to build docker images for spark and s3utils contai…
mobidyc 0ff6ff9
chore: Add Dockerfile for Spark and PySpark containers
mobidyc be2193d
feat: Add entrypoint script for Spark containers to run master, worke…
mobidyc 48362ae
Add py3 version of scality module
mobidyc a7323c8
remove legacy spark version
mobidyc 2e459d5
remove useless scripts
mobidyc 6f561dc
chore: Add .gitignore file to exclude *.jar, *.tar, and *.tgz files
mobidyc 32aaecf
chore: Update Dockerfile to optimize package installation and add py3…
mobidyc 1adcebd
feat: Add pyspark 3.5.2 to requirements.txt
mobidyc 7ca754f
chore: Update spark_run.sh script to support starting, stopping, and …
mobidyc 95bdb2b
chore: Add spark-defaults.conf file with default Spark configuration
mobidyc 023c502
chore: Update listkey.py script to use Python 3.9+ compatible element…
mobidyc dd32a64
Add new scality library
mobidyc 63d840e
chore: Refactor submit.py script
mobidyc 5d59c35
chore: Add Spark script to calculate sum of squares
mobidyc a1fe7e6
chore: Update Spark configuration and scripts
mobidyc 304fa9f
Add full example of conf file
mobidyc 6b5c853
Refactor Dockerfile to create /spark/jars/ directory
mobidyc aad86e6
Manage /etc/hosts changes
mobidyc 8ee1ed4
add a few tshoot packages to Dockerfile, fix some comments in spark_r…
scality-fno 00962b3
fix dependency: add aws sdk bundle
scality-fno ffb9983
RING-46168 improve jq pipeline in export_s3_keys.sh for performance
scality-fno c0b008f
add EL8 CTR support to export_s3_keys.sh
scality-fno 615073b
change spark webui ports from 7077 and 8088 to 17077 and 18088
scality-fno a299c6d
make spark logdir configurable. And fix minor things.
scality-fno 8767668
improve config.yml template and fix arcdata reference in P2 revlookup
scality-fno 80ce2a5
make export_s3_keys.sh more robust
scality-fno cc22515
fix p0 script with unhexlify
scality-fno ee65923
improve listkey.py
scality-fno 883b3cd
update S3 FSCK scripts to use FullLoader for YAML and upgrade Hadoop …
scality-fno c1bc7fe
fix indentation in submit.py and ensure driver memory configuration i…
mobidyc 130cb7b
add type ignore comment to disable urllib3 warnings in listkey.py
mobidyc e00cdc3
add type check for Node instance in listkeys function
mobidyc 8f22e5d
refactor listkeys function to clarify parameter usage and prevent rem…
mobidyc 8b65273
refactor listkeys function to initialize count variable correctly and…
mobidyc ab3d273
rename prepare_path function to initialize_csv_directory for clarity …
mobidyc 19405eb
update listkey.py to rename PATH to CSV_PATH for clarity and adjust d…
mobidyc b324521
refactor SparkSession initialization in listkey.py for clarity and ma…
mobidyc bc7e9e9
refactor listkeys function to improve error handling and enhance logg…
mobidyc 047bac5
update check_key.py to use FullLoader for YAML loading to improve sec…
mobidyc 68c644e
refactor check_key.py to improve code formatting and enhance readability
mobidyc c62d12f
refactor check_key.py to rename RING variable to RING_NAME for clarit…
mobidyc ebb06a4
refactor listkey.py to enhance error handling for configuration loadi…
mobidyc 5adf876
refactor listkey.py to improve code readability and enhance logging f…
mobidyc 7e9b422
update .gitignore to include additional file patterns for tar.gz, pyc…
mobidyc 91fd81b
refactor listkey.py to rename RING variable to RING_NAME for clarity …
mobidyc 80c0a99
🔧 refactor(check_key): rename PATH and PROT variables for clarity
mobidyc 5945c04
🔧 chore(requirements): add urllib3 to dependencies
mobidyc 2ae3f3f
✨ chore(ruff): add configuration file for Ruff linter and formatter
mobidyc 0ef14b3
✨ chore(vscode): add configuration files for Ruff and Python settings
mobidyc d53ad0d
🔧 refactor(check_key): improve readability by formatting DataFrame sh…
mobidyc bf4c668
🔧 refactor(dig): fix indentation and update print function syntax
mobidyc f162393
🔧 refactor(scripts): update YAML loading and improve Spark session co…
mobidyc 42d1466
Spark now uses the magic committer, and submit.py allows to launch a …
scality-gdoumergue 67861f5
Add a script to test the connectivity to the S3 cluster
scality-gdoumergue a0a99fd
Make use of the s3a magic committer to speed up writes to S3, and use…
scality-gdoumergue d6a0752
Remove redundant SparkSession options, that are already set in submit.py
scality-gdoumergue a88c832
Cleanup listkey.py: remove prints, remove redundant SkarkContext configs
scality-gdoumergue 88b5f3f
Remove the config.yml, so that it doesn't overwrite the one previousl…
scality-gdoumergue 02efaa5
Improve spark_run.sh: Check Spark image, variabilize and create 2 wor…
scality-gdoumergue 398b205
no need for old way scripts/offline-archive-setup.sh
scality-gdoumergue d1242d7
Dockerfile: need curl-dev pkgs for pycurl
scality-gdoumergue ebb1c81
Add comment in config template
scality-gdoumergue 4671cc6
spark_run.sh checks the images version
scality-gdoumergue 02f6826
fixup
scality-gdoumergue b84ab16
Add nodejs into the image and a script that lists the name of all the…
scality-gdoumergue 09bf15f
Hide more things from git
scality-gdoumergue a79e5fb
Add shyaml to parse config from shell scripts
scality-gdoumergue 8b40664
Ignore config file
scality-gdoumergue c4bdfc8
Able to script a build.
scality-gdoumergue 3634d1c
add comments to config template
scality-gdoumergue e2dd01c
harmless bug in spark_run.sh
scality-gdoumergue dbbf036
more comments in config templates
scality-gdoumergue 4fd4ade
extract_metadata_keys_to_s3.sh: the extraction of the sproxyd keys fr…
scality-gdoumergue b13c178
Spark image entrypoint now allows to run a shell or a script
scality-gdoumergue 623c098
Working Spark image
scality-gdoumergue 7852607
No more need for common.sh
scality-gdoumergue 04817d6
S3 MD extraction now works with RAFT_SESSIONS variable
scality-gdoumergue ce41022
Generate and upload S3 MD journal backups
scality-gdoumergue 7709900
extracted keys objects follow the old naming convention
scality-gdoumergue 43f3655
Some more comments
scality-gdoumergue 8e61695
Need gawk - not awk - for better csv processing
scality-gdoumergue c2f59dd
count_ring_keys.sh is now fully automated (and safer)
scality-gdoumergue 9b33092
The Four Horsemen: 4 scripts to check that p0 and P1 were successful.
scality-gdoumergue 2f26f9f
local dir (or work dir, or tmp dir) now works
scality-gdoumergue 593338d
allow private ssh key distribution for S3_FSCK/s3_fsck_p2_reverselook…
scality-gdoumergue c058b4b
Comment s3_fsck_p* scripts - source: https://github.com/scality/spark…
scality-gdoumergue 968da10
count scripts are more precise
scality-gdoumergue effa292
check un-committed changes before creating tarballs
scality-gdoumergue b4fd2c2
bugfix: spark_run.sh was badly handling mounts
scality-gdoumergue 010cda9
Spark can now automatically import TLS certs
scality-gdoumergue 5693cc4
"spark_run.sh exec" now works, add docker commands for "spark_run.sh …
scality-gdoumergue 2bb1970
config template must comply with spark_run.sh and doc
scality-gdoumergue 4c91fe0
Run Spark on RHEL/CentOS 7
scality-gdoumergue b18cdb8
submit.py now takes the ring's name from the config before defaulting…
scality-gdoumergue af8f7e4
make scripts/S3_FSCK/s3mdjournalbackuphashes.sh RHEL/CentOS 7 compatible
scality-gdoumergue bc782a2
Bugfix: scripts/S3_FSCK/s3_fsck_p3.py didn't correctly handle keys wi…
scality-gdoumergue ab7f6a9
The driver command now shows a dedicated prompt
scality-gdoumergue e383775
s3a path-style access restored
scality-gdoumergue fccc182
S3 access bugfix: the workers must have access to the apps dir to fin…
scality-gdoumergue 6807e52
harmless typo in scripts/S3_FSCK/extract_metadata_keys_to_s3.sh
scality-gdoumergue fbcacbc
Spark 3.5.2-12: improvement for decoupled architectures
scality-gdoumergue 41c5822
Permit the addition of extra ip:hostname pairs within containers
scality-gdoumergue 08b3abf
scripts/S3_FSCK/s3mdjournalbackuphashes.sh can now run on stateless-o…
scality-gdoumergue 5ecca82
patch verifyBucketSproxydKeys.js during the build, to workaround RD-404
scality-gdoumergue 7b2bf2f
spark_run.sh must add the short hostname to /etc/hosts for master to …
scality-gdoumergue File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
*.jar | ||
*.tar | ||
*.tar.gz | ||
*.tgz | ||
*.pyc | ||
scripts/scality | ||
scripts/py4j | ||
*.dist-info | ||
*.log | ||
nodejs | ||
s3utils | ||
scripts/config/config.yml |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
{ | ||
"recommendations": [ | ||
"charliermarsh.ruff" | ||
] | ||
} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,61 @@ | ||
{ | ||
"python.analysis.typeCheckingMode": "standard", | ||
"editor.quickSuggestions": { | ||
"strings": true | ||
}, | ||
"[python]": { | ||
"editor.defaultFormatter": "charliermarsh.ruff", | ||
"editor.formatOnSave": true, | ||
"editor.codeActionsOnSave": { | ||
"source.organizeImports": "explicit" | ||
} | ||
}, | ||
"git.openRepositoryInParentFolders": "never", | ||
"git.autofetch": true, | ||
"git.enableSmartCommit": true, | ||
"git.replaceTagsWhenPull": true, | ||
"github.copilot.editor.enableAutoCompletions": true, | ||
"python.createEnvironment.trigger": "off", | ||
"cSpell.enabled": false, | ||
"python.testing.pytestArgs": ["tests"], | ||
"python.testing.pytestEnabled": true, | ||
"python.testing.unittestEnabled": false, | ||
"git.inputValidation": true, | ||
"git.inputValidationLength": 72, | ||
"git.inputValidationSubjectLength": 72, | ||
"github.copilot.chat.commitMessageGeneration.instructions": [ | ||
{ | ||
"text": "Use the Conventional Commits format for all commit messages." | ||
}, | ||
{ | ||
"text": "The commit subject must follow this pattern: <type>(<scope>): <description>." | ||
}, | ||
{ | ||
"text": "Replace <type> with one of the following: feat, fix, chore, docs, style, refactor, perf, test, build, ci, revert." | ||
}, | ||
{ | ||
"text": "The <scope> should be the affected module, feature, or component (e.g., 'auth', 'api', 'ui')." | ||
}, | ||
{ | ||
"text": "The <description> should be a concise summary of the change, written in imperative mood." | ||
}, | ||
{ | ||
"text": "If a commit introduces breaking changes, append 'BREAKING CHANGE:' followed by a detailed explanation in the body." | ||
}, | ||
{ | ||
"text": "If referencing an issue, add 'Closes #123' or 'Fixes #456' in the commit body." | ||
}, | ||
{ | ||
"text": "Limit the subject line to 72 characters." | ||
}, | ||
{ | ||
"text": "Separate the subject from the body with a blank line." | ||
}, | ||
{ | ||
"text": "The commit body should explain what changed and why, wrapped at 72 characters per line." | ||
}, | ||
{ | ||
"text": "Include Gitmojis where relevant, placed before the <type> in the subject line. Example: '✨ feat(auth): add login via Google'." | ||
} | ||
], | ||
} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,133 @@ | ||
ARG NODE_IMAGE=16.20.2-bullseye-slim | ||
ARG NODE_VERSION=16.20.2 | ||
|
||
############################## | ||
# builder: nodejs dependencies | ||
############################## | ||
|
||
# The builder technique: best way | ||
# to have a lighter image in the end. | ||
FROM node:${NODE_IMAGE} as builder | ||
|
||
ENV NVM_DIR=/root/.nvm | ||
|
||
RUN --mount=type=cache,sharing=locked,target=/var/cache/apt apt update \ | ||
&& apt-get install -y --no-install-recommends \ | ||
curl \ | ||
git \ | ||
build-essential \ | ||
python3 \ | ||
jq \ | ||
ssh \ | ||
ca-certificates \ | ||
&& apt-get clean \ | ||
&& rm -rf /var/lib/apt/lists/* | ||
|
||
COPY nodejs ./nodejs | ||
|
||
WORKDIR nodejs | ||
|
||
# The node_version.txt file brings node's version to the next steps | ||
# because I don't know why the NODE_VERSION variable is not passed | ||
# to the runner part | ||
RUN yarn install --production --network-concurrency 1 && \ | ||
echo "${NODE_VERSION}" > node_version.txt | ||
|
||
########################################## | ||
# | ||
# RUNNER | ||
# | ||
########################################## | ||
|
||
FROM python:3.8-slim-bullseye | ||
|
||
RUN --mount=type=cache,sharing=locked,target=/var/cache/apt apt update \ | ||
&& apt-get install -y --no-install-recommends \ | ||
ca-certificates \ | ||
sudo \ | ||
curl \ | ||
libcurl4-openssl-dev libssl-dev \ | ||
awscli \ | ||
inetutils-ping \ | ||
netcat-traditional \ | ||
wget \ | ||
vim \ | ||
unzip \ | ||
rsync \ | ||
openjdk-11-jdk \ | ||
build-essential \ | ||
software-properties-common \ | ||
ssh \ | ||
jq \ | ||
gawk \ | ||
net-tools \ | ||
less \ | ||
&& apt-get clean \ | ||
&& rm -rf /var/lib/apt/lists/* | ||
|
||
ENV NVM_DIR=/opt/nvm | ||
ENV PATH="/opt/spark/sbin:/opt/spark/bin:${PATH}" | ||
ENV HADOOP_HOME=${HADOOP_HOME:-"/opt/hadoop"} | ||
ENV SPARK_HOME=${SPARK_HOME:-"/opt/spark"} | ||
ENV SPARK_MASTER_HOST="spark-master" | ||
ENV SPARK_MASTER_PORT="17077" | ||
ENV SPARK_MASTER="spark://${SPARK_MASTER_HOST}:${SPARK_MASTER_PORT}" | ||
ENV PYSPARK_PYTHON=python3 | ||
ENV PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH | ||
|
||
RUN mkdir -p ${HADOOP_HOME} ${SPARK_HOME}/scality-tools /spark/jars/ | ||
WORKDIR ${SPARK_HOME} | ||
|
||
# Install what's been yarned by the builder part | ||
COPY --from=builder nodejs/ ./scality-tools/ | ||
|
||
## Install nodejs without yarn | ||
RUN NVM_NODE_VERSION=$(cat ./scality-tools/node_version.txt) && \ | ||
mkdir -p "${NVM_DIR}" && \ | ||
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.3/install.sh | /bin/bash && \ | ||
. "${NVM_DIR}/nvm.sh" && nvm install ${NVM_NODE_VERSION} && \ | ||
nvm use v${NVM_NODE_VERSION} && \ | ||
nvm alias default v${NVM_NODE_VERSION} | ||
|
||
ENV PATH="${NVM_DIR}/versions/node/v${NVM_NODE_VERSION}/bin/:${PATH}" | ||
|
||
# Time to work on Spark & Python stuff | ||
|
||
COPY requirements.txt /tmp/requirements.txt | ||
COPY scality-0.1-py3-none-any.whl /tmp/ | ||
COPY --from=ghcr.io/astral-sh/uv:0.4.8 /uv /bin/uv | ||
|
||
RUN --mount=type=cache,target=/root/.cache/uv \ | ||
uv pip compile /tmp/requirements.txt > /tmp/requirements-compiled.txt \ | ||
&& uv pip sync --system /tmp/requirements-compiled.txt \ | ||
&& uv pip install --system /tmp/scality-0.1-py3-none-any.whl | ||
|
||
|
||
# globbing to not fail if not found | ||
COPY spark-3.5.2-bin-hadoop3.tg[z] /tmp/ | ||
# -N enable timestamping to condition download if already present or not | ||
RUN cd /tmp \ | ||
&& wget -N https://archive.apache.org/dist/spark/spark-3.5.2/spark-3.5.2-bin-hadoop3.tgz \ | ||
&& tar xvzf spark-3.5.2-bin-hadoop3.tgz --directory /opt/spark --strip-components 1 \ | ||
&& rm -f spark-3.5.2-bin-hadoop3.tgz | ||
|
||
COPY conf/spark-defaults.conf ${SPARK_HOME}/conf | ||
COPY conf/spark-env.sh ${SPARK_HOME}/conf | ||
|
||
# https://github.com/sayedabdallah/Read-Write-AWS-S3 | ||
# https://spot.io/blog/improve-apache-spark-performance-with-the-s3-magic-committer/ | ||
COPY aws-java-sdk-bundle-1.12.770.ja[r] /spark/jars/ | ||
COPY hadoop-aws-3.3.4.ja[r] /spark/jars/ | ||
COPY spark-hadoop-cloud_2.13-3.5.2.ja[r] /spark/jars/ | ||
RUN cd /spark/jars/ \ | ||
&& wget -N https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.770/aws-java-sdk-bundle-1.12.770.jar \ | ||
&& wget -N https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar \ | ||
&& wget -N https://repo1.maven.org/maven2/org/apache/spark/spark-hadoop-cloud_2.13/3.5.2/spark-hadoop-cloud_2.13-3.5.2.jar | ||
|
||
# Misc | ||
RUN chmod u+x /opt/spark/sbin/* /opt/spark/bin/* && \ | ||
aws configure set default.s3.multipart_threshold 64MB && \ | ||
aws configure set default.s3.multipart_chunksize 32MB | ||
|
||
COPY entrypoint.sh . | ||
ENTRYPOINT ["/opt/spark/entrypoint.sh"] |
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
2 changes: 1 addition & 1 deletion
2
ansible/roles/create-sample-config/templates/config-template.yml.j2
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file not shown.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question:
Is the legacy uv version a workaround to get pip3.8 to install the requirements file without Tracebacks?
With Python 3.8 EOL in Oct 2024, is there any reason not to take this time to bump to Python 3.9 or Python 3.11, the spark 3.5.2 max supported version?