Skip to content

improvement: brand new way of installing and using Spark #54

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 107 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
107 commits
Select commit Hold shift + click to select a range
cea4e07
Remove useless files/scripts
mobidyc Sep 2, 2024
5ceacfc
build: Add script to build docker images for spark and s3utils contai…
mobidyc Sep 2, 2024
0ff6ff9
chore: Add Dockerfile for Spark and PySpark containers
mobidyc Sep 2, 2024
be2193d
feat: Add entrypoint script for Spark containers to run master, worke…
mobidyc Sep 2, 2024
48362ae
Add py3 version of scality module
mobidyc Sep 2, 2024
a7323c8
remove legacy spark version
mobidyc Sep 10, 2024
2e459d5
remove useless scripts
mobidyc Sep 10, 2024
6f561dc
chore: Add .gitignore file to exclude *.jar, *.tar, and *.tgz files
mobidyc Sep 10, 2024
32aaecf
chore: Update Dockerfile to optimize package installation and add py3…
mobidyc Sep 10, 2024
1adcebd
feat: Add pyspark 3.5.2 to requirements.txt
mobidyc Sep 10, 2024
7ca754f
chore: Update spark_run.sh script to support starting, stopping, and …
mobidyc Sep 10, 2024
95bdb2b
chore: Add spark-defaults.conf file with default Spark configuration
mobidyc Sep 10, 2024
023c502
chore: Update listkey.py script to use Python 3.9+ compatible element…
mobidyc Sep 19, 2024
dd32a64
Add new scality library
mobidyc Sep 19, 2024
63d840e
chore: Refactor submit.py script
mobidyc Sep 19, 2024
5d59c35
chore: Add Spark script to calculate sum of squares
mobidyc Sep 19, 2024
a1fe7e6
chore: Update Spark configuration and scripts
mobidyc Sep 19, 2024
304fa9f
Add full example of conf file
mobidyc Sep 19, 2024
6b5c853
Refactor Dockerfile to create /spark/jars/ directory
mobidyc Oct 9, 2024
aad86e6
Manage /etc/hosts changes
mobidyc Oct 11, 2024
8ee1ed4
add a few tshoot packages to Dockerfile, fix some comments in spark_r…
scality-fno Oct 11, 2024
00962b3
fix dependency: add aws sdk bundle
scality-fno Oct 15, 2024
ffb9983
RING-46168 improve jq pipeline in export_s3_keys.sh for performance
scality-fno Mar 31, 2024
c0b008f
add EL8 CTR support to export_s3_keys.sh
scality-fno Oct 15, 2024
615073b
change spark webui ports from 7077 and 8088 to 17077 and 18088
scality-fno Nov 26, 2024
a299c6d
make spark logdir configurable. And fix minor things.
scality-fno Mar 10, 2025
8767668
improve config.yml template and fix arcdata reference in P2 revlookup
scality-fno Mar 11, 2025
80ce2a5
make export_s3_keys.sh more robust
scality-fno Mar 31, 2025
cc22515
fix p0 script with unhexlify
scality-fno Apr 2, 2025
ee65923
improve listkey.py
scality-fno Apr 2, 2025
883b3cd
update S3 FSCK scripts to use FullLoader for YAML and upgrade Hadoop …
scality-fno Apr 2, 2025
c1bc7fe
fix indentation in submit.py and ensure driver memory configuration i…
mobidyc Apr 2, 2025
130cb7b
add type ignore comment to disable urllib3 warnings in listkey.py
mobidyc Apr 2, 2025
e00cdc3
add type check for Node instance in listkeys function
mobidyc Apr 2, 2025
8f22e5d
refactor listkeys function to clarify parameter usage and prevent rem…
mobidyc Apr 2, 2025
8b65273
refactor listkeys function to initialize count variable correctly and…
mobidyc Apr 2, 2025
ab3d273
rename prepare_path function to initialize_csv_directory for clarity …
mobidyc Apr 2, 2025
19405eb
update listkey.py to rename PATH to CSV_PATH for clarity and adjust d…
mobidyc Apr 2, 2025
b324521
refactor SparkSession initialization in listkey.py for clarity and ma…
mobidyc Apr 2, 2025
bc7e9e9
refactor listkeys function to improve error handling and enhance logg…
mobidyc Apr 2, 2025
047bac5
update check_key.py to use FullLoader for YAML loading to improve sec…
mobidyc Apr 2, 2025
68c644e
refactor check_key.py to improve code formatting and enhance readability
mobidyc Apr 2, 2025
c62d12f
refactor check_key.py to rename RING variable to RING_NAME for clarit…
mobidyc Apr 2, 2025
ebb06a4
refactor listkey.py to enhance error handling for configuration loadi…
mobidyc Apr 2, 2025
5adf876
refactor listkey.py to improve code readability and enhance logging f…
mobidyc Apr 2, 2025
7e9b422
update .gitignore to include additional file patterns for tar.gz, pyc…
mobidyc Apr 2, 2025
91fd81b
refactor listkey.py to rename RING variable to RING_NAME for clarity …
mobidyc Apr 2, 2025
80c0a99
🔧 refactor(check_key): rename PATH and PROT variables for clarity
mobidyc Apr 2, 2025
5945c04
🔧 chore(requirements): add urllib3 to dependencies
mobidyc Apr 2, 2025
2ae3f3f
✨ chore(ruff): add configuration file for Ruff linter and formatter
mobidyc Apr 2, 2025
0ef14b3
✨ chore(vscode): add configuration files for Ruff and Python settings
mobidyc Apr 2, 2025
d53ad0d
🔧 refactor(check_key): improve readability by formatting DataFrame sh…
mobidyc Apr 2, 2025
bf4c668
🔧 refactor(dig): fix indentation and update print function syntax
mobidyc Apr 2, 2025
f162393
🔧 refactor(scripts): update YAML loading and improve Spark session co…
mobidyc Apr 2, 2025
42d1466
Spark now uses the magic committer, and submit.py allows to launch a …
scality-gdoumergue May 20, 2025
67861f5
Add a script to test the connectivity to the S3 cluster
scality-gdoumergue May 21, 2025
a0a99fd
Make use of the s3a magic committer to speed up writes to S3, and use…
scality-gdoumergue May 22, 2025
d6a0752
Remove redundant SparkSession options, that are already set in submit.py
scality-gdoumergue May 22, 2025
a88c832
Cleanup listkey.py: remove prints, remove redundant SkarkContext configs
scality-gdoumergue May 23, 2025
88b5f3f
Remove the config.yml, so that it doesn't overwrite the one previousl…
scality-gdoumergue May 23, 2025
02efaa5
Improve spark_run.sh: Check Spark image, variabilize and create 2 wor…
scality-gdoumergue May 26, 2025
398b205
no need for old way scripts/offline-archive-setup.sh
scality-gdoumergue May 26, 2025
d1242d7
Dockerfile: need curl-dev pkgs for pycurl
scality-gdoumergue May 26, 2025
ebb1c81
Add comment in config template
scality-gdoumergue May 26, 2025
4671cc6
spark_run.sh checks the images version
scality-gdoumergue May 27, 2025
02f6826
fixup
scality-gdoumergue May 27, 2025
b84ab16
Add nodejs into the image and a script that lists the name of all the…
scality-gdoumergue May 27, 2025
09bf15f
Hide more things from git
scality-gdoumergue May 27, 2025
a79e5fb
Add shyaml to parse config from shell scripts
scality-gdoumergue May 27, 2025
8b40664
Ignore config file
scality-gdoumergue May 28, 2025
c4bdfc8
Able to script a build.
scality-gdoumergue May 28, 2025
3634d1c
add comments to config template
scality-gdoumergue May 28, 2025
e2dd01c
harmless bug in spark_run.sh
scality-gdoumergue May 28, 2025
dbbf036
more comments in config templates
scality-gdoumergue May 28, 2025
4fd4ade
extract_metadata_keys_to_s3.sh: the extraction of the sproxyd keys fr…
scality-gdoumergue May 28, 2025
b13c178
Spark image entrypoint now allows to run a shell or a script
scality-gdoumergue May 28, 2025
623c098
Working Spark image
scality-gdoumergue May 28, 2025
7852607
No more need for common.sh
scality-gdoumergue May 28, 2025
04817d6
S3 MD extraction now works with RAFT_SESSIONS variable
scality-gdoumergue May 28, 2025
ce41022
Generate and upload S3 MD journal backups
scality-gdoumergue May 29, 2025
7709900
extracted keys objects follow the old naming convention
scality-gdoumergue May 30, 2025
43f3655
Some more comments
scality-gdoumergue May 30, 2025
8e61695
Need gawk - not awk - for better csv processing
scality-gdoumergue May 30, 2025
c2f59dd
count_ring_keys.sh is now fully automated (and safer)
scality-gdoumergue May 30, 2025
9b33092
The Four Horsemen: 4 scripts to check that p0 and P1 were successful.
scality-gdoumergue May 30, 2025
2f26f9f
local dir (or work dir, or tmp dir) now works
scality-gdoumergue Jun 2, 2025
593338d
allow private ssh key distribution for S3_FSCK/s3_fsck_p2_reverselook…
scality-gdoumergue Jun 2, 2025
c058b4b
Comment s3_fsck_p* scripts - source: https://github.com/scality/spark…
scality-gdoumergue Jun 3, 2025
968da10
count scripts are more precise
scality-gdoumergue Jun 4, 2025
effa292
check un-committed changes before creating tarballs
scality-gdoumergue Jun 4, 2025
b4fd2c2
bugfix: spark_run.sh was badly handling mounts
scality-gdoumergue Jun 4, 2025
010cda9
Spark can now automatically import TLS certs
scality-gdoumergue Jun 5, 2025
5693cc4
"spark_run.sh exec" now works, add docker commands for "spark_run.sh …
scality-gdoumergue Jun 5, 2025
2bb1970
config template must comply with spark_run.sh and doc
scality-gdoumergue Jun 6, 2025
4c91fe0
Run Spark on RHEL/CentOS 7
scality-gdoumergue Jun 6, 2025
b18cdb8
submit.py now takes the ring's name from the config before defaulting…
scality-gdoumergue Jun 10, 2025
af8f7e4
make scripts/S3_FSCK/s3mdjournalbackuphashes.sh RHEL/CentOS 7 compatible
scality-gdoumergue Jun 10, 2025
bc782a2
Bugfix: scripts/S3_FSCK/s3_fsck_p3.py didn't correctly handle keys wi…
scality-gdoumergue Jun 10, 2025
ab7f6a9
The driver command now shows a dedicated prompt
scality-gdoumergue Jun 10, 2025
e383775
s3a path-style access restored
scality-gdoumergue Jun 11, 2025
fccc182
S3 access bugfix: the workers must have access to the apps dir to fin…
scality-gdoumergue Jun 11, 2025
6807e52
harmless typo in scripts/S3_FSCK/extract_metadata_keys_to_s3.sh
scality-gdoumergue Jun 12, 2025
fbcacbc
Spark 3.5.2-12: improvement for decoupled architectures
scality-gdoumergue Jun 12, 2025
41c5822
Permit the addition of extra ip:hostname pairs within containers
scality-gdoumergue Jun 12, 2025
08b3abf
scripts/S3_FSCK/s3mdjournalbackuphashes.sh can now run on stateless-o…
scality-gdoumergue Jun 23, 2025
5ecca82
patch verifyBucketSproxydKeys.js during the build, to workaround RD-404
scality-gdoumergue Jun 26, 2025
7b2bf2f
spark_run.sh must add the short hostname to /etc/hosts for master to …
scality-gdoumergue Aug 4, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
12 changes: 12 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
*.jar
*.tar
*.tar.gz
*.tgz
*.pyc
scripts/scality
scripts/py4j
*.dist-info
*.log
nodejs
s3utils
scripts/config/config.yml
5 changes: 5 additions & 0 deletions .vscode/extensions.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
{
"recommendations": [
"charliermarsh.ruff"
]
}
61 changes: 61 additions & 0 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
{
"python.analysis.typeCheckingMode": "standard",
"editor.quickSuggestions": {
"strings": true
},
"[python]": {
"editor.defaultFormatter": "charliermarsh.ruff",
"editor.formatOnSave": true,
"editor.codeActionsOnSave": {
"source.organizeImports": "explicit"
}
},
"git.openRepositoryInParentFolders": "never",
"git.autofetch": true,
"git.enableSmartCommit": true,
"git.replaceTagsWhenPull": true,
"github.copilot.editor.enableAutoCompletions": true,
"python.createEnvironment.trigger": "off",
"cSpell.enabled": false,
"python.testing.pytestArgs": ["tests"],
"python.testing.pytestEnabled": true,
"python.testing.unittestEnabled": false,
"git.inputValidation": true,
"git.inputValidationLength": 72,
"git.inputValidationSubjectLength": 72,
"github.copilot.chat.commitMessageGeneration.instructions": [
{
"text": "Use the Conventional Commits format for all commit messages."
},
{
"text": "The commit subject must follow this pattern: <type>(<scope>): <description>."
},
{
"text": "Replace <type> with one of the following: feat, fix, chore, docs, style, refactor, perf, test, build, ci, revert."
},
{
"text": "The <scope> should be the affected module, feature, or component (e.g., 'auth', 'api', 'ui')."
},
{
"text": "The <description> should be a concise summary of the change, written in imperative mood."
},
{
"text": "If a commit introduces breaking changes, append 'BREAKING CHANGE:' followed by a detailed explanation in the body."
},
{
"text": "If referencing an issue, add 'Closes #123' or 'Fixes #456' in the commit body."
},
{
"text": "Limit the subject line to 72 characters."
},
{
"text": "Separate the subject from the body with a blank line."
},
{
"text": "The commit body should explain what changed and why, wrapped at 72 characters per line."
},
{
"text": "Include Gitmojis where relevant, placed before the <type> in the subject line. Example: '✨ feat(auth): add login via Google'."
}
],
}
133 changes: 133 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
ARG NODE_IMAGE=16.20.2-bullseye-slim
ARG NODE_VERSION=16.20.2

##############################
# builder: nodejs dependencies
##############################

# The builder technique: best way
# to have a lighter image in the end.
FROM node:${NODE_IMAGE} as builder

ENV NVM_DIR=/root/.nvm

RUN --mount=type=cache,sharing=locked,target=/var/cache/apt apt update \
&& apt-get install -y --no-install-recommends \
curl \
git \
build-essential \
python3 \
jq \
ssh \
ca-certificates \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*

COPY nodejs ./nodejs

WORKDIR nodejs

# The node_version.txt file brings node's version to the next steps
# because I don't know why the NODE_VERSION variable is not passed
# to the runner part
RUN yarn install --production --network-concurrency 1 && \
echo "${NODE_VERSION}" > node_version.txt

##########################################
#
# RUNNER
#
##########################################

FROM python:3.8-slim-bullseye

RUN --mount=type=cache,sharing=locked,target=/var/cache/apt apt update \
&& apt-get install -y --no-install-recommends \
ca-certificates \
sudo \
curl \
libcurl4-openssl-dev libssl-dev \
awscli \
inetutils-ping \
netcat-traditional \
wget \
vim \
unzip \
rsync \
openjdk-11-jdk \
build-essential \
software-properties-common \
ssh \
jq \
gawk \
net-tools \
less \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*

ENV NVM_DIR=/opt/nvm
ENV PATH="/opt/spark/sbin:/opt/spark/bin:${PATH}"
ENV HADOOP_HOME=${HADOOP_HOME:-"/opt/hadoop"}
ENV SPARK_HOME=${SPARK_HOME:-"/opt/spark"}
ENV SPARK_MASTER_HOST="spark-master"
ENV SPARK_MASTER_PORT="17077"
ENV SPARK_MASTER="spark://${SPARK_MASTER_HOST}:${SPARK_MASTER_PORT}"
ENV PYSPARK_PYTHON=python3
ENV PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH

RUN mkdir -p ${HADOOP_HOME} ${SPARK_HOME}/scality-tools /spark/jars/
WORKDIR ${SPARK_HOME}

# Install what's been yarned by the builder part
COPY --from=builder nodejs/ ./scality-tools/

## Install nodejs without yarn
RUN NVM_NODE_VERSION=$(cat ./scality-tools/node_version.txt) && \
mkdir -p "${NVM_DIR}" && \
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.3/install.sh | /bin/bash && \
. "${NVM_DIR}/nvm.sh" && nvm install ${NVM_NODE_VERSION} && \
nvm use v${NVM_NODE_VERSION} && \
nvm alias default v${NVM_NODE_VERSION}

ENV PATH="${NVM_DIR}/versions/node/v${NVM_NODE_VERSION}/bin/:${PATH}"

# Time to work on Spark & Python stuff

COPY requirements.txt /tmp/requirements.txt
COPY scality-0.1-py3-none-any.whl /tmp/
COPY --from=ghcr.io/astral-sh/uv:0.4.8 /uv /bin/uv

RUN --mount=type=cache,target=/root/.cache/uv \
uv pip compile /tmp/requirements.txt > /tmp/requirements-compiled.txt \
&& uv pip sync --system /tmp/requirements-compiled.txt \
&& uv pip install --system /tmp/scality-0.1-py3-none-any.whl
Comment on lines +98 to +103
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question:

Is the legacy uv version a workaround to get pip3.8 to install the requirements file without Tracebacks?

With Python 3.8 EOL in Oct 2024, is there any reason not to take this time to bump to Python 3.9 or Python 3.11, the spark 3.5.2 max supported version?



# globbing to not fail if not found
COPY spark-3.5.2-bin-hadoop3.tg[z] /tmp/
# -N enable timestamping to condition download if already present or not
RUN cd /tmp \
&& wget -N https://archive.apache.org/dist/spark/spark-3.5.2/spark-3.5.2-bin-hadoop3.tgz \
&& tar xvzf spark-3.5.2-bin-hadoop3.tgz --directory /opt/spark --strip-components 1 \
&& rm -f spark-3.5.2-bin-hadoop3.tgz

COPY conf/spark-defaults.conf ${SPARK_HOME}/conf
COPY conf/spark-env.sh ${SPARK_HOME}/conf

# https://github.com/sayedabdallah/Read-Write-AWS-S3
# https://spot.io/blog/improve-apache-spark-performance-with-the-s3-magic-committer/
COPY aws-java-sdk-bundle-1.12.770.ja[r] /spark/jars/
COPY hadoop-aws-3.3.4.ja[r] /spark/jars/
COPY spark-hadoop-cloud_2.13-3.5.2.ja[r] /spark/jars/
RUN cd /spark/jars/ \
&& wget -N https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.770/aws-java-sdk-bundle-1.12.770.jar \
&& wget -N https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar \
&& wget -N https://repo1.maven.org/maven2/org/apache/spark/spark-hadoop-cloud_2.13/3.5.2/spark-hadoop-cloud_2.13-3.5.2.jar

# Misc
RUN chmod u+x /opt/spark/sbin/* /opt/spark/bin/* && \
aws configure set default.s3.multipart_threshold 64MB && \
aws configure set default.s3.multipart_chunksize 32MB

COPY entrypoint.sh .
ENTRYPOINT ["/opt/spark/entrypoint.sh"]
17 changes: 0 additions & 17 deletions Dockerfile-master

This file was deleted.

34 changes: 0 additions & 34 deletions Dockerfile-worker

This file was deleted.

Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
master: "spark://{{ hostvars[groups['sparkmaster'][0]]['ansible_host'] }}:7077"
master: "spark://{{ hostvars[groups['sparkmaster'][0]]['ansible_host'] }}:17077"
ring: "DATA"
path: "{{ bucket_name }}"
protocol: s3a # Protocol can be either file or s3a.
Expand Down
Binary file removed aws-java-sdk-1.7.4.jar
Binary file not shown.
Loading