-
Notifications
You must be signed in to change notification settings - Fork 73
[WIP] Support signed URLs for non-AWS object storage providers #688
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
noticed that should these be updated to use the new storage abstraction or left as-is? |
also, tested locally using MinIO. I don't have access to AWS S3 + CloudFront, but I plan to test it in my work environment (which uses VK Object Storage) |
You are right, those two functions are not used anymore. I will remove them, no need to do updates related to them. |
For documents/nodes Papermerge REST API returns an URL pointing to where client should get the file from. In simple cases (i.e. without S3), that URL is something like:
You can see this setup in action in here https://demo.papermerge.com (username, password demo/demo). Why I am telling this, is that in the scenario I am using S3, Papermerge does not serve files: it just give back to the client correct URL and for that REST API server does not need @bl1nkker, in your setup, who is serving files ? Is it Papermerge ? Or S3 server? |
@@ -37,6 +37,13 @@ class Settings(BaseSettings): | |||
papermerge__ocr__automatic: bool = False | |||
papermerge__search__url: str | None = None | |||
|
|||
papermerge__s3__provider: str = "aws" | |||
aws_access_key_id: str | None = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned in comments, so far, when using S3 storage, Papermerge does not serve files. Thus, there is no need for aws_access_key_id
etc. Just keep this in mind, because I assume in your S3 setup, you want Papermerge to serve files as well? Which means that you will need to add code for downloading from S3?
at the moment in production files are served by Papermerge and stored on local storage however i'm facing a new requirement to offload files to external object storage since more than 2 million pages will be uploaded soon. that's why I’m working on integrating papermerge with VK Cloud |
regarding the variables, I don’t have much experience with |
i'm sorry, i was wrong some of the params are actually not strictly required. According to the boto3 docs, the client can work without explicitly passing credentials however i think for non-AWS providers like minio or VK Cloud specifying at least I'll update the PR once I finish testing |
tested papermerge with another S3 provider. after a few small adjustments (I'll add them to the pull request a little later), everything works as expected note: my setup does not use CloudFront or any CDN. note2: please don’t review the code just yet. when I was working on testing I noticed that there's already a this is my services:
webapp:
image: blinkker/papermerge:0.0.9-dev
environment:
PAPERMERGE__SECURITY__SECRET_KEY: 12345
PAPERMERGE__AUTH__USERNAME: admin
PAPERMERGE__AUTH__PASSWORD: admin
PAPERMERGE__DATABASE__URL: postgresql://postgres:[email protected]:5432/pmgdb
PAPERMERGE__MAIN__MEDIA_ROOT: /var/media/pmg
PAPERMERGE__REDIS__URL: redis://host.docker.internal:6379/0
PAPERMERGE__OCR__LANG_CODES: "deu,eng,kaz,rus"
PAPERMERGE__OCR__DEFAULT_LANG_CODE: "deu"
AWS_ACCESS_KEY_ID: <aws-access-key>
AWS_SECRET_ACCESS_KEY: <aws-secret-key>
AWS_ENDPOINT_URL: <aws-endpoint-url>
AWS_REGION_NAME: us-east-1
PAPERMERGE__S3__BUCKET_NAME: <bucket-name>
PAPERMERGE__MAIN__FILE_SERVER: s3
# options are: vk, minio, aws
PAPERMERGE__S3__PROVIDER: vk
volumes:
- media_root:/var/media/pmg
ports:
- "12000:80"
s3worker:
image: blinkker/papermerge-s3-worker:0.0.2
command: worker
environment:
PAPERMERGE__DATABASE__URL: postgresql://postgres:[email protected]:5432/pmgdb
PAPERMERGE__REDIS__URL: redis://host.docker.internal:6379/0
PAPERMERGE__MAIN__MEDIA_ROOT: /var/media/pmg
S3_WORKER_ARGS: "-Q s3 -c 2"
PAPERMERGE__S3__BUCKET_NAME: <bucket-name>
AWS_ACCESS_KEY_ID: <aws-access-key>
AWS_SECRET_ACCESS_KEY: <aws-secret-key>
AWS_ENDPOINT_URL: <aws-endpoint-url>
AWS_REGION_NAME: us-east-1
volumes:
- media_root:/var/media/pmg
volumes:
media_root: |
the environment variables I added actually are needed for signing S3 URLs directly. in my case I want Papermerge to work only with the object storage letting S3 serve all files directly also, I don’t want Papermerge to store files locally at all. But according to the documentation it seems like local storage is still used even when S3 is configured is there a way to fully switch papermerge to use only S3 for storing and serving documents without saving anything locally? |
I don't really understand what you mean by "only S3". Also what exactly do you mean by "locally" ? Locally for whom? for REST API server or for S3 worker? Also the docker compose is just an example. On real production server there can be any number of REST API servers (1, 2, 3, 4, ... N) an each of them has its own "local" storage - which they don't share. Same for S3-workers. So what do you exactly mean by "locally" ? Maybe you explain here in detail your production setup (e.g. do you plan to deploy in k8s, do you plan to have only one REST API server? etc) so that I can further help you. PS: "production" for me means k8s cluster with N ( N>= 3) REST API server instances, each with ephemeral storage (i.e. storage which can be replaced at any time) and with N s3-workers (each worker has access to the same storage of it's peer REST API server so that it can upload files to S3) |
right now my setup is very simple: I have a single Ubuntu 22.04 server with 8GB RAM and 500GB disk. I'm running only one container with the Papermerge REST API (via Docker) and currently it stores files in a volume mounted on the host (no s3wokrer on my setup) i also don’t plan to migrate to k8s. Currently my deployment is based on docker compose. Once I complete integration with VK Cloud I plan to move towards docker swarm as I intend to provision a dedicated server for the database
what I want to achieve is the following: I’d like to run an additional s3worker container and make Papermerge store all documents exclusively in S3 object storage (in my case VK Cloud) without consuming any local disk space on the server. so instead of saving documents locally first and then offloading to S3 later I want Papermerge to write directly to S3 and read from there as well skipping local volumes altogether. and yes, this means that s3 itself provides access to files. with my current setup Papermerge retrieves the files directly from my storage (I'm running MinIO and the Papermerge API locally). For example, here’s a generated pre-signed URL that the frontend uses to access a document:
|
Right. Currently uploaded files are saved on the disk first. But in general you can write another application which will take over |
@ciur apologies for not working on this pull request for a while. i've had a busy period at work and since my current solution worked well enough for our use case, I had to temporarily pause work on the PR (same for s3-worker pull request) However, I’d be happy to continue contributing |
also, I’d like to reiterate the approach I took for supporting alternative object storages: Papermerge generates a pre-signed URL for downloading the file from the object storage and sends it to the client (the files themselves are stored on the main papermerge node and in object storage, as was originally the case) |
@bl1nkker don't forget to rebase. I changed master recently (all sync went async). Also you may want to have a look at this: https://docs.papermerge.io/3.5/developer-manual/architecture/ |
@bl1nkker don't forget to rebase. I changed master recently (all sync went async). Also you may want to have a look at this: https://docs.papermerge.io/3.5/developer-manual/architecture/ |
this pr adds support for generating signed URLs with aws s3 compatible storage providers
Added settings:
papermerge__s3__provider
aws_access_key_id
,aws_secret_access_key
,aws_region_name
,aws_endpoint_url
papermerge__s3__bucket_name
Added classes:
AWSS3Storage
– generates CloudFront signed URLsGenericS3Storage
– uses boto3 for S3-compatible storageAdded factory method
get_storage()
Updated
resource_sign_url
to use storage abstractionrefs #14