Skip to content

[WIP] Support signed URLs for non-AWS object storage providers #688

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

bl1nkker
Copy link
Contributor

this pr adds support for generating signed URLs with aws s3 compatible storage providers

Added settings:

  • papermerge__s3__provider
  • aws_access_key_id, aws_secret_access_key, aws_region_name, aws_endpoint_url
  • papermerge__s3__bucket_name

Added classes:

  • AWSS3Storage – generates CloudFront signed URLs
  • GenericS3Storage – uses boto3 for S3-compatible storage

Added factory method get_storage()

Updated resource_sign_url to use storage abstraction

refs #14

@bl1nkker
Copy link
Contributor Author

@ciur

noticed that _s3_page_svg_url and _s3_docver_download_url also perform URL signing, but are currently unused

should these be updated to use the new storage abstraction or left as-is?

@bl1nkker
Copy link
Contributor Author

also, tested locally using MinIO. I don't have access to AWS S3 + CloudFront, but I plan to test it in my work environment (which uses VK Object Storage)

@ciur
Copy link
Member

ciur commented Jul 22, 2025

@ciur

noticed that _s3_page_svg_url and _s3_docver_download_url also perform URL signing, but are currently unused

should these be updated to use the new storage abstraction or left as-is?

You are right, those two functions are not used anymore. I will remove them, no need to do updates related to them.

@ciur
Copy link
Member

ciur commented Jul 22, 2025

For documents/nodes Papermerge REST API returns an URL pointing to where client should get the file from. In simple cases (i.e. without S3), that URL is something like: /api/document-versions/<document-version-id>/download: in such case clients (e.g. browser) makes one more request for getting the document to the Papermerge i.e. Papermerge serves files.
In more complex scenarios (with S3), Papermerge returns URL like https://<s3-server>/url-of-the-file-in-s3-server e.g:

https://ddmnua7cm301s.cloudfront.net/docvers/be/0e/be0ec1db-97dc-43ca-9e24-2665d4673181/The%20Project%20Gutenberg%20eBook%20of%20Also%20sprach%20Zarathustra%2C%20by%20Friedrich%20Wilhelm%20Nietzsche.pdf?Expires=1753157508&Signature=PQuB-J52QHhqHtc-x00VtRBI-5WBwT8d74LBxBhAXt9VnPfKia~pRe-XsBLgmzK~8M6S26hSMfkZ3rG83xdGYP9R9s6ksLq-D2vk3mS4KihV7r~KGjM9b5vgl0FlAebEV19stjoPs9lFeG9sUtoAZCnZpqatDdTVZuyHd9-WMDC16Gg84n6QsqKeLoTPFuQZVA6kro~Yd-OniMgfWjU3f6lrP2grbNoPxywGIZtq6591etbsyw27TAxczyVI0uP8WICbOqiZz0W7hcAGKUGHCTR9uVf6UeUtbVPwcyGf8FeVTCuM0WtQ96jRgIoC-3r4voFuaa0jDXSVk2uz-4tBoQ__&Key-Pair-Id=K19GCMLERJU26R

You can see this setup in action in here https://demo.papermerge.com (username, password demo/demo).

Why I am telling this, is that in the scenario I am using S3, Papermerge does not serve files: it just give back to the client correct URL and for that REST API server does not need aws_access_key_id, aws_secret_access_key, aws_region_name, aws_endpoint_url. Of course this scenario is different in sense that S3 here acts as CDN as well.

@bl1nkker, in your setup, who is serving files ? Is it Papermerge ? Or S3 server?

@@ -37,6 +37,13 @@ class Settings(BaseSettings):
papermerge__ocr__automatic: bool = False
papermerge__search__url: str | None = None

papermerge__s3__provider: str = "aws"
aws_access_key_id: str | None = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned in comments, so far, when using S3 storage, Papermerge does not serve files. Thus, there is no need for aws_access_key_id etc. Just keep this in mind, because I assume in your S3 setup, you want Papermerge to serve files as well? Which means that you will need to add code for downloading from S3?

@bl1nkker
Copy link
Contributor Author

For documents/nodes Papermerge REST API returns an URL pointing to where client should get the file from. In simple cases (i.e. without S3), that URL is something like: /api/document-versions/<document-version-id>/download: in such case clients (e.g. browser) makes one more request for getting the document to the Papermerge i.e. Papermerge serves files. In more complex scenarios (with S3), Papermerge returns URL like https://<s3-server>/url-of-the-file-in-s3-server e.g:

https://ddmnua7cm301s.cloudfront.net/docvers/be/0e/be0ec1db-97dc-43ca-9e24-2665d4673181/The%20Project%20Gutenberg%20eBook%20of%20Also%20sprach%20Zarathustra%2C%20by%20Friedrich%20Wilhelm%20Nietzsche.pdf?Expires=1753157508&Signature=PQuB-J52QHhqHtc-x00VtRBI-5WBwT8d74LBxBhAXt9VnPfKia~pRe-XsBLgmzK~8M6S26hSMfkZ3rG83xdGYP9R9s6ksLq-D2vk3mS4KihV7r~KGjM9b5vgl0FlAebEV19stjoPs9lFeG9sUtoAZCnZpqatDdTVZuyHd9-WMDC16Gg84n6QsqKeLoTPFuQZVA6kro~Yd-OniMgfWjU3f6lrP2grbNoPxywGIZtq6591etbsyw27TAxczyVI0uP8WICbOqiZz0W7hcAGKUGHCTR9uVf6UeUtbVPwcyGf8FeVTCuM0WtQ96jRgIoC-3r4voFuaa0jDXSVk2uz-4tBoQ__&Key-Pair-Id=K19GCMLERJU26R

You can see this setup in action in here https://demo.papermerge.com (username, password demo/demo).

Why I am telling this, is that in the scenario I am using S3, Papermerge does not serve files: it just give back to the client correct URL and for that REST API server does not need aws_access_key_id, aws_secret_access_key, aws_region_name, aws_endpoint_url. Of course this scenario is different in sense that S3 here acts as CDN as well.

@bl1nkker, in your setup, who is serving files ? Is it Papermerge ? Or S3 server?

at the moment in production files are served by Papermerge and stored on local storage

however i'm facing a new requirement to offload files to external object storage since more than 2 million pages will be uploaded soon. that's why I’m working on integrating papermerge with VK Cloud

@bl1nkker
Copy link
Contributor Author

bl1nkker commented Jul 22, 2025

Why I am telling this, is that in the scenario I am using S3, Papermerge does not serve files: it just give back to the client correct URL and for that REST API server does not need aws_access_key_id, aws_secret_access_key, aws_region_name, aws_endpoint_url. Of course this scenario is different in sense that S3 here acts as CDN as well.

regarding the variables, I don’t have much experience with boto3 but according to their documentation signing a URL requires creating a client which in turn requires those credentials

@bl1nkker
Copy link
Contributor Author

i'm sorry, i was wrong

some of the params are actually not strictly required. According to the boto3 docs, the client can work without explicitly passing credentials

however i think for non-AWS providers like minio or VK Cloud specifying at least endpoint_url is likely necessary for proper functionality

I'll update the PR once I finish testing

@bl1nkker
Copy link
Contributor Author

bl1nkker commented Jul 22, 2025

tested papermerge with another S3 provider. after a few small adjustments (I'll add them to the pull request a little later), everything works as expected

note: my setup does not use CloudFront or any CDN.

note2: please don’t review the code just yet. when I was working on testing I noticed that there's already a storage.py file in the project but I created a new storage/ module which causes an import conflict and breaks the app. It works in my environment only because I patched it in my custom Docker image.

this is my docker-compose setup (note: it uses my custom images to support non-AWS signing):

services:
  webapp:
    image: blinkker/papermerge:0.0.9-dev
    environment:
      PAPERMERGE__SECURITY__SECRET_KEY: 12345
      PAPERMERGE__AUTH__USERNAME: admin
      PAPERMERGE__AUTH__PASSWORD: admin
      PAPERMERGE__DATABASE__URL: postgresql://postgres:[email protected]:5432/pmgdb
      PAPERMERGE__MAIN__MEDIA_ROOT: /var/media/pmg
      PAPERMERGE__REDIS__URL: redis://host.docker.internal:6379/0
      PAPERMERGE__OCR__LANG_CODES: "deu,eng,kaz,rus"
      PAPERMERGE__OCR__DEFAULT_LANG_CODE: "deu"

      AWS_ACCESS_KEY_ID: <aws-access-key>
      AWS_SECRET_ACCESS_KEY: <aws-secret-key>
      AWS_ENDPOINT_URL: <aws-endpoint-url>
      AWS_REGION_NAME: us-east-1
      PAPERMERGE__S3__BUCKET_NAME: <bucket-name>

      PAPERMERGE__MAIN__FILE_SERVER: s3
      # options are: vk, minio, aws
      PAPERMERGE__S3__PROVIDER: vk
    volumes:
      - media_root:/var/media/pmg
    ports:
      - "12000:80"

  s3worker:
    image: blinkker/papermerge-s3-worker:0.0.2
    command: worker
    environment:
      PAPERMERGE__DATABASE__URL: postgresql://postgres:[email protected]:5432/pmgdb
      PAPERMERGE__REDIS__URL: redis://host.docker.internal:6379/0
      PAPERMERGE__MAIN__MEDIA_ROOT: /var/media/pmg
      S3_WORKER_ARGS: "-Q s3 -c 2"
      PAPERMERGE__S3__BUCKET_NAME: <bucket-name>
      AWS_ACCESS_KEY_ID: <aws-access-key>
      AWS_SECRET_ACCESS_KEY: <aws-secret-key>
      AWS_ENDPOINT_URL: <aws-endpoint-url>
      AWS_REGION_NAME: us-east-1
    volumes:
      - media_root:/var/media/pmg

volumes:
  media_root:

@bl1nkker
Copy link
Contributor Author

some of the params are actually not strictly required. According to the boto3 docs, the client can work without explicitly passing credentials

the environment variables I added actually are needed for signing S3 URLs directly. in my case I want Papermerge to work only with the object storage letting S3 serve all files directly

also, I don’t want Papermerge to store files locally at all. But according to the documentation it seems like local storage is still used even when S3 is configured

is there a way to fully switch papermerge to use only S3 for storing and serving documents without saving anything locally?

@ciur
Copy link
Member

ciur commented Jul 22, 2025

is there a way to fully switch papermerge to use only S3 for storing and serving documents without saving anything locally?

I don't really understand what you mean by "only S3". Also what exactly do you mean by "locally" ? Locally for whom? for REST API server or for S3 worker? Also the docker compose is just an example. On real production server there can be any number of REST API servers (1, 2, 3, 4, ... N) an each of them has its own "local" storage - which they don't share. Same for S3-workers. So what do you exactly mean by "locally" ?

Maybe you explain here in detail your production setup (e.g. do you plan to deploy in k8s, do you plan to have only one REST API server? etc) so that I can further help you.

PS: "production" for me means k8s cluster with N ( N>= 3) REST API server instances, each with ephemeral storage (i.e. storage which can be replaced at any time) and with N s3-workers (each worker has access to the same storage of it's peer REST API server so that it can upload files to S3)

@bl1nkker
Copy link
Contributor Author

bl1nkker commented Jul 26, 2025

Maybe you explain here in detail your production setup (e.g. do you plan to deploy in k8s, do you plan to have only one REST API server? etc) so that I can further help you.

right now my setup is very simple: I have a single Ubuntu 22.04 server with 8GB RAM and 500GB disk. I'm running only one container with the Papermerge REST API (via Docker) and currently it stores files in a volume mounted on the host (no s3wokrer on my setup)

i also don’t plan to migrate to k8s. Currently my deployment is based on docker compose. Once I complete integration with VK Cloud I plan to move towards docker swarm as I intend to provision a dedicated server for the database

I don't really understand what you mean by "only S3".

what I want to achieve is the following: I’d like to run an additional s3worker container and make Papermerge store all documents exclusively in S3 object storage (in my case VK Cloud) without consuming any local disk space on the server. so instead of saving documents locally first and then offloading to S3 later I want Papermerge to write directly to S3 and read from there as well skipping local volumes altogether. and yes, this means that s3 itself provides access to files. with my current setup Papermerge retrieves the files directly from my storage (I'm running MinIO and the Papermerge API locally). For example, here’s a generated pre-signed URL that the frontend uses to access a document:

http://host.docker.internal:9000/papermerge-test/1.png?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=minioadmin%2F20250726%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250726T115842Z&X-Amz-Expires=600&X-Amz-SignedHeaders=host&X-Amz-Signature=a27e1cb6674d19c0160a651a536ff1ea175d2c204505bee12eb699e369cf9629

@bl1nkker
Copy link
Contributor Author

is there a way to fully switch papermerge to use only S3 for storing and serving documents without saving anything locally?

based on this diagram I understand that with the current architecture there is no way. this is not very critical for me, but I think in the future I will think about how to solve this problem (at least in my fork)

image

@ciur
Copy link
Member

ciur commented Jul 26, 2025

Right. Currently uploaded files are saved on the disk first. But in general you can write another application which will take over /api/{document_id}/upload endpoint and do what you want (e.g. stream files directly to S3 storage). But as I mentioned (did I mentioned that?) - for this is out of Papermerge scope.

@bl1nkker
Copy link
Contributor Author

bl1nkker commented Aug 7, 2025

@ciur apologies for not working on this pull request for a while. i've had a busy period at work and since my current solution worked well enough for our use case, I had to temporarily pause work on the PR (same for s3-worker pull request)

However, I’d be happy to continue contributing

@bl1nkker
Copy link
Contributor Author

bl1nkker commented Aug 7, 2025

also, I’d like to reiterate the approach I took for supporting alternative object storages: Papermerge generates a pre-signed URL for downloading the file from the object storage and sends it to the client (the files themselves are stored on the main papermerge node and in object storage, as was originally the case)

@ciur
Copy link
Member

ciur commented Aug 7, 2025

@bl1nkker don't forget to rebase. I changed master recently (all sync went async). Also you may want to have a look at this: https://docs.papermerge.io/3.5/developer-manual/architecture/

@ciur
Copy link
Member

ciur commented Aug 7, 2025

also, I’d like to reiterate the approach I took for supporting alternative object storages: Papermerge generates a pre-signed URL for downloading the file from the object storage and sends it to the client (the files themselves are stored on the main papermerge node and in object storage, as was originally the case)

@bl1nkker don't forget to rebase. I changed master recently (all sync went async). Also you may want to have a look at this: https://docs.papermerge.io/3.5/developer-manual/architecture/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants