[WIP] Support signed URLs for non-AWS object storage providers #688

bl1nkker · 2025-07-21T13:04:46Z

this pr adds support for generating signed URLs with aws s3 compatible storage providers

Added settings:

papermerge__s3__provider
aws_access_key_id, aws_secret_access_key, aws_region_name, aws_endpoint_url
papermerge__s3__bucket_name

Added classes:

AWSS3Storage – generates CloudFront signed URLs
GenericS3Storage – uses boto3 for S3-compatible storage

Added factory method get_storage()

Updated resource_sign_url to use storage abstraction

refs #14

bl1nkker · 2025-07-21T13:09:44Z

@ciur

noticed that _s3_page_svg_url and _s3_docver_download_url also perform URL signing, but are currently unused

should these be updated to use the new storage abstraction or left as-is?

bl1nkker · 2025-07-21T13:12:17Z

also, tested locally using MinIO. I don't have access to AWS S3 + CloudFront, but I plan to test it in my work environment (which uses VK Object Storage)

ciur · 2025-07-22T03:45:13Z

@ciur

noticed that _s3_page_svg_url and _s3_docver_download_url also perform URL signing, but are currently unused

should these be updated to use the new storage abstraction or left as-is?

You are right, those two functions are not used anymore. I will remove them, no need to do updates related to them.

ciur · 2025-07-22T04:13:49Z

For documents/nodes Papermerge REST API returns an URL pointing to where client should get the file from. In simple cases (i.e. without S3), that URL is something like: /api/document-versions/<document-version-id>/download: in such case clients (e.g. browser) makes one more request for getting the document to the Papermerge i.e. Papermerge serves files.
In more complex scenarios (with S3), Papermerge returns URL like https://<s3-server>/url-of-the-file-in-s3-server e.g:

https://ddmnua7cm301s.cloudfront.net/docvers/be/0e/be0ec1db-97dc-43ca-9e24-2665d4673181/The%20Project%20Gutenberg%20eBook%20of%20Also%20sprach%20Zarathustra%2C%20by%20Friedrich%20Wilhelm%20Nietzsche.pdf?Expires=1753157508&Signature=PQuB-J52QHhqHtc-x00VtRBI-5WBwT8d74LBxBhAXt9VnPfKia~pRe-XsBLgmzK~8M6S26hSMfkZ3rG83xdGYP9R9s6ksLq-D2vk3mS4KihV7r~KGjM9b5vgl0FlAebEV19stjoPs9lFeG9sUtoAZCnZpqatDdTVZuyHd9-WMDC16Gg84n6QsqKeLoTPFuQZVA6kro~Yd-OniMgfWjU3f6lrP2grbNoPxywGIZtq6591etbsyw27TAxczyVI0uP8WICbOqiZz0W7hcAGKUGHCTR9uVf6UeUtbVPwcyGf8FeVTCuM0WtQ96jRgIoC-3r4voFuaa0jDXSVk2uz-4tBoQ__&Key-Pair-Id=K19GCMLERJU26R

You can see this setup in action in here https://demo.papermerge.com (username, password demo/demo).

Why I am telling this, is that in the scenario I am using S3, Papermerge does not serve files: it just give back to the client correct URL and for that REST API server does not need aws_access_key_id, aws_secret_access_key, aws_region_name, aws_endpoint_url. Of course this scenario is different in sense that S3 here acts as CDN as well.

@bl1nkker, in your setup, who is serving files ? Is it Papermerge ? Or S3 server?

ciur · 2025-07-22T04:23:46Z

papermerge/core/config.py

@@ -37,6 +37,13 @@ class Settings(BaseSettings):
    papermerge__ocr__automatic: bool = False
    papermerge__search__url: str | None = None

+    papermerge__s3__provider: str = "aws"
+    aws_access_key_id: str | None = None


As I mentioned in comments, so far, when using S3 storage, Papermerge does not serve files. Thus, there is no need for aws_access_key_id etc. Just keep this in mind, because I assume in your S3 setup, you want Papermerge to serve files as well? Which means that you will need to add code for downloading from S3?

bl1nkker · 2025-07-22T06:55:26Z

For documents/nodes Papermerge REST API returns an URL pointing to where client should get the file from. In simple cases (i.e. without S3), that URL is something like: /api/document-versions/<document-version-id>/download: in such case clients (e.g. browser) makes one more request for getting the document to the Papermerge i.e. Papermerge serves files. In more complex scenarios (with S3), Papermerge returns URL like https://<s3-server>/url-of-the-file-in-s3-server e.g:
https://ddmnua7cm301s.cloudfront.net/docvers/be/0e/be0ec1db-97dc-43ca-9e24-2665d4673181/The%20Project%20Gutenberg%20eBook%20of%20Also%20sprach%20Zarathustra%2C%20by%20Friedrich%20Wilhelm%20Nietzsche.pdf?Expires=1753157508&Signature=PQuB-J52QHhqHtc-x00VtRBI-5WBwT8d74LBxBhAXt9VnPfKia~pRe-XsBLgmzK~8M6S26hSMfkZ3rG83xdGYP9R9s6ksLq-D2vk3mS4KihV7r~KGjM9b5vgl0FlAebEV19stjoPs9lFeG9sUtoAZCnZpqatDdTVZuyHd9-WMDC16Gg84n6QsqKeLoTPFuQZVA6kro~Yd-OniMgfWjU3f6lrP2grbNoPxywGIZtq6591etbsyw27TAxczyVI0uP8WICbOqiZz0W7hcAGKUGHCTR9uVf6UeUtbVPwcyGf8FeVTCuM0WtQ96jRgIoC-3r4voFuaa0jDXSVk2uz-4tBoQ__&Key-Pair-Id=K19GCMLERJU26R
You can see this setup in action in here https://demo.papermerge.com (username, password demo/demo).

Why I am telling this, is that in the scenario I am using S3, Papermerge does not serve files: it just give back to the client correct URL and for that REST API server does not need aws_access_key_id, aws_secret_access_key, aws_region_name, aws_endpoint_url. Of course this scenario is different in sense that S3 here acts as CDN as well.

@bl1nkker, in your setup, who is serving files ? Is it Papermerge ? Or S3 server?

at the moment in production files are served by Papermerge and stored on local storage

however i'm facing a new requirement to offload files to external object storage since more than 2 million pages will be uploaded soon. that's why I’m working on integrating papermerge with VK Cloud

bl1nkker · 2025-07-22T07:00:37Z

Why I am telling this, is that in the scenario I am using S3, Papermerge does not serve files: it just give back to the client correct URL and for that REST API server does not need aws_access_key_id, aws_secret_access_key, aws_region_name, aws_endpoint_url. Of course this scenario is different in sense that S3 here acts as CDN as well.

regarding the variables, I don’t have much experience with boto3 but according to their documentation signing a URL requires creating a client which in turn requires those credentials

bl1nkker · 2025-07-22T07:06:06Z

i'm sorry, i was wrong

some of the params are actually not strictly required. According to the boto3 docs, the client can work without explicitly passing credentials

however i think for non-AWS providers like minio or VK Cloud specifying at least endpoint_url is likely necessary for proper functionality

I'll update the PR once I finish testing

bl1nkker · 2025-07-22T13:01:26Z

tested papermerge with another S3 provider. after a few small adjustments (I'll add them to the pull request a little later), everything works as expected

note: my setup does not use CloudFront or any CDN.

note2: please don’t review the code just yet. when I was working on testing I noticed that there's already a storage.py file in the project but I created a new storage/ module which causes an import conflict and breaks the app. It works in my environment only because I patched it in my custom Docker image.

this is my docker-compose setup (note: it uses my custom images to support non-AWS signing):

services:
  webapp:
    image: blinkker/papermerge:0.0.9-dev
    environment:
      PAPERMERGE__SECURITY__SECRET_KEY: 12345
      PAPERMERGE__AUTH__USERNAME: admin
      PAPERMERGE__AUTH__PASSWORD: admin
      PAPERMERGE__DATABASE__URL: postgresql://postgres:[email protected]:5432/pmgdb
      PAPERMERGE__MAIN__MEDIA_ROOT: /var/media/pmg
      PAPERMERGE__REDIS__URL: redis://host.docker.internal:6379/0
      PAPERMERGE__OCR__LANG_CODES: "deu,eng,kaz,rus"
      PAPERMERGE__OCR__DEFAULT_LANG_CODE: "deu"

      AWS_ACCESS_KEY_ID: <aws-access-key>
      AWS_SECRET_ACCESS_KEY: <aws-secret-key>
      AWS_ENDPOINT_URL: <aws-endpoint-url>
      AWS_REGION_NAME: us-east-1
      PAPERMERGE__S3__BUCKET_NAME: <bucket-name>

      PAPERMERGE__MAIN__FILE_SERVER: s3
      # options are: vk, minio, aws
      PAPERMERGE__S3__PROVIDER: vk
    volumes:
      - media_root:/var/media/pmg
    ports:
      - "12000:80"

  s3worker:
    image: blinkker/papermerge-s3-worker:0.0.2
    command: worker
    environment:
      PAPERMERGE__DATABASE__URL: postgresql://postgres:[email protected]:5432/pmgdb
      PAPERMERGE__REDIS__URL: redis://host.docker.internal:6379/0
      PAPERMERGE__MAIN__MEDIA_ROOT: /var/media/pmg
      S3_WORKER_ARGS: "-Q s3 -c 2"
      PAPERMERGE__S3__BUCKET_NAME: <bucket-name>
      AWS_ACCESS_KEY_ID: <aws-access-key>
      AWS_SECRET_ACCESS_KEY: <aws-secret-key>
      AWS_ENDPOINT_URL: <aws-endpoint-url>
      AWS_REGION_NAME: us-east-1
    volumes:
      - media_root:/var/media/pmg

volumes:
  media_root:

bl1nkker · 2025-07-22T13:08:25Z

some of the params are actually not strictly required. According to the boto3 docs, the client can work without explicitly passing credentials

the environment variables I added actually are needed for signing S3 URLs directly. in my case I want Papermerge to work only with the object storage letting S3 serve all files directly

also, I don’t want Papermerge to store files locally at all. But according to the documentation it seems like local storage is still used even when S3 is configured

is there a way to fully switch papermerge to use only S3 for storing and serving documents without saving anything locally?

ciur · 2025-07-22T16:26:25Z

is there a way to fully switch papermerge to use only S3 for storing and serving documents without saving anything locally?

I don't really understand what you mean by "only S3". Also what exactly do you mean by "locally" ? Locally for whom? for REST API server or for S3 worker? Also the docker compose is just an example. On real production server there can be any number of REST API servers (1, 2, 3, 4, ... N) an each of them has its own "local" storage - which they don't share. Same for S3-workers. So what do you exactly mean by "locally" ?

Maybe you explain here in detail your production setup (e.g. do you plan to deploy in k8s, do you plan to have only one REST API server? etc) so that I can further help you.

PS: "production" for me means k8s cluster with N ( N>= 3) REST API server instances, each with ephemeral storage (i.e. storage which can be replaced at any time) and with N s3-workers (each worker has access to the same storage of it's peer REST API server so that it can upload files to S3)

bl1nkker · 2025-07-26T11:39:14Z

Maybe you explain here in detail your production setup (e.g. do you plan to deploy in k8s, do you plan to have only one REST API server? etc) so that I can further help you.

right now my setup is very simple: I have a single Ubuntu 22.04 server with 8GB RAM and 500GB disk. I'm running only one container with the Papermerge REST API (via Docker) and currently it stores files in a volume mounted on the host (no s3wokrer on my setup)

i also don’t plan to migrate to k8s. Currently my deployment is based on docker compose. Once I complete integration with VK Cloud I plan to move towards docker swarm as I intend to provision a dedicated server for the database

I don't really understand what you mean by "only S3".

what I want to achieve is the following: I’d like to run an additional s3worker container and make Papermerge store all documents exclusively in S3 object storage (in my case VK Cloud) without consuming any local disk space on the server. so instead of saving documents locally first and then offloading to S3 later I want Papermerge to write directly to S3 and read from there as well skipping local volumes altogether. and yes, this means that s3 itself provides access to files. with my current setup Papermerge retrieves the files directly from my storage (I'm running MinIO and the Papermerge API locally). For example, here’s a generated pre-signed URL that the frontend uses to access a document:

http://host.docker.internal:9000/papermerge-test/1.png?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=minioadmin%2F20250726%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250726T115842Z&X-Amz-Expires=600&X-Amz-SignedHeaders=host&X-Amz-Signature=a27e1cb6674d19c0160a651a536ff1ea175d2c204505bee12eb699e369cf9629

bl1nkker · 2025-07-26T11:46:38Z

is there a way to fully switch papermerge to use only S3 for storing and serving documents without saving anything locally?

based on this diagram I understand that with the current architecture there is no way. this is not very critical for me, but I think in the future I will think about how to solve this problem (at least in my fork)

ciur · 2025-07-26T17:37:30Z

Right. Currently uploaded files are saved on the disk first. But in general you can write another application which will take over /api/{document_id}/upload endpoint and do what you want (e.g. stream files directly to S3 storage). But as I mentioned (did I mentioned that?) - for this is out of Papermerge scope.

bl1nkker · 2025-08-07T06:31:26Z

@ciur apologies for not working on this pull request for a while. i've had a busy period at work and since my current solution worked well enough for our use case, I had to temporarily pause work on the PR (same for s3-worker pull request)

However, I’d be happy to continue contributing

bl1nkker · 2025-08-07T06:33:42Z

also, I’d like to reiterate the approach I took for supporting alternative object storages: Papermerge generates a pre-signed URL for downloading the file from the object storage and sends it to the client (the files themselves are stored on the main papermerge node and in object storage, as was originally the case)

ciur · 2025-08-07T07:13:53Z

@bl1nkker don't forget to rebase. I changed master recently (all sync went async). Also you may want to have a look at this: https://docs.papermerge.io/3.5/developer-manual/architecture/

ciur · 2025-08-07T07:21:07Z

also, I’d like to reiterate the approach I took for supporting alternative object storages: Papermerge generates a pre-signed URL for downloading the file from the object storage and sends it to the client (the files themselves are stored on the main papermerge node and in object storage, as was originally the case)

@bl1nkker don't forget to rebase. I changed master recently (all sync went async). Also you may want to have a look at this: https://docs.papermerge.io/3.5/developer-manual/architecture/

feat: support signed URLs for non-AWS object storage providers

76497bf

ciur reviewed Jul 22, 2025

View reviewed changes

[WIP] Support signed URLs for non-AWS object storage providers #688

Are you sure you want to change the base?

[WIP] Support signed URLs for non-AWS object storage providers #688

Uh oh!

Conversation

bl1nkker commented Jul 21, 2025

Uh oh!

bl1nkker commented Jul 21, 2025

Uh oh!

bl1nkker commented Jul 21, 2025

Uh oh!

ciur commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ciur commented Jul 22, 2025

Uh oh!

ciur Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

bl1nkker commented Jul 22, 2025

Uh oh!

bl1nkker commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bl1nkker commented Jul 22, 2025

Uh oh!

bl1nkker commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bl1nkker commented Jul 22, 2025

Uh oh!

ciur commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bl1nkker commented Jul 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bl1nkker commented Jul 26, 2025

Uh oh!

ciur commented Jul 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bl1nkker commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bl1nkker commented Aug 7, 2025

Uh oh!

ciur commented Aug 7, 2025

Uh oh!

ciur commented Aug 7, 2025

Uh oh!

Uh oh!

ciur commented Jul 22, 2025 •

edited

Loading

bl1nkker commented Jul 22, 2025 •

edited

Loading

bl1nkker commented Jul 22, 2025 •

edited

Loading

ciur commented Jul 22, 2025 •

edited

Loading

bl1nkker commented Jul 26, 2025 •

edited

Loading

ciur commented Jul 26, 2025 •

edited

Loading

bl1nkker commented Aug 7, 2025 •

edited

Loading