Skip to content

http: Compressed response example in Python #35

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 35 commits into from
Nov 27, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
5725565
http: Compressed response example in Python
felipecrv Sep 6, 2024
24973e3
complete the chunked response loop
felipecrv Sep 6, 2024
62277d7
more strict list of available compressors
felipecrv Sep 6, 2024
3409b1f
simplify config
felipecrv Sep 7, 2024
cc7bdae
better names
felipecrv Sep 7, 2024
3131eed
turns out I can use for..in in this loop as well
felipecrv Sep 7, 2024
ef90f49
fix indent
felipecrv Sep 7, 2024
4cb867b
don't pick gzip as default when it's not in AVAILABLE_CODINGS
felipecrv Sep 7, 2024
b2c6c88
suggest default filename
felipecrv Sep 7, 2024
c53805c
fix brotli file extension
felipecrv Sep 10, 2024
d98c7f7
expand README with note about simpler Accept-Encoding headers
felipecrv Sep 10, 2024
31aecef
Add client.py
felipecrv Sep 10, 2024
07c4dd5
reduce buffering and reduce latency
felipecrv Sep 10, 2024
727d3e5
expedite the yielding of the first buffer
felipecrv Sep 10, 2024
83d241d
expand README
felipecrv Sep 11, 2024
36924f7
remove test code
felipecrv Sep 11, 2024
4319ede
add an option to use dictionary-encoded string column
felipecrv Sep 12, 2024
2d992ad
readme: add note about IPC compression codec negotiation
felipecrv Sep 12, 2024
6725886
remove BUFFER_ENTIRE_RESPONSE option
felipecrv Sep 12, 2024
152157b
write a parser based on a tokenizer
felipecrv Sep 12, 2024
427a8b7
make parser generic to Accept and Accept-Encoding
felipecrv Sep 12, 2024
0df44ad
support IPC buffer compression based on Accept header
felipecrv Sep 12, 2024
42195c0
return codec in header
felipecrv Sep 12, 2024
ad2d3f2
extend client.py cases
felipecrv Sep 12, 2024
73897d4
Update paragraph about double-compression
felipecrv Sep 20, 2024
bff94ae
Fix typo in README
felipecrv Sep 20, 2024
3d60a54
Add note about meaning and interpretation of Content-Type
felipecrv Sep 20, 2024
ffd07e1
fix typo
felipecrv Sep 20, 2024
4f49776
Apply suggestions from code review
felipecrv Nov 22, 2024
4be06cd
README.md: Break long lines
felipecrv Nov 22, 2024
ac6b45e
Move make_requests.sh to curl/client.sh
felipecrv Nov 22, 2024
c44f49e
Add README files to sub directories
felipecrv Nov 22, 2024
fb4d5dd
Improve python/server/README.md
ianmcook Nov 27, 2024
382984b
Improve python/client/README.md
ianmcook Nov 27, 2024
0f20539
Improve python/client/README.md
ianmcook Nov 27, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
166 changes: 165 additions & 1 deletion http/get_compressed/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,4 +19,168 @@

# HTTP GET Arrow Data: Compression Examples

This directory contains examples of HTTP servers/clients that transmit/receive data in the Arrow IPC streaming format and use compression (in various ways) to reduce the size of the transmitted data.
This directory contains examples of HTTP servers/clients that transmit/receive
data in the Arrow IPC streaming format and use compression (in various ways) to
reduce the size of the transmitted data.

Since we re-use the [Arrow IPC format][ipc] for transferring Arrow data over
HTTP and both Arrow IPC and HTTP standards support compression on their own,
there are at least two approaches to this problem:

1. Compressed HTTP responses carrying Arrow IPC streams with uncompressed
array buffers.
2. Uncompressed HTTP responses carrying Arrow IPC streams with compressed
array buffers.

Applying both IPC buffer and HTTP compression to the same data is not
recommended. The extra CPU overhead of decompressing the data twice is
not worth any possible gains that double compression might bring. If
compression ratios are unambiguously more important than reducing CPU
overhead, then a different compression algorithm that optimizes for that can
be chosen.

This table shows the support for different compression algorithms in HTTP and
Arrow IPC:

| Codec | Identifier | HTTP Support | IPC Support |
|----------- | ----------- | ------------- | ------------ |
| GZip | `gzip` | X | |
| DEFLATE | `deflate` | X | |
| Brotli | `br` | X[^2] | |
| Zstandard | `zstd` | X[^2] | X[^3] |
| LZ4 | `lz4` | | X[^3] |

Since not all Arrow IPC implementations support compression, HTTP compression
based on accepted formats negotiated with the client is a great way to increase
the chances of efficient data transfer.

Servers may check the `Accept-Encoding` header of the client and choose the
compression format in this order of preference: `zstd`, `br`, `gzip`,
`identity` (no compression). If the client does not specify a preference, the
only constraint on the server is the availability of the compression algorithm
in the server environment.

## Arrow IPC Compression

When IPC buffer compression is preferred and servers can't assume all clients
support it[^4], clients may be asked to explicitly list the supported compression
algorithms in the request headers. The `Accept` header can be used for this
since `Accept-Encoding` (and `Content-Encoding`) is used to control compression
of the entire HTTP response stream and instruct HTTP clients (like browsers) to
decompress the response before giving data to the application or saving the
data.

Accept: application/vnd.apache.arrow.stream; codecs="zstd, lz4"

This is similar to clients requesting video streams by specifying the
container format and the codecs they support
(e.g. `Accept: video/webm; codecs="vp8, vorbis"`).

The server is allowed to choose any of the listed codecs, or not compress the
IPC buffers at all. Uncompressed IPC buffers should always be acceptable by
clients.

If a server adopts this approach and a client does not specify any codecs in
the `Accept` header, the server can fall back to checking `Accept-Encoding`
header to pick a compression algorithm for the entire HTTP response stream.

To make debugging easier servers may include the chosen compression codec(s)
in the `Content-Type` header of the response (quotes are optional):

Content-Type: application/vnd.apache.arrow.stream; codecs=zstd

This is not necessary for correct decompression because the payload already
contains information that tells the IPC reader how to decompress the buffers,
but it can help developers understand what is going on.

When programatically checking if the `Content-Type` header contains a specific
format, it is important to use a parser that can handle parameters or look
only at the media type part of the header. This is not an exclusivity of the
Arrow IPC format, but a general rule for all media types. For example,
`application/json; charset=utf-8` should match `application/json`.

When considering use of IPC buffer compression, check the [IPC format section of
the Arrow Implementation Status page][^5] to see whether the the Arrow
implementations you are targeting support it.

## HTTP/1.1 Response Compression

HTTP/1.1 offers an elaborate way for clients to specify their preferred
content encoding (read compression algorithm) using the `Accept-Encoding`
header.[^1]

At least the Python server (in [`python/`](./python)) implements a fully
compliant parser for the `Accept-Encoding` header. Application servers may
choose to implement a simpler check of the `Accept-Encoding` header or assume
that the client accepts the chosen compression scheme when talking to that
server.

Here is an example of a header that a client may send and what it means:

Accept-Encoding: zstd;q=1.0, gzip;q=0.5, br;q=0.8, identity;q=0

This header says that the client prefers that the server compress the
response with `zstd`, but if that is not possible, then `brotli` and `gzip`
are acceptable (in that order because 0.8 is greater than 0.5). The client
does not want the response to be uncompressed. This is communicated by
`"identity"` being listed with `q=0`.

To tell the server the client only accepts `zstd` responses and nothing
else, not even uncompressed responses, the client would send:

Accept-Encoding: zstd, *;q=0

RFC 2616[^1] specifies the rules for how a server should interpret the
`Accept-Encoding` header:

A server tests whether a content-coding is acceptable, according to
an Accept-Encoding field, using these rules:

1. If the content-coding is one of the content-codings listed in
the Accept-Encoding field, then it is acceptable, unless it is
accompanied by a qvalue of 0. (As defined in section 3.9, a
qvalue of 0 means "not acceptable.")

2. The special "*" symbol in an Accept-Encoding field matches any
available content-coding not explicitly listed in the header
field.

3. If multiple content-codings are acceptable, then the acceptable
content-coding with the highest non-zero qvalue is preferred.

4. The "identity" content-coding is always acceptable, unless
specifically refused because the Accept-Encoding field includes
"identity;q=0", or because the field includes "*;q=0" and does
not explicitly include the "identity" content-coding. If the
Accept-Encoding field-value is empty, then only the "identity"
encoding is acceptable.

If you're targeting web browsers, check the compatibility table of [compression
algorithms on MDN Web Docs][^2].

Another important rule is that if the server compresses the response, it
must include a `Content-Encoding` header in the response.

If the content-coding of an entity is not "identity", then the
response MUST include a Content-Encoding entity-header (section
14.11) that lists the non-identity content-coding(s) used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to expand this text to talk about how Arrow streams can be piped into a compressed response. Perhaps having it before this summarized explanation of encoding negotiation.

Since not all servers implement the full `Accept-Encoding` header parsing logic,
clients tend to stick to simple header values like `Accept-Encoding: identity`
when no compression is desired, and `Accept-Encoding: gzip, deflate, zstd, br`
when the client supports different compression formats and is indifferent to
which one the server chooses. Clients should expect uncompressed responses as
well in theses cases. The only way to force a "406 Not Acceptable" response when
no compression is available is to send `identity;q=0` or `*;q=0` somewhere in
the end of the `Accept-Encoding` header. But that relies on the server
implementing the full `Accept-Encoding` handling logic.


[^1]: [Fielding, R. et al. (1999). HTTP/1.1. RFC 2616, Section 14.3 Accept-Encoding.](https://www.rfc-editor.org/rfc/rfc2616#section-14.3)
[^2]: [MDN Web Docs: Accept-Encoding](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Encoding#browser_compatibility)
[^3]: [Arrow Columnar Format: Compression](https://arrow.apache.org/docs/format/Columnar.html#compression)
[^4]: Web applications using the JavaScript Arrow implementation don't have
access to the compression APIs to decompress `zstd` and `lz4` IPC buffers.
[^5]: [Arrow Implementation Status: IPC Format](https://arrow.apache.org/docs/status.html#ipc-format)

[ipc]: https://arrow.apache.org/docs/format/Columnar.html#serialization-and-interprocess-communication-ipc
80 changes: 80 additions & 0 deletions http/get_compressed/curl/client/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
<!---
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

# HTTP GET Arrow Data: Compressed Arrow Data Examples

This directory contains a simple `curl` script that issues multiple HTTP GET
requests to the server implemented in the parent directory, negotiating
different compression algorithms for the Arrow IPC stream data piping the output
to different files with extensions that indicate the compression algorithm used.

To run this example, first start one of the server examples in the parent
directory, then run the `client.sh` script.

You can check all the sizes with a simple command:

```bash
$ du -sh out* | sort -gr
816M out.arrows
804M out_from_chunked.arrows
418M out_from_chunked.arrows+lz4
405M out.arrows+lz4
257M out.arrows.gz
256M out_from_chunked.arrows.gz
229M out_from_chunked.arrows+zstd
229M out.arrows+zstd
220M out.arrows.zstd
219M out_from_chunked.arrows.zstd
39M out_from_chunked.arrows.br
38M out.arrows.br
```

> [!WARNING]
> Better compression is not the only relevant metric as it might come with a
> trade-off in terms of CPU usage. The best compression algorithm for your use
> case will depend on your specific requirements.

## Meaning of the file extensions

Files produced by HTTP/1.0 requests are not chunked, they get buffered in memory
at the server before being sent to the client. If compressed, they end up
slightly smaller than the results of chunked responses, but the extra delay for
first byte is not worth it in most cases.

- `out.arrows` (Uncompressed)
- `out.arrows.gz` (Gzip HTTP compression)
- `out.arrows.zstd` (Zstandard HTTP compression)
- `out.arrows.br` (Brotli HTTP compression)

- `out.arrows+zstd` (Zstandard IPC compression)
- `out.arrows+lz4` (LZ4 IPC compression)

HTTP/1.1 requests are returned by the server with `Transfer-Encoding: chunked`
to send the data in smaller chunks that are sent to the socket as soon as they
are ready. This is useful for large responses that take a long time to generate
at the cost of a small overhead caused by the independent compression of each
chunk.

- `out_from_chunked.arrows` (Uncompressed)
- `out_from_chunked.arrows.gz` (Gzip HTTP compression)
- `out_from_chunked.arrows.zstd` (Zstandard HTTP compression)
- `out_from_chunked.arrows.br` (Brotli HTTP compression)

- `out_from_chunked.arrows+lz4` (LZ4 IPC compression)
- `out_from_chunked.arrows+zstd` (Zstandard IPC compression)
46 changes: 46 additions & 0 deletions http/get_compressed/curl/client/client.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
#!/bin/sh

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

CURL="curl --verbose"
URI="http://localhost:8008"
OUT_HTTP1=out.arrows
OUT_CHUNKED=out_from_chunked.arrows

# HTTP/1.0 means that response is not chunked and not compressed...
$CURL --http1.0 -o $OUT_HTTP1 $URI
# ...but it may be compressed with an explicitly set Accept-Encoding
# header
$CURL --http1.0 -H "Accept-Encoding: gzip, *;q=0" -o $OUT_HTTP1.gz $URI
$CURL --http1.0 -H "Accept-Encoding: zstd, *;q=0" -o $OUT_HTTP1.zstd $URI
$CURL --http1.0 -H "Accept-Encoding: br, *;q=0" -o $OUT_HTTP1.br $URI
# ...or with IPC buffer compression if the Accept header specifies codecs.
$CURL --http1.0 -H "Accept: application/vnd.apache.arrow.stream; codecs=\"zstd, lz4\"" -o $OUT_HTTP1+zstd $URI
$CURL --http1.0 -H "Accept: application/vnd.apache.arrow.stream; codecs=lz4" -o $OUT_HTTP1+lz4 $URI

# HTTP/1.1 means compression is on by default...
# ...but it can be refused with the Accept-Encoding: identity header.
$CURL -H "Accept-Encoding: identity" -o $OUT_CHUNKED $URI
# ...with gzip if no Accept-Encoding header is set.
$CURL -o $OUT_CHUNKED.gz $URI
# ...or with the compression algorithm specified in the Accept-Encoding.
$CURL -H "Accept-Encoding: zstd, *;q=0" -o $OUT_CHUNKED.zstd $URI
$CURL -H "Accept-Encoding: br, *;q=0" -o $OUT_CHUNKED.br $URI
# ...or with IPC buffer compression if the Accept header specifies codecs.
$CURL -H "Accept: application/vnd.apache.arrow.stream; codecs=\"zstd, lz4\"" -o $OUT_CHUNKED+zstd $URI
$CURL -H "Accept: application/vnd.apache.arrow.stream; codecs=lz4" -o $OUT_CHUNKED+lz4 $URI
32 changes: 32 additions & 0 deletions http/get_compressed/python/client/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
<!---
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

# HTTP GET Arrow Data: Compressed Arrow Data Examples

This directory contains an HTTP client implemented in Python that issues multiple
requests to one of the server examples implemented in the parent directory,
negotiating different compression algorithms for the Arrow IPC stream data.

To run this example, first start one of the compressed server examples in the
parent directory, then:

```sh
pip install pyarrow
python client.py
```
Loading