-
Notifications
You must be signed in to change notification settings - Fork 19
http: Compressed response example in Python #35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
35 commits
Select commit
Hold shift + click to select a range
5725565
http: Compressed response example in Python
felipecrv 24973e3
complete the chunked response loop
felipecrv 62277d7
more strict list of available compressors
felipecrv 3409b1f
simplify config
felipecrv cc7bdae
better names
felipecrv 3131eed
turns out I can use for..in in this loop as well
felipecrv ef90f49
fix indent
felipecrv 4cb867b
don't pick gzip as default when it's not in AVAILABLE_CODINGS
felipecrv b2c6c88
suggest default filename
felipecrv c53805c
fix brotli file extension
felipecrv d98c7f7
expand README with note about simpler Accept-Encoding headers
felipecrv 31aecef
Add client.py
felipecrv 07c4dd5
reduce buffering and reduce latency
felipecrv 727d3e5
expedite the yielding of the first buffer
felipecrv 83d241d
expand README
felipecrv 36924f7
remove test code
felipecrv 4319ede
add an option to use dictionary-encoded string column
felipecrv 2d992ad
readme: add note about IPC compression codec negotiation
felipecrv 6725886
remove BUFFER_ENTIRE_RESPONSE option
felipecrv 152157b
write a parser based on a tokenizer
felipecrv 427a8b7
make parser generic to Accept and Accept-Encoding
felipecrv 0df44ad
support IPC buffer compression based on Accept header
felipecrv 42195c0
return codec in header
felipecrv ad2d3f2
extend client.py cases
felipecrv 73897d4
Update paragraph about double-compression
felipecrv bff94ae
Fix typo in README
felipecrv 3d60a54
Add note about meaning and interpretation of Content-Type
felipecrv ffd07e1
fix typo
felipecrv 4f49776
Apply suggestions from code review
felipecrv 4be06cd
README.md: Break long lines
felipecrv ac6b45e
Move make_requests.sh to curl/client.sh
felipecrv c44f49e
Add README files to sub directories
felipecrv fb4d5dd
Improve python/server/README.md
ianmcook 382984b
Improve python/client/README.md
ianmcook 0f20539
Improve python/client/README.md
ianmcook File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -19,4 +19,168 @@ | |
|
||
# HTTP GET Arrow Data: Compression Examples | ||
|
||
This directory contains examples of HTTP servers/clients that transmit/receive data in the Arrow IPC streaming format and use compression (in various ways) to reduce the size of the transmitted data. | ||
This directory contains examples of HTTP servers/clients that transmit/receive | ||
data in the Arrow IPC streaming format and use compression (in various ways) to | ||
reduce the size of the transmitted data. | ||
|
||
Since we re-use the [Arrow IPC format][ipc] for transferring Arrow data over | ||
HTTP and both Arrow IPC and HTTP standards support compression on their own, | ||
there are at least two approaches to this problem: | ||
|
||
1. Compressed HTTP responses carrying Arrow IPC streams with uncompressed | ||
array buffers. | ||
2. Uncompressed HTTP responses carrying Arrow IPC streams with compressed | ||
array buffers. | ||
|
||
Applying both IPC buffer and HTTP compression to the same data is not | ||
recommended. The extra CPU overhead of decompressing the data twice is | ||
not worth any possible gains that double compression might bring. If | ||
compression ratios are unambiguously more important than reducing CPU | ||
overhead, then a different compression algorithm that optimizes for that can | ||
be chosen. | ||
|
||
This table shows the support for different compression algorithms in HTTP and | ||
Arrow IPC: | ||
|
||
| Codec | Identifier | HTTP Support | IPC Support | | ||
|----------- | ----------- | ------------- | ------------ | | ||
| GZip | `gzip` | X | | | ||
| DEFLATE | `deflate` | X | | | ||
| Brotli | `br` | X[^2] | | | ||
| Zstandard | `zstd` | X[^2] | X[^3] | | ||
| LZ4 | `lz4` | | X[^3] | | ||
|
||
Since not all Arrow IPC implementations support compression, HTTP compression | ||
based on accepted formats negotiated with the client is a great way to increase | ||
the chances of efficient data transfer. | ||
|
||
Servers may check the `Accept-Encoding` header of the client and choose the | ||
compression format in this order of preference: `zstd`, `br`, `gzip`, | ||
`identity` (no compression). If the client does not specify a preference, the | ||
only constraint on the server is the availability of the compression algorithm | ||
in the server environment. | ||
|
||
## Arrow IPC Compression | ||
|
||
When IPC buffer compression is preferred and servers can't assume all clients | ||
support it[^4], clients may be asked to explicitly list the supported compression | ||
algorithms in the request headers. The `Accept` header can be used for this | ||
since `Accept-Encoding` (and `Content-Encoding`) is used to control compression | ||
of the entire HTTP response stream and instruct HTTP clients (like browsers) to | ||
decompress the response before giving data to the application or saving the | ||
data. | ||
|
||
Accept: application/vnd.apache.arrow.stream; codecs="zstd, lz4" | ||
|
||
This is similar to clients requesting video streams by specifying the | ||
container format and the codecs they support | ||
(e.g. `Accept: video/webm; codecs="vp8, vorbis"`). | ||
|
||
The server is allowed to choose any of the listed codecs, or not compress the | ||
IPC buffers at all. Uncompressed IPC buffers should always be acceptable by | ||
clients. | ||
|
||
If a server adopts this approach and a client does not specify any codecs in | ||
the `Accept` header, the server can fall back to checking `Accept-Encoding` | ||
header to pick a compression algorithm for the entire HTTP response stream. | ||
|
||
To make debugging easier servers may include the chosen compression codec(s) | ||
in the `Content-Type` header of the response (quotes are optional): | ||
|
||
Content-Type: application/vnd.apache.arrow.stream; codecs=zstd | ||
|
||
This is not necessary for correct decompression because the payload already | ||
contains information that tells the IPC reader how to decompress the buffers, | ||
but it can help developers understand what is going on. | ||
|
||
When programatically checking if the `Content-Type` header contains a specific | ||
format, it is important to use a parser that can handle parameters or look | ||
only at the media type part of the header. This is not an exclusivity of the | ||
Arrow IPC format, but a general rule for all media types. For example, | ||
`application/json; charset=utf-8` should match `application/json`. | ||
|
||
felipecrv marked this conversation as resolved.
Show resolved
Hide resolved
|
||
When considering use of IPC buffer compression, check the [IPC format section of | ||
the Arrow Implementation Status page][^5] to see whether the the Arrow | ||
implementations you are targeting support it. | ||
|
||
## HTTP/1.1 Response Compression | ||
|
||
HTTP/1.1 offers an elaborate way for clients to specify their preferred | ||
content encoding (read compression algorithm) using the `Accept-Encoding` | ||
header.[^1] | ||
|
||
At least the Python server (in [`python/`](./python)) implements a fully | ||
compliant parser for the `Accept-Encoding` header. Application servers may | ||
choose to implement a simpler check of the `Accept-Encoding` header or assume | ||
that the client accepts the chosen compression scheme when talking to that | ||
server. | ||
|
||
Here is an example of a header that a client may send and what it means: | ||
|
||
Accept-Encoding: zstd;q=1.0, gzip;q=0.5, br;q=0.8, identity;q=0 | ||
|
||
This header says that the client prefers that the server compress the | ||
response with `zstd`, but if that is not possible, then `brotli` and `gzip` | ||
are acceptable (in that order because 0.8 is greater than 0.5). The client | ||
does not want the response to be uncompressed. This is communicated by | ||
`"identity"` being listed with `q=0`. | ||
|
||
To tell the server the client only accepts `zstd` responses and nothing | ||
else, not even uncompressed responses, the client would send: | ||
|
||
Accept-Encoding: zstd, *;q=0 | ||
|
||
RFC 2616[^1] specifies the rules for how a server should interpret the | ||
`Accept-Encoding` header: | ||
|
||
A server tests whether a content-coding is acceptable, according to | ||
an Accept-Encoding field, using these rules: | ||
|
||
1. If the content-coding is one of the content-codings listed in | ||
the Accept-Encoding field, then it is acceptable, unless it is | ||
accompanied by a qvalue of 0. (As defined in section 3.9, a | ||
qvalue of 0 means "not acceptable.") | ||
|
||
2. The special "*" symbol in an Accept-Encoding field matches any | ||
available content-coding not explicitly listed in the header | ||
field. | ||
|
||
3. If multiple content-codings are acceptable, then the acceptable | ||
content-coding with the highest non-zero qvalue is preferred. | ||
|
||
4. The "identity" content-coding is always acceptable, unless | ||
specifically refused because the Accept-Encoding field includes | ||
"identity;q=0", or because the field includes "*;q=0" and does | ||
not explicitly include the "identity" content-coding. If the | ||
Accept-Encoding field-value is empty, then only the "identity" | ||
encoding is acceptable. | ||
|
||
ianmcook marked this conversation as resolved.
Show resolved
Hide resolved
|
||
If you're targeting web browsers, check the compatibility table of [compression | ||
algorithms on MDN Web Docs][^2]. | ||
|
||
Another important rule is that if the server compresses the response, it | ||
must include a `Content-Encoding` header in the response. | ||
|
||
If the content-coding of an entity is not "identity", then the | ||
response MUST include a Content-Encoding entity-header (section | ||
14.11) that lists the non-identity content-coding(s) used. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I need to expand this text to talk about how Arrow streams can be piped into a compressed response. Perhaps having it before this summarized explanation of encoding negotiation. |
||
Since not all servers implement the full `Accept-Encoding` header parsing logic, | ||
clients tend to stick to simple header values like `Accept-Encoding: identity` | ||
when no compression is desired, and `Accept-Encoding: gzip, deflate, zstd, br` | ||
when the client supports different compression formats and is indifferent to | ||
which one the server chooses. Clients should expect uncompressed responses as | ||
well in theses cases. The only way to force a "406 Not Acceptable" response when | ||
no compression is available is to send `identity;q=0` or `*;q=0` somewhere in | ||
the end of the `Accept-Encoding` header. But that relies on the server | ||
implementing the full `Accept-Encoding` handling logic. | ||
|
||
|
||
[^1]: [Fielding, R. et al. (1999). HTTP/1.1. RFC 2616, Section 14.3 Accept-Encoding.](https://www.rfc-editor.org/rfc/rfc2616#section-14.3) | ||
[^2]: [MDN Web Docs: Accept-Encoding](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Encoding#browser_compatibility) | ||
[^3]: [Arrow Columnar Format: Compression](https://arrow.apache.org/docs/format/Columnar.html#compression) | ||
[^4]: Web applications using the JavaScript Arrow implementation don't have | ||
access to the compression APIs to decompress `zstd` and `lz4` IPC buffers. | ||
[^5]: [Arrow Implementation Status: IPC Format](https://arrow.apache.org/docs/status.html#ipc-format) | ||
|
||
[ipc]: https://arrow.apache.org/docs/format/Columnar.html#serialization-and-interprocess-communication-ipc |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,80 @@ | ||
<!--- | ||
Licensed to the Apache Software Foundation (ASF) under one | ||
or more contributor license agreements. See the NOTICE file | ||
distributed with this work for additional information | ||
regarding copyright ownership. The ASF licenses this file | ||
to you under the Apache License, Version 2.0 (the | ||
"License"); you may not use this file except in compliance | ||
with the License. You may obtain a copy of the License at | ||
|
||
http://www.apache.org/licenses/LICENSE-2.0 | ||
|
||
Unless required by applicable law or agreed to in writing, | ||
software distributed under the License is distributed on an | ||
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
KIND, either express or implied. See the License for the | ||
specific language governing permissions and limitations | ||
under the License. | ||
--> | ||
|
||
# HTTP GET Arrow Data: Compressed Arrow Data Examples | ||
|
||
This directory contains a simple `curl` script that issues multiple HTTP GET | ||
requests to the server implemented in the parent directory, negotiating | ||
different compression algorithms for the Arrow IPC stream data piping the output | ||
to different files with extensions that indicate the compression algorithm used. | ||
|
||
To run this example, first start one of the server examples in the parent | ||
directory, then run the `client.sh` script. | ||
|
||
You can check all the sizes with a simple command: | ||
|
||
```bash | ||
$ du -sh out* | sort -gr | ||
816M out.arrows | ||
804M out_from_chunked.arrows | ||
418M out_from_chunked.arrows+lz4 | ||
405M out.arrows+lz4 | ||
257M out.arrows.gz | ||
256M out_from_chunked.arrows.gz | ||
229M out_from_chunked.arrows+zstd | ||
229M out.arrows+zstd | ||
220M out.arrows.zstd | ||
219M out_from_chunked.arrows.zstd | ||
39M out_from_chunked.arrows.br | ||
38M out.arrows.br | ||
``` | ||
|
||
> [!WARNING] | ||
> Better compression is not the only relevant metric as it might come with a | ||
> trade-off in terms of CPU usage. The best compression algorithm for your use | ||
> case will depend on your specific requirements. | ||
|
||
## Meaning of the file extensions | ||
|
||
Files produced by HTTP/1.0 requests are not chunked, they get buffered in memory | ||
at the server before being sent to the client. If compressed, they end up | ||
slightly smaller than the results of chunked responses, but the extra delay for | ||
first byte is not worth it in most cases. | ||
|
||
- `out.arrows` (Uncompressed) | ||
- `out.arrows.gz` (Gzip HTTP compression) | ||
- `out.arrows.zstd` (Zstandard HTTP compression) | ||
- `out.arrows.br` (Brotli HTTP compression) | ||
|
||
- `out.arrows+zstd` (Zstandard IPC compression) | ||
- `out.arrows+lz4` (LZ4 IPC compression) | ||
|
||
HTTP/1.1 requests are returned by the server with `Transfer-Encoding: chunked` | ||
to send the data in smaller chunks that are sent to the socket as soon as they | ||
are ready. This is useful for large responses that take a long time to generate | ||
at the cost of a small overhead caused by the independent compression of each | ||
chunk. | ||
|
||
- `out_from_chunked.arrows` (Uncompressed) | ||
- `out_from_chunked.arrows.gz` (Gzip HTTP compression) | ||
- `out_from_chunked.arrows.zstd` (Zstandard HTTP compression) | ||
- `out_from_chunked.arrows.br` (Brotli HTTP compression) | ||
|
||
- `out_from_chunked.arrows+lz4` (LZ4 IPC compression) | ||
- `out_from_chunked.arrows+zstd` (Zstandard IPC compression) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
#!/bin/sh | ||
|
||
# Licensed to the Apache Software Foundation (ASF) under one | ||
# or more contributor license agreements. See the NOTICE file | ||
# distributed with this work for additional information | ||
# regarding copyright ownership. The ASF licenses this file | ||
# to you under the Apache License, Version 2.0 (the | ||
# "License"); you may not use this file except in compliance | ||
# with the License. You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, | ||
# software distributed under the License is distributed on an | ||
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
# KIND, either express or implied. See the License for the | ||
# specific language governing permissions and limitations | ||
# under the License. | ||
|
||
CURL="curl --verbose" | ||
URI="http://localhost:8008" | ||
OUT_HTTP1=out.arrows | ||
OUT_CHUNKED=out_from_chunked.arrows | ||
|
||
# HTTP/1.0 means that response is not chunked and not compressed... | ||
$CURL --http1.0 -o $OUT_HTTP1 $URI | ||
# ...but it may be compressed with an explicitly set Accept-Encoding | ||
# header | ||
$CURL --http1.0 -H "Accept-Encoding: gzip, *;q=0" -o $OUT_HTTP1.gz $URI | ||
$CURL --http1.0 -H "Accept-Encoding: zstd, *;q=0" -o $OUT_HTTP1.zstd $URI | ||
$CURL --http1.0 -H "Accept-Encoding: br, *;q=0" -o $OUT_HTTP1.br $URI | ||
# ...or with IPC buffer compression if the Accept header specifies codecs. | ||
$CURL --http1.0 -H "Accept: application/vnd.apache.arrow.stream; codecs=\"zstd, lz4\"" -o $OUT_HTTP1+zstd $URI | ||
$CURL --http1.0 -H "Accept: application/vnd.apache.arrow.stream; codecs=lz4" -o $OUT_HTTP1+lz4 $URI | ||
|
||
# HTTP/1.1 means compression is on by default... | ||
# ...but it can be refused with the Accept-Encoding: identity header. | ||
$CURL -H "Accept-Encoding: identity" -o $OUT_CHUNKED $URI | ||
# ...with gzip if no Accept-Encoding header is set. | ||
$CURL -o $OUT_CHUNKED.gz $URI | ||
# ...or with the compression algorithm specified in the Accept-Encoding. | ||
$CURL -H "Accept-Encoding: zstd, *;q=0" -o $OUT_CHUNKED.zstd $URI | ||
$CURL -H "Accept-Encoding: br, *;q=0" -o $OUT_CHUNKED.br $URI | ||
# ...or with IPC buffer compression if the Accept header specifies codecs. | ||
$CURL -H "Accept: application/vnd.apache.arrow.stream; codecs=\"zstd, lz4\"" -o $OUT_CHUNKED+zstd $URI | ||
$CURL -H "Accept: application/vnd.apache.arrow.stream; codecs=lz4" -o $OUT_CHUNKED+lz4 $URI |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
<!--- | ||
Licensed to the Apache Software Foundation (ASF) under one | ||
or more contributor license agreements. See the NOTICE file | ||
distributed with this work for additional information | ||
regarding copyright ownership. The ASF licenses this file | ||
to you under the Apache License, Version 2.0 (the | ||
"License"); you may not use this file except in compliance | ||
with the License. You may obtain a copy of the License at | ||
|
||
http://www.apache.org/licenses/LICENSE-2.0 | ||
|
||
Unless required by applicable law or agreed to in writing, | ||
software distributed under the License is distributed on an | ||
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
KIND, either express or implied. See the License for the | ||
specific language governing permissions and limitations | ||
under the License. | ||
--> | ||
|
||
# HTTP GET Arrow Data: Compressed Arrow Data Examples | ||
|
||
This directory contains an HTTP client implemented in Python that issues multiple | ||
requests to one of the server examples implemented in the parent directory, | ||
negotiating different compression algorithms for the Arrow IPC stream data. | ||
|
||
To run this example, first start one of the compressed server examples in the | ||
parent directory, then: | ||
|
||
```sh | ||
pip install pyarrow | ||
python client.py | ||
``` |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.