-
Notifications
You must be signed in to change notification settings - Fork 19
http: Compressed response example in Python #35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I still need to write the client that decompresses and parses the stream. I want to measure things like time-to-first-batch and how long it takes to download and consume streams of different compressions algorithms. For now, the |
If the content-coding of an entity is not "identity", then the | ||
response MUST include a Content-Encoding entity-header (section | ||
14.11) that lists the non-identity content-coding(s) used. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to expand this text to talk about how Arrow streams can be piped into a compressed response. Perhaps having it before this summarized explanation of encoding negotiation.
def check_parser(s, expected): | ||
try: | ||
parsed = parse_accept_encoding(s) | ||
# print("parsed:", parsed, "\nexpected:", expected) | ||
assert parsed == expected | ||
except ValueError as e: | ||
print(e) | ||
|
||
|
||
check_parser("", []) | ||
expected = [("gzip", None), ("zstd", 1.0), ("*", None)] | ||
check_parser("gzip, zstd;q=1.0, *", expected) | ||
check_parser("gzip , zstd; q= 1.0 , *", expected) | ||
expected = [("gzip", None), ("zstd", 1.0), ("*", 0.0)] | ||
check_parser("gzip , zstd; q= 1.0 \t \r\n ,*;q =0", expected) | ||
expected = [("zstd", 1.0), ("gzip", 0.5), ("br", 0.8), ("identity", 0.0)] | ||
check_parser("zstd;q=1.0, gzip;q=0.5, br;q=0.8, identity;q=0", expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These "unit tests" will be removed before merge.
Thanks! At quick glance this looks great so far. I'll take a closer look soon. |
Stats when running the server.py/client.py pair on the same M1 Pro macbook: $ python client.py
The uncompressed response size is almost 1GB. I think brotli is getting really high compression ratio here because the batches of data are random slices of the same base array.
From one laptop to another on my home Wi-Fi and 1/10 of the records: $ python client.py
|
We can make Brotli less impressive by feeding it more random data:
(generated batches with random values instead of simply slicing from a big array) --- a/http/get_compressed/python/server/server.py
+++ b/http/get_compressed/python/server/server.py
@@ -72,13 +72,14 @@ def example_batches(tickers):
total_records = 42_000_000
batch_len = 6 * 1024
# all the batches sent are random slices of the larger base batch
- base_batch = example_batch(tickers, length=8 * batch_len)
+ # base_batch = example_batch(tickers, length=8 * batch_len)
batches = []
records = 0
while records < total_records:
length = min(batch_len, total_records - records)
- offset = randint(0, base_batch.num_rows - length - 1)
- batch = base_batch.slice(offset, length)
+ # offset = randint(0, base_batch.num_rows - length - 1)
+ # batch = base_batch.slice(offset, length)
+ batch = example_batch(tickers, length)
batches.append(batch)
records += length
return batches What is the CPU overhead though? All requests are over $ python client.py
Now with the server running on a different laptop and loading the response over Wi-Fi and 1/10 of the data. Zstd is still the winner. $ python client.py
|
Arrow IPC buffer compression is probably preferrable to HTTP compression. |
Why? |
Because it only supports modern compressors with extremely high decompression speed (lz4, zstd). |
Browsers support
I'm experimenting with dynamic buffer sizing and frame flushes that align with the HTTP chunk boundaries to improve latency. I can ensure metadata reaches the client as soon as possible. When the compressed stream comes from the network, the cost of decompressing a single block of the Zstd stream to get metadata is too small relative to the network latency. Buffer-level compression can also increase latency by buffering the whole compressed buffer before producing the length and putting that on the stream. |
I added the option to dictionary-encode a column in the compression example. Results are interesting. From not dictionary-encoded to sharing the same dictionary of 60 strings in the
Interestingly, brotli compresses better when the data is not dictionary encoded. |
Now including dictionary encoding at the data generation time (on all these cases) and IPC buffer compression.
Timings
|
@felipecrv would you say that the below are valid interpretations of the timings and compression ratios above?
|
There might be a Python overhead in these HTTP compression examples because the buffer compression happens completely inside the C++ layer and the HTTP examples connect different pyarrow classes. This is still a merit of the IPC buffer compression since Python might be present on both client and server. Buffer compression is really beneficial to the IPC stream parser. The numbers above look very good.
I would recommend
I would emphasize the
Indeed. |
@felipecrv thanks — that all sounds good.
Agreed. I think we only need to warn developers away from it when they don't control the application on the other end of the connection. (For example, if you're a SaaS service adding an Arrow-over-HTTP protocol.) Another issue to consider is that if you are caching results in Arrow format and serving them to HTTP clients as opaque binary data, and you need to make this work with the lowest-common-denominator Arrow client library, then IPC buffer compression is a poor option for you. |
Not so clear cut, because you could use IPC buffer compression and decompress right before returning to save on cache storage space. |
Perhaps... but do any Arrow implements provide efficient, easy-to-use methods to convert an IPC stream with compressed buffers to an IPC stream with uncompressed buffers? This could be done simply enough with a small procedure in most of the Arrow implementations, but I'm not sure how efficient it would be. |
It would require an intermediate step. But if the cache exists to reduce the need for some expensive computation that produces the RecordBatch stream, even expensive decompression and re-serialization could pay off. |
f192eb8
to
ffd07e1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really fantastic @felipecrv.
I left a couple of comments and a code suggestion but I had some high-level thoughts:
- To follow the convention established in other examples in this repo, the
server
andclient
folders should get concise README.md files with instructions on how to prepare an environment to run the example server and client. - The current README has turned into a really useful guidance document which I think should live on the main website. If we did that, this README could be made into a short document like the others in this repo and could link to that guidance document.
- The benchmarking you've done in PR comments may be one of the most useful exports of this work. I think that should get published somewhere.
Thanks for doing this @felipecrv ! A couple of requests:
|
@amoeba wrote:
Could the convention be broken here considering that readers will be better off reading the recommendations and picking a single compression algorithm that they use on their own application?
Goal of this repo is to work as a staging area for content that will eventually move to the official docs.
Benchmarking compression is very tricky because it depends too much on distribution of the values and using randomly generated data like I did here pretty much defines the compression rate one gets, so it's better that people get a feel for how compression behaves on their workloads, CPU and network bandwidth budgets. |
Co-authored-by: Ian Cook <[email protected]> Co-authored-by: Bryce Mecum <[email protected]>
@ianmcook wrote:
Makes sense! Doing it now.
|
Closes apache/arrow#40601