Skip to content

Made mmap-related code compatible with Cygwin #10

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

mvanleeu
Copy link
Contributor

The whole logic of first blocking the memory + guard page, only to overwrite it afterwards in a fixed location does not work in cygwin (startp_ gets a value of 0xffffffffffff). So created a Cygwin path and simplified the mmap to a maximum.
Didn't bother to #ifdef all references to guardp_ though, which may have been cleaner.

Testing has been limited to the benchmark files, fitting in RAM.

The whole logic of first blocking the memory + guard page, only to overwrite it then in the fixed location does not work in cygwin (startp_ gets a value of 0xffffffffffff). So created a Cygwin path and simplified the mmap to a maximum.
Didn't bother to #ifdef all references to guardp_ though, which may have been cleaner.

Testing has been limited to the benchmark files, fitting in RAM.
@dw
Copy link
Owner

dw commented Mar 26, 2022

I haven't touched this code in a long time, but IIRC it is always necessary to be able to run 15 bytes over the end of the buffer, to allow pcmpistri to begin on the last byte (which might be necessary depending on the length of the file, and/or the prior data bytes appearing earlier in the buffer)

Won't merge this one without a bit more clarity, it looks like it might be breaking that requirement

Verify if the use of SIMD instructions do not cause to read after the end of the mmap buffer.
Test files = length-{_n_}.csv with 'n' == 0 to 15 and representing the result of  file size(length-_n_.csv) modulo 16, as the SIMD instruction tested processes 16 bytes at a time.
Separate unittest file as the existing one somehow crashed.
@mvanleeu
Copy link
Contributor Author

Indeed, the description of csvmonkey mentions the approach of parsing 16 bytes at a time via SIMD.
To see if I was just lucky running the test in an ideal scenario, I have doctored some small test data file that have slightly different lengths (i.e. "length of file" modulo 16 == "length-{rest_value}.csv", with the last field of the file filled with the character 'x' equal to 'rest_value + 1' . I created test case to read and parse the last field of the file and verify if the number of 'x' characters matches.
Somehow, none of the length made the test program crashed or returned wrong values, which I would have expected if the mmap would cause the SMID instruction to process data that was after the end of the mmap.
Please review if this test case is actually testing the right thing.

@mvanleeu
Copy link
Contributor Author

$ python3 tests/csvmonkey_test_mmap.py
.
----------------------------------------------------------------------
Ran 1 test in 0.004s

OK

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants