From dd311b559df5cd9213cbbb7fe39633eac2b5e44d Mon Sep 17 00:00:00 2001 From: alexanderjohnston Date: Sun, 14 Jul 2024 13:09:28 -0500 Subject: [PATCH 1/2] from_chars is about 2/3 faster --- include/csvmonkey.hpp | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/include/csvmonkey.hpp b/include/csvmonkey.hpp index 9e13ad4..e7b5766 100644 --- a/include/csvmonkey.hpp +++ b/include/csvmonkey.hpp @@ -340,7 +340,13 @@ struct CsvCell qi::parse(ptr, ptr+size, double_, n); return n; #else - return strtod(ptr, NULL); + double number; + std::from_chars( + ptr, + ptr + size, + number + ); + return number; #endif } }; From f6295cb604c8d59dcd51478f7cdaca32415395f2 Mon Sep 17 00:00:00 2001 From: aljohnston112 Date: Mon, 19 May 2025 11:44:16 -0500 Subject: [PATCH 2/2] Update README.md --- README.md | 214 +----------------------------------------------------- 1 file changed, 1 insertion(+), 213 deletions(-) diff --git a/README.md b/README.md index 511cd1f..1d7dac0 100644 --- a/README.md +++ b/README.md @@ -1,213 +1 @@ -# csvmonkey - -This is a header-only vectorized, lazy-decoding, zero-copy CSV file parser. -Given appropriate input data and hardware, the C++ version can tokenize ~1.9 -GiB/sec of input in one thread. For a basic summing task, the Python version is -~5x faster than `csv.reader` and ~10x faster than `csv.DictReader`, while -maintaining a similarly usable interface. - -**This still requires a ton of work. For now it's mostly toy code.** - -Requires a CPU supporting Intel SSE4.2 and a C++11 compiler that bundles -`smmintrin.h`. For non-SSE4.2 machines, a reasonable fallback implementation is -also provided. - -As of writing, csvmonkey comfortably leads Ewan Higg's csv-game -microbenchmark of 24 CSV parsers. - - -## How It Works - -* **Vectorized**: scanning for values that change parser state is done using - Intel SSE 4.2 PCMPISTRI instruction. PCMPISTRI can locate the first occurence - of up to four values within a 16 byte vector, allowing searching 16 input - bytes for end of line, escape, quote, or field separators in one instruction. - -* **Zero Copy**: the user supplies the parser's input buffer. The output is an - array of column offsets within a row, each flagged to indicate whether an - escape character was detected. The highest throughput is achieved in - combination with memory-mapped files, where none of the OS, application or - parser make any bulk copies. - -* **Lazy Decoding**: input is not copied or unquoted until requested. Since a - flag indicates the presence of escapes, a fast path can avoid any bytewise - decode in the usual case where no escape is present. Due to lazy decoding, - csvmonkey is extremely effective at tasks that scan only a subset of data, - for example one or two columns from a large many-columned CSV. This use case - is the original motivation for the design. - -* **Header Only**: the parser has no third-party dependencies, just some - templates defined in ``csvmonkey.hpp``. - - -## Python Usage - -You can install the library through pip: - -``` -pip install csvmonkey -``` - -If this fails on ubuntu, first install `clang`: - -``` -sudo apt-get install clang -``` - -Then run: - -``` -CC=clang pip install csvmonkey -``` - -You can also install the library locally by cloning this repo and running: - -``` -pip install -e . -``` - -Then you can use it likewise: - -1. `import csvmonkey` -2. `csvmonkey.from_path()` for a memory-mapped file, `csvmonkey.from_file()` - for any file-like object with a `read()` method, or `csvmonkey.from_iter()` - for any iterable object that yields lines or file chunks, e.g. - `from_iter(open("ram.csv"))`. - -By default a magical `Row` object is yielded during iteration. This object is -only a window into the currently parsed data, and becomes invalid upon the next -iteration. Row data is accessed either by index or by key (if `header=True`) -using: - -``` -for row in csvmonkey.from_path("ram.csv", header=True): - row[20] # by index - row["UnBlendedCost"] # by header value -``` - -If your CSV contains a header, specify `header=True` to read it during -construction. If the CSV lacks a header but dict-like behaviour is desired, -pass a header explicitly as `header=("a", "b", "c", "d")`. - -Element access causes the relevant chunk of the row to be copied to the heap -and returned as a Python string. - -Rows may be converted to dicts via `row.asdict()`, tuples via -`row.astuple()` or lists via `row.aslist()`. If you want rows to be produced -directly in concrete form, pass `yields="list"`, `yields="tuple"`, -`yields="dict"` keyword arguments. - - -### Unicode - -Unicode is supported for character sets where delimiters and line endings are -represented by one byte. To configure Unicode, pass an ``encoding`` parameter, -and optionally an ``errors`` parameter. - -* "bytes": Return bytes (default on Python 2) -* "utf-8": Decode as UTF-8 (default on Python 3) -* "ascii": Decode as ASCII -* "latin1": Decode as LATIN1 -* "locale": Decode according to the active C locale -* "...": Decode according some codec "..." known to Python - -Where possible, prefer exact spelling and case matching one of the above -encodings, to ensure an associated fast path is used. - - -## Python Benchmark - -ram.csv is 614MiB with 1,540,093 records of 22 columns and approximately 418 -bytes per record. An anonymized version is checked into LFS as -``tests/data/anon-ram.csv.zstd``. - -Python 2.7 Sum: convert to float and sum single column: - -| Mode | Rate | Ratio | i7-6700HQ | Xeon E5530 | Core i5-2435M | -|--------------------------|------------|-------|-----------|------------|---------------| -| csvmonkey lazy decode | 1098 MiB/s | - | 0.559s | 0.9s | 1.29s | -| csvmonkey yields="tuple" | 642 MiB/s | 1.7x | 0.956s | 1.87s | 2.17s | -| csvmonkey yields="dict" | 281 MiB/s | 3.8x | 2.18s | 4.57s | 5.04s | -| csv.reader | 223 MiB/s | 4.9x | 2.75s | 5.88s | 11.1s | -| csv.DictReader | 85 MiB/s | 12.7x | 7.15s | 16.3s | 25.0s | - -Python 2.7 No-op: Iterate complete file, no other processing: - -| Mode | Rate | Ratio | i7-6700HQ | Xeon E5530 | -|--------------------------|------------|-------|-----------|------------| -| csvmonkey lazy decode | 1906 MiB/s | - | 0.322s | 0.444s | -| csvmonkey yields="tuple" | 831 MiB/s | 2.3x | 0.738s | 1.4s | -| csvmonkey yields="dict" | 318 MiB/s | 6.0x | 1.93s | 4.26s | -| csv.reader | 248 MiB/s | 7.6x | 2.47s | 5.31s | -| csv.DictReader | 92 MiB/s | 20.5x | 6.62s | 15.2s | - -Python 3.6 No-op: Iterate complete file, includes charset decoding - -| Mode | Rate | Ratio | i7-6700HQ | -|---------------------------------------------|------------|------------|------------| -| csvmonkey lazy decode | 1906 MiB/s | - | 0.322s | -| csvmonkey yields="tuple", encoding="bytes" | 833 MiB/s | 2.3x | 0.737s | -| csvmonkey yields="tuple", encoding="latin1" | 579 MiB/s | 3.3x | 1.06s | -| csvmonkey yields="tuple" | 495 MiB/s | 3.8x | 1.24s | -| csvmonkey yields="dict" | 235 MiB/s | 8.1x | 2.61s | -| csv.reader | 121 MiB/s | 15.7x | 5.07s | -| csv.DictReader | 55 MiB/s | 34.4x | 11.1s | - - -### Command lines - -Sum: - -``` -python -mtimeit -n1 -r3 -s 'import csvmonkey' 'sum(float(row["UnBlendedCost"]) for row in csvmonkey.from_path("ram.csv", header=True))' -python -mtimeit -n1 -r3 -s 'import csvmonkey' 'sum(float(row[20]) for row in csvmonkey.from_path("ram.csv", header=True, yields="tuple"))' -python -mtimeit -n1 -r3 -s 'import csvmonkey' 'sum(float(row["UnBlendedCost"]) for row in csvmonkey.from_path("ram.csv", header=True, yields="dict"))' -python -mtimeit -n1 -r3 -s 'import csv' 'r = csv.reader(open("ram.csv")); next(r); sum(float(row[20]) for row in r)' -python -mtimeit -n1 -r3 -s 'import csv' 'sum(float(row["UnBlendedCost"]) for row in csv.DictReader(open("ram.csv")))' -``` - -No-op: - -``` -python -mtimeit -n1 -r3 -s 'import csvmonkey' 'all(csvmonkey.from_path("ram.csv", header=True))' -python -mtimeit -n1 -r3 -s 'import csvmonkey' 'all(csvmonkey.from_path("ram.csv", header=True, yields="tuple"))' -python -mtimeit -n1 -r3 -s 'import csvmonkey' 'all(csvmonkey.from_path("ram.csv", header=True, yields="dict"))' -python -mtimeit -n1 -r3 -s 'import csv' 'all(csv.reader(open("ram.csv")))' -python -mtimeit -n1 -r3 -s 'import csv' 'all(csv.DictReader(open("ram.csv")))' -``` - - -## C++ Usage - -1. Copy `csvmonkey.hpp` to your project and include it. -1. `CFLAGS=-msse4.2 -O3` -1. See `Makefile` for an example of producing a profile-guided build (worth an - extra few %). -1. Instantiate `MappedFileCursor` (zero copy) or `FdStreamCursor` (buffered), attach it to a `CsvReader`. -1. Invoke `read_row()` and use `row().by_value()` to pick out `CsvCell` pointers for your desired columns. -1. Pump `read_row()` in a loop and use cell's `ptr()`, `size()`, `as_str()`, `equals()` and `as_double()` methods while `read_row()` returns true. - - -# TODO - -* COW pointer interface to `as_str()`. -* ~~Finish Python 3 support~~ -* ~~Ensure Python ReaderObject is always 16-byte aligned~~ -* Fix handling of last row when it: - * lacks newline, or - * is truncated after final quote, or - * is truncated within a quote, or - * is truncated within an escape -* Restartable: fix quadratic behaviour when `StreamCursor` yields lines and CSV - rows span lines -* ~~Python `from_file()` that uses `read()` in preference to `__iter__()`.~~ -* ~~Fix CRLF / LFCR handling.~~ -* ~~`StreamCursor` error / exception propagation.~~ -* ~~Remove hard 256 column limit & fix crash if it's exceeded.~~ -* ~~Ensure non-SSE fallback return codes match SSE when not found.~~ -* ~~Map single zero page after file pages in MappedFileCursor~~ -* ~~Add trailing 16 NUL bytes to BufferedStreamCursor~~ -* ~~Remove hard-coded page size~~ -* ~~(Single byte separator) Unicode support.~~ -* (Multi byte separator) Unicode support. +A fork of csvmonkey that is about 2/3 faster at the time of the last commit.