Skip to content

PDS (Portable Data Store) is a Python class for efficiently storing and retrieving large amounts of key-value data, where keys are hierarchical strings and values are arbitrary JSON-serializable Python objects.

License

Notifications You must be signed in to change notification settings

kiss-oliver/PDS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDS - Portable Data Store

PyPI version Python versions License: MIT CI Status

PDS (Portable Data Store) is a Python class for efficiently storing and retrieving large amounts of key-value data, where keys are hierarchical strings and values are arbitrary JSON-serializable Python objects. It leverages Zstandard (zstd) compression, including optional dictionary-based compression, to minimize storage space, particularly for datasets with repetitive structures or content (like collections of JSON objects, log entries, etc.).

It's designed for scenarios where you need to store structured data persistently in a single object but want faster random access and potentially better compression than simply storing individual JSON files or using less specialized formats.

Table of Contents

Features

  • Hierarchical Keys: Use lists of strings (e.g., ["logs", "2025-05-02", "errors"]) to organize data.
  • JSON-Serializable Values: Store any Python object that can be serialized to JSON (dictionaries, lists, strings, numbers, booleans, etc.).
  • Metadata Storage: Attach a JSON-serializable dictionary as metadata to the entire store.
  • Efficient Compression: Utilizes Zstandard (zstd) for fast and effective compression.
    • No Compression: Option to store data uncompressed.
    • Standard Zstd: Compress values using zstd without a dictionary.
    • Dictionary Compression: Automatically train a zstd dictionary based on data samples during save for potentially significant size reduction on repetitive datasets.
  • Random Access Reads: Retrieve individual values efficiently using their key path via an internal index, without needing to scan the entire file.
  • Data Modification: Add, update (by adding with the same key), and remove keys. Changes are initially stored in temp files and consolidated during save.
  • Context Manager Support: Use with PDS(...) as store: for automatic resource cleanup (dispose).
  • Temporary File Management: Handles temporary storage for added/modified data transparently before saving.

When to Use PDS

PDS is particularly well-suited for scenarios where you need to:

  • Store numerous structured records: Manage collections of data like JSON objects, dictionaries, or logs efficiently.
  • Avoid managing many small files: Consolidate potentially thousands or millions of records into a single, portable file, improving I/O performance and simplifying file handling.
  • Achieve high compression ratios: Especially useful when records share common structures or repeating string content (like JSON keys or log message formats), leveraging Zstandard dictionary compression (zstd_dict).
  • Retrieve data primarily by key: Access specific records quickly using their known hierarchical string key path, without needing to scan the entire dataset.
  • Keep things simple: Use a straightforward key-value storage approach without the setup, schema requirements, or query language complexity of full SQL or NoSQL databases.
  • Prioritize JSON-like data: Store nested lists, dictionaries, strings, numbers, etc., naturally.

It's a good fit if a full database like SQLite seems like overkill, but storing individual files is too inefficient or cumbersome.

Installation

You can install python-pds directly from PyPI using pip:

pip install python-pds

This will automatically install the required zstandard dependency as well.

Note: The package is installed as python-pds, but you import it in your Python code as pds:

from pds import PDS

# Now you can use the PDS class
store = PDS()

Basic Usage

Creating and Saving

# Make sure to install first: pip install python-pds
from pds import PDS # Import the class from the 'pds' package
import os

filename = "my_data_store.pds"

# Remove old file if it exists
if os.path.exists(filename):
    os.remove(filename)

# Use zstd_dict mode (will train a dictionary on save if data warrants it)
# Use context manager for automatic cleanup
try:
    with PDS(compression_mode='zstd_dict') as pds:
        # Set some metadata
        pds.set_meta_data({
            "project": "Web Scraper",
            "version": "1.0",
            "timestamp": "2025-05-02T11:00:00Z"
        })

        # Add data using hierarchical keys
        pds.add_key(
            ["articles", "example.com", "article1"],
            {"title": "Example Article", "content": "Lots of text...", "tags": ["news", "example"]}
        )
        pds.add_key(
            ["articles", "example.com", "article2"],
            {"title": "Another Example", "content": "More text data...", "tags": ["tech"]}
        )
        pds.add_key(
            ["logs", "scraper1", "errors"],
            [{"error": "Timeout", "url": "..."}, {"error": "404", "url": "..."}]
        )
        pds.add_key(
            ["config", "settings"],
            {"retry_count": 3, "timeout": 30}
        )

        # Data is held in temp files until save

        print(f"Saving data to {filename}...")
        pds.save(filename)
        print("Save complete.")

except Exception as e:
    print(f"An error occurred: {e}")

Opening and Reading

from pds import PDS
import json

filename = "my_data_store.pds"

try:
    # Open an existing PDS file (no mode needed for opening)
    with PDS() as pds:
        print(f"Opening {filename}...")
        pds.open(filename)
        print("File opened.")

        # Read metadata
        print("\nMetadata:")
        print(json.dumps(pds.meta_data, indent=2))

        # Get the structure of keys
        print("\nKey Structure:")
        print(json.dumps(pds.get_keys(), indent=2))
        # Note: The actual values in the key structure are internal value IDs,
        # not the data itself. This just shows the hierarchy.

        # Read specific values
        print("\nReading specific keys:")
        article1_data = pds.read_key(["articles", "example.com", "article1"])
        print(f"Article 1 Title: {article1_data.get('title')}")

        error_logs = pds.read_key(["logs", "scraper1", "errors"])
        print(f"Number of errors logged: {len(error_logs)}")

        # Attempt to read a non-existent key
        try:
            non_existent = pds.read_key(["non", "existent", "key"])
        except KeyError as ke:
            print(f"\nCorrectly caught error for non-existent key: {ke}")

except FileNotFoundError:
    print(f"Error: File not found - {filename}")
except Exception as e:
    print(f"An error occurred: {e}")

Modifying Data

You can open an existing store, add/remove keys, and then save (usually to a new file, but you can also overwrite the existing one).

from pds import PDS
import os

filename_v1 = "my_data_store.pds"
filename_v2 = "my_data_store_v2.pds"

# Ensure v1 exists from previous example
if not os.path.exists(filename_v1):
    print(f"Error: {filename_v1} not found. Run the creation example first.")
else:
    try:
        # Open the existing store
        with PDS() as pds:
            pds.open(filename_v1)
            print(f"Opened {filename_v1} for modification.")

            # Add a new key
            pds.add_key(["status", "progress"], {"completed": 0.75, "state": "running"})
            print("Added new key ['status', 'progress']")

            # Remove an existing key
            try:
                pds.remove_key(["logs", "scraper1", "errors"])
                print("Removed key ['logs', 'scraper1', 'errors']")
            except KeyError:
                 print("Key ['logs', 'scraper1', 'errors'] not found for removal.")

            # Update an existing key by adding it again
            pds.add_key(
                ["config", "settings"],
                {"retry_count": 5, "timeout": 60, "user_agent": "PDS Bot"} # Overwrites previous value
            )
            print("Updated key ['config', 'settings']")


            # Save changes to a new file
            print(f"\nSaving modified data to {filename_v2}...")
            pds.save(filename_v2)
            print("Save complete.")

            # You can now read from filename_v2

    except Exception as e:
        print(f"An error occurred during modification: {e}")

Compression Options

The compression_mode parameter in the PDS() constructor determines the intended compression strategy used when save() is called next.

none

  • Description: No compression is applied to the values.
  • Pros: Fastest save/load times if disk I/O is not the bottleneck. Simple.
  • Cons: Results in the largest file sizes.
  • Use When: File size is not a concern, absolute maximum speed is needed, and data does not compress well anyway.

zstd_no_dict

  • Description: Compresses values using standard Zstandard (level 5) without a dictionary.
  • Pros: Good balance of compression speed and ratio. Generally fast. Good default choice.
  • Cons: May not achieve optimal compression for datasets with high repetition across many small records.
  • Use When: You need good compression without the overhead of dictionary training, or when your data consists of larger chunks with good internal repetition.

zstd_dict

  • Description: Compresses values using Zstandard (level 5) with a dictionary automatically trained on a sample of the data during the save() operation.
  • Pros: Can achieve significantly better compression ratios than zstd_no_dict for datasets containing many small records with shared patterns (e.g., JSON keys, log formats, common strings).
  • Cons: save() operation incurs overhead for sampling data and training the dictionary, making it slower than other modes, sometimes significantly so for very large datasets. Dictionary training requires memory to hold samples (configurable via dict_sample_size, default 2GB).
  • Use When: Minimizing file size is crucial, and the dataset contains significant repetitive elements across different keys/values. Suitable for collections of structured records like logs, JSON objects, configuration snippets, etc.
  • Configuration:
    • dict_sample_size: Max bytes of (decompressed) value data to sample for training (default: 2GB).
    • dict_target_size: Target size for the trained dictionary (default: ~110KB).

Choosing a Mode

  • Start with zstd_no_dict for a good general baseline.
  • If your data consists of many small items with clear repetition (lots of similar JSON keys, repeated text snippets), try zstd_dict and compare the file size and save time.
  • Use none only if compression provides negligible benefit or speed is paramount above all else.

API Reference

  • PDS(compression_mode: str = "zstd_dict", dict_sample_size: int = ..., dict_target_size: int = ...):
    • Constructor. Sets the intended compression mode for the next save(). Does not affect open().
  • add_key(keys_list: List[str], value: Any):
    • Adds or updates a value at the specified hierarchical key path. value must be JSON-serializable.
  • read_key(keys_list: List[str]) -> Any:
    • Retrieves the value associated with the key path. Raises KeyError if not found.
  • remove_key(keys_list: List[str]):
    • Removes the key and its associated value reference. Raises KeyError if not found.
  • save(filename: str):
    • Writes the entire data store (metadata, values, index) to the specified file using the intended compression mode set during __init__. Consolidates temporary changes. Trains dictionary if mode is zstd_dict.
  • open(filename: str):
    • Loads an existing PDS file into memory (metadata and index). Values are read on demand via read_key. Detects the compression mode used when the file was saved.
  • set_meta_data(meta_data: Dict[str, Any]):
    • Sets the metadata for the store. Must be a JSON-serializable dictionary.
  • get_keys() -> Union[List, Dict]:
    • Returns a deep copy of the keys index structure (containing internal value IDs, not the actual data). Useful for exploring the hierarchy.
  • dispose():
    • Closes the file handle (if open) and cleans up temporary directories. Automatically called when exiting a PDS context (with statement).

Comparison to Alternatives

PDS occupies a specific niche. Here's how it compares to other common data storage approaches:

vs. File System / Archives (e.g., storing individual JSON files, .tar.gz, .zip)

  • PDS Advantages: Significantly better I/O performance (single file handle vs. many), much better compression potential across records (especially with zstd_dict), efficient random access by key (impossible in simple archives without full decompression), easier management of a single artifact.
  • File System/Archive Advantages: Simplicity for basic cases, uses universally standard tools.

vs. SQL Databases (e.g., SQLite, DuckDB)

  • PDS Advantages: Simpler API for key-value operations (no SQL required), potentially better compression specifically for repetitive JSON structures via zstd_dict, lighter dependency footprint (Python + zstandard wheels vs. C/C++ based DBs).
  • SQL DB Advantages: PDS has no querying capabilities and handling relational data is cumbersome.

vs. Document Databases (e.g., TinyDB)

  • PDS Advantages: Significantly better space efficiency and likely performance due to binary storage and Zstandard compression (vs. TinyDB's plain JSON text storage), potentially scales better for very large datasets.
  • TinyDB Advantages: Pure Python, simple querying capabilities beyond exact key match, very easy to get started with.

vs. Python Standard Library (e.g., Pickle, Shelve)

  • PDS Advantages: More portable data format (JSON/zstd vs. Python-specific Pickle), avoids Pickle's security risks with untrusted data, generally more robust than shelve, integrated advanced compression. Data is readable using other programming languages as well, as long as they can decompress zstandard-compressed data.
  • Pickle/Shelve Advantages: Can store arbitrary Python objects (not just JSON-serializable ones), part of the standard library.

vs. Hierarchical/Array Formats (e.g., HDF5, Zarr)

  • PDS Advantages: Simpler API focused specifically on hierarchical keys and JSON-like document values, potentially less storage overhead for this data type.
  • HDF5/Zarr Advantages: Optimized for large, N-dimensional numerical arrays, support complex chunking/sharding strategies, rich ecosystem for scientific data. More feature-rich but potentially overkill for simple document storage.

vs. Embedded Key-Value Stores (e.g., RocksDB/LMDB bindings)

  • PDS Advantages: Higher-level API (handles JSON serialization, hierarchical keys, compression automatically), easier to use for the target data model.
  • KV Store Advantages: Significantly higher raw put/get performance, designed for low-level speed, may offer transactional guarantees. Usually require manual data serialization and have C/C++ dependencies.

Choose PDS when you need an efficient, single-file store for many JSON-like documents addressable by hierarchical keys, where compression is important, and complex querying is not a primary requirement.

File Format (.pds)

The PDS file format is structured sequentially as follows:

  1. Metadata Block:

    • [UINT4: Length M] - 4 bytes, unsigned integer, little-endian: Length of the metadata JSON bytes.
    • [Bytes: Metadata] - M bytes: UTF-8 encoded JSON string containing the store's metadata.
  2. Dictionary Information Block:

    • [INT4: Dictionary Length D] - 4 bytes, signed integer, little-endian:
      • D > 0: Length of the Zstandard dictionary data that follows. Compression mode was zstd_dict.
      • D == -1 (_ZSTD_NO_DICT): No dictionary data follows. Compression mode was zstd_no_dict.
      • D == -2 (_NO_COMPRESSION): No dictionary data follows. Compression mode was none.
    • [Bytes: Zstd Dictionary] - D bytes: The Zstandard dictionary data. Only present if D > 0.
  3. Value Blocks: (Variable number of blocks, stored sequentially)

    • Repeated for each value stored:
      • [UINT8: Value Length V] - 8 bytes, unsigned integer, little-endian: Length of the value data that follows.
      • [Bytes: Value Data] - V bytes: The actual value data. This data is compressed according to the mode indicated by the Dictionary Information Block (none, zstd_no_dict, or zstd_dict using the stored dictionary).
  4. Keys Index Block: (Stored at the end)

    • [Bytes: Compressed Index Data K] - K bytes: The Zstandard compressed (level 5) representation of the keys index. The index is a JSON structure mirroring the key hierarchy, but leaf nodes contain strings "|offset:length" pointing to the start (offset) and size (length) of the corresponding Value Data block within the file. The offset points after the [UINT8: Value Length V] prefix for that value.
    • [UINT4: Index Length K] - 4 bytes, unsigned integer, little-endian: Length of the compressed index data (K bytes) that immediately precedes it.

Note: The Keys Index itself is always compressed using zstd (level 5) during save, regardless of the compression_mode chosen for the main values.

Considerations and Limitations

  • Memory Usage:
    • Opening a file loads the metadata and the (decompressed) keys index into memory. Index size depends on the number of keys and the depth of the hierarchy.
    • read_key loads the entire requested (decompressed) value into memory.
    • save with zstd_dict requires memory to hold data samples (up to dict_sample_size) during dictionary training.
    • Saving large datasets involves reading temporary files or existing value blocks, potentially leading to high memory usage if many large values are processed simultaneously.
  • Performance:
    • add_key performs a simple zstd compression and writes to a temporary file - relatively fast but involves I/O.
    • read_key performs a file seek, read, potentially zstd decompression, and JSON parsing. Generally fast for random access.
    • save is the most intensive operation. It reads all data (from temp files or the original file), potentially decompresses, re-compresses according to the target mode (including dictionary training if applicable), and writes everything sequentially. Can be slow for large datasets.
  • Concurrency: The PDS class is not thread-safe for concurrent write operations (add_key, remove_key, save on the same instance). Reading from the same instance in multiple threads might work depending on the underlying file handle's behavior but is not explicitly designed or tested for. Use separate PDS instances (potentially opening the same file read-only) for concurrent reads.
  • Atomicity: The save operation is not atomic. If the process is interrupted during save, the output file may be incomplete or corrupted. For critical applications, consider saving to a temporary file and then atomically renaming it upon successful completion (this logic is not currently implemented within the PDS class).
  • Error Handling: Uses standard Python exceptions. File corruption, resource exhaustion (memory/disk), or invalid data can lead to errors (IOError, MemoryError, zstd.ZstdError, json.JSONDecodeError, ValueError, etc.).
  • Large Individual Values: While the format supports large values (UINT8 for length), extremely large individual values (approaching or exceeding available RAM) could cause MemoryError during reading, saving, or dictionary sampling.

Contributing

Contributions are welcome! Please see the Contributing Guidelines for details on how to set up your development environment, report bugs, suggest features, and submit pull requests.

License

This project is licensed under the MIT License - see the LICENSE file for details

About

PDS (Portable Data Store) is a Python class for efficiently storing and retrieving large amounts of key-value data, where keys are hierarchical strings and values are arbitrary JSON-serializable Python objects.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages