# pystore **Repository Path**: mirrors_ranaroussi/pystore ## Basic Information - **Project Name**: pystore - **Description**: Fast data store for Pandas time-series data - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2022-01-07 - **Last Updated**: 2025-12-07 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README PyStore - Fast data store for Pandas timeseries data ==================================================== .. image:: https://img.shields.io/badge/python-3.8+-blue.svg?style=flat :target: https://pypi.python.org/pypi/pystore :alt: Python version .. image:: https://img.shields.io/pypi/v/pystore.svg?maxAge=60 :target: https://pypi.python.org/pypi/pystore :alt: PyPi version .. image:: https://img.shields.io/pypi/status/pystore.svg?maxAge=60 :target: https://pypi.python.org/pypi/pystore :alt: PyPi status .. image:: https://www.codefactor.io/repository/github/ranaroussi/pystore/badge :target: https://www.codefactor.io/repository/github/ranaroussi/pystore :alt: CodeFactor .. image:: https://img.shields.io/github/stars/ranaroussi/pystore.svg?style=social&label=Star&maxAge=60 :target: https://github.com/ranaroussi/pystore :alt: Star this repo .. image:: https://img.shields.io/twitter/follow/aroussi.svg?style=social&label=Follow&maxAge=60 :target: https://x.com/aroussi :alt: Follow me on X/Twitter \ `PyStore `_ is a simple (yet powerful) datastore for Pandas dataframes, and while it can store any Pandas object, **it was designed with storing timeseries data in mind**. It's built on top of `Pandas `_, `Numpy `_, `Dask `_, and `Parquet `_ (via `pyarrow `_), to provide an easy to use datastore for Python developers that can easily query millions of rows per second per client. **New in 2025 Release (PR #77):** * **MultiIndex Support** - Store and retrieve DataFrames with Pandas MultiIndex * **Complex Data Types** - Full support for Timedelta, Period, Interval, Categorical dtypes * **Timezone-Aware Operations** - Proper handling of timezone data with UTC storage * **Async/Await Support** - Non-blocking I/O operations for better performance * **Data Validation Framework** - Extensible validation rules for data integrity * **Schema Evolution** - Handle schema changes over time with flexible strategies * **Transaction Support** - Atomic operations with rollback capabilities * **Performance Optimizations** - Streaming operations and memory management **Performance Enhancements (Phase 3 Release):** * **Streaming Operations** - Memory-efficient append for datasets larger than RAM * **Batch Processing** - 5-10x faster parallel read/write operations * **Intelligent Partitioning** - Automatic time-based and size-based partitioning * **Memory Management** - 70-90% memory reduction with monitoring and optimization * **Metadata Caching** - 100x faster metadata access with TTL cache * **Query Optimization** - Column selection and predicate pushdown at storage level Performance improvements include: * Append 1M rows: 3.75x faster, 90% less memory * Batch operations: 6x faster for multiple items * Column selection: 4x faster when reading subset of columns * Filtered reads: 8x faster with predicate pushdown ==> Check out `this Blog post `_ for the reasoning and philosophy behind PyStore, as well as a detailed tutorial with code examples. ==> Follow `this PyStore tutorial `_ in Jupyter notebook format. Quickstart ========== Install PyStore --------------- Install using `pip`: .. code:: bash $ pip install pystore --upgrade --no-cache-dir Install using `conda`: .. code:: bash $ conda install -c ranaroussi pystore **INSTALLATION NOTE:** If you don't have Snappy installed (compression/decompression library), you'll need to you'll need to `install it first `_. Using PyStore ------------- .. code:: python #!/usr/bin/env python # -*- coding: utf-8 -*- import pystore import yfinance as yf # Set storage path (optional) # Defaults to `~/pystore` or `PYSTORE_PATH` environment variable (if set) pystore.set_path("~/pystore") # List stores pystore.list_stores() # Connect to datastore (create it if not exist) store = pystore.store('mydatastore') # List existing collections store.list_collections() # Access a collection (create it if not exist) collection = store.collection('NASDAQ') # List items in collection collection.list_items() # Load some data from yfinance aapl = yf.download("AAPL", multi_level_index=False) # Store the first 100 rows of the data in the collection under "AAPL" collection.write('AAPL', aapl[:100], metadata={'source': 'yfinance'}) # Reading the item's data item = collection.item('AAPL') data = item.data # <-- Dask dataframe (see dask.pydata.org) metadata = item.metadata df = item.to_pandas() # Append the rest of the rows to the "AAPL" item collection.append('AAPL', aapl[100:]) # Reading the item's data item = collection.item('AAPL') data = item.data metadata = item.metadata df = item.to_pandas() # --- Query functionality --- # Query available symbols based on metadata collection.list_items(some_key='some_value', other_key='other_value') # --- Snapshot functionality --- # Snapshot a collection # (Point-in-time named reference for all current symbols in a collection) collection.create_snapshot('snapshot_name') # List available snapshots collection.list_snapshots() # Get a version of a symbol given a snapshot name collection.item('AAPL', snapshot='snapshot_name') # Delete a collection snapshot collection.delete_snapshot('snapshot_name') # ... # Delete the item from the current version collection.delete_item('AAPL') # Delete the collection store.delete_collection('NASDAQ') Advanced Features ----------------- **Async Operations:** .. code:: python import asyncio from pystore import async_pystore async def async_example(): async with async_pystore.store('mydatastore') as store: async with store.collection('NASDAQ') as collection: # Async write await collection.write('AAPL', df) # Async read df = await collection.item('AAPL').to_pandas() asyncio.run(async_example()) **Data Validation:** .. code:: python from pystore import create_validator, ColumnExistsRule, RangeRule # Create a validator validator = create_validator([ ColumnExistsRule(['Open', 'High', 'Low', 'Close']), RangeRule('Close', min_value=0) ]) # Apply validator to collection collection.set_validator(validator) **Schema Evolution:** .. code:: python from pystore import SchemaEvolution, EvolutionStrategy # Enable schema evolution evolution = collection.enable_schema_evolution( 'AAPL', strategy=EvolutionStrategy.FLEXIBLE ) # Schema changes are handled automatically during append collection.append('AAPL', new_data_with_extra_columns) **Complex Data Types:** .. code:: python # DataFrames with Period, Interval, Categorical types df = pd.DataFrame({ 'period': pd.period_range('2024-01', periods=12, freq='M'), 'interval': pd.IntervalIndex.from_tuples([(0, 1), (1, 2)]), 'category': pd.Categorical(['A', 'B', 'A']), 'nested': [{'key': 'value'}, [1, 2, 3], None] }) collection.write('complex_data', df) **Performance Features:** .. code:: python # Streaming append for large datasets def data_generator(): for chunk in pd.read_csv('huge_file.csv', chunksize=100000): yield chunk collection.append_stream('large_data', data_generator()) # Batch operations items_to_write = { 'item1': df1, 'item2': df2, 'item3': df3 } collection.write_batch(items_to_write, parallel=True) # Read multiple items efficiently results = collection.read_batch(['item1', 'item2', 'item3']) # Memory-optimized reading from pystore.memory import optimize_dataframe_memory, read_in_chunks # Optimize DataFrame memory usage df = collection.item('large_item').to_pandas() df_optimized = optimize_dataframe_memory(df) # Up to 70% memory reduction # Read in chunks for processing for chunk in read_in_chunks(collection, 'large_item', chunk_size=50000): # Process chunk - automatically garbage collected process(chunk) **Query Optimization:** .. code:: python # Column selection - read only what you need item = collection.item('data') df = item.to_pandas(columns=['price', 'volume']) # 4x faster for subset # Filter at storage level df = item.to_pandas(filters=[('price', '>', 100)]) # 8x faster Using Dask schedulers --------------------- PyStore supports using Dask distributed. To use a local Dask scheduler, add this to your code: .. code:: python from dask.distributed import LocalCluster pystore.set_client(LocalCluster()) To use a distributed Dask scheduler, add this to your code: .. code:: python pystore.set_client("tcp://xxx.xxx.xxx.xxx:xxxx") pystore.set_path("/path/to/shared/volume/all/workers/can/access") Concepts ======== PyStore provides namespaced *collections* of data. These collections allow bucketing data by *source*, *user* or some other metric (for example, frequency: End-Of-Day, Minute Bars, etc.). Each collection (or namespace) maps to directory containing partitioned **parquet files** for each item (e.g., symbol). A good practice it to create collections that may look something like this: * collection.EOD * collection.ONEMINUTE Requirements ============ * Python >= 3.8 * Pandas >= 2.0 * Numpy >= 1.20 * Dask >= 2023.1 * PyArrow >= 10.0 (Parquet engine) * `Snappy `_ (Google's compression/decompression library) * multitasking * pytest-asyncio (for async testing) PyStore was tested to work on \*nix-like systems, including macOS. Dependencies: ------------- PyStore utilizes `Snappy `_, a fast and efficient compression/decompression library developed by Google. You'll need to install Snappy on your system before installing PyStore. \* See the ``python-snappy`` `Github repo `_ for more information. ***nix Systems:** - APT: ``sudo apt-get install libsnappy-dev`` - RPM: ``sudo yum install libsnappy-devel`` **macOS:** First, install Snappy's C library using `Homebrew `_: .. code:: $ brew install snappy Then, install Python's snappy using conda: .. code:: $ conda install python-snappy -c conda-forge ...or, using `pip`: .. code:: $ CPPFLAGS="-I/usr/local/include -L/usr/local/lib" pip install python-snappy **Windows:** Windows users should check out `Snappy for Windows `_ and `this Stackoverflow post `_ for help on installing Snappy and ``python-snappy``. Current Status ============== **Core Features:** * Local filesystem support with Parquet storage * Full Pandas DataFrame compatibility, including MultiIndex * Snapshots for point-in-time data versioning * Metadata support for data organization **Advanced Features (July 2025 Release):** * Complex data type serialization (Period, Interval, Categorical, nested objects) * Timezone-aware datetime handling with UTC storage * Async/await operations for non-blocking I/O * Data validation framework with extensible rules * Schema evolution for handling data structure changes * Transaction support with rollback capabilities **Performance Features:** * Streaming operations for datasets larger than RAM * Batch read/write with parallel processing * Intelligent partitioning (time-based and size-based) * Memory optimization with automatic type downcasting * Metadata caching for faster access * Query optimization with column selection and predicate pushdown **Known Limitations:** * MultiIndex append operations have limited support due to Dask limitations - while there's a workaround that converts MultiIndex to regular columns, it may not fully preserve the MultiIndex structure after append (test remains marked as expected failure) * Some Parquet limitations with preserving exact index metadata **Future Plans:** * Amazon S3 support (via `s3fs `_) * Google Cloud Storage support (via `gcsfs `_) * Hadoop Distributed File System support (via `hdfs3 `_) Acknowledgements ================ PyStore is hugely inspired by `Man AHL `_'s `Arctic `_ which uses MongoDB for storage and allows for versioning and other features. I highly recommend you check it out. License ======= PyStore is licensed under the **Apache License, Version 2.0**. A copy of which is included in LICENSE.txt. ----- I'm very interested in your experience with PyStore. Please drop me a note with any feedback you have. Contributions welcome! \- **Ran Aroussi**