Apache Parquet

Apache Parquet is a columnar storage file format designed for efficient data storage and retrieval in big data processing frameworks, optimizing for analytics workloads by storing data column-by-column rather than row-by-row, which enables compression, encoding, and query performance optimizations.

Compact binary data serialization.

Apache Parquet is a columnar storage file format specifically designed for efficient data storage and processing in big data analytics environments, developed as a collaboration between Twitter and Cloudera in 2013 and now part of the Apache Software Foundation. Unlike traditional row-oriented formats (like CSV or Avro) that store data records sequentially, Parquet organizes data by columns, grouping all values from the same column together in storage. This columnar approach provides significant advantages for analytical workloads where queries typically access only a subset of columns from wide tables—instead of reading entire rows and discarding unneeded columns, Parquet allows systems to read only the specific columns required for a query, dramatically reducing I/O operations and improving query performance. The format also enables highly effective compression since values in the same column tend to have similar characteristics and patterns, allowing compression algorithms like Snappy, Gzip, LZO, and Zstandard to achieve much better compression ratios than they would on mixed-type row data. Parquet files are self-describing, containing schema information and metadata that allow any processing system to understand the data structure without external schema definitions.

Parquet has become the de facto standard for analytical data storage in modern data lakes and big data ecosystems, with native support across virtually all major data processing frameworks including Apache Spark, Apache Hive, Apache Impala, Presto, Trino, Apache Drill, and cloud data warehouses like Amazon Athena, Google BigQuery, Azure Synapse, and Snowflake. The format supports rich data types including nested and repeated structures (arrays, maps, and complex records), making it ideal for storing semi-structured data from JSON or Avro sources while maintaining query efficiency. Parquet’s internal structure uses techniques like dictionary encoding for low-cardinality columns, bit-packing for small integers, run-length encoding for repeated values, and delta encoding for sorted data, all of which contribute to both storage efficiency and query speed. The format includes column statistics (min/max values, null counts) stored in file metadata that enable predicate pushdown—allowing query engines to skip entire row groups or files that don’t contain relevant data based on filter conditions. This combination of columnar organization, advanced encoding schemes, efficient compression, predicate pushdown, and schema evolution support makes Parquet the optimal choice for data warehouse tables, analytical datasets, machine learning feature stores, time-series data, and any scenario where fast analytical queries over large datasets are required, often achieving 10-100x improvements in query performance and storage efficiency compared to row-oriented formats.

License: Apache 2.0

Tags: Data, Serialization, Binary

Properties: Columnar Storage Format, Column-Oriented, Apache Project, Open Source, Twitter-Cloudera Collaboration, Big Data Format, Analytics Optimized, Self-Describing, Schema Embedded, Metadata Rich, Binary Format, Efficient Storage, High Compression, Compression Support, Snappy Compression, Gzip Compression, LZO Compression, Brotli Compression, Zstandard Compression, Uncompressed Option, Column-Level Compression, Encoding Schemes, Dictionary Encoding, Run-Length Encoding, Bit-Packing, Delta Encoding, Delta Binary Packing, Plain Encoding, Byte Stream Split, Hybrid Encoding, Efficient Reads, Selective Column Access, Column Pruning, Projection Pushdown, Predicate Pushdown, Filter Pushdown, Statistics-Based Filtering, Row Group Skipping, Page-Level Statistics, Column Statistics, Min/Max Values, Null Counts, Distinct Counts, Bloom Filters, File-Level Metadata, Row Group Metadata, Column Chunk Metadata, Page Metadata, Schema Evolution, Schema Compatibility, Add Columns, Remove Columns, Rename Columns, Type Evolution, Nested Data Support, Complex Types, Struct Types, Array Types, Map Types, List Types, Repeated Fields, Optional Fields, Required Fields, Hierarchical Data, Semi-Structured Data, JSON Compatible, Avro Compatible, Thrift Compatible, Protocol Buffers Compatible, Rich Data Types, Primitive Types, Boolean Type, Integer Types, INT32, INT64, INT96, Float Type, Double Type, Binary Type, Fixed-Length Binary, String Type, UTF-8 Strings, Decimal Type, Date Type, Time Type, Timestamp Type, UUID Type, Enum Type, Logical Types, Converted Types, Annotation Support, Row Groups, Columnar Chunks, Data Pages, Dictionary Pages, Index Pages, Header Pages, Footer Structure, Magic Number, Version Number, File Format Version, Parquet Format 2.0, Apache Arrow Integration, Arrow Flight, In-Memory Format, Zero-Copy Reads, Memory Mapping, Lazy Loading, Streaming Reads, Batch Reads, Vectorized Processing, SIMD Optimization, CPU Efficiency, I/O Efficiency, Network Efficiency, Query Performance, Fast Scans, Aggregate Performance, Join Performance, Analytical Workloads, OLAP Queries, Data Warehouse Format, Data Lake Format, Cloud Storage Optimized, S3 Optimized, Azure Blob Compatible, Google Cloud Storage Compatible, HDFS Compatible, Object Storage, Distributed Storage, Splittable Files, Parallel Processing, Multi-Threaded Reads, Concurrent Access, Apache Spark Integration, Spark SQL, DataFrame Support, Dataset Support, PySpark Support, Apache Hive Integration, Hive Tables, HiveQL Support, Impala Support, Presto Support, Trino Support, Apache Drill Support, Dremio Support, ClickHouse Support, DuckDB Support, Snowflake Compatible, BigQuery Compatible, Redshift Spectrum, Athena Compatible, Azure Synapse, Databricks Support, EMR Support, Dataproc Compatible, AWS Glue, Data Catalog Integration, Table Format Support, Delta Lake, Apache Iceberg, Apache Hudi, Time Travel, ACID Transactions, Schema Registry, Metastore Integration, Partition Support, Partitioned Tables, Partition Pruning, Bucketing Support, Sorted Data, Clustered Data, Data Organization, File Organization, Directory Structure, Hive Partitioning, Key-Based Partitioning, Date Partitioning, ETL Integration, Data Pipelines, Batch Processing, Stream Processing, Real-Time Analytics, Apache Kafka Integration, Apache Flink Support, Streaming Writes, Micro-Batching, Change Data Capture, Incremental Updates, Upsert Support, Delete Support, Merge Support, Compaction, Small File Problem, File Consolidation, Optimization, Vacuum Operations, Machine Learning, Feature Store, Training Datasets, Model Input, Data Science, Pandas Integration, NumPy Compatible, Scikit-learn, TensorFlow Datasets, PyTorch DataLoader, Jupyter Notebooks, R Support, Julia Support, Command Line Tools, parquet-tools, parquet-cli, File Inspection, Schema Extraction, Row Count, File Size, Compression Ratio, Storage Metrics, Performance Metrics, Benchmark Results, Query Optimization, Cost-Based Optimization, Statistics Collection, Cardinality Estimation, Data Profiling, Data Quality, Data Validation, Type Safety, Schema Validation, Constraint Checking, Business Rules, Programming Language Support, Java Support, Scala Support, Python Support, C++ Support, Go Support, Rust Support, JavaScript Support, .NET Support, Arrow Parquet, PyArrow, FastParquet, parquet-cpp, parquet-mr, parquet-format, Specification, Open Standard, Vendor Neutral, Cross-Platform, Portable Format, Interoperability, Data Exchange, Data Sharing, Data Publishing, Open Data, Public Datasets, Reproducible Research, Version Control Friendly, Git LFS Compatible, Data Versioning, Data Lineage, Provenance Tracking, Audit Trails, Compliance Support, GDPR Compatible, Data Governance, Access Control, Encryption Support, Encryption at Rest, Column Encryption, Transparent Encryption, Security, Authentication, Authorization, Row-Level Security, Column Masking, Data Redaction, PII Protection, Sensitive Data, Anonymization, Pseudonymization, Production Ready, Enterprise Grade, Mission Critical, High Performance, Scalable, Petabyte Scale, Exabyte Scale, Large Datasets, Wide Tables, Deep Nesting, High Cardinality, Low Cardinality, Sparse Data, Dense Data, Time Series Data, Event Data, Log Data, Metrics Data, Sensor Data, IoT Data, Clickstream Data, User Behavior, Transaction Data, Financial Data, Scientific Data, Genomics Data, Weather Data, Geospatial Data, GIS Integration, Location Data, Coordinates, Spatial Queries, Temporal Queries, Historical Data, Archive Format, Cold Storage, Data Retention, Backup Format, Disaster Recovery, Long-Term Storage, Cost Optimization, Storage Savings, Cloud Cost Reduction, Bandwidth Savings, Compute Efficiency, Resource Optimization, Green Computing, Energy Efficient, Carbon Footprint, Sustainability, Industry Standard, Widely Adopted, Battle Tested, Mature Technology, Active Development, Community Support, Documentation, Examples, Tutorials, Best Practices, Design Patterns, Anti-Patterns, Troubleshooting, Debugging, Profiling Tools, Performance Tuning, Optimization Guides, Migration Tools, Conversion Tools, CSV to Parquet, JSON to Parquet, Avro to Parquet, ORC Alternative, Comparison Benchmarks, Format Selection, Use Case Specific, Analytics First, Write Once Read Many, WORM, Append-Only, Immutable Files, Idempotent Writes, Exactly-Once Semantics, Consistency, Durability, Reliability, Fault Tolerance, Error Handling, Data Integrity, Checksum Validation, CRC Checks, Corruption Detection, Self-Healing, Backward Compatible, Forward Compatible, Version Migration, Legacy Support, Modern Format, Future Proof, Ecosystem Integration, Tool Support, BI Tools, Tableau Support, Power BI, Looker, Qlik, Metabase, Superset, Grafana, Monitoring, Observability, Telemetry, Usage Tracking, Access Patterns, Query Patterns, Workload Analysis

Website: https://parquet.apache.org/


Last modified December 31, 2025: update latest docs (d09718ca)