Apache ORC (Optimized Row Columnar) is a free and open-source column-oriented data storage format of the Apache Hadoop ecosystem. It is designed to provide efficient storage and fast access for large datasets in distributed computing frameworks.
Key Features
- Columnar Storage: Data is stored by columns rather than rows, which improves compression and speeds up query performance.
- High Compression: ORC files use lightweight compression algorithms to reduce storage space significantly.
- Predicate Pushdown: Enables the execution engine to filter data early during reading to improve performance.
- Splittable Files: Facilitates efficient parallel processing by splitting large files into manageable chunks.
- Schema Evolution: Supports adding or modifying columns without rewriting entire datasets.
Use Cases
Apache ORC is widely used in big data processing environments, especially with Apache Hive and Apache Spark, for analytical queries on massive datasets. It is well-suited for:
- Data warehousing and analytics
- ETL processing
- Machine Learning workflows requiring efficient data access
- Storage optimization in Hadoop Distributed File System (HDFS)
Comparison with Other Formats
While ORC is one of the popular columnar data formats, others like Apache Parquet and Avro also serve similar purposes:
- ORC vs Parquet: Both offer columnar storage and compression, but ORC is often preferred in Hive environments, whereas Parquet is frequently used with Spark and Impala.
- ORC vs Avro: Avro is a row-based format suited for streaming and logging data, while ORC is optimized for analytical batch processing.