Data Storage Formats Overview: Avro, Parquet, ORC, and GFS

Feature Avro Parquet ORC GFS
Data Storage Type Row-oriented Columnar Columnar Distributed file system
Designed For Streaming, serialization, row-based processing Analytical queries, big data processing Analytical queries, big data processing (Hive-centric) Storing large files reliably and distributed across many nodes
Compression Yes (supports several codecs) Yes (Snappy, Gzip, LZO, etc.) Yes (Zlib, Snappy, LZO, etc.) No native data compression for format; relies on client or application-level compression
Schema Evolution Yes, flexible schema evolution Yes, supports adding/removing columns Yes, supports schema evolution No schema enforcement; handles files and blocks
Splittable for Parallel Processing No (not ideal) Yes Yes Yes – file blocks are distributed and processed in parallel
Typical Use Cases Data serialization, message passing, streaming Data warehousing, analytics, ETL, ML workflows Hive data warehousing, analytics, ETL Distributed data storage for Hadoop and other large-scale applications
Integration Kafka, Hadoop, various streaming systems Spark, Hive, Presto, Impala Hive, Spark, Hadoop ecosystem Hadoop, MapReduce, Bigtable (pre-Colossus), data-intensive apps