Avro vs Parquet

Avro

Apache Avro is a remote procedure call and data serialization framework developed with Apache Hadoop project. It uses Json for defining data types and protocols and serialize data in compact binary format.
Avro stores both schema and data together in one message or file.
A key feature of Avro is robust support for data schemas that change over time, called scheam evolution.

Avro

  • Row based format
  • Schema is stored as JSON within the file
  • It is also a Serialization and RPC framework
  • Files support block compression and are splittable
  • Can be written from streaming data (eg Apache Kafka)
  • Excellent with schema evolution

Parquet

Parquet is a columnar file format that provides optimizations to speed up queries and a far more efficient file format than csv or json

Parquet

  • Column based format
  • Schema is stored in the footer of the file
  • Due to merging of schema across multiple files, schema evolution is expensive
  • Excellent for selected column data consumption and processing
  • Can’t be written from streaming data since it needs to wait for blocks to get finished. However, this will work using micro-batch (eg Apache Spark).
  • Working excellent with spark as there is vectorized reader for parquet