Introduction to Apache Avro

·

3 min read

I know that I already mentioned Apache Avro, but I believe that some basic explanation is needed.

Apache Avro is a data serialization system that allows you to define complex data structures and store them in a compact binary format. It was originally developed as part of the Hadoop ecosystem, but can also be used as a standalone library for any application that needs to work with complex data structures.

One of the main benefits of Avro is its schema-based approach. Instead of storing data as raw bytes or text, Avro stores data in a structured format that is defined by a schema. This schema defines the structure of the data, including the fields and their types, and can be used to validate data before it is read or written.

Avro Types

Avro has a number of built-in types that can be used to define fields in a schema. Here are some of the most commonly used types:

Primitive Types

  • null: Represents a null value.

  • boolean: Represents a boolean value.

  • int: Represents a 32-bit integer value.

  • long: Represents a 64-bit integer value.

  • float: Represents a 32-bit floating-point value.

  • double: Represents a 64-bit floating-point value.

  • bytes: Represents a sequence of bytes.

  • string: Represents a Unicode string.

Complex Types

  • record: Represents a complex data structure consisting of multiple named fields. Fields can be of any Avro type, including other complex types.

  • enum: Represents a fixed set of named values.

  • array: Represents a variable-length array of items, all of the same type.

  • map: Represents a map of key-value pairs, where the keys are strings and the values can be of any Avro type.

  • fixed: Represents a fixed-length sequence of bytes.

  • union: Represents a value that can be one of several different types. Unions can be used to specify optional fields by including the null type as one of the possible types.

Why use null in the field definition?

As mentioned earlier, the null type is one of the most commonly used types in Avro. One reason for this is that it allows you to define optional fields in a schema.

When you define a field in a schema, you can specify its type and default value. If you don't specify a default value, the field is required and must be present in any data that conforms to the schema. However, if you specify a default value of null, the field becomes optional. This means that it can be omitted from the data, and the schema will still be valid.

The benefit of using null to define optional fields is that it makes your schema more flexible. If you decide in the future that you no longer need a particular field, you can simply remove it from the schema and any data that conforms to the schema will still be valid. If you had not used null to make the field optional, removing the field from the schema would have caused all existing data to become invalid.

Here's an example of how to define an optional field using null in Avro:

{
    "type": "record",
    "name": "Person",
    "fields": [
        {"name": "name", "type": "string"},
        {"name": "age", "type": ["int", "null"], "default": null}
    ]
}

In this example, the age field is optional because it includes the null type as one of its possible types.