Simple Java Avro Serializer Deserializer

“Apache Avro is the leading serialization format for record data, and first choice for streaming data pipelines. It offers excellent schema evolution, and has implementations for the JVM …” - source: avro.apache.org

In this post a simple Avro serializer and deserializer (AVRO serde) implementation is presented. The Apache Avro library make use of the single responsibility principle, and it defines different classes for different type of work.

The class SpecificDatumWriter is responsible for writing Avro object into an OutputStream. The class SpecificDatumReader is responsible for determining the in-memory serialized data representation while deserializing.

Here is an abbreviated source code of the class AvroToBytesSerializerDeserializer<T extends GenericRecord>:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
package com.techeule.examples.avro;

// ... shortened for readability ...

import org.apache.avro.generic.GenericRecord;
import org.apache.avro.specific.SpecificDatumReader;
import org.apache.avro.specific.SpecificDatumWriter;

public class AvroToBytesSerializerDeserializer<T extends GenericRecord> {
    private final SpecificDatumWriter<T> datumWriter;
    private final SpecificDatumReader<T> datumReader;

    public AvroToBytesSerializerDeserializer(final Class<T> classType) {
        Objects.requireNonNull(classType, "classType must not be null.");
        datumWriter = new SpecificDatumWriter<>(classType);
        datumReader = new SpecificDatumReader<>(classType);
    }

    public byte[] serialize(final List<T> records) {
      // ... omitted for readability ...
    }

    public List<T> deserialize(final byte[] data) {
      // ... omitted for readability ...
    }
}

Serialization

The method AvroToBytesSerializerDeserializer#serialize(List<T>):byte[] serializes list of Avro objects into byte[].

26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
    public byte[] serialize(final List<T> records) {
        if ((records == null) || records.isEmpty()) {
            throw new IllegalArgumentException("records must not be null or empty");
        }

        final var byteArrayOutputStream = new ByteArrayOutputStream();
        try (final var dataFileWriter = new DataFileWriter<>(datumWriter)) {
            tryToAppendRecords(byteArrayOutputStream, dataFileWriter, records);
        } catch (final IOException e) {
            throw new RuntimeException(e);
        }
        return byteArrayOutputStream.toByteArray();
    }

    private void tryToAppendRecords(final ByteArrayOutputStream byteArrayOutputStream,
                                    final DataFileWriter<T> dataFileWriter,
                                    final List<T> records) throws IOException {
        dataFileWriter.create(records.get(0).getSchema(), byteArrayOutputStream);

        for (final T avroRecord : records) {
            if (avroRecord != null) {
                dataFileWriter.append(avroRecord);
            }
        }
    }

Deserialization

The method AvroToBytesSerializerDeserializer#deSerialize(byte[]):List<T> deserializes byte[] into a list of Avro objects. The class DataFileReader is responsible to read the data from a SeekableInput (e.g. file or byte[]) and putting them in memory.

52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
    public List<T> deserialize(final byte[] data) {
        try (final DataFileReader<T> dataFileReader = new DataFileReader<>(new SeekableByteArrayInput(data), datumReader)) {
            return tryToReadRecords(dataFileReader);
        } catch (final IOException e) {
            throw new RuntimeException(e);
        }
    }

    private List<T> tryToReadRecords(final DataFileReader<T> dataFileReader) {
        final List<T> records = new LinkedList<>();

        while (dataFileReader.hasNext()) {
            records.add(dataFileReader.next());
        }
        return records;
    }

Conclusion

A simple Avro serializer and deserializer can be implemented with just one class consisting of only 68 lines of code including the import-statements and empty lines. The entire source code of the AvroToBytesSerializerDeserializer can be found in a maven project as usual at GitHub (techeule / te-java-avro-serialization-deserialization) with some unit-tests. The unit-tests can be seen as showcases of AvroToBytesSerializerDeserializer.

Resources: