“Apache Avro is the leading serialization format for record data,
and first choice for streaming data pipelines.
It offers excellent schema evolution, and has implementations
for the JVM …” - source: avro.apache.org
In this post a simple Avro serializer and deserializer (AVRO serde) implementation is presented.
The Apache Avro library make use of the single responsibility principle, and it defines different classes for different
type of work.
The class SpecificDatumWriter
is responsible for writing Avro object into an OutputStream
.
The class SpecificDatumReader
is responsible for determining the in-memory serialized data representation while
deserializing.
Here is an abbreviated source code of the class AvroToBytesSerializerDeserializer<T extends GenericRecord>
:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
|
package com.techeule.examples.avro;
// ... shortened for readability ...
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.specific.SpecificDatumReader;
import org.apache.avro.specific.SpecificDatumWriter;
public class AvroToBytesSerializerDeserializer<T extends GenericRecord> {
private final SpecificDatumWriter<T> datumWriter;
private final SpecificDatumReader<T> datumReader;
public AvroToBytesSerializerDeserializer(final Class<T> classType) {
Objects.requireNonNull(classType, "classType must not be null.");
datumWriter = new SpecificDatumWriter<>(classType);
datumReader = new SpecificDatumReader<>(classType);
}
public byte[] serialize(final List<T> records) {
// ... omitted for readability ...
}
public List<T> deserialize(final byte[] data) {
// ... omitted for readability ...
}
}
|
The method AvroToBytesSerializerDeserializer#serialize(List<T>):byte[]
serializes list of Avro objects into byte[]
.
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
|
public byte[] serialize(final List<T> records) {
if ((records == null) || records.isEmpty()) {
throw new IllegalArgumentException("records must not be null or empty");
}
final var byteArrayOutputStream = new ByteArrayOutputStream();
try (final var dataFileWriter = new DataFileWriter<>(datumWriter)) {
tryToAppendRecords(byteArrayOutputStream, dataFileWriter, records);
} catch (final IOException e) {
throw new RuntimeException(e);
}
return byteArrayOutputStream.toByteArray();
}
private void tryToAppendRecords(final ByteArrayOutputStream byteArrayOutputStream,
final DataFileWriter<T> dataFileWriter,
final List<T> records) throws IOException {
dataFileWriter.create(records.get(0).getSchema(), byteArrayOutputStream);
for (final T avroRecord : records) {
if (avroRecord != null) {
dataFileWriter.append(avroRecord);
}
}
}
|
The method AvroToBytesSerializerDeserializer#deSerialize(byte[]):List<T>
deserializes byte[]
into a list of Avro
objects. The class DataFileReader
is responsible to read the data from a SeekableInput
(e.g. file or byte[]
) and putting them in memory.
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
|
public List<T> deserialize(final byte[] data) {
try (final DataFileReader<T> dataFileReader = new DataFileReader<>(new SeekableByteArrayInput(data), datumReader)) {
return tryToReadRecords(dataFileReader);
} catch (final IOException e) {
throw new RuntimeException(e);
}
}
private List<T> tryToReadRecords(final DataFileReader<T> dataFileReader) {
final List<T> records = new LinkedList<>();
while (dataFileReader.hasNext()) {
records.add(dataFileReader.next());
}
return records;
}
|
A simple Avro serializer and deserializer can be implemented with just one class consisting of only 68 lines of code
including the import
-statements and empty lines.
The entire source code of the AvroToBytesSerializerDeserializer
can be found in a maven project as usual at
GitHub
(techeule / te-java-avro-serialization-deserialization)
with some unit-tests. The unit-tests can be seen as showcases of AvroToBytesSerializerDeserializer
.