Data Ingest in PNDA 5.0

PNDA 5.0 brings some interesting new capabilities for data ingest with Gobblin.

Previously it was necessary to encapsulate data in an AVRO encoded envelope before sending it to PNDA. All data sent to PNDA had to conform to the same AVRO schema. We can think of this as external AVRO encoding.

Now it is possible to teach Gobblin to understand raw data formats directly and to perform the AVRO encoding as part of the ingest process. We can think of this as internal AVRO encoding.

Here is a quick reminder of the AVRO schema used by PNDA:

{"namespace": "pnda.entity",
 "type": "record",
 "name": "event",
 "fields": [
     {"name": "timestamp",   "type": "long"},
     {"name": "source",      "type": "string"},
     {"name": "rawdata",     "type": "bytes"}
 ]
}

The purpose of the AVRO envelope is to provide metadata, information about the data, so that PNDA can manage the datasets consistently. These are the metadata fields in the AVRO schema:

  • timestamp – when the data was sent
  • source – the source of the data, .e.g. XR Telemetry host

Gobblin uses these metadata fields to specify a named dataset and to store the data into the correct time bucket in HDFS. 

In order to receive data that is not AVRO encoded, it is necessary for Gobblin to identify the metadata by some other method. There are various ways this can be acheived:

  • The Kafka topic can be used as a proxy metadata field
  • Fields can be extracted from the message if it conforms to a known format
  • Receiving timestamp can be used instead of sending timestamp, with some inevitable loss of timestamp fidelity

Protobuf Encoded Messages

Let’s look at the configuration that tells Gobblin how to receive Cisco XR Telemetry which is protobuf encoded:

kafka.topic.specific.state=[ \
  { \
    "dataset": "protobuf.telemetry.\*", \
    "pnda.converter.delegate.class":
"gobblin.pnda.PNDAProtoBufConverter", \
    "pnda.family.id": "protobuf.telemetry", \
    "pnda.protobuf.source.tag": "1", \
    "pnda.protobuf.timestamp.tag": "10" \
  } \
]

This tells Gobblin to do something specific with any messages that are received on topics that match the protobuf.telemetry.* topic pattern. A custom class is used to convert a protobuf encoded message into an AVRO encoded message that is ready for Gobblin to store in HDFS. The properties needed for the AVRO envelope are extracted from the decoded protobuf message, using the field numbers of the known Cisco XR Telemetry protobuf schema.

This is the protobuf schema for Cisco XR Telemetry:

message Telemetry {
oneof node_id {
string node_id_str = 1;
}
oneof subscription {
string subscription_id_str = 3;
}
string encoding_path = 6;
uint64 collection_id = 8;
uint64 collection_start_time = 9;
uint64 msg_timestamp = 10;
repeated TelemetryField data_gpbkv = 11;
TelemetryGPBTable data_gpb = 12;
uint64 collection_end_time = 13;
}

As you can see, field 1 is the node_id and field 10 is the msg_timestamp.

Field  1 → source
Field 10 → timestamp

JSON Messages

It is also possible to directly ingest messages with a JSON payload. When you use a topic for which Gobblin has no specific configuration, then Gobblin will use the PNDAFallbackConverter.

The behaviour of the PNDAFallbackConverter is to use the Kafka topic as the source name and System.currentTimeMillis() to provide the timestamp. The content of the Kafka message is put in the Avro rawdata field unchanged.

topic → source
payload → rawdata
time now → timestamp

A VES event received on Kafka topic “raw.json” would result in an Avro encoded message like this:

{"timestamp":1543849236770,
"source":"raw.json",
"rawdata":"{ \"commonEventHeader\" : { ... } }"

Conclusion

As you can see, PNDA has powerful Gobblin based ingest capabilities out of the box. It is possible to customise how it works with additional Gobblin configuration and is extensible with new converter classes. You can explore the converter classes by looking at the project on Github.

https://github.com/pndaproject/platform-gobblin-modules/tree/develop/src/main/java/gobblin/pnda

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s