In networking, we’ve long been used to network management and monitoring products and solutions that use an underlying RDBMS. Such systems implement a “schema on write model“; you define your RDBMS schema and when you write data to that RDBMS you “normalise” it to the schema. When you read data, it’s in the format defined by that schema.
Although we’ve long been used to this model, it can suffer from a number of problems:
- Trying to create a one-size-fits-all schema is a challenged which requires significant data modelling up front. This is even more challenging when you are trying to aggregate disparate data sets.
- No matter how good the data modelling, the uses of the data may change over time, which may require changes to active schemas, which can be difficult; imagine changing an engine on a moving car
- You may discover that the “normalised” data is missing useful information which was available in the raw data; even if you enhance your schema to include that information you may not be able to recover the information that has been lost from the data which has already been normalised
To overcome these issues, the big data analytics world tends towards a “schema on read” model. With schema on read, the data is stored as is, and then any schema is applied by the application that reads the data.
The schema on read model provides more flexibility and extensibility in that:
- Different applications can apply different schemas to the data, in parallel
- It’s generally an easier job to change the application schema, and that does not impact the raw data or other applications
- Whatever schema is applied, no information is lost if the raw data is maintained
- The resulting cost of experimentation is low
Schema on read is an obvious fit where we bring together multiple network datasets (event data, metric data, flow data, …), and need to enable different users and applications to process different (often overlapping) subsets of the data in parallel.
The potential disadvantage of the schema on read model is in read performance, i.e. because of the effort required to apply the schema. To overcome this, a common deployment architecture is to save the data processed by the application to a derived data store.
This is a model that we commonly use on PNDA, where HDFS is used as the master data store for the raw data, and then Hbase or OpenTSDB are used to store derived data, for key-value or metric data respectively. It provides the benefits of extensibility and performance. Of course, it’s always possible to write the derived data back to the master data store further enriching the master data set.