Schema on write vs. schema on read

In networking, we’ve long been used to network management and monitoring products and solutions that use an underlying RDBMS.  Such systems implement a “schema on write model“; you define your RDBMS schema and when you write data to that RDBMS you “normalise” it to the schema.  When you read data, it’s in the format defined by that schema.

Screen Shot 2017-04-26 at 12.07.27

Although we’ve long been used to this model, it can suffer from a number of problems:

  • Trying to create a one-size-fits-all schema is a challenged which requires significant data modelling up front.  This is even more challenging when you are trying to aggregate disparate data sets.
  • No matter how good the data modelling, the uses of the data may change over time, which may require changes to active schemas, which can be difficult; imagine changing an engine on a moving car
  • You may discover that the “normalised” data is missing useful information which was available in the raw data; even if you enhance your schema to include that information you may not be able to recover the information that has been lost from the data which has already been normalised

To overcome these issues, the big data analytics world tends towards a “schema on read” model.  With schema on read, the data is stored as is, and then any schema is applied by the application that reads the data.

Screen Shot 2017-04-26 at 12.19.54

The schema on read model provides more flexibility and extensibility in that:

  • Different applications can apply different schemas to the data, in parallel
  • It’s generally an easier job to change the application schema, and that does not impact the raw data or other applications
  • Whatever schema is applied, no information is lost if the raw data is maintained
  • The resulting cost of experimentation is low

Schema on read is an obvious fit where we bring together multiple network datasets (event data, metric data, flow data, …), and need to enable different users and applications to process different (often overlapping) subsets of the data in parallel.

The potential disadvantage of the schema on read model is in read performance, i.e. because of the effort required to apply the schema.  To overcome this, a common deployment architecture is to save the data processed by the application to a derived data store.

Screen Shot 2017-04-26 at 12.20.08

This is a model that we commonly use on PNDA, where HDFS is used as the master data store for the raw data, and then Hbase or OpenTSDB are used to store derived data, for key-value or metric data respectively.  It provides the benefits of extensibility and performance.  Of course, it’s always possible to write the derived data back to the master data store further enriching the master data set.



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s