There are several different methods of storing and querying data in PNDA and they trade off ease of use, query time and flexibility of computation. In this post I give my thoughts on some methods I’ve used in the past, and tips for when to use each approach.
Store data in HDFS and produce static reports
see results in hours, fixed queries
The simplest option is to store data in large flat HDFS files and crunch them periodically using Spark or map reduce applications to produce data for summary reports and charts. This is slow and fixed, but if you need to work out what happened in the last few hours / days it could be the best option for you. The result could be a fixed file that is exported for distribution by an operator. This also requires a bespoke data processing application to produce the result file from the data.
Store data in HBase and query it using OpenTSDB or Grafana
see results in seconds, timeseries and simple aggregations only
If you have a time series of data and only require graphs of those time series, or simple aggregations of metrics within the series then OpenTSDB is a great option. It stores data in HBase but abstracts the complexity of organising the data and indexes for you.
Grafana is a dashboard creation tool which can load data from OpenTSDB. It is very quick and easy to produce nice-looking, live dashboards of series based data, again with simple aggregations. The degree of styling is quite limited but if you are happy to accept what their charts look like and have some simple metrics to chart, then it is incredibly quick and simple to create them.
Store data in HBase and query it using SQL from a BI tool or bespoke UI
see results in seconds to minutes, careful query and schema design required
If you need a more powerful querying capability as provided by SQL, then Impala can be used to query data from HBase using full SQL syntax. Impala runs a daemon on each datanode in the cluster and computes SQL queries in parallel over data in HBase (or HDFS). For this reason it is also important to ensure the data is balanced around the available HBase region servers. The query time depends hugely on the structure in HBase which must be arranged to allow efficient queries of the type required, so this is no magic bullet.
HBase only provides a single index – the row key. Any queries using this row key will be very quick, and queries trying to process other fields will be slow. So the trick with Impala is to reduce the quantity of data that must be considered using the row key index and then perform processing using SQL on this subset. If that subset will fit in memory (100GB per Impala node is typical) then sets of queries over that same subset can be performed efficiently. Otherwise the only queries that will return quickly are ones using the HBase row key.
To design a row key for HBase, you must start with the query you want to make. Each clause in the query must be represented in the row key and used with an Impala ‘between’ statement. For example if you want to query all data for a given IP address in the last hour the row key could be defined as <IP Address>_<time>.
This would allow queries for:
- all data for a given subnet: WHERE key BETWEEN ‘054.067’ and ‘054.068’
- all data for a given IP address: WHERE key BETWEEN ‘054.067.043.001’ and ‘054.067.043.002’
- the last N hours of data for a given IP address: WHERE key BETWEEN ‘054.067.043.001_1487238283’ and ‘054.067.043.002_1487234283’
It would not allow the last N hours for all IP addresses because IP address appears first in the row key. If more than one form of query is required, it is normal in HBase to create duplicate tables each with the row key needed to support the query in question.
To see whether Impala can perform your query efficiently you can use the explain statement to show the query plan. If the plan shows a range scan with a start and end row keys then it won’t be scanning the whole table.
Store a subset into RDBMS for query by BI tool or bespoke UI
sub second flexible querying with standard SQL
If the query times are still too long using Impala, another option is to use a batch job to extract a subset of data small enough to fit into a RDBMS. This subset can then be queried very quickly and flexibly with normal SQL.