Simple application for ingestion and data access – real time (SAIDAR)

SAIDAR is a simple application to address a common business use case – continuously streaming data to be stored and accessible realtime. Data in consideration here is of fairly simple format like events getting generated from a web server / storage appliance / application code. This data is primarily subjected to time series analysis.

Solution offered here is nothing new in terms of architecture or accomplishment. There are a variety of third party tools (open source and propriety). A good example is .

Intent of this solution is to enable developers to deploy something quickly without going to their IT groups requesting for these special data storage / management software. It is a custom code that can be extended.

In Part-1 of this series, I will be explaining the architecture. Part-2 will come up sometime in future with code level details. First version of the tool will work only in standalone mode. The next revision will enable multi node deployment using which the benefits of a cluster based operation can be fully exploited.

Saidar Architecture Depiction

SaidarArchitecture

The architecture has 2 types of components : Control Center and Services.

Control Center

  • Scheduler – Manages the various scheduling tasks like polling for streamed data, in-memory index update etc
  • Configurator – Settings of various features is controlled through this interface
  • Monitoring – Metrics and health tracker

Services

  • Ingestion Service – Handles entry of data into the tool via HTTP, File or DB channels. It also has user extension points to help convert incoming data into Saidar compliant format
  • Indexer Service – Data received through ingestion service is persisted first as an index. Raw data is sent to a different service. Also, In Memory copy of the index is updated at regular intervals. This service supports an index scan feature for client searches. Another smart feature is to rank the various nodes of the memory tree and promote/demote nodes based on access pattern
  • Data Storage Service – Data that needs to be persisted to disk is packaged in a smart way so that retrieval of information can minimize the number of disk seeks
  • Query Service – Handles the tasks of interpreting the client query, searching both index and data storage and finally aggregating the response based on user query. This supports user defined code since aggregation is very specific to implementation
  • Access Service – Client entry point to the tool is enabled through a RESTful webservice interface and also via TCP/IP

In the subsequent posts, I will explain more on the technical components and their performance characteristics. The goal is to keep the design simple and easy to deploy and use. Challenge will be to check to what extent this tool can handle large datasets.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s