2 min read

Big Data Duplication - don't do it

Picture of Vinay Samuel Vinay Samuel : May 8, 2017 9:39:00 PM

Big Data duplication - don't do it

Why do data analytics and big data projects that create massive unnecessary cost overheads in the form of data duplication, unnecessary transformation and data lake proliferation?

Data is streaming into organisations at unparalleled speed. Streaming data (such as clickstream, machine logs, sensor data, device data, etc), can assist the business to make better real-time decisions ONLY when the data is analysed in context with historic information. Fast, complete analysis of both real-time with historic data makes the organisation more competitive and dynamic in the market.

The trouble is that most organisations still believe that input data must be copied from source (or operational) systems into a central analytical processing store in order to analyse it. The common belief is that a physical data lake or warehouse (with data duplicated from other systems) is what’s needed for big data, deep analytics and machine learning applications. The truth is that it's a technology capability limitation, not a best practice. And, massive costs are incurred in moving data, integrating it into a common view and format, and then analysing it. This approach is “old-school” even when you are using Hadoop or Spark as the core technology framework. This approach is the reason why organisations take months or years to execute on analytical projects with costs for integration of data out of control.

Many organisations, limited by the wrong technologies and tied to this old "centralised platform" thinking, are forced to ignore their true big data opportunity; in-context real-time analytics and interaction. Traditional data integration methods and physical data lakes slow analytics value extraction and restrain the ability to create truly inspiring digital experiences for customers. When you consider the update cycles, structure and location of data across organisations and the Internet, traditional data lake and data warehousing technologies are out of sync with new streaming real-time data sources. New digital models requiring real-time mass customer interaction necessitate the assimilation of historic data with data in the stream. And, a centralised physical data platform is problematic.

Zetaris is changing the data landscape in organisations by challenging the myth that data from across the organisation, or the network, needs to be centralised in order to analyse and act upon it. Zetaris' Lightning - Virtual Data Lake - is stopping this costly duplication of data, human effort, and processing.

Zetaris Lightning - the Virtual Data Lake - enables the query to be moved to the data where it lives. We give data scientists the ability to query data across the organisation, or network, without needing it to physically be in the same place. Zetaris technologies enable in-context real-time analytics for digital disruption. Through sophisticated analytical query optimisers, Zetaris Lightning ensures efficient processing across the network and handles the core challenges faced when distributing analytical workloads across your data centre.

With Zetaris Lightning, organisations don’t have to duplicate or move the data into yet another data store for analytics, reducing project timelines by a factor of 5x with a massive reduction in the total cost of project.