Data Warehouse Optimization with Apache Hadoop

Apache Hadoop Open Source Data Warehouse Architecture

For the past few years, we have heard a lot about the benefits of augmenting the Enterprise Data Warehouse with Hadoop. The Data Warehouse vendors as well as the Hadoop vendors are showcasing how Hadoop can handle unstructured data while the EDW will continue to remain as the central source in an enterprise.

The Enterprise Data Warehouse (EDW) is a standard component of a corporate data architecture because it provides valuable business insights and powerful decision analytics for front-line workers, executives, business analysts, data scientists, and software developers. The Enterprise Data Warehouse built using Teradata, Oracle, DB2 or other DBMS is undergoing a revolutionary change. As the sources of data become rich and diverse, storing them in a traditional EDW is not the optimal solution. Big data technologies such as Apache Hadoop excel at managing large volumes of unstructured data and are coming into mainstream use, by integrating with existing legacy Data Warehouse platforms to get the best of both worlds.

The figure shows the structure of a typical enterprise data warehouse. Data from various IT applications such as CRM, ERP and Supply Chain Management is fed into a data warehouse by applying appropriate ETL processes to transform it to the desired schema. After aggregation, the results are accessible by end users via Business Intelligence and Reporting tools.

Challenges with Proprietary Data Warehouse Platforms:

1 - Inefficient resource utilization with ETL, apps and analytics requirements compete with each other.

2 - Inadequate data for business demands. Because of capacity and performance constraints, some warehouses contain only summary data, not the granular and detailed information that the business needs.

3 - Wasted Storage with only ~20% of data is real-time.

4 - Vendor dependency with proprietary solution.

5 - Limitations in multi-structured data and schema flexibility.

Apache™ Hadoop as Open Source Data Warehouse Platform:

Apache Hadoop is an open source project that provides a parallel storage and processing framework that enables customized analytical functions using commodity hardware. It scales out to clusters spanning tens to thousands of server nodes making it possible to process very large amounts of data at a fraction of the cost of enterprise data warehouses. The key is the use of commodity servers – Hadoop makes this possible by providing for replication and distribution of the nodes over multiple nodes, racks and even data centers.

Key reasons to Augment Apache™ Hadoop with Existing Enterprise Data Warehouse:

1.Reduce Cost - The Teradata Active Data Warehouse starts at $57,000 per Terabyte. In comparison, the cost of a supported Hadoop distribution and all hardware and datacenter costs are around $1,450/TB, a fraction of the Teradata price. With such a compelling price advantage, it is a no brainer to use Hadoop to augment and in some cases even replace the enterprise data warehouse.

2.Off-load data processing - Hadoop provides an ideal, massively parallel platform for the extraction and transformation jobs, making these one of the first applications to be moved to Hadoop. In cases where applicable, the connectors to Hadoop that come with certain ETL tools can be used as well (although this option won’t help reduce the cost of using such tools).

3.Datatype flexibility - Hadoop can access, ingest and process all datatypes and formats, including legacy mainframe, relational, social, machine, and other sources, while ensuring data quality.

4.Off-load Data Storage - In many enterprises, 80% or more of the data in a data warehouse is not actively used. As new data comes in, the volume of data in the warehouse grows, leading to increased cost as more appliances or nodes and disks are added. Data is categorized as being hot, warm or cold based on the frequency of access. Warm and cold data are prime targets to migrate to a Hadoop cluster. Hadoop clusters are cheap and by dramatically lowering the amount of data stored in the warehouse, considerable savings can be made.

5.Backup and Recovery - By using commodity hardware and cheap disks that are replicated, Hadoop has proven to be a safe and fast backup solution. The backup solution is easy to setup and the attractive costs and recovery time make it an ideal choice for this important function. To ensure high availability and recovery from disasters, data is backed up to geographically distributed data centers (DR site). Hadoop can be used in the DR site to keep a backup of the data warehouse.

Interested in using the power of Hadoop EDW Integration? Find out more about CIGNEX Datamatics' proficiencies in this domain.