Skip to main content

Basics of Data warehousing


Data warehousing :

Data warehousing is combining data from multiple and usually varied sources into one comprehensive and easily manipulated database. Common accessing systems of data warehousing include queries, analysis and reporting. Because data warehousing creates one database in the end, the number of sources can be anything you want it to be, provided that the system can handle the volume, of course. The final result, however, is homogeneous data, which can be more easily manipulated.

Data warehousing is comprised of two primary tools: databases and hardware. In a data warehouse, there are multiple databases and data tables used to store information. These tables are related to each other through the use of common information or keys. The size of a data warehouse is limited by the storage capacity of the hardware.

The hardware required for a data warehouse includes a server, hard drives, and processors. In most organizations, the data warehouse is accessible via the shared network or Intranet. A data architect usually is responsible for setting up the database structure and managing the process for the updating of data from the original sources.                                                                                                                       


Data warehouse:

A data warehouse is a repository of an organization's electronically stored data. It is designed to facilitate reporting and analysis.
  • The purpose of data warehouse is to store data consistently across the organization and to make the organizational information accessible.
  • It is adaptive and resilient source of information. When new data is added to the Data Warehouse, the existing data and technologies are not disrupted. The design of separate data marts that make up the data warehouse must be distributed and incremental. Anything else is a compromise.
  • The data warehouse not only controls the access to the data, but gives its owners great visibility into the uses and abuses of the data, even after it has left the data warehouse.
  • Data warehouse is the foundation for decision-making.;


Difference Between Data Warehousing And Business Intelligence:

Data warehousing and business intelligence are two terms that are a common source of confusion, both inside and outside of the information technology (IT) industry. 
Data warehousing refers to the technology used to actually create a repository of data. Business intelligence refers to the tools and applications used in the analysis and interpretation of data. 
 
Data warehousing and business intelligence have grown substantially and are forecast to experience continued growth into the future.     



Different Types of Data Warehouse Design: 

There are two main types of data warehouse design: top-down and bottom-up. The two designs have their own advantages and disadvantages.
Bottom-up is easier and cheaper to implement, but it is less complete, and data correlations are more sporadic.
In a top-down design, connections between data are obvious and well-established, but the data may be out of date, and the system is costly to implement.

Data marts are the central figure in data warehouse design. A data mart is a collection of data based around a single concept. Each data mart is a unique and complete subset of data. Each of these collections is completely correlated internally and often has connections to external data marts.

The way data marts are handled is the main difference between the two styles of data warehouse design. In the top-down design, data marts occur naturally as data is put into the system. In the bottom-up design, data marts are made directly and connected together to form the warehouse. While this may seem like a minor difference, it makes for a very different design.

The top-down method was the original data warehouse design. Using this method, all of the information the organization holds is put into the system. Each broad subject will have its own general area within the databases. As the data is used, connections will appear between correlative data points, and data marts will appear. In addition, any data in the system stays there forever—even if the data is superseded or trivialized by later information, it will stay in the system as a record of past events.

The bottom-up method of data warehouse design works from the opposite direction. A company puts in information as a standalone data mart. As time goes on, other data sets are added to the system, either as their own data mart or as part of one that already exists. When two data marts are considered connected enough, they merge together into a single unit.

The two data warehouse designs each have their own strong and weak points. The top-down method is a huge project for even smaller data sets. Since big projects are also more costly, it is the most expensive in terms of money and manpower. If the data warehouse is finished and maintained, it is a vast collection, containing everything that the company knows.

The bottom-up process is much faster and cheaper, but since the data is entered as needed, the database will never actually be complete. In addition, correlations between data marts are only as strong as their usage makes them. If a strong correlation exists, but no users see it, it goes unconnected.       


Data Warehouse architecture:

 







 

 

 

 


Source Systems/Data Sources
Typically in any organization the data is stored in various databases, usually divided up by the systems. There may be data for marketing, sales, payroll, engineering, etc. These systems might be legacy/mainframe systems,Flat files or relational database systems.


 Staging Area
The data coming from various source systems is first kept in a staging area. The staging area is used to clean, transform, combine, de-duplicate, household, archive, and to prepare source data for use in data warehouse. The data coming from source system is kept as it is in this area. This need not be based on relational terminology. Sometimes managers of the data are comfortable with normalized set of data. In these cases, normalized structure of the data staging storage is certainly acceptable. Also, staging area doesnt provide querying/presentation services. 


 Ware House
Once the data is in staging area, it is cleansed, transformed and then sent to Data warehouse. You may or may not have ODS before transferring data to Data Warehouse.
 


OLAP
The data in Data Warehouse has to be easily manipulated in order to answer the business questions from management and other users. This is accomplished by connecting the data to fast and easy-to-use tools known as Online Analytical Processing (OLAP) tools. OLAP tools can be thought of as super high-speed forklifts that have knowledge of the warehouse and the operators built into them in order to allow ordinary people off the street to jump in and quickly find products by asking English-like questions. Within the OLAP server, data is reorganized to meet the reporting and analysis requirements of the business, including:
    * Exception reporting
    * Ad-hoc analysis
    * Actual vs. budget reporting
    * Data mining (looking for trends or anomalies in the data)
In order to process business queries at high speed, answers to common questions are preprocessed in some OLAP servers, resulting in exceptional query responses at the cost of having an OLAP database that may be several times bigger than the data warehouse itself.  

                                                                                          

Comments

Popular posts from this blog

Informatica Powercenter Partitioning

Informatica PowerCenter Partitioning Option increases the performance of PowerCenter through parallel data processing. This option provides a thread-based architecture and automatic data partitioning that optimizes parallel processing on multiprocessor and grid-based hardware environments. Introduction: With the Partitioning Option, you can execute optimal parallel sessions by dividing data processing into subsets that are run in parallel and spread among available CPUs in a multiprocessor system. When different processors share the computational load,large data volumes can be processed faster. When sourcing and targeting relational databases, the Partitioning Option enables PowerCenter to automatically align its partitions with database table partitions to improve performance. Unlike approaches that require manual data partitioning, data integrity is automatically guaranteed because the parallel engine of PowerCenter dynamically realigns data partitions for set-oriented trans...

Data virtualization

Data virtualization is a process of offering a data access interface that hides the technical aspects of stored data, such as location, storage structure, API, access language, and storage technology. Analogous to concept of views in databases Data virtualization tools come with capabilities of  data integration, data federation, and data modeling Requires more memory caching Can integrate several data marts or data warehouses through a  single data virtualization layer This concept and software is a subset of data integration and is commonly used within business intelligence, service-oriented architecture data services, cloud computing, enterprise search, and master data management. Composite, Denodo, and Informatica are the largest players in the area of data virtualization References for definition: http://www.b-eye-network.com/view/14815

Find Changed Data by computing Checksum using MD5 function in Informatica

Introduction: Capturing and preserving the state of data across time is one of the core functions of a data warehouse, but CDC can be utilized in any database or data integration tool. There are many methodologies such as Timestamp, Versioning, Status indicators, Triggers and Transaction logs exists but MD5 function outlines on working with Checksum. Overview: MD5 stands for Message Digest 5 algorithm.It calculates the checksum of the input value using a cryptographic Message-Digest algorithm 5 and returns a128-bit 32 character string of hexadecimal digits (0 - F). Advantage of using MD5 function is that, it will reduce overall ETL run time and also reduces cache memory usage by caching only required fields which are utmost necessary. Implementation Steps : Identify the ports from the source which are subjected to change. Concatenate all the ports and pass them as parameter to MD5 function in   expression transformation Map the MD5 function output to a checksum outp...