Datawarehousing Concepts and Tools

Posts

Showing posts from May, 2013

Types of facts

Measures in the Fact Tables The values that quantify facts are usually numeric, and are often referred to as measures. Measures are typically additive along all dimensions, such as Quantity in a sales fact table. A sum of Quantity by customer, product, time, or any combination of these dimensions results in a meaningful value. Additive - Additive measures are facts that can be added up through all of the dimensions in the fact table. A sales fact is a good example for additive fact. Semi-Additive - Measures that can be summed up for some of the dimensions in the fact table, but not the others. Eg: quantity-on-hand can be added along the Warehouse dimension to achieve the total-quantity-on-hand Non-Additive - Measures that cannot be added along any dimension are truly non-additive. Non-additive measures can often be combined with additive measures to create new additive measures Eg: Sale Price =Quantity*Price Facts which have percentages, ra...

Dimension types

Rapidly Changing Dimensions A dimension is considered to be a rapidly changing if one or more of its attributes changes frequently in many rows. For a rapidly changing dimension, the dimension table can grow very large from the application of numerous type 2 changes. Slowly Changing Dimensions Attributes of a dimension that would undergo changes over time. It depends on the business requirement whether particular attribute history of changes should be preserved in the data warehouse. This is called a Slowly Changing Attribute and a dimension containing such an attribute is called a Slowly Changing Dimension. Conforming Dimension A dimension table may be used in multiple places if the data warehouse contains multiple fact tables or contributes data to data marts. For example,A dimension such as customer, time, or product that is used in multiple schemas is called a conforming dimension if all copies of the dimension are the same Use of Conforming Dimensions in Multiple Fac...

When is snowflaking of dimensions required

There are few key factors need to be considered to decide on snowflake schema as a data model in DW initiative Frequency of dimension attribute change No of dimension attributes - Large dimensions Hierarchies within the dimension attributes History tracking of dimension attributes Consider type 2 SCD implementation is used for history tracking. If a dimensional table has 10 dimension attributes and only one among the dimension attributes is frequently changing in many rows, then table becomes huge causing performance issues and unnecesary space consumption by rarely changed dimension attributes. Move the frequently changed dimension attribute to a seperate table and maintain RI with main dimension table, Thus snowflaking the dimension to avoid performance issues and unneccesary space consumption If there are hierarchies within the dimension attributes,then keying between Aggregate facts and dimension table(with hierarchies) is not appropriate A record i...

Business Requirement Specifications/Findings for a Data warehouse intiative

The scope of data warehouse initiative must be driven by business requirements.Requirements determine what data must be available in the data warehouse, how it is organized, and how often it is updated. High-level requirements analysis with business management should comprise of following areas: Understand their key strategic business initiatives. Identify their key performance indicators or success metrics for each of the strategic business initiatives. Determine the core business processes they monitor and want to impact. Determine the potential impact on their performance metrics with improved access to improved business process information. You will also need to conduct preliminary data discovery sessions to identify any gaping data feasibility issues. Approach to Requirements Definition Talk to the business users and get the overall understanding of an organization: What are the objectives of your Organization/department? What are you trying to...

How teradata makes sure that there are no duplicate rows being inserted when its a SET table?

Teradata redirects the new inserted row as per its Primary Index to the target AMP (Access Module Processor) on the basis of its row hash value, and if it find same row hash value in that AMP then it start comparing the whole row, and find out if duplicate. When the target table has UPI(Unique Primary Index) then the duplicate row is rejected with an error. In case of a NUPI(Non-Unique Primary Index) then it is rejected without throwing any error

Difference between Rank and Row_number() functions in teradata

RANK function in teradata resembles very much to rank we have in real life. for example,Students with equal number of marks are assigned with same rank. ROW_NUMBER function in Teradata is used to generate SEQUENCE number Using Rank function : SELECT Student_name ,SUM(English + Maths) AS MARKS_Total ,RANK() OVER (ORDER BY MARKS_Total DESC) AS STUDENT_RANK FROM STUDENT GROUP BY 1; Output of above sql: Student_name MARKS_Total Student_ROWNUM John 198 1 David 165 3 Alice ...

Sql for INSERT else UPDATE

Select Case When E.Empid Is Not Null Then ‘UPDATE’ Else ‘INSERT’ End Flag, E1.Empid, E1.Location From Empsource E1 Left Outer Join Emptarget E On E.Empid = E1.Empid

Sql to find space occupied by a table in a specific database

Select Databasename ,Tablename ,Sum(Currentperm)/(1024*1024*1024) (Decimal(15,6)) As Table_Size From DBC.Tablesize Where Databasename = ‘xyz_db‘ And Tablename=’xyz_Table_Name’ Group By 1,2 ;

Space utilization of datatypes in Teradata

Byteint – 1 Byte Smallint – 2 Bytes Integer – 4 Bytes Bigint – 8 Bytess Decimal – 2 Bytes To 8 Bytes Float – 8 Bytes Date – 4 Bytes Time – 6 Bytes Time With Zone – 8 Bytes Timestamp – 10 Bytes Timestamp With Zone – 12 Bytes The range of value for numeric datatypes can be calculated by formula 2 power of (No.of Bytes*8 -1) Example for smallint = 2 power of ((2*8) -1) = 2 power of 15 =32768

Teradata Utilities

TPUMP (Teradata Parallel Data Pump) * TPUMP allows near real time updates from Transactional Systems into the Data Warehouse. * It can perform Insert, Update and Delete operations or a combination from the same source. * It can be used as an alternative to MLOAD for low volume batch maintenance of large databases. * TPUMP allows target tables to have Secondary Indexes, Join Indexes, Hash Indexes, Referential Integrity, Populated or Empty Table, Multiset or Set Table or Triggers defined on the Tables. * TPUMP can have many sessions as it doesn’t have session limit. * TPUMP uses row hash locks thus allowing concurrent updates on the same table. Limitations of Teradata TPUMP Utility: * Use of SELECT statement is not allowed. * Concatenation of Data Files is not supported. * Exponential & Aggregate Operators are not ...

PE - Parsing Engine

PE, acronym for "Parsing Engine," is the type of vproc (Virtual Processor) for session control, task dispatching and SQL parsing in the multi-tasking and possibly parallel-processing environment of the Teradata Database. DEFINITION PE, acronym for "Parsing Engine," is the type of vproc (Virtual Processor) for session control, task dispatching and SQL parsing in the multi-tasking and possibly parallel-processing environment of the Teradata Database. Vproc is a software process running in an SMP (Symmetric Multiprocessing) environment or a node. COMPONENTS The components of a PE vproc can be classified as the following: 1. Parser: It desolves SQL statements into RDBMS processing steps; 2. Query Optimizer: It decides the optimal way to access data; 3. Step Generator: It 1) produces processing steps, and 2) encapsulates them into packages; 4. Dispatcher: It 1) transmits the encapsulated steps from the parser to the pertine...

Teradata Sql Assistant (TSA)

Definition: Teradata SQL Assistant (TSA), as part of Teradata Tools and Utilities (TTU), is an ODBC-based client utility used to access and manipulate data on ODBC-compliant database servers. It has two editions: 1) Teradata SQL Assistant for Microsoft Windows 2) Teradata SQL Assistant/Web Edition Teradata SQL Assistant is an information discovery tool designed for Windows XP and Windows 2000. Teradata SQL Assistant retrieves data from any ODBC-compliant database server. The data can then be manipulated and stored on the desktop PC. Teradata SQL Assistant/Web Edition is a web-based query tool that allows you to compose queries, submit them to the Teradata Database, and then view the results in a web browser. Overview : Teradata SQL Assistant for Microsoft Windows, originally called "Queryman" (before V. 6.0)or "QueryMan" (V. 6.0 and up), is also known as "SQLA" among programmers. It supports import / export tasks, but not the serious ones. W...

ODBC AND JDBC

JDBC: Java Database Connectivity (JDBC) provides Java developers with a standard API (application Programming Interfaces) that is used to access databases regardless of the driver and database product. To use JDBC you'll need at least JDK 1.1 a database and a JDBC driver. There are several characteristics of JDBC:ODBC : ODBC is the acronym for Open DataBase Connectivity a Microsoft Universal Data Access standard that started life as the Windows implementation of the X/Open SQL Call Level Interface specification. ODBC Driver Manager - an application binds to this generic library which is responsible for loading the requested ODBC Driver. ODBC Driver - dynamically loaded by the ODBC Driver manager for making connection to target Database. Difference b/w them is 1.ODBC is for Microsoft and JDBC is for java applications. 2.ODBC can't be directly used with Java because it uses a C interface. 3.ODBC makes use of pointers which have been removed totally from java....

AMP

DEFINITION AMP, acronym for "Access Module Processor," is the type of vproc (Virtual Processor) used to manage the database, handle file tasks and and manipulate the disk subsystem in the multi-tasking and possibly parallel-processing environment of the Teradata Database. OVERVIEW In reality, each AMP is an instance of the database management software responsible for accessing and manipulating data. As such, every AMP is allowed a part of the database to manage, and a part of the physical disk space to keep its set of database tables. Usually, the AMP obtains its portion of the disk space by being associated with a virtual disk (vdisk). It handles its disk reading / writing by using its file system software, which converts AMP steps (i.e., the steps from PEs) into physical data block requests. The AMPs are responsible to access and manipulate the data so as to complete the request processing. There may be mutiple AMPs on one node, and the communication among the AMPs ...

Vprocs

DEFINITION Vproc, acronym for "Virtual PROCessor," is a self-contained instance of the processor software on a Teradata platform (SMP platform or a node). Vprocs are the basic units of parallelism that make up the Teradata database server. OVERVIEW To put it simply, a virtual processor is a simulated processor in a processing software system, or a software version of a dedicated physical processor. Each vproc uses a portion of the physical processor resource, and runs independently of other vprocs. The vprocs co-existing within a node share a single memory pool - a collection of free memory in the node. The portion of memory allocated from the pool to one vproc will be returned to the pool after usage for use by other vprocs. Vprocs are the basic units of parallelism that make up the Teradata database server. A single SMP node is made up of multiple vprocs, and a single Teradata MPP system is made up of multiple SMP nodes. Vprocs run as multi-threaded processes to enable...

Teradata Architecture

Symmetric multiprocessing (SMP) - A single node that contains multiple CPUs sharing a memory pool. Massively parallel processing (MPP) - Multiple SMP nodes working together comprise a larger configuration. The nodes are connected using the BYNET, which allows multiple virtual processors on multiple system nodes to communicate with each other. Shared Nothing Architecture (MPP) - means that each vproc(Access Module Processors and Parsing Engines are Virtual processors) is responsible for its own portion of the database and do not share common components.each AMP manages its own dedicated memory space and the data on its own vdisk -- these are not shared with other AMPs. Each AMP uses system resources independently of the other AMPs so they can all work in parallel for high system performance overall A node is made up of various hardware and softwares A clique is a set of Teradata nodes that share a common set of disk arrays. Cabling a subset of nodes to the same disk arrays create...

BYNET

DEFINITION BYNET, acronym for "BanYan NETwork," is a folded banyan switching network built upon the capacity of the YNET. It acts as a distributed multi-fabric inter-connect to link PEs, AMPs and nodes on a Massively Parallel Processing (MPP) system. OVERVIEW Interconnect technology is important for parallel computing. The BYNET is Teradata's "system interconnect for high-speed, fault tolerant warehouse-optimized messaging between nodes." [11] As an indispensable part of the Teradata MPP system, it can be understood better with its predecessor "YNET" in the background. In 1982, the YNET interconnecting technology used on the DBC 1012 was patented for the parallelism. As a broadcast-based hardware solution, it linked all the IFPs, COPs, and AMPs together with circuit boards and cables in a dual bus architecture. Two costom-built busses operated concurrently within the interconnect framework: YNET A to connect the IFPs and COPs on one sid...

Journals

Permanent Journal The purpose of a permanent journal is to maintain a sequential history of all changes made to the rows of one or more tables. Permanent journals help protect user data when users commit, uncommit or abort transactions. A permanent journal can capture a snapshot of rows before a change, after a change, or both. You use permanent journaling to protect data. Unlike the automatic journal, the contents of a permanent journal remain until you drop them. When you create a new journal table, you can use several options to control the type of information to be captured. You create permanent journals when you create a user or database. To create permanent journals within an existing user or database, use the MODIFY statement. Users activate permanent journaling by including the JOURNAL option in the CREATE or MODIFY statements for users or databases. You must allocate sufficient permanent space to a database or user that will contain permanent journals. If a database ...

Teradata interview questions, FAQs

Difference between Rank and Row_number() functions in teradata Sql for INSERT else UPDATE How teradata makes sure that there are no duplicate rows being inserted when its a SET table?

Basics of Data warehousing

Data warehousing : Data warehousing is combining data from multiple and usually varied sources into one comprehensive and easily manipulated database. Common accessing systems of data warehousing include queries, analysis and reporting. Because data warehousing creates one database in the end, the number of sources can be anything you want it to be, provided that the system can handle the volume, of course. The final result, however, is homogeneous data , which can be more easily manipulated. Data warehousing is comprised of two primary tools: databases and hardware. In a data warehouse, there are multiple databases and data tables used to store information. These tables are related to each other through the use of common information or keys. The size of a data warehouse is limited by the storage capacity of the hardware. The hardware required for a data warehouse includes a server, hard drives, and processors. In most organizations, the data warehouse is accessibl...

Data virtualization

Data virtualization is a process of offering a data access interface that hides the technical aspects of stored data, such as location, storage structure, API, access language, and storage technology. Analogous to concept of views in databases Data virtualization tools come with capabilities of data integration, data federation, and data modeling Requires more memory caching Can integrate several data marts or data warehouses through a single data virtualization layer This concept and software is a subset of data integration and is commonly used within business intelligence, service-oriented architecture data services, cloud computing, enterprise search, and master data management. Composite, Denodo, and Informatica are the largest players in the area of data virtualization References for definition: http://www.b-eye-network.com/view/14815