Skip to main content

Informatica Powercenter Partitioning

Informatica PowerCenter Partitioning Option increases the performance of PowerCenter through parallel data processing.
This option provides a thread-based architecture and automatic data
partitioning that optimizes parallel processing on multiprocessor and grid-based hardware environments.

Introduction:
With the Partitioning Option, you can execute optimal parallel sessions by dividing data
processing into subsets that are run in parallel and spread among available CPUs in a multiprocessor system. When different processors share the computational load,large data volumes can be processed faster.
When sourcing and targeting relational databases, the Partitioning Option enables PowerCenter to automatically align its partitions with database table partitions to improve performance. Unlike approaches that require manual data partitioning, data integrity is automatically guaranteed because the parallel engine of PowerCenter dynamically
realigns data partitions for set-oriented transformations (e.g., aggregators or sorters).

Detail description:
Every mapping contains one or more source pipelines.A source pipeline consists of a source qualifier and all the transformations and targets that receive data from that source qualifier.
What is a partition?
A partition is a pipeline stage that executes in a single (reader, transformation, or writer) thread.
Partitioning for a pipeline is specified by setting the following attributes

1.Location of partition points(where to use?)
Partition points mark the thread boundaries in a pipeline and divide the pipeline into stages.
 A stage is a section of a pipeline between any two partition points. When you set partition point at a transformation, the new pipeline stage includes that transformation.
The Integration service sets partition points at several transformations in a pipeline by default.
When you add partition points, you increase the number of transformation threads,which can improve session performance.
Integration service can redistribute rows of data at partition points, which can also improve session performance.

2.Number of partitions(how many?)
With PowerCenter,Number of partitions can be set at any partition point.
By default, the Informatica Server defines a single partition in the source pipeline.
When we add partitions, there is increase in the number of processing threads, which can improve session performance.

3.Partition types(what type to go for?)
Integration service specifies a default partition type at each partition point.
The partition type controls how the Informatica Server redistributes data among partitions at partition points.
When you configure the partitioning information for a pipeline, you must specify a partition type at each partition point in the pipeline.

The Workflow Manager allows you to specify the following partition types:

Round-robin partitioning:The Integration service distributes data evenly among all partitions. Use round-robin partitioning where you want each partition to process approximately the same number of rows.

Hash partitioning:The Integration service applies a hash function to a partition key to group data among partitions. If you select hash auto-keys, the Integration service uses all grouped or sorted ports as the partition key. If you select hash user keys, you specify a number of ports to form the partition key. Use hash partitioning where you want to ensure that the Integration service processes groups of rows with the same partition key in the same partition.

Key range partitioning:You specify one or more ports to form a compound partition key.The Integration service passes data to each partition depending on the ranges you specify for each port. Use key range partitioning where the sources or targets in the pipeline are partitioned by key range.

Pass-through partitioning:The Integration service passes all rows at one partition point to the next partition point without redistributing them. Choose pass-through partitioning where you want to create an additional pipeline stage to improve performance, but do not want to change the distribution of data across partitions.

You can specify different partition types at different points in the pipeline.

Example:
Consider a mapping reads data about items and calculates average wholesale costs and prices.The mapping must read item information from three flat files of various sizes, and then filter outdiscontinued items. It sorts the active items by description, calculates the average prices and wholesale costs, and writes the results to a relational database in which the target tables are partitioned by key range.
When you use this mapping in a session, you can increase session performance by specifying different partition types at the following partition points in the pipeline:
• Source qualifier.
To read data from the three flat files concurrently, you must specify three partitions at the source qualifier. Accept the default partition type, pass-through.
• Filter transformation.
Since the source files vary in size, each partition processes a different amount of data. Set a partition point at the Filter transformation, and choose round-robin partitioning to balance the load going into the Filter transformation.
• Sorter transformation.
To eliminate overlapping groups in the Sorter and Aggregator transformations, use hash auto-keys partitioning at the Sorter transformation. This causes the integration service to group all items with the same description into the same partition before the Sorter and Aggregator transformations process the rows. You can delete the default partition point at the Aggregator transformation.
• Target.
Since the target tables are partitioned by key range, specify key range partitioning at the target to optimize writing data to the target.

Difference between partitioning at the database level and informatica level
Informatica Partitioning is "how to load the data efficiently"
When you configure the partitioning information for a pipeline, you must define a partition type at each partition point in the pipeline. The partition type determines how the Integration Service redistributes data across partition points.

Database Partitioning is " how to store the data efficiently and how to retrieve the same"
Informatica can also use the database partitioning as follows:

Database partitioning. The Integration Service queries the IBM DB2 or Oracle database system for table partition information. It reads partitioned data from the corresponding nodes in the database. You can use database partitioning with Oracle or IBM DB2 source instances on a multi-node tablespace. You can use database partitioning with DB2 targets.

1. When you use source database partitioning, the Integration Service queries the database system catalog for partition information. It distributes the data from the database partitions among the session partitions.
2. If the session has more partitions than the database, the Integration Service generates SQL for each database partition and redistributes the data to the session partitions at the next partition point.

Partitioning will not always increase the performance
Session performance with partitioning depends on the data distribution in the database partitions. The Integration Service generates SQL queries to the database partitions. The SQL queries perform union or join commands, which can result in large query statements that have a performance impact. 





Back to Informatica Main page 

Comments

  1. Excellent information!!Great work! helpful,good one....If you are Interested contact below link

    http://www.tekclasses.com/informatica-online-training-best-informatica-training-in-bangalore-informatica-training/


    http://www.tekclasses.com/informatica-training-in-bangalore-informatica-training/

    ReplyDelete
  2. Good one

    http://www.tekclasses.com/informatica-online-training-best-informatica-training-in-bangalore-informatica-training/


    http://www.tekclasses.com/informatica-training-in-bangalore-informatica-training/

    ReplyDelete

  3. Wealth of information. Ton of ideas...nice blog.visit link.


    http://www.tekclasses.com/

    ReplyDelete
  4. Your information is very understandable and very clear-cut Informatica Online Course

    ReplyDelete
  5. very informative blog and useful article thank you for sharing with us Informatica Online Training India

    ReplyDelete
  6. Extremely elegantly composed. I would love to share it on my network in Informatica Online Training part.

    ReplyDelete

Post a Comment

Popular posts from this blog

Data virtualization

Data virtualization is a process of offering a data access interface that hides the technical aspects of stored data, such as location, storage structure, API, access language, and storage technology. Analogous to concept of views in databases Data virtualization tools come with capabilities of  data integration, data federation, and data modeling Requires more memory caching Can integrate several data marts or data warehouses through a  single data virtualization layer This concept and software is a subset of data integration and is commonly used within business intelligence, service-oriented architecture data services, cloud computing, enterprise search, and master data management. Composite, Denodo, and Informatica are the largest players in the area of data virtualization References for definition: http://www.b-eye-network.com/view/14815

Find Changed Data by computing Checksum using MD5 function in Informatica

Introduction: Capturing and preserving the state of data across time is one of the core functions of a data warehouse, but CDC can be utilized in any database or data integration tool. There are many methodologies such as Timestamp, Versioning, Status indicators, Triggers and Transaction logs exists but MD5 function outlines on working with Checksum. Overview: MD5 stands for Message Digest 5 algorithm.It calculates the checksum of the input value using a cryptographic Message-Digest algorithm 5 and returns a128-bit 32 character string of hexadecimal digits (0 - F). Advantage of using MD5 function is that, it will reduce overall ETL run time and also reduces cache memory usage by caching only required fields which are utmost necessary. Implementation Steps : Identify the ports from the source which are subjected to change. Concatenate all the ports and pass them as parameter to MD5 function in   expression transformation Map the MD5 function output to a checksum outp...