Course Overview

This course is applicable for software version 10.2.2. Learn to accelerate Big Data Integration through mass ingestion, incremental loads, transformations, processing of complex files, and integrating data science using Python. Optimize the Big Data system performance through monitoring, troubleshooting, and best practices while gaining an understanding of how to reuse application logic for big data use cases.

Agenda

Module 1: Big Data Integration Course Introduction

  • Course Agenda
  • Accessing the lab environment
  • Related Courses

Module 2: Big Data Basics

  • What is Big Data?
  • Hadoop concepts
  • Hadoop Architecture Components
  • The Hadoop Distributed File System (HDFS)
  • Purposes of a Name Node & Secondary Name Node
  • MapReduce
  • “Yet Another Resource Manager” (YARN) (MapReduce Version 2)

Module 3: Data Warehouse Offloading

  • Challenges with traditional Data Warehousing
  • The requirements of optimal Data Warehouse
  • The Data Warehouse Offloading Process

Module 4: Ingestion and Offload

  • PowerCenter Reuse Reports
  • Importin PowerCenter Mappings to Developer
  • SQOOP
  • SQL to Mapping capability
  • Partitioning and parallelism

Module 5: Big Data Management Architecture

  • The Big Data world
  • Build once, deploy anywhere
  • The Informatica abstraction layer
  • Polyglot computing
  • The Smart Executor
  • Open source and innovation
  • Connection architecture
  • Conections to third Party applications

Module 6: Informatica Polyglot Computing in Hadoop

  • Hive MR/Tez
  • Blaze
  • Spark
  • Native
  • The Smart Executor

Module 7: Mappings, Monitoring, and Troubleshooting

  • Configuring and running a mapping in Native and Hadoop environments
  • Execution Plans
  • Monitor mappings
  • Troubleshoot mappings
  • Viewing mapping results

Module 8: Hadoop Data Integration Challenges and Performance Tuning

  • Describe challenges with executing mappings in Hadoop
  • Big Data Management Performance Tuning
  • Hive Environment Optimization
  • Tips

Module 9: Data Quality on Hadoop

  • The Data Quality process
  • Discover insights into your data
  • Collaborate and Create Data Improvement Assets
  • Modify, Manage, and Monitor Data Quality
  • Self Service Data Quality
  • Executing Data Quality mappings on Hadoop

Module 10: Complex File Parsing

  • The Complex file reader
  • The Data Processor transformation
  • The Complex file writer
  • Performance Considerations: Partitioning
  • Parsing and processing Avro, Parquet, JSON, and XML file
  • Data Processor Transformation Considerations

Module 11: Accessing NoSQL Databases

  • CAP Theorem
  • HBase
  • MongoDB
  • Cassandra