Spark and Spark streaming with Python

Description:

Apache Spark is a fast and general engine for large-scale data processing. It is 100x faster than Hadoop MapReduce in memory and 10x faster on disk. Apache Spark is designed to write applications quickly in Java, Scala or Python. You can use it interactively from the Scala and Python shells. You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN. Access data in HDFS, Cassandra, HBase, Hive, and any Hadoop data source.

לחצו כאן BI לקורס פייתון עבור אנליסטים ואנשי

לקורס Spark with Java לחצו כאן

This course will teach you to create applications in Spark with the implementation of Python programming. It provides a clear comparison between Spark and Hadoop and covers techniques to increasing your application performance and enabling high-speed processing.

The module Spark Streaming will explain how easy to build scalable fault-tolerant streaming applications. It will let you to work with large scale streaming data using familiar batch processing abstractions.

In addition, we will cover MLlib, Apache Spark’s scalable machine learning library. We present an integrated view of data processing by highlighting the various components of these pipelines, including exploratory data analysis, feature extraction, supervised learning, and model evaluation. You will gain hands-on experience applying these principles using Spark

For more information about Apache Spark with Java & Spring click here

This course is designed for developers, BI experts, analysts with python programming experience, working experience with datasets, including data analytics.

  • Working experience in python programming
  • Basic knowledge of machine learning
  • Basic knowledge of SQL is helpful
  • Prior knowledge of Hadoop is not required

Introduction to Hadoop and the Hadoop Ecosystem

Installation and implementation in developer’s environment

Spark Architecture:

·         Infrastructure layer

·         Persistence layer

·         Integration layer

·         Analytics layer

·         Engagement layer

 Working with Datasets:

·         RDD, DataFrames – Explanation

·         Read/Write Data from HIVE

·         Persisting data in CSV

·         Persisting data in JSON

·         Filtering the header

·         Missing values

·         Anomalous data

·         Split and parse a record

·         Lambda Functions

·         Filter records – matching a condition Apply a boolean function

·         Compute aggregates using the reduce operation

·         Computing Averages with reduceByKey

·         Transform each record to another record

·         merge Function

·         Merging Pair RDDs

·         Types of Joins

·         Exploring data using Spark SQL

·         Understanding the Spark SQL query optimizer

·         Loading and processing CSV files with Spark SQL

Introduction to Scala

Spark Mlib

Introduction to machine learning

Introduction to Mlib

·         linear regression problems

·         classification problems and decision tree algorithm

·         clustering problems and k-means algorithm

 Spark Streaming

·         Spark Streaming inner working

·         Kafka and Spark

·         Processing live data

·         Developing producers

·         Developing consumers

·         Developing a Spark Streaming consumer for Kafka

Monitoring And Performance Tuning

יבגני הינו מרצהיבגני הינו מנהל תחום ומוביל טכנולוגי Big Data Development בנאיה טכנולוגיות, מומחה Java ומרצה בכיר בנאיה אקדמי
  • על פי דרישה מועד פתיחה
  • 09:00-16:30daysימים ושעות
  • 40academic hours שעות אקדמיות
  • מתקדםcourse levelרמת הקורס
  • עברית/Englishlanguageשפת הדרכה
  • לבדיקת התאמה לקורס
  • [current_url]

    השאירו פרטים ונחזור אליכם בהקדם!