Spark and Spark streaming with Python
Apache Spark is a fast and general engine for large-scale data processing. It is 100x faster than Hadoop MapReduce in memory and 10x faster on disk. Apache Spark is designed to write applications quickly in Java, Scala or Python. You can use it interactively from the Scala and Python shells. You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN. Access data in HDFS, Cassandra, HBase, Hive, and any Hadoop data source.
לחצו כאן BI לקורס פייתון עבור אנליסטים ואנשי
לקורס Spark with Java לחצו כאן
This course will teach you to create applications in Spark with the implementation of Python programming. It provides a clear comparison between Spark and Hadoop and covers techniques to increasing your application performance and enabling high-speed processing.
The module Spark Streaming will explain how easy to build scalable fault-tolerant streaming applications. It will let you to work with large scale streaming data using familiar batch processing abstractions.
In addition, we will cover MLlib, Apache Spark’s scalable machine learning library. We present an integrated view of data processing by highlighting the various components of these pipelines, including exploratory data analysis, feature extraction, supervised learning, and model evaluation. You will gain hands-on experience applying these principles using Spark
For more information about Apache Spark with Java & Spring click here
This course is designed for developers, BI experts, analysts with python programming experience, working experience with datasets, including data analytics.
- Working experience in python programming
- Basic knowledge of machine learning
- Basic knowledge of SQL is helpful
- Prior knowledge of Hadoop is not required
Introduction to Hadoop and the Hadoop Ecosystem
Installation and implementation in developer’s environment
Spark Architecture:
· Infrastructure layer
· Persistence layer
· Integration layer
· Analytics layer
· Engagement layer
Working with Datasets:
· RDD, DataFrames – Explanation
· Read/Write Data from HIVE
· Persisting data in CSV
· Persisting data in JSON
· Filtering the header
· Missing values
· Anomalous data
· Split and parse a record
· Lambda Functions
· Filter records – matching a condition Apply a boolean function
· Compute aggregates using the reduce operation
· Computing Averages with reduceByKey
· Transform each record to another record
· merge Function
· Merging Pair RDDs
· Types of Joins
· Exploring data using Spark SQL
· Understanding the Spark SQL query optimizer
· Loading and processing CSV files with Spark SQL
Introduction to Scala
Spark Mlib
Introduction to machine learning
Introduction to Mlib
· linear regression problems
· classification problems and decision tree algorithm
· clustering problems and k-means algorithm
Spark Streaming
· Spark Streaming inner working
· Kafka and Spark
· Processing live data
· Developing producers
· Developing consumers
· Developing a Spark Streaming consumer for Kafka
Monitoring And Performance Tuning

- על פי דרישה מועד פתיחה
- 09:00-16:30
ימים ושעות
- 40
שעות אקדמיות
- מתקדם
רמת הקורס
- עברית/English
שפת הדרכה
לבדיקת התאמה לקורס
ממליצים
לפתיחה והורדת סילבוס