PC-54320 | PyCon Canada 2018

How to Aggregate Interest Data and User “Likes” using PySpark

by Abbas Taher

Python PySpark Big Data Marketing Likes Apache Spark

Today, Big Data is becoming an important component of IT in large organizations. The talk presents three methods to aggregate big data using Python dictionaries and the API of Apache Spark known as PySpark. The talk tries to simplify complex methods and presents them in a simple approach.

Apache Spark is one the most popular Big Data frameworks and PySpark is the Python API for using Spark. PySpark is a great choice when you need to scale up your jobs to work with big data files. In this short overview, we shall present 3 methods to aggregate big data. First, we shall use Python dictionaries, then we shall present the two methods “GroupBy” and “ReduceBy” to do the same aggregation work using PySpark. The three approaches will be presented and explained using a Jupyter Notebook.

About the Author

Abbas Taher comes with 25’ years experience in IT and business. During his career, he has founded multiple start-ups and worked in companies like Microsoft and Etisalat. Early 2018, he has founded GoFlek Inc. to build the next generation of machine learning engines using Python. He also works as a consultant to CN Rails and a member of the team integrating big data into their systems.

Author website:

If you are the author of this talk and want to make an edit, feel free to send us a PR!