Spark for Big Data Analysis

I recently completed Big Data Analysis with Scala and Spark on coursera. While I have previously used pySpark at work, this course provided a really nice overview of the pros and cons of using different data types (RDD vs dataframe vs dataset) and explained why I sometimes did not get the performance I expected. (Understand the difference of transformation and action, and use persist() wisely!).

The course is part of the Scala functional programming sequence, so it follows that all the homework assignment are to be written in Scala. Since I have no previous exposure to the language, it was a nice challenge to familiarize myself with the code syntax (all the .!!) and recast many problems into a MapReduce framework.

Written on May 22, 2020