Article

Developing Apache Spark applications: Scala vs. Python

Xavier Morera
Learn something new. Take control of your career.
Sign up

Imagine the first day of a new Apache Spark project. The project manager looks at the team and says: Is this a problem that we should solve using Scala or Python?

You may wonder if this is a trick question. What does the enterprise rule book say? Is this like asking iOS or Android? Is there a right or wrong answer? Is this a case of overthinking?

The answer to the last question is, most likely, yes. But selecting a language is still an important decision. If you’re about to embark on a Spark project of your own, and have already made your choice––then these courses on developing Spark applications using Scala and developing Spark Applications using Python should be helpful on your respective path.  

If you’re still deciding between Scala and Python, the following comparison can be a useful resource. 


First, let’s talk about Spark 

Spark is one of the most active open source Big Data projects with many contributors. This is surprising given that it’s a younger project. Compared with Hadoop, which was the original and de facto parallel processing framework, Spark is evolving at a faster pace. But popularity by itself is not a feature (or the main measure) of a project’s success and usefulness.

That’s when we look at Spark’s adoption. Spark is one of the fastest Big Data platforms currently available. In some benchmarks, it has proved itself 10x to 100x times faster than MapReduce and, as it matures, performance is improving. Because of this, Spark is adopted by many companies from startups to large enterprises.

Another large driver of adoption is ease of use. Applications that take many lines to write in other languages can now be succinctly created using one of the available Spark APIs like Scala, Python, Java and R. Yes, Java and R, but let’s focus.


Scala and Python side-by-side

Squint at the code below. Can you tell which is Scala and which is Python? We’re looking at the hello world of Big Data—the word count example—and they look pretty much the same. But there are a few important differences, so let’s explore them. 

demos = sc.textFile("/user/cloudera/spark-course/")

demos.
flatMap(lambda line: line.split()).
map(lambda word: (word, 1)).
reduceByKey(lambda a, b: a + b)
val demos = sc.textFile("/user/cloudera/sparkcourse/")

demos.
flatMap(line => line.split(" ")).
map(word => (word, 1)).
reduceByKey(_ + _)

Compiled vs. interpreted

One of the first differences: Python is an interpreted language while Scala is a compiled language. Well, yes and no—it’s not quite that black and white. A quick note that being interpreted or compiled is not a property of the language, instead it’s a property of the implementation you’re using.  

In CPython, in the reference implementation you execute .py files which contain source code that gets compiled into bytecode and then executed. So, there is a compilation step, but you can always open the source with vi or any other text editor, make a change and execute again with that change. That’s why Python is seen as an interpreted language, and can be fairly convenient when coding.

On the other hand, with Scala you need to compile your code, which creates a file that contains bytecode that is executed in the Java Virtual Machine. Since Scala runs on top of the JVM, it means that you can leverage existing Java libraries which greatly increases available functionality. So, when you need to make a small change, you can’t just open the source code with a text editor, make a change and re-execute. You need to compile using scalac (for example) and then execute. This shouldn’t get in your way, but it’s worth mentioning there is more that you’ll need to learn with Scala.

In similarities, both Python and Scala have a Read Evaluate Print Loop (REPL), which is an interactive top-tevel shell that allows you to work by issuing commands or statements one-at-a-time, getting immediate feedback. Best of all, you can use both with the Spark API. When using Python it’s PySpark, and with Scala it’s Spark Shell.



Pros and cons

Performance
Spark has two APIs, the low-level one, which uses resilient distributed datasets (RDDs), and the high-level one where you will find DataFrames and Datasets. In truth, you’ll find only Datasets with DataFrames being a special case even though there are a few differences among them when it comes to performance.

With RDDs, performance is better with Scala. Why? Because with Python, there is an additional overhead of JVM communication. Though you shouldn’t have performance problems in Python, there is a difference.

When using a higher level API, the performance difference is less noticeable. Spark works very efficiently with Python and Scala, especially with the large performance improvements included in Spark 2.3. (You can read about this in more detail in the release page under PySpark Performance Improvements.) 

Don’t forget that we’re talking about a massively parallel distributed framework––one that is already pretty efficient. So, it’s highly likely you’ll be impressed when you process your data.

Type-safety
Now this is black and white. Scala is statically-typed and Python is not. What does this mean? Let’s create a quick and simple test. Below there’s a variable of type string with an integer assigned. This is made in the REPL to make it easier to test.

Python

>>> answer = “Forty two”
>>> answer
'Forty two'
>>> type(answer)
<type 'str'>

>>> answer = 42
>>> answer
42
>>> type(answer)
<type 'int'>

Scala

scala> var answer = "Forty two"
answer: String = Forty two

scala> answer = 42
<console>:12: error: type mismatch;
 found   : Int(42)
 required: String
       answer = 42
                       ^

In Scala, you can’t change the type of a variable—that’s what being statically-typed means. 

You may be wondering if you should select a statically-typed language like Scala or a dynamically-typed language like Python? You’re not alone. There’s been an ongoing debate about which language paradigm is better for developer productivity. And both sides have valid points.

With dynamically-typed, you don’t specify the type of variable you are declaring. This makes your code less verbose. In Python, you can simply write your variable name, assign a value and the type is inferred. Need to assign a value of a different type? Just do it and your variable is now of a different type. A variable is simply a value bound to a name. It follows the principle of duck typing where type checking is deferred to runtime. The duck test basically means that if it looks like it a duck, then it probably is a duck. 

On the other hand we have Scala, which has static types. It’s worth noting that Scala can also infer types. For example, look at the expressions below:

scala> val answer = 42
answer: Int = 42

scala> val answer: Int = 42
answer: Int = 42

In one case, you didn’t specify the type, but the compiler easily determined it. In the other, you explicitly stated the type. In both cases, there is no doubt that answer is an Int. This can help avoid bugs in complex applications, mainly because they’re caught at an earlier phase during the development process.  

You could have a function that returns a value that is used as the key for a Map. Somewhere along the way in the execution, we return a list of values instead of a single value, but just for an edge case. In a dynamically-typed language, you wouldn’t know until a server blows up somewhere in a data center. With Scala, the compiler will tell you.


Learning Curve

If you’re just getting started with either one of the languages, it’s generally easier to learn Python. Python’s learning curve is gradual, but once you are up to speed, there are very advanced things you can do using the same friendly syntax you started with. It’s recommended as one of the first languages to learn for several reasons:

  • It’s very readable and extremely popular
  • You can get started by experimenting and learning about types without having to memorize a lot of information
  • There are many libraries that you can import and use right away
  • It’s interpreted, so making a change with a text editor and executing takes just a few seconds
  • In many cases, your code kind of looks like everyday English which makes it even easier to understand

Scala’s learning curve is more steep. Scala wasn’t designed to be easy, it was designed to be scalable (that’s how Scala got its name: SCAlable LAnguage), and it may take more time to become proficient. That said, Scala has some advantages:

  • Scala’s code is concise and functional—you can perform the same tasks as Java but with fewer lines of code
  • Fewer lines of code allow for faster development testing and deployment
  • Given that it is built on top of the JVM, it provides interoperability with Java and is thus widely broadening the libraries at your disposal
  • Provides the best of both worlds: object-oriented programming, but with functional programming at your disposal

Scala and Python have different advantages for different projects. Yes, Python is easy to use. But Scala is fast. Making the right decision requires evaluating the requirements and unique aspects of the project. So, next time you’re faced with making a choice between the two, remember there’s no wrong answer. And move confidently in the direction of where your skills and preferences best match the project needs. 

Learn something new. Take control of your career.
Sign up

Xavier Morera

Xavier is an entrepreneur, project manager, architect, trainer and developer who applies his experience, passion and desire for results... See more