MySQL Sunday at Oracle Open World
Help Bring Zork and the FyrevM to Android, Kindle et al

Do We Need a New Programming Language for Big Data?

Data_deluge
 

I'm the boards of two companies (Pentaho, Revolution Analytics) that are starting to see a lot of customer traction around Big Data. More and more companies in media, pharma, retail and finance are doing advanced analysis, reporting, graphing, etc with massive data sets. It made me wonder what other areas of the technology stack might evolve with the trend towards Big Data.  Obviously, there's new middleware layers like Hadoop and Map Reduce, and we're also seeing the emergence of NoSQL data management layers with Cassandra, MongoDB, MemBase and others.  But what about programming languages?  

OpenGamma CEO and resident genius Kirk Wylie wrote a post recently about why he wants a new programming language.

So why don't I have this language yet? Well, partially because programming language craftsmanship is hard. I'm pretty sure I'm not good enough to do it, which is usually my default criteria for saying something is Really Hard.

But I think as well the k3wl languages coming out are coming out of language requirements of the Top 10% crowd. They're the ones good enough to actually write the languages, and they're going to write a language that makes them happy. But then you end up with Scala, and then you end up with this monstrosity, and then you make me cry. A language in which that thing is even possible will never be a candidate as a Journeyman Programming Language.

You know who's going to do it? Someone like Gosling, who set about with the needs of the journeyman programmer in Java. But the state of the art has moved on, and Java just isn't suitable anymore.

Who I would really like to do it is Anders Hejlsberg. I am a very big fan of C#-the-Language. It's just that .Net-the-Ecosystem is so Microsoft-specific and horrific it'll never catch on in the wider world, no matter what Miguel de Icaza thinks.

This got me thinking about the challenge of the current complexity in Big Data systems.  Today, you have to be near genius level to build systems on top of Cassandra, Hadoop and the like today.  These are powerful tools, but very low-level, equivalent to programming client server applications in assembly language.  When it works it's great, but the effort is significant and it's probably beyond the scope of mainstream IT organizations.  (That's one reason that Revolution's R product has appeal, but R is a specialized statistical analysis tool, not a general purpose language.)

Could the Big Data complexity be factored out somehow with a new general purpose programming language?  No doubt. Having worked with Anders on the creation of Delphi many years back, this is right up his alley.  Or maybe we already have a good starting point with Erlang, Scala and Google's Go.  Go is particularly interesting having been designed by Rob Pike and Ken Thompson of Bell Labs / Unix fame.

What's been your experience in programming Big Data systems?  What do you think's needed?  Let me know in the comments below.

Zack Urlocker is an investor, advisor and board member to several startup software companies in SaaS and Open Source. He was previously the EVP of Products at MySQL responsible for Engineering and Marketing. He built the MySQL Enterprise subscription strategy and product line. MySQL was sold to Sun for $1 billion and is now part of Oracle Corporation. He is also a marathon runner, blues guitarist and fan of Interactive Fiction

Comments

Maybe it isn't exactly what you had in mind, but you might want to check out Apache Pig: http://hadoop.apache.org/pig/

Interesting point about R being a domain specific language and is this in opposition or already a solution to a general purpose Big Data language. I think R is great, but it is very specific to statistical analysis, not general purpose computing. So what I'm suggesting is do we need a general purpose language (as opposed to libraries) that make it easy to do other sorts of Big Data processing.

I think there are pros and cons to this issue, and I'm happy to get some level of discussion around it.

Could this new language be used also for line-of-business applications?
COBOL is really old, I think Java, VB.NET and C# are too generic languages, and I am looking for a new language specialized for business, database-oriented applications.
What about creating a community to collect ideas and requested features?
(It could be also an open source response to Microsoft LightSwitch).

This article and the comments very much highlight to me the differences between a computer science treatment of data and a mathematical/statistical perspective. The problems caused by these differences is one reason that there has been a call for a new approach: data-craft/data-science.

From a statistical viewpoint, the R-project treats data properly, allowing one to understand the underlying distribution [PDF/pdf CDF/cdf], to have objects and functions to programmatically create inferences & predictions from that data, and visualization tools to tell the story gleaned from these understanding & analyses.

The R-project also provides numerous interfaces/libraries to various data sources and other languages, as well as the thousands of community-contributed statistical packages on CRAN, BioConductor, RForge and Omegahat.

Do we need more? Something like Sun's [now Oracle's] FORTRESS, perhaps?

http://labs.oracle.com/projects/plrg/Fortress/overview.html

Or perhaps we need to realize that "Big Data" must be treated as a very large, complex system to handle volumetric flows & data streams in real-time [SQLstream or Esper anyone] and that no single language will do it.

The comments to this entry are closed.