Do We Need a New Programming Language for Big Data?
September 13, 2010
I'm the boards of two companies (Pentaho, Revolution Analytics) that are starting to see a lot of customer traction around Big Data. More and more companies in media, pharma, retail and finance are doing advanced analysis, reporting, graphing, etc with massive data sets. It made me wonder what other areas of the technology stack might evolve with the trend towards Big Data. Obviously, there's new middleware layers like Hadoop and Map Reduce, and we're also seeing the emergence of NoSQL data management layers with Cassandra, MongoDB, MemBase and others. But what about programming languages?
OpenGamma CEO and resident genius Kirk Wylie wrote a post recently about why he wants a new programming language.
So why don't I have this language yet? Well, partially because programming language craftsmanship is hard. I'm pretty sure I'm not good enough to do it, which is usually my default criteria for saying something is Really Hard.
But I think as well the k3wl languages coming out are coming out of language requirements of the Top 10% crowd. They're the ones good enough to actually write the languages, and they're going to write a language that makes them happy. But then you end up with Scala, and then you end up with this monstrosity, and then you make me cry. A language in which that thing is even possible will never be a candidate as a Journeyman Programming Language.
You know who's going to do it? Someone like Gosling, who set about with the needs of the journeyman programmer in Java. But the state of the art has moved on, and Java just isn't suitable anymore.
Who I would really like to do it is Anders Hejlsberg. I am a very big fan of C#-the-Language. It's just that .Net-the-Ecosystem is so Microsoft-specific and horrific it'll never catch on in the wider world, no matter what Miguel de Icaza thinks.
This got me thinking about the challenge of the current complexity in Big Data systems. Today, you have to be near genius level to build systems on top of Cassandra, Hadoop and the like today. These are powerful tools, but very low-level, equivalent to programming client server applications in assembly language. When it works it's great, but the effort is significant and it's probably beyond the scope of mainstream IT organizations. (That's one reason that Revolution's R product has appeal, but R is a specialized statistical analysis tool, not a general purpose language.)
Could the Big Data complexity be factored out somehow with a new general purpose programming language? No doubt. Having worked with Anders on the creation of Delphi many years back, this is right up his alley. Or maybe we already have a good starting point with Erlang, Scala and Google's Go. Go is particularly interesting having been designed by Rob Pike and Ken Thompson of Bell Labs / Unix fame.
What's been your experience in programming Big Data systems? What do you think's needed? Let me know in the comments below.
Zack Urlocker is an investor, advisor and board member to several startup software companies in SaaS and Open Source. He was previously the EVP of Products at MySQL responsible for Engineering and Marketing. He built the MySQL Enterprise subscription strategy and product line. MySQL was sold to Sun for $1 billion and is now part of Oracle Corporation. He is also a marathon runner, blues guitarist and fan of Interactive Fiction.- Kirk Wylie: Blog, OpenGamma
- Wikipedia: Erlang, Scala, Go, Hadoop, Cassandra, MongoDB
- Web sites: Erlang, Scala, Go, Hadoop, Cassandra, MongoDB
- Companies: Cloudera, CouchOne, 10Gen, NorthScale, Riptano
I agree with you on the fact that we definitely need a new language for bigdata systems.
I am not an expert on the matter, but it looks obvious to me that this bigdata language would need to have monads ( http://en.wikipedia.org/wiki/Monad_%28functional_programming%29 ) as first-class citizens.
These monads would then be responsible for:
- distribution
- delayed execution
- partitioning
- storage
- filtering
- priorities
- nil handling
- exception handling
- caching
- eventual consistency
- ....
In short, you would be able to modify the execution behaviour without modifying the actual algorithm implementation.
Currently those bigdata/distributed systems work usually best through some kind of a message bus, where the messages are handled by handlers, and the handlers are decorated by handler decorators which alter the behaviour of the handler (these would be the monads).
I think you would have a pretty good language if you would be able to have such a system, but where the message bus is abstracted away from the developer ...
Just my .02€
Tom
http://www.corebvba.be/blog
Posted by: ToJans | September 13, 2010 at 02:32 PM
The lesson of Java's (in my opinion) failure is that perhaps one size doesn't fit all.
In the past standardized solutions such a graphics abstractions, data source access, data representation, and interface programming evolved after living with and coming to understand the flaws and limitations of earlier approaches.
Too often today I see a rush to standardize on new tools and methodologies before tryng to solve problems with what's currently available.
Everyone's looking for the perfect pre-solution. They then act shocked when it doesan't work out as planned.
Posted by: Frank | September 14, 2010 at 07:21 AM
I believe a large part of the problem is that we fail to make a distinction between an "architectural" language - i.e. one that describes how you do things and an "intentional" language - i.e. one that describes what you want. For the architectural language, you need to be able to describe caching, queuing, latency, locality, that sort of thing (Cassandra is a pretty good start). For the intentional language we already have some great candidates, almost inevitably they have to be declarative (or applicative). There's nothing wrong with relational algebra (just get rid of the SQL mess). Functional Query Languages are well understood and have been for decades.
Posted by: Patrick | September 14, 2010 at 08:04 AM
I don't think you need a new language, I think you need a framework on an existing language.
Existing higher level languages already deal with RAM acceptable amounts of data very well, and that's about as far as in-memory needs to reasonably go.
Past that you merely need the ability to acquire subsets from stores (rdbms, nosql, distributed cache systems, etc.), ideally in a general way.
So, build a general purpose big data query/analysis system as described and you should be good to go.
Posted by: TerryM | September 14, 2010 at 08:05 AM
I think you are working on the wrong end of it. You don't need a new language yet; what you need is a new paradigm: a new way of thinking about managing and using Big Data. The language, as an expression of the paradigm, can come later. Must come later. Trying to find or fashion the perfect language without first trying to refashion your conception is like putting the cart before the horse. SQL did not just spring into being. What came first was the realization "what if we treated a pile of data as a mathematical set, and then we can use set algebra to operate on it! Now how do we express the algebra of sets without requiring a degree in mathematics to understand the symbols?".
Once you have your new paradigm, you can prototype the means of expressing that paradigm in LISP, and graduate to creating something more formal as things solidify ;)
So what is it that makes Big Data so much more different than the plain old data? Is it really just about size? What new or existing concepts demand language expression that are cumbersome or impossible today? Why do they need expression in the language instead of just figuring out how to improve the technical architecture? Can't we just fix the technical architecture without messing with the language? Not every paradigm demands expression as a programming language after all. Maybe (as Patrick above said) we need a different language for expressing the architecture of the system that is unrelated to the language used to access the data. Of course, by Codd's rules, such a system may no longer be a Relational Database Management System, but that might not necessarily be a bad thing.
Posted by: Trevor | September 14, 2010 at 12:54 PM
but that was just my off-the-cuff take on it...
"
SciDB goes beyond the relational world Stonebraker helped pioneer by swapping rows and columns for mathematical arrays that put fewer restrictions on the data and can work in any number of dimensions. Stonebraker claimed arrays are 100 or so faster than a RDBMS on this class of problem.
"
- http://www.theregister.co.uk/2010/09/13/michael_stonebraker_interview/
There's your new paradigm.
Posted by: Trevor | September 14, 2010 at 01:35 PM
I don't know all the programming languages out there. I think there may already be a few.
Rather than a new language, updating or adding commands to an old language would be quicker for us to learn
and quicker for the company to create and test, than starting from scratch.
A new version would need to use and store larger numbers.
So the program and computers may need to be upgraged from 8bit to 16bit, 32bit, 64bit, 128bit 256bit ...
But you must remember that not all your users that want to run your software may have a 256bit processor.
Posted by: Mark A. | September 14, 2010 at 08:32 PM
Trevor,
thanks for the comments. Of course, a new programming language has to be designed with the right conceptual framework. Building a new language before we know what the model is unlikely to come out new. But it may also be the case that it helps us formulate the concepts.
--Zack
Posted by: ZUrlocker | September 15, 2010 at 12:28 AM
Does anybody remember COBOL? I scored 100% on my COBOL exam, walked out of the exam, and never used it ever again. =)
Posted by: Michael Fever | September 15, 2010 at 04:21 AM
If the data store and object persistence layer already employs a distributed architecture, and a scalable addressing scheme, then all the current languages should be capable of utilizing distributed, big data and processing it.
The apps around big data need to access a single logical view, node elasticity or auto-provisioning (an ability to add or remove compute nodes, which should be provided by the cloud platform host or some middleware solution), and the dbms should be such that very little or no administration is required to spin up needed resources. With those conditions met (although I'm still looking for the middle tier auto-provisioning layer that allows the app and dbms to figure out ho wmany nodes are needed), "big data" could be accessed and utilized as easily as any other size.
Posted by: Thomas | September 15, 2010 at 06:16 PM
Sure, but the distinction is between being able to utilize distributed data and making it easy to do so. That's where further abstraction may help.
Posted by: ZUrlocker | September 15, 2010 at 09:25 PM
Just don't see the issue here Python, and I'm sure many other similar systems are quite capable in this domain. Python is already quite "embeddable" and there are plenty of constructs for dealing with all the algorithms mentioned such as interfaces to R and C as well as Numpy, Scipy and plenty more. Someone will have to be way clearer than this as to what issues are being "solved".
Posted by: Dartdog | September 15, 2010 at 11:48 PM
I would like to propose not to wait until you earn enough cash to buy all you need! You should just take the home loans or just financial loan and feel yourself comfortable
Posted by: Brigitte25AVILA | September 16, 2010 at 10:03 AM
We do not need a new language for big data. What we new is a new standard API on top of which current languages can work. Something like Red Hat's Deltacloud. Languages are improved for boosting programmers productivity, but what we have here is a new kind of problem which complexity can be probabibly hidden in a (good) library without the need of paying the price of creating a whole new language.
Posted by: Sergio Montoro Ten | September 16, 2010 at 04:05 PM
Something the article conflict each other. Big Data is a specific domain. And one of the article's complain about R is it is not a *general-purpose* language. How BOTH the requirement to a DSL and the demand to a general-purpose language on the same article?
Posted by: middleware | September 16, 2010 at 09:01 PM