You know what will be popular? whatever runs reasonably fast and helps you import and clean data quickly from a variety of sources.
Because the analysis is often the quickest part of being a data scientist. Coursera, as I recall, apparently cleans it's data, and also lets you easily import it.
In real life, data is messy, and messed up. You looking at birthdays from some website? expect a spike for whatever the default is... but that doesn't mean you can eliminate that data completely, because some people were presumably born Jan 1st.
You looking at birth years? I recall dealing with them in SAS... remember if it's four digit that you check for births occurring in the current and past century.
And hey... do you have two or more elements of data for an individual? 2% to 5% will probably be missing some element, and some will have wrong data. a zip code off by one, an address not in the city you are looking to geocode for, whatever. If you are lucky, it will be obvious stuff like that.
The life if the data scientist is mostly cleaning, formatting, and transferring data, with the occasional sweeeet analysis. Of course your analysis will probably give you nothing useful, because despite several thousand usable records, it's not clear if any element has a significant effect on the dependent variable you are looking at. If you are smart, maybe you can finagle an analysis based on a non parametric distribution or logistic regression.
Oh, and often the speed of your analysis running is inversely correlated with how easy it is to code and enter your data. There is a reason people use SAS, and it's not because of it's amazing IDE.
I wouldn't be surprised if Octave (open source version of Matlab) doen't become very popular because a lot of Coursera classes use it for homework assignments.
I thought that Octave was an ugly little language at first, now I really like it - a great tool for doing linear algebra, data visualization, machine learning, neural networks, etc.
Many use Matlab because of its amazing IDE, a great collection of toolboxes and remarkable speed, not so much because of its language features. Last I checked Octave was still missing all of that. If you can afford it, Matlab is usually well worth paying for. If not, other alternatives (Python, R) are much better in my opinion.
When I used Matlab daily, I never used the ide. The command line and editor were good.
Matlab is awesome above all else because the design is coherent. Both the syntax and the standard libraries.
It is extremely quick to whip up anything and then turn that into a script and then into a software with functions (since functions can return many variables, and they also have zero overhead, you don't need any includes or requires, you just call them). Type conversions are practically never a problem, since they are sane and automatic. None of this 1+1.5 giving syntax error. Real booleans. Data input and output libraries just simply work like you would expect them to. ( A=imread('/home/gravityloss/abc.png') creates a width x height x 3 matrix with all the rgb values. No requires, includes, plugins, hunting and compiling libraries.).
You don't need libraries to do a huge amount of stuff, but if you need them for something experimental, they work extremely easily.
You also rarely need stuff like loops since mass operations on data are native. If you as a newbie create a custom function for a scalar, there's good chance it will work for vectors or n-matrices automatically. This reduces the amount of error-prone housekeeping code for indices and lengths immensely. It's also much much faster than some looping in another scripting language.
As a result, the code is often very readable as well.
There's help which actually returns something sensible when you type help, you can type help help or help command or search this or that, the help texts are actually very thoughtful and helpful too and not at all like Linux man pages... I could go on for hours on features that don't really exist anywhere else, even though everything's been in plain sight for decades in Matlab.
Julia's an awesome thing though, I hope it gets more traction...
Virtually none of that is in fact unique to MATLAB or even a strength in the first place.
> Matlab is awesome above all else because the design is
> coherent. Both the syntax and the standard libraries.
It certainly is coherent, and also consistent, but only in the weakest and least interesting sense. Namely, everything's about equally messy. Namespaces are non-existent in the standard library and clumsily realised otherwise. OOP remains rudimentary and feels as tacked-on as it happens to be. The one-function-per-file system ruins everyone's day. No standard arguments (and checking nargin loses its appeal rather fast), and shitty inlined pseudo-lambdas.
> It is extremely quick to whip up anything and then turn
> that into a script and then into a software with
> functions
No different from R or Python, and most of the time a genuine weakness; it's a key reason for scientific/engineering code being as ad hoc and convoluted as it is.
> If you as a newbie create a custom function for a
> scalar, there's good chance it will work for vectors or
> n-matrices automatically.
Vectorising functions in Python (that is, NumPy) is about as straightforward.
> It's also much much faster than some looping in another
> scripting language.
Nah, MATLAB is now fairly good at unrolling and optimising such loops. Don't worry too much about vectorising every single bit of your algorithm.
> There's help which actually returns something sensible
> when you type help
That's also true in the cases of both Python and R. Long story short: Most of your perceived advantages aren't unique, and that comes on top of MATLAB's exorbitant pricing schemes and extremely dubious language design. Trust me, if you think MATLAB's a particularly well-designed language for anything other than linear algebra, you owe it to yourself to check out alternatives and other languages.
True, Matlab has its limitations, but those are partly unavoidable. If you want to build a large object oriented program, you often use something more heavyweight anyway.
But that heavyweight language (or framework) is usually not so quick to build something in anymore, because your heavyweight structures are just in the way in the earlier phase.
I tried Python and Numpy, and the vectors, matrices and all that felt just tacked on and the syntax was much more complex compared to Matlab. Maybe it's changed since. Also in Scilab the type conversions and function overhead are a nuisance. Every time you edit a script or function, you have to specifically reload it before running it. Makes rapid prototyping about three fold as time consuming. Would it be hard to make the software notice I actually edited something?
Many people actually want to solve problems, and they just end up creating a program as a side product. They do not set out to study libraries and do not want to actually write any code that is not directly related to the problem they are solving.
It's why Matlab is able to charge the price. It sometimes saves time. Some of the users are not primarily software developers but are quite educated and intelligent and their salary is not small.
So you're implying anyone who is a software developer is not educated or intelligent, nor has a large salary. I think a lot of people would beg to differ.
In reality, matlab only exists because of inertia. It's the same reason why microsoft windows is still around. There's no substance behind it.
He wasn't implying that. You've inverted his statement. Many economists and engineers I know care less about how they code and care more about getting a solution to the model at hand--this seems to be the poster's point. The implication is exactly what he exposited, whereas the logical inverse is what you've mistakenly deduced as the implication.
Perhaps it comes off that way to you but I'd be willing to bet it doesn't to the great majority of readers, because it simply isn't saying what you say it's saying.
OK, but he's worded it wrong, it gives the impression of a snarkiness. Maybe he's saying that matlab users can't program well but are still intelligent/well paid (but that doesn't really make sense since numpy is equally easy and an intelligent/educated person wouldn't find programming hard). Anyway, maybe I misread it and you're right.
edit: great, I was getting points before, and then you come along with your italics.
It depends on what you're trying to do. There's no question that Python is a cleaner language with generally better support for general-purpose programming, but Matlab is really good for numerics, prototyping and exploratory data analysis.
It's also worth saying that Matlab graphics are really convenient and powerful. Last time I used matplotlib, I was very disappointed with the lack of many plot types and finickiness.
One other example is the parfor (Matlab's support for multithreaded computation in embarrassingly parallel computations like map/reduce and cross-validation). Classic Matlab -- simple, not totally orthogonal, gets you 80% of the way there without fuss. (I don't know whether to hate this tendency or love it.)
There's a reason it's so widely used, and it's not because people have not shopped around.
IDE: QtOctave
Toolboxes: Exceeds those of Matlab (http://octave.sourceforge.net/packages.php)
Speed: On par or better than Matlab generally, unless you use compile your Matlab with MEX (but then, Octave interfaces with C and Fortran for even more speedup).
Python is better generally than matlab/octave, and R I view as roughly on par.
*Source: 7 year veteran, octave and matlab user, python lover
The claim that Octave is as fast or faster than Matlab is simply absurd and untrue. Octave is consistently the slowest language on all of our benchmarks, often by orders of magnitude [http://julialang.org/], even on heavily vectorized code.
In the real world, I mainly see people using R or SAS : especially in the web or ecommerce domain. I hardly ever see people using Matlab. Moreover, a lot of research papers publish R code.
If Octave's only claim to fame is being the poorer cousin of Matlab, I wonder why universities still use Octave to teach anything. I would much rather that they use R : which pretty much ensures that the students have an open source path to use it in the future.
The thing that most impressed me about R when I was dabbling with it was the quality of the graphing tools. It was very easy to create all kinds of very polished looking and expressive graphs. It was a lot more work to get comparable results in Octave and NumPy.
Octave and Matlab are pretty easy for people with no programming whatsoever to pick up. Engineering and economics thrive in the Matlab world, especially in academia. It's a fairly good setup for them.
>Engineering and economics thrive in the Matlab world, especially in academia. It's a fairly good setup for them.
This is precisely what I'm concerned about. R is open source and used in production in various high profile places around the world. Matlab has the same (or lesser?) expressive power and is very expensive.
Why is Matlab (or its derivative Octave - often sold as "if you dont have Matlab...") used in academia at all ?
Doing coursera stat one and R is pretty easy. It remind me of PHP. Syntax wise, I don't know why it's just a feeling. This article made it a bit clearer. OOP was an after thought...
I'm getting more and more into R now. Hopefully one day Python.
So I guess what I'm saying is I think it's R that I wouldn't be surprise and that I have to respectfully disagree with your Octave statement.
Mplus is quite popular in social sciences. From what I understand its main functionality is fitting Latent Variable Models and structural equation modelling. I've never used it myself, but it can in fact do things for which it is hard to find R packages at this point.
I think the article tagline would be better "Domain Specific Languages for data analysis". Fortunately, the article does mention Python which is critical because new people might not recognize just how prevalent Python is for solving data analysis problems after reading this. The great work of the SciPy community has enabled Python to be used for all of the things that Matlab, R, and Julia can do. In addition, Python can integrate easily with these languages, so if you are a data analyst you need to learn Python.
> The great work of the SciPy community has enabled Python to be used for all of the things that Matlab, R, and Julia can do.
As much as I hate R and love Python, this is not entirely true (unless you count rpy2 as part of "Python"). R has many more statistical models and better plotting capability compared with Python. It also has a lot of domain-specific packages (for example, Bioconductor) that are not available in Python.
Though Python doesn't have the library support that R has, it far exceeds what's available in Julie (and,depending what you are looking for, in Matlab as well)
And it depends on what you mean by "library". For statistics packages, yes, python is behind. But for general purpose computing, connecting to databases, and sheer number of packages available, python is way ahead (currently 4097 in CRAN vs. 24775 in PyPI).
Yeah, agreed. I'm a big, big fan of data analysis in Python, but there isn't a full-featured time series analysis library (statsmodels is almost there). R has at least two that I know of (and needed to hook into from Python using rpy2).
Personally I'm in love with R's data.frame. It allows very concise, robust and elegant manipulation and subsetting of a data set.
I wish every language would have such a built-in object type, I definitely feel its loss when I manipulate data in other languages such as Javascript or Mathematica.
> Personally I'm in love with R's data.frame. It allows very concise, robust and elegant manipulation and subsetting of a data set.
The performance is terrible though. For data of more than ~10,000 observations SQL is much better performance wise, is more robust, and is as good at subsetting. Although it's maybe not as elegant for everyone's definition of elegant.
What dataframe operations do you find to be slow? Usually I'm able to get huge performance wins by rewriting my slow R code in a loop free way (*apply and friends).
I wonder if there is room for some smaller languages optimized specifically for data analysis. In particular, I wonder how a carefully designed non-Turing-complete language would fare.
That would be a really cool project to work on: design a minimal language for expressing most types of data analysis at a higher level. If the language is sufficiently small and simple, I could see some very powerful tooling being possible for it.
Perhaps it might make sense to go even more specific: have a small language designed not just for data analysis but for analysis in a very specific vertical (say finance or bioinformatics). It would be awesome to let people express their ideas in terms of the domain and not worry about low-level details like loops.
It seems like a good idea, but I wonder how actually useful highly specialized programming languages would be. Why?
1) Most data analysis tasks boil down to roughly the same things: accessing the data source --> data cleaning -->simple transformations --> (optional)stats/fitting/ML/specialized procedures-->pretty pictures and reporting.
2) Not everyone wants programming to be the main component of their job.
People who can take advantage of the flexibility that programming offers can usually take advantage of existing technologies. People who don't enjoy coding will always look for of-the-shelf solutions that have pretty GUI's with magic buttons that solve all their problems. I just don't think there is a huge market in between to be filled... in the domains that i've been exposed to anyway.
I disagree with your supposition. I think highly specialized languages exist and are highly useful to non-programming communities. I think there is plenty of proof of their usefulness and room for growth.
For instance, consider illustrator products or d3? Both of these are specialized ("deep") tools for creating pictures that I've used extensively in the "pretty pictures and reporting" stage you outlined.
Also of serious note are BUGS[1], JAGS[2] and (recently) Stan[3] as small semi-declarative languages for MCMC model building, fitting, and checking.
SQL is an obvious example of a component of the "simple transformations" step.
I think you are going in the wrong direction. It's too easy to paint yourself in a corner that way.
> It would be awesome to let people express their ideas in terms of the domain and not worry about low-level details like loops.
Yes! I think you want to build this functionality on top of a powerful language to easily handle the dirty ETL work too. This is the reason lots of financial companies use python with scipy, numpy, pandas, etc, on top of it.
I've been working on a guaranteed-terminating language in the vein of APL/J. Primitive recursion is possible, but with the looping all implicit due to array shape, infinite looping is impossible. The issue I'm not sure how to handle is that some algorithms take the form, "repeat the following until convergence: ..." with no obvious way for a machine to prove that convergence will eventually happen.
Stata is close to your ideal - it is popular with business and economists. However, no one can ever resist Turing completeness - that would require shelling out to another language for that one small thing every project needs (a different thing for every project)
I'm working on this direction in healthcare. I want to make a little specialized, non-Turing complete language for handling evented data streams from patients.
I think there's a lot of power in certain kinds of non-Turing completeness. Email me if you want to talk about it.
Such a thing was already invented a long time ago and widely celebrated - SQL. It eliminated the need for loops, was designed for ease of use, had a lot of cool features added like ACID, took over the world and ran every kind of business and website. Sadly, because it was invented too long ago (1970s) many people thought of it as uncool and failed to realize it's absolute awesomeness.
When introducing python, the author writes "Despite the obvious advantages of MATLAB, R, and Julia, it’s also always worth considering what a general-purpose language can bring to the table."
Even with thousands of hours of experience in Matlab, R and Python... I'm not sure what "obvious advantage" Matlab
and R share over Python.
Mainly it's the immediacy of Matlab and R, and the libraries. I've used all 3 and consider Python my main and favorite programming language.
But you can just type "R", do read.table(), and very quickly slice and dice your data. In Python just evaluating what package to use, then getting the packages, dealing with versioning issues, etc. kind of breaks the whole thing. Then you need to figure out what plotting library to use, etc. Having stuff built-in as a common base which all your coworkers share is important. I know there are common distributions like SciPy but they are not as common as R is.
Probably the bigger issue, as mentioned above, is that R has higher-level stuff like time series libraries that Python doesn't.
The main thing that's needed is a shell to glue all these languages together, to ease integration pain. Everybody wants the "one true language", but that's a pipe dream. Python's close but not quite. Julia is kind of falling prey to this fallacy too. The programming world is becoming more heterogeneous, and the solution is to have tools to make multiple languages work nicely together. Not to pretend that heterogeneity doesn't exist.
You can work really hard to get homogeneity on your one little project. Maybe that's what language wars are so heated. But the second you have to borrow code from another lab, or you acquire a company, or get acquired, you have a heterogeneous mix. Matlab, R, Python, or Julia will never suffice for all tasks. Non-trivial problems will always require a mix of them. You have to pick the solution according to the problem, and Matlab and R definitely are superior to Python for certain problems.
The straightforward answer, and the one that seems to be implied by the contrast with Python as a "general-purpose language", is that Matlab, R, and Julia are specifically designed for data analysis, mathematics, calculation, statistics, etc., while Python is not. But that's not a very concrete answer, of course.
python fanboy here:
"[python is] not as tuned to numerics as MATLAB": if you build numpy with ATLAS there is, in my experience, hardly ever any noticeable speed difference between numpy and MATLAB
" Python a compelling alternative: not as tuned to numerics as MATLAB, or to stats as R, or as fast or elegant as Julia "
The part about python not being as fast as Julia jumped at me. Wes McKinney's benchmarks show that python is faster than Julia for numerics: http://wesmckinney.com/blog/?p=475
EDIT: should not have said "python faster than Julia". They are comparable because the slow bits get done in BLAS anyway.
Cython is actually what is faster than Julia in Wes' comparison, not Python. Cython looks kinda, sorta like Python, but it is actually a static language with C-like types (but quite different syntax for those types), no polymorphism, and, afaict, ill-defined semantics. The best answer I seem to get about Cython's semantics is that Cython's semantics are whatever it does. I'm not alone in this complaint – Travis Oliphant expressed a similar concern at this year's SciPy (in this panel [http://www.youtube.com/watch?v=7i2vhoQY-K4], if I recall correctly), which is part of his motivation to work on Numba [https://github.com/numba/numba].
If you look at the comments on Wes' post, when I used the dot(x,y) function, which ships with Julia and uses a BLAS to compute the inner product just like the fastest "Python" version does, Julia is equally fast. That stands to reason – they're both just calling a BLAS.
Finally, that blog post is months old – since then Julia passed the milestone of being no slower than 2x C++ on its microbenchmarks suite [http://julialang.org/]. That's not a guarantee that all code is that fast, but most things we see can be pretty easily tweaked to get there (counterintuitively for those coming from Matlab, Python or R, usually by devectorizing the code rather than vectorizing it). And of course, there's a lot of room for improving Julia's performance, the compiler is still quite young and there are many optimizations that we haven't implemented. Basically, there's nothing but work standing in the way of reaching C or Fortran's speed across the board.
I just ran Wes's benchmarks (not the BLAS call versions) on my machine with a Julia I built on 10/13 (17c3c13), and the timings have indeed improved. For the details, see this gist: https://gist.github.com/3901139 (including the comment I posted on it).
The highlight is
numpy: (x * y).sum() => 41.1 ms
julia: inner(x,y) => 37.4 ms
julia: x*y => 19.5 ms
cython: inner(x,y) => 13.8 ms
The numpy and Julia versions are much easier to write and run.
Disclaimers: I've never written or built cython code before just now, and I think Julia is the coolest.
EDIT: whoops, missed the most important one (inner() written in pure Julia). Added it. Any thoughts on why inner() in Julia isn't faster?
Nice. Thanks for running those. The reason inner isn't faster is probably that we do bounds checks on every array access. This is surprisingly inexpensive on modern hardware but it still takes some time. We're working on a couple of things to address this: generating code so that llvm can more easily hoist bounds checks out of loops, and allowing turning bounds checking off entirely for blocks of Julia code.
Do y'all have a performance roadmap (or issue tracker tag?) that lists some optimizations that you foresee as julia matures? If not, I hope you and/or Jeff will do a blog post on this at some point! One key point of interest would be the split (obviously hand-wavy) between:
- llvm hotness that you don't use yet
- well-known techniques from HotSpot, Smalltalk, V8, etc.
- researchy optimizations that julia is particularly suited for?
This would be a nice resource for newcomers with a language bent, and also as a building block for your Google Summer of Code applications ;)
That's a good idea. Jeff is the real compilers genius so I'll have to see if I can convince him to write a blog post along those lines. The bounds check stuff I mentioned above is one of the important planned optimizations. Another important move is making composite types immutable by default, which is surprisingly unstiffling, yet allows a large number of clever optimizations (stack allocation, memory layout optimizations). More playing around with llvm optimization passes could help and is pretty easy to do, but we haven't spent much time on that. Also lots of gc improvements (approximate ref counting, escape analysis).
From a user's point of view it doesn't matter if it is python or cython. Slow things get implemented carefully and are packed away in a 3rd party libraries.
If the performance comes down to BLAS, then language speed benchmarks are moot. Saying one language is faster than another becomes disingenuous.
> If the performance comes down to BLAS, then language speed benchmarks are moot. Saying one language is faster than another becomes disingenuous.
Fully agreed. That's why comparing NumPy calling BLAS to Julia calling BLAS is a silly exercise. Comparing NumPy calling BLAS to a summation loop written in pure Julia goes beyond silly to just unreasonable, but that's what the blog post does.
Testing in my lab at CWRU by Gary Doran has indicated that correctly written Numpy code often outperforms the Matlab equivalent. I don't know about R and Julia but there isn't usually a speed bottleneck which can be fixed by moving to Matlab from Python.
No argument there as it was "free" for me in college. However, I've never worked in academia - so all the private sector companies I've known to use it had to pay big bucks. Plus, as I'm no longer a student, I suspect it would be a good chunk of money to buy a personal license (I'm guessing).
I have to create figures with Matlab and that's a pain in the ass. Changing XTickLabels, kills another part of the figure, and in general it's very hard to do a little more with figures.
But the basic data analysis is fine. The IDE has awful code completion and lacks more refinement in the editor.
One of my bioinformatics courses "required" MATLAB because the class project was based on a simulation framework called the COBRA Toolbox which was developed in MATLAB[1]. I didn't know who to ask about obtaining a MATLAB license, so instead I just got it to work in Octave and used that. I was pleasantly surprised at how little I had to tweak before the framework just worked in Octave, given that as far as I know everyone in the lab that develops the framework just uses MATLAB.
PDL is the Perl Data Language, a perl extension that [...] includes fully vectorized, multidimensional array handling, plus several paths for device-independent graphics output.
PDL is fast, comparable and often outperforming IDL and MATLAB in real world applications. PDL allows large N-dimensional data sets such as large images, spectra, etc to be stored efficiently and manipulated quickly.
I've used Weka, and RapidMiner once. As I recall, RapidMiner seemed to be general purpose, but lots of posts were for using it for data mining stock data to build a model.
I think it would be interesting to see breakdowns of different software, and where they are used. Often times it seems to me that people just use the tools their peers and co-workers use, and people tend learn to like whatever they use most.
I'm also learning Clojure and have been playing with Incanter. Seems quite a decent statistical library/environment. I had a few issues with lazy evaluation with the dynamic charting functions, but I think that has more to do with my inexperience with Clojure than a problem with Incanter. Also, I'm not to sure how active the project is?
Because the analysis is often the quickest part of being a data scientist. Coursera, as I recall, apparently cleans it's data, and also lets you easily import it.
In real life, data is messy, and messed up. You looking at birthdays from some website? expect a spike for whatever the default is... but that doesn't mean you can eliminate that data completely, because some people were presumably born Jan 1st.
You looking at birth years? I recall dealing with them in SAS... remember if it's four digit that you check for births occurring in the current and past century.
And hey... do you have two or more elements of data for an individual? 2% to 5% will probably be missing some element, and some will have wrong data. a zip code off by one, an address not in the city you are looking to geocode for, whatever. If you are lucky, it will be obvious stuff like that.
The life if the data scientist is mostly cleaning, formatting, and transferring data, with the occasional sweeeet analysis. Of course your analysis will probably give you nothing useful, because despite several thousand usable records, it's not clear if any element has a significant effect on the dependent variable you are looking at. If you are smart, maybe you can finagle an analysis based on a non parametric distribution or logistic regression.
Oh, and often the speed of your analysis running is inversely correlated with how easy it is to code and enter your data. There is a reason people use SAS, and it's not because of it's amazing IDE.