I wish I had Drill a few years ago, but even more nowadays.
Drill and Dremel approach to data querying is really interesting in the sense that it's the tool that changes the 'lenses' on how it sees the data, rather than you have to re-shape the data for how would you like to see it. ++glasses && --photoshop;
How does it compare with spark SQL? Does the difference lie in where the query runs? In spark it loads all data into memory first and then uses SQL to query it (it parses the query and converts it to a set of spark transformations, kind of like hive). Does Drill actually translate SQL into, for example, mongo db native queries and run it on the source data store? If so this is a whooping amazing feat...
Actually Spark SQL doesn't load everything into memory. Its data source API supports pushing predicates down, and if the data sources implement it, it even supports running aggregations and joins in the data sources!
This will help save thousands of hours of developer time querying different sources and generating reports for biz analytics teams of startups who know only SQL. Kudos to the team and the project! The lonely junior developer firing queries late at night thanks you. 😊
I prefer the term "schema-on-read" to "schema-free" or "schema-less". One needs to correctly understand the implications of the fact that all data has SOME schema and that if it changes you need to be able to read the old and new.
I agree. My employer Treasure Data (http://www.treasuredata.com/product) has been offering exactly this (SQL on schema-on-read data) for the last three years as a cloud service. But in reality, people "get" schema-less much better than "schema-on-read" thanks to all the marketing that MongoDB folks have done for "schemaless".
Heard a brilliant quote on this topic at a meetup last year: "If you think your data's schemaless, what that really means is that you don't know what the schema is, and then you really have problems".
Heterogeneous can mean different things. Drill attacks problems where there is no well defined schema. Other heterogeneous data source tools are more about accessing different data sources, all relational.
To do what Drill does requires that SQL be changed from a statically typed language to a dynamically typed language that generates code on the fly. Drill also does this in parallel.
How does Apache Drill compare to Power Pivot/ Power Query? As far as I understand, both aim to solve the problem if heterogeneous data sources. What I don't understand is how they differ.
I'm not sure power pivot deals well with heterogeneous data sources. It rests on OLAP services which rely on a fixed star schemas in a data warehouse. Power pivot is just a front end for MDX(multi-dimensional expression) queries.
Wow this sounds pretty interesting. Is there any python library to use it in ipython/jupyter, etc. yet? I love pandas and the python data 'stack' but sometimes its syntax for joining and querying data is just extremely obtuse and odd. Many times I finding myself wishing I could query stuff like a SQL table.
Disclosure, I am a committer on the Apache Drill project.
Traditionally Hive has been focused on batch processing, it started as a SQL query engine built on top of map-reduce. Unfortunately many operations in the SQL model are not well suited for map-reduce. This is a problem that is being addressed by the Hive team with their vectorization/tez work, which is a completely different execution engine from map reduce. Around the time that this work began a number of new projects were trying to solve similar problems, making queries more interactive, providing more flexibility and greater coverage of the SQL language (Hive has supported a healthy subset of the SQL language, but not full support). It is true that this new Hive engine is solving some similar problems as Drill, as is the case for Impala, Facebook's Presto engine and a few other projects.
Drill is focused on providing standard SQL support, instead of the SQL-like HiveQL language. This allows it to be used to hook up to the analytical tools and ecosystem that has been developing around the SQL language for the past few decades.
Drill also has no requirement to define schema ahead of time when querying. This is best illustrated by the JSON read support, you can write a query against a bare file, and Drill will discover the schema as it reads. This is the case for simple files stored on a local or distributed file system or even for document stores like MongoDB.
While there will certainly be cases where schema will need to be known ahead of time, many users of Hadoop and semi-structured datastores are having a hard time with the overhead of trying to fit a wide variety of datasets into a traditional DB modnel, which is what the Hive metastore provides. In cases where the number of schemas in your data is manageable, this can give consumers of the data useful meta-data and structutre. For this reason, Drill has support for reading from Hive tables for these types of use cases.
However while this model has its place, our discussions with current Hadoop users has validated the design of Drill. Analysts want to be able to gain insights from data in Hadoop, and there are cases where the overhead of managing this data like a traditional DB is too resource intensive. Drill enables these types of users to explore raw semi-structured data as it enters the Hadoop cluster, instead of waiting for IT driven ETL.
I'd be interested to know this too. I know they are using the Calcite[1] query planner which uses statistics to modify a query plan, it's pretty sophisticated. But I am curious about the actual execution time on this stuff.
There is an inspiration lineage from Dremel, but there are also very substantial differences. Dremel handles nested data, but doesn't like variable schema and tends to generate code early (AFAIK). Drill handles data that changes shape and type and generates code very late.
The idea that SQL can be applied to data that is pretty different from traditional relational data is definitely common between the two.
I wish I had Drill a few years ago, but even more nowadays.
Drill and Dremel approach to data querying is really interesting in the sense that it's the tool that changes the 'lenses' on how it sees the data, rather than you have to re-shape the data for how would you like to see it. ++glasses && --photoshop;
Thanks Drill team!