Apache Drill 1.0 – Schema-Free SQL Query Engine for Hadoop and NoSQL

deorus · on May 19, 2015

.

I wish I had Drill a few years ago, but even more nowadays.

Drill and Dremel approach to data querying is really interesting in the sense that it's the tool that changes the 'lenses' on how it sees the data, rather than you have to re-shape the data for how would you like to see it. ++glasses && --photoshop;

Thanks Drill team!

crb · on May 19, 2015

It's been available as a Google cloud service since 2010: https://cloud.google.com/bigquery/what-is-bigquery

eranation · on May 20, 2015

How does it compare with spark SQL? Does the difference lie in where the query runs? In spark it loads all data into memory first and then uses SQL to query it (it parses the query and converts it to a set of spark transformations, kind of like hive). Does Drill actually translate SQL into, for example, mongo db native queries and run it on the source data store? If so this is a whooping amazing feat...

rxin · on May 20, 2015

Actually Spark SQL doesn't load everything into memory. Its data source API supports pushing predicates down, and if the data sources implement it, it even supports running aggregations and joins in the data sources!

MasterScrat · on May 20, 2015

I would really be interested in a comparison of BI tools performances using the JDBC/ODBC drivers from Spark vs Drill.

hurin · on May 20, 2015

How hard is it to translate syntaxes?

p.s. Will translate syntaxes, $250 per.

troysk · on May 20, 2015

This will help save thousands of hours of developer time querying different sources and generating reports for biz analytics teams of startups who know only SQL. Kudos to the team and the project! The lonely junior developer firing queries late at night thanks you. 😊

TallGuyShort · on May 19, 2015

I prefer the term "schema-on-read" to "schema-free" or "schema-less". One needs to correctly understand the implications of the fact that all data has SOME schema and that if it changes you need to be able to read the old and new.

kiyoto · on May 19, 2015

I agree. My employer Treasure Data (http://www.treasuredata.com/product) has been offering exactly this (SQL on schema-on-read data) for the last three years as a cloud service. But in reality, people "get" schema-less much better than "schema-on-read" thanks to all the marketing that MongoDB folks have done for "schemaless".

TallGuyShort · on May 19, 2015

Heard a brilliant quote on this topic at a meetup last year: "If you think your data's schemaless, what that really means is that you don't know what the schema is, and then you really have problems".

ted_dunning · on May 19, 2015

@polskibus,

Heterogeneous can mean different things. Drill attacks problems where there is no well defined schema. Other heterogeneous data source tools are more about accessing different data sources, all relational.

To do what Drill does requires that SQL be changed from a statically typed language to a dynamically typed language that generates code on the fly. Drill also does this in parallel.

polskibus · on May 19, 2015

How does Apache Drill compare to Power Pivot/ Power Query? As far as I understand, both aim to solve the problem if heterogeneous data sources. What I don't understand is how they differ.

IndianAstronaut · on May 20, 2015

I'm not sure power pivot deals well with heterogeneous data sources. It rests on OLAP services which rely on a fixed star schemas in a data warehouse. Power pivot is just a front end for MDX(multi-dimensional expression) queries.

tdicola · on May 19, 2015

Wow this sounds pretty interesting. Is there any python library to use it in ipython/jupyter, etc. yet? I love pandas and the python data 'stack' but sometimes its syntax for joining and querying data is just extremely obtuse and odd. Many times I finding myself wishing I could query stuff like a SQL table.

vicenteg · on May 20, 2015

Check this out: https://github.com/cjmatta/drill_ipython_notebook

Uses the Drill ODBC driver via pyodbc and includes some visualization through Pandas and matplotlib.

tdicola · on May 20, 2015

Very cool, thanks!

djhworld · on May 19, 2015

How does this differ from say, Apache Hive?

jaltekruse · on May 19, 2015

Disclosure, I am a committer on the Apache Drill project.

Traditionally Hive has been focused on batch processing, it started as a SQL query engine built on top of map-reduce. Unfortunately many operations in the SQL model are not well suited for map-reduce. This is a problem that is being addressed by the Hive team with their vectorization/tez work, which is a completely different execution engine from map reduce. Around the time that this work began a number of new projects were trying to solve similar problems, making queries more interactive, providing more flexibility and greater coverage of the SQL language (Hive has supported a healthy subset of the SQL language, but not full support). It is true that this new Hive engine is solving some similar problems as Drill, as is the case for Impala, Facebook's Presto engine and a few other projects.

Drill is focused on providing standard SQL support, instead of the SQL-like HiveQL language. This allows it to be used to hook up to the analytical tools and ecosystem that has been developing around the SQL language for the past few decades.

Drill also has no requirement to define schema ahead of time when querying. This is best illustrated by the JSON read support, you can write a query against a bare file, and Drill will discover the schema as it reads. This is the case for simple files stored on a local or distributed file system or even for document stores like MongoDB.

While there will certainly be cases where schema will need to be known ahead of time, many users of Hadoop and semi-structured datastores are having a hard time with the overhead of trying to fit a wide variety of datasets into a traditional DB modnel, which is what the Hive metastore provides. In cases where the number of schemas in your data is manageable, this can give consumers of the data useful meta-data and structutre. For this reason, Drill has support for reading from Hive tables for these types of use cases.

However while this model has its place, our discussions with current Hadoop users has validated the design of Drill. Analysts want to be able to gain insights from data in Hadoop, and there are cases where the overhead of managing this data like a traditional DB is too resource intensive. Drill enables these types of users to explore raw semi-structured data as it enters the Hadoop cluster, instead of waiting for IT driven ETL.

uberneo · on May 19, 2015

How about the performance , does anybody have tried this yet .. any metrics ?

ddispaltro · on May 19, 2015

I'd be interested to know this too. I know they are using the Calcite[1] query planner which uses statistics to modify a query plan, it's pretty sophisticated. But I am curious about the actual execution time on this stuff.

lerchmo · on May 20, 2015

Any info on the google cloud data store integration?

m_mueller · on May 20, 2015

Sounds very interesting. Is there anything like this for CouchDB? I like couch because it's easily deployed on mobile as well.

vicenteg · on May 20, 2015

There is a couchbase plugin for Drill: https://github.com/jacques-n/drill-couchbase-plugin

m_mueller · on May 20, 2015

Thanks, very interesting!

mark_l_watson · on May 19, 2015

This is sort of similar to the Dremel tool used internally at Google, right? They might mention this on their web site.

ted_dunning · on May 19, 2015

There is an inspiration lineage from Dremel, but there are also very substantial differences. Dremel handles nested data, but doesn't like variable schema and tends to generate code early (AFAIK). Drill handles data that changes shape and type and generates code very late.

The idea that SQL can be applied to data that is pretty different from traditional relational data is definitely common between the two.

ckluis · on May 19, 2015

Wow. That looks pretty amazing.

aggress · on May 19, 2015

Congratulations to the Apache Drill team, Drill's awesome.

turingbook · on May 20, 2015

How does it compare with Impala？

tshiran · on May 19, 2015