I want a notebook situation where the platform understand sampling, so that, while I'm doing my EDA and initial development and generally doing the kinds of work that are appropriate to do in notebook, I'm never working with 100GB data frames.
I suspect that a big part of my annoyance about the current state of the data space is that parts of the ecosystem were designed with the needs of data scientists in mind, and other parts of the ecosystem were designed with the needs of data engineers in mind, and it's all been jammed together in a way that makes sure nobody can ever be happy.
You can sample data if you want already (or sequentially load partial data, which is what I usually do if I just want to test basic transformations), but if you need to worry about rare occurrences (and don't know the rate) then sampling can be dangerous. For example, when validating data there are edge cases that are very rare (ie sometimes I catch issues that are less than one record per billion), it can be hard to catch them without looking at all of the data.
I suspect that a big part of my annoyance about the current state of the data space is that parts of the ecosystem were designed with the needs of data scientists in mind, and other parts of the ecosystem were designed with the needs of data engineers in mind, and it's all been jammed together in a way that makes sure nobody can ever be happy.