@Duy Tran - A reference to the TableAPI you've built brought me here. Absolutely wonderful job with the article, and even a better job with building a scalable data system that enforces standardization.
Although you've underscored the importance of having the standardized metadata by way of defining interfaces and APIs rather than custom code, I'll still say one can't stress on that point enough. I'm curious to know about the testing and quality frameworks you've used or have you built your own?
@Amir - Both indexes and partitions help with joining tables. If you don't have a distributed system like a single MySQL or PostgreSQL instance, but still have a lot of data, indexes might get too big - and maintaining them can become a pain. So, you can think about partitioning the tables, i.e., when a heavy query comes in, it will first prune the partitions and then use the index to reach the data that you actually need.
In distributed processing systems like Spark, partitions have different connotations. The internals work differently. Having said that, the end goal of partitioning there too is pruning the ones that you don't need. What you mentioned about loading tables in parallel applies here.
@Robby - Awesome introduction to DuckDB! I have been reading about it and have been doing some basic PoCs using DuckDB lately. I'm not sure how many databases have actually implemented the EXCLUDE keyword. I think it's a novelty in DuckDB. With ClickHouse, Druid, and now DuckDB (and maybe some more), I think there are plenty of really good open-source OLAP engines to choose from.
The fact that the engineering team quickly figured out what was wrong and fixed it within four hours (which is impressive, given the scale of the problem) and also profoundly thought through what happened in the post-incident analysis is a testament to some great work under severely high pressure.
It's great when recommendation engines work and you end up reading something meaningful. I stumbled upon this blog post as it showed up in my feed. It's full of very practical advice regarding compensation, growth, and the ethics of changing jobs in a really competitive market.
I agree with most of your personal beliefs that you have used as the base of your strategy. I might contradict you in others, but the important thing you've done (and I haven't done yet) is to write it down. Writing it down really helps structure your thoughts.
I didn't know about this tool, I would have written some kind of recursive SQL query or some Python code to generate test data, but it's great to know that Databricks already provides this out-of-the-box.
This can be great for running small scale tests for loads specific to your business, but when comparing other technologies with Databricks, using TPC-DS or TPC-H might be the way to go.