Kovid Rathee

@Duy Tran - A reference to the TableAPI you've built brought me here. Absolutely wonderful job with the article, and even a better job with building a scalable data system that enforces standardization.

Although you've underscored the importance of having the standardized metadata by way of defining interfaces and APIs rather than custom code, I'll still say one can't stress on that point enough. I'm curious to know about the testing and quality frameworks you've used or have you built your own?

--

--

@Amir - Both indexes and partitions help with joining tables. If you don't have a distributed system like a single MySQL or PostgreSQL instance, but still have a lot of data, indexes might get too big - and maintaining them can become a pain. So, you can think about partitioning the tables, i.e., when a heavy query comes in, it will first prune the partitions and then use the index to reach the data that you actually need.

In distributed processing systems like Spark, partitions have different connotations. The internals work differently. Having said that, the end goal of partitioning there too is pruning the ones that you don't need. What you mentioned about loading tables in parallel applies here.

--

--

@Robby - Awesome introduction to DuckDB! I have been reading about it and have been doing some basic PoCs using DuckDB lately. I'm not sure how many databases have actually implemented the EXCLUDE keyword. I think it's a novelty in DuckDB. With ClickHouse, Druid, and now DuckDB (and maybe some more), I think there are plenty of really good open-source OLAP engines to choose from.

--

--

--

--

Kovid Rathee

Kovid Rathee

I write about tech, Indian classical music, literature, and the workplace among other things. 1x engineer on weekdays. https://kovidrathee.medium.com/membership