vutrinh @_vutrinh

My mom read my articles to support her son. Now, she can design a data architecture and write ETL scripts. vutr.substack.com Join 7,290 readers at 👉 Joined March 2023

Tweets

47
Followers

123
Following

216
Likes

238

vutrinh @_vutrinh

8 months ago

Parquet is not a columnar format. Indeed, it’s a hybrid format combining the best of row and column formats. Parquet groups data into subsets of rows. (horizontal partition.) In each subset, data for each column is stored close together. (vertical partition) A Parquet file is…

0 0 2 102 1

Download Image

vutrinh @_vutrinh

9 months ago

🚀🚀 DuckDB is great. It allows us to execute analytics SQLs on the local laptop with minutes set up. Here are some bullet points about its storage after my sefl-learning process via DuckDB’s materials and source code. ◉ Two modes: persistent and in-memory; the latter will…

0 0 1 114 0

Pekka Enberg @penberg

10 months ago

Paper I would love to read but instead have to write? 🤔

11 18 469 27K 215

Download Image

Shivang Agarwal @shivang_in

10 months ago

Have you ever wondered how the Parquet dataset is written on the Disk? Parquet is a self-described file format that contains all the information needed for the application that consumes the file. Parquet organizes data in a hybrid format behind the scenes.

1 1 5 691 3

Download Image

vutrinh @_vutrinh

12 months ago

🚀🚀 How does Apache Spark execute the applications for us? A few weeks ago, I wrote an article that gave an overview of Apache Spark. Let’s revisit how Spark handles processing—from user-defined logic to execution by the executors: ◉ Defining the Application: The user defines…

0 1 5 123 1

Download Image

vutrinh @_vutrinh

12 months ago

🤔 My humble observation Large-scale cloud OLAP has increasingly converged toward the lakehouse paradigm. Below are some insights from my research—feel free to discuss or share corrections if you find anything off! 📌 In this context: ➝ Internal tables refer to data loaded…

0 0 2 158 0

vutrinh @_vutrinh

12 months ago

🚀🚀 How does the @ApacheSpark plan the execution for us? (With the help of Catalyst Optimizer) When defining DataFrame transformation logic, it must first go through an optimized process before execution. This involves four key phases: ◉ Analysis: Spark SQL starts by…

0 0 1 99 0

Download Image

vutrinh @_vutrinh

12 months ago

🚀🚀 How does the @ApacheIceberg reading process look like? ◉ The reader first visits the catalog to retrieve the table's current metadata file location. ◉ After fetching the metadata file, it collects the table’s schema and checks partition schemes to understand the data…

0 1 0 113 0

Download Image

AutoMQ: Cost-Effective Auto-Scaling Kafka @AutoMQ_Lab

a year ago

🎉 Wow. This is truly an epic masterpiece. Article from Vu Trinh(@_vutrinh), with its vivid illustrations, breaks down and explains the technical architecture of AutoMQ in a very clear and understandable way. If you're interested in the cloud-native technical architecture of…