My mom read my articles to support her son. Now, she can design a data architecture and write ETL scripts.vutr.substack.com Join 7,290 readers at 👉 Joined March 2023
Parquet is not a columnar format.
Indeed, it’s a hybrid format combining the best of row and column formats.
Parquet groups data into subsets of rows. (horizontal partition.)
In each subset, data for each column is stored close together. (vertical partition)
A Parquet file is…
🚀🚀 DuckDB is great.
It allows us to execute analytics SQLs on the local laptop with minutes set up.
Here are some bullet points about its storage after my sefl-learning process via DuckDB’s materials and source code.
◉ Two modes: persistent and in-memory; the latter will…
Have you ever wondered how the Parquet dataset is written on the Disk?
Parquet is a self-described file format that contains all the information needed for the application that consumes the file.
Parquet organizes data in a hybrid format behind the scenes.
🚀🚀 How does Apache Spark execute the applications for us?
A few weeks ago, I wrote an article that gave an overview of Apache Spark. Let’s revisit how Spark handles processing—from user-defined logic to execution by the executors:
◉ Defining the Application: The user defines…
🤔 My humble observation
Large-scale cloud OLAP has increasingly converged toward the lakehouse paradigm. Below are some insights from my research—feel free to discuss or share corrections if you find anything off!
📌 In this context:
➝ Internal tables refer to data loaded…
🚀🚀 How does the @ApacheSpark plan the execution for us?
(With the help of Catalyst Optimizer)
When defining DataFrame transformation logic, it must first go through an optimized process before execution. This involves four key phases:
◉ Analysis: Spark SQL starts by…
🚀🚀 How does the @ApacheIceberg reading process look like?
◉ The reader first visits the catalog to retrieve the table's current metadata file location.
◉ After fetching the metadata file, it collects the table’s schema and checks partition schemes to understand the data…
🎉 Wow. This is truly an epic masterpiece. Article from Vu Trinh(@_vutrinh), with its vivid illustrations, breaks down and explains the technical architecture of AutoMQ in a very clear and understandable way. If you're interested in the cloud-native technical architecture of…
92 Followers 738 FollowingWith the current trends in the digital world, the power of words harnesses the art of persuasion. Excellent communication is paramount in the competitive market
19 Followers 737 FollowingAn opinionated person. Journaling my thoughts on whatever affects me. RTs are just to get different perspectives. RTs Not Endorsements.
954 Followers 515 FollowingEx business consultant, now business intelligence consultant. Works for Data Reply in Munich. Mostly moved to https://t.co/m1afuEQh3V
11 Followers 63 Following#FormalMethods don't have to be scary. Let's make verification approachable & fun! Learn, verify, and build better software. 🐝 https://t.co/tiyG95vMxe
1K Followers 6K FollowingSomewhere between machines and people. Less is exponentially more. Deciding what not to do is as important as deciding what to do. 靑天亂流.
4K Followers 1K FollowingGoogle Geo SRE 🗺️📍, prev @mapbox, @strava. Following my curiosity and writing about CS/AI/Systems research - https://t.co/aRJWZub62X.
849 Followers 2 FollowingOpen Source Observer is a free analytics suite that helps funders measure the impact of open source software contributions to the health of their ecosystem.
9K Followers 462 FollowingNow @tigerdatabase. The modern cloud platform built on PostgreSQL for time series, events, and analytics (and vectors too). ⭐️ - https://t.co/9HK3eQGIr5.
2K Followers 329 FollowingI work on storage at Amazon. Recovering Professor. Recovering entrepreneur. Father of three. Tinkerer. https://t.co/TyiRK4lE92
1K Followers 47 FollowingWarpStream is a diskless, Apache Kafka®-compatible data streaming platform built directly on top of object storage: zero disks, zero inter-AZ fees, zero access.
5K Followers 180 FollowingRedpanda is a simple, high throughput, and cost-efficient streaming data platform that's compatible with Kafka® APIs without the Kafka complexity.
1K Followers 2K FollowingCo-Author on two O’Reilly books (no spoilers), Dremio Senior Evangelist, and Friendly Tech & Data Hipster. (https://t.co/RV3bH5gwnq)
4K Followers 254 FollowingOpen source SQL query engine for data analytics and the Open Data Lakehouse. Official account of the Presto open source project.
2K Followers 234 FollowingFounder @Onehousehq, Creator of @apachehudi, Built the World's first #DataLakehouse, Distributed/Data Systems, Linkedin, Uber, Confluent alum. (views are mine)
1K Followers 104 FollowingOnehouse is the universal data lakehouse, offering a cloud-native managed lakehouse built on @apachehudi, accessible across table formats, engines and clouds.
1K Followers 91 FollowingTabular is an independent storage platform from the creators of Apache Iceberg, including ingestion, performance optimization, central RBAC and SaaS simplicity.
279K Followers 863 FollowingRaised by addicts, born to build. I train the top 1% of ghostwriters at https://t.co/Zqzpp0gQkS. Free ghostwriting course https://t.co/QOkLUEu6DH
229K Followers 680 FollowingOn a mission to become a better writer, thinker, and entrepreneur • Ex-dentist, now building an internet business (at ~$500k/year).
924K Followers 180 FollowingFounder https://t.co/gQN7OehYd2, Co-Founder https://t.co/VLS8LzeasI. My new book $100M Money Models is out. (3.6M copies sold) Get yours now
428K Followers 553 FollowingI talk about the skills, beliefs, and businesses I’m building | Helped 10,000+ start writing at https://t.co/t2IzOoW1mW | Former @blackrock trader turned writer
9K Followers 1K FollowingCTO of @InfluxDB (YC W13), founder of NYC Machine Learning, series editor for Addison Wesley's Data & Analytics, author of Service Oriented Design with Ruby.