ETL Data Pipeline for Milo Sales Performance

Data Extraction, Transformation, Storage, and Basic Analysis on Lazada Indonesia

View Case Study

01—Introduction

By Q4 2025, sales data generated across Indonesia's major e-commerce platforms had increased significantly in volume and operational complexity. For brands operating on Lazada Indonesia, transaction records continued to grow but remained fragmented and difficult to use. Data came from different sources with inconsistent structures, forcing teams to spend more time cleaning files and manually merging spreadsheets than analyzing performance.

Without a basic ETL data pipeline and a centralized database, generating simple insights or checking basic correlations became slow, repetitive, and error-prone. This friction quietly limited day-to-day decision making in a high-throughput marketplace environment.

02—Data Overview

The dataset used in this analysis is collected through web scraping from publicly available Lazada Indonesia product and sales listing pages. The data captures item-level information such as product names, prices, ratings, products sold, review counts, and seller attributes, which are extracted and processed through a structured ETL data pipeline.

Raw HTML content is transformed into a tabular format and stored in a centralized database, enabling basic performance exploration and simple correlation analysis across key sales-related variables.

03—Methodology

Data Extraction

Data is collected through web scraping from publicly available Lazada Indonesia product listing pages using automated extraction tools.

Data Transformation

Extracted data is cleaned and standardized to address inconsistent formats, missing values, and duplicate handling. Fields are normalized into structured columns to ensure consistency and analytical readiness.

Data Loading

The transformed dataset is stored in a centralized PostgreSQL database, creating a single source of truth for analysis.

Basic Data Analysis

Exploratory analysis is conducted to generate descriptive insights and identify basic correlations between key variables such as price, total sold, ratings, and reviews.

04—Data Preview

ETL Pipeline Dashboard Preview

05—Market Insights

Product Popularity & Sales Performance

1. Most Popular Product

The most popular Milo product based on the number of reviews is Nestlé Milo Activ-Go Chocolate Milk 22g (Pack of 10), indicating the highest level of consumer exposure and engagement.

2. Best-Selling Product

The best-selling Milo product based on units sold is also Nestlé Milo Activ-Go Chocolate Milk 22g (Pack of 10), showing strong alignment between popularity and sales performance.

3. Top Sub-Category

The Activ-Go sub-category dominates both popularity and sales, confirming its role as Milo's flagship product line.

Correlation Insights

1. Popularity vs Sales

A correlation value of 0.86604 indicates a strong positive relationship between units sold and the number of reviews.

2. Reviews Signal

Products with higher sales volumes tend to accumulate more reviews, suggesting that reviews reflect product popularity and consumer trust.

3. Price vs Units Sold

The correlation between price and units sold is 0.078871, indicating a very weak or near-zero relationship between the two variables.

Market & Distribution Patterns

1. Sub-Category Performance

Activ-Go contributes the largest share of sales, while less-sugar variants indicate growing demand from health-conscious consumers. RTD (can) variants underperform in e-commerce channels.

2. Sales by Location

Depok records the highest number of sellers and the highest purchase frequency, suggesting its role as a key distribution or fulfillment hub.

3. Geographic Concentration

Cities with fewer sellers, such as Bekasi and North Jakarta, still generate significant sales volume, while seller distribution remains concentrated in major cities on Java Island.

06—Conclusion

This project demonstrates how a structured ETL process converts raw e-commerce data into a centralized and analysis-ready database. Using basic exploratory analysis and simple correlation checks, it provides a clear view of Milo's sales performance on Lazada Indonesia without relying on advanced modeling.

The resulting dataset and insights form a reliable foundation for data-driven decision making and support future extensions into deeper consumer analysis or broader business insights.

07—Libraries & Tools

BeautifulSoup
pgAdmin4
SQL
Python
Pandas
NumPy
Matplotlib
Statsmodel
GitHub

08—GitHub Repository

Access the complete code and documentation for the ETL Data Pipeline on GitHub: Visit GitHub Repository