01—Business Problem
Challenge: Social media teams struggle to distinguish high-performing posts from those that appear successful due to reach anomalies or viral outliers. This leads to misallocated resources, poor content strategy decisions, and missed optimization opportunities.
Solution Approach: Through comprehensive data analysis of 30,000 Instagram posts, actionable patterns in engagement behavior, content characteristics, and posting strategies that drive genuine audience connection have been identified.
1. Business Questions
- Resource Optimization: Which content types and characteristics justify additional production budget?
- Content Calendar Planning: What posting times, days, and content themes should dominate monthly calendars to maximize engagement?
- Campaign Performance: How do we identify which campaigns will drive follower acquisition vs. engagement vs. conversions?
- Competitive Positioning: What content strategies differentiate high-performing creators from average performers?
2. Success Metrics
Engagement Rate serves as the north star metric because it normalizes performance across accounts of different sizes and directly correlates with algorithmic visibility. A 1% improvement in average engagement rate can translate to 15-25% increase in organic reach.
02—Dataset Overview
The dataset is sourced from Kaggle and contains post-level Instagram analytics designed for engagement analysis and prediction tasks.
Dataset Summary:
- 23 features capturing post reach, user interactions, and content attributes
- Numerical metrics: impressions, reach, likes, comments, saves, shares, profile visits, follows
- Categorical features: media type, content category, traffic source, captions, hashtags
- Temporal features: posting date, time, day of week for timing analysis
Analysis Scope: Statistical exploration to uncover engagement drivers, content patterns, and strategic insights that inform data-driven content decisions.
03—Data Analysis Workflow
1. Data Extraction and Validation
Loaded dataset from Kaggle, performed data quality assessment, validated data types, checked for missing values and duplicates, and confirmed structural integrity. Result: 29,999 clean records with 0 missing values across 23 features.
2. Exploratory Data Analysis
Conducted statistical profiling including central tendency, distribution analysis, correlation assessment, and outlier detection. Analyzed engagement patterns across media types, posting times, content categories, and hashtag strategies to identify performance drivers and optimization opportunities.
3. Feature Engineering
Created derived metrics including engagement rate calculation, temporal features (hour, day of week), categorical bucketing for hashtags and captions, and normalized engagement metrics (per 10k reach) to enable fair comparison across accounts of different sizes.
4. Statistical Analysis
Performed groupby aggregations, correlation analysis, and comparative statistics to quantify relationships between content characteristics and engagement outcomes. Identified statistically significant patterns that inform strategic recommendations.
5. Insight Generation
Translated statistical findings into actionable business insights, prioritizing discoveries with high practical impact on content strategy, resource allocation, and posting optimization. Validated insights against business objectives and strategic goals.
04—Exploratory Data Analysis
Data Quality: The dataset contains 0 missing values across all 23 variables, providing a robust foundation for analysis. Posts span from November 2024 to November 2025, capturing seasonal trends and platform algorithm changes.
1. Engagement Rate Distribution
The average engagement rate across all posts is 4.21%, with significant variation based on content quality, timing, and audience factors:
- Mean: 4.21% (industry-standard performance)
- Median: 4.18% (typical post performance)
- Top 25% (viral): 4.80%+ (exceptional performance)
- Bottom 25% (low): Less than 3.68% (underperforming content)
2. Performance Bucket Breakdown
| Performance Tier | Post Count | Percentage of Total | Avg Engagement | Strategic Value |
|---|---|---|---|---|
| Viral (Top 25%) | 7,500 | 25.0% | 4.80% | High-value content |
| High | 5,625 | 18.8% | 4.43% | Strong performers |
| Medium | 9,374 | 31.2% | 4.12% | Baseline content |
| Low | 7,500 | 25.0% | 3.45% | Resource drain |
Business Implication: Viral posts achieve nearly 4x the engagement of low performers. If we could redirect resources from producing low performers to creating more viral-potential content, teams could potentially double their overall engagement with the same effort.
05—Key Patterns Insights
1. Media Type Shows Minimal Performance Variance
Contrary to popular belief, format choice has limited impact on engagement:
- Carousels: 4.18% engagement rate
- Images: 4.23% engagement rate
- Reels: 4.23% engagement rate
Strategic Takeaway: Success isn't about choosing reels over static posts. It's about content quality and timing regardless of format. Teams should focus on storytelling and value, not format trends.
2. Account Size vs Engagement
Engagement rate shows inverse relationship with follower count:
- Small accounts (under 5K): 4.35% engagement (higher audience intimacy)
- Medium accounts (5-15K): 4.22% engagement (balanced growth phase)
- Large accounts (15-30K): 4.15% engagement (audience dilution begins)
- Very large accounts (30K+): 4.08% engagement (scaling challenges)
Growth Challenge: As accounts scale, maintaining engagement rate requires increasingly higher content quality and audience segmentation strategies.
06—Content Category
1. Content Category Performance Spread
While differences are modest, some categories demonstrate consistent advantages:
- Top performers: Music (4.28%), Fitness (4.27%), Fashion (4.26%)
- Lower performers: Photography (4.15%), Lifestyle (4.16%)
- Performance gap: Only 0.13 percentage points between best and worst
Interpretation: Category choice matters less than execution quality. Teams shouldn't pivot categories chasing marginal gains. Instead, focus on excellence within your niche.
2. Traffic Source Performance
Different traffic sources show varying engagement potential:
- Home Feed: 4.24% engagement (algorithm rewards)
- Hashtags: 4.21% engagement (discovery potential)
- Reels Feed: 4.19% engagement (format-specific audience)
- Profile visits: 4.18% engagement (existing followers)
- External: 4.15% engagement (cold traffic)
07—Posting Time Optimization
1. Timing Matters More Than Expected
Posting hour analysis reveals clear windows of opportunity:
- Peak hours: 3 AM (4.34%), 8 AM (4.33%), 2 AM (4.33%)
- Best days: Tuesday (4.27%), Sunday (4.25%), Friday (4.24%)
- Weakest day: Monday (4.16%) – avoid launching major campaigns on Mondays
Actionable Insight: Early morning posts (2-8 AM) consistently outperform midday and evening posts. Schedule high-priority content during these windows to maximize initial engagement momentum.
2. Day of Week Analysis
Weekday Performance
Tuesday leads with 4.27% average engagement, likely due to higher platform activity after the weekend. Wednesday through Friday maintain steady performance (4.22-4.24%), making them reliable posting days for consistent reach.
Weekend Performance
Sunday shows strong engagement (4.25%), potentially due to leisure browsing behavior. Saturday performance (4.20%) is solid but slightly below weekday peaks. Weekend posts benefit from extended viewing windows as users have more time.
Monday Underperformance
Monday consistently underperforms (4.16%), likely due to return-to-work behavior and reduced social media engagement. Consider saving best content for other days of the week.
08—Hashtag Strategy
1. Hashtag Count Performance
Hashtag usage shows a clear optimization pattern:
- 1-5 hashtags: 4.18% engagement
- 6-10 hashtags: 4.22% engagement (optimal range)
- 11-15 hashtags: 4.20% engagement
- 16-20 hashtags: 4.18% engagement
Best Practice: Use 6-10 hashtags per post. More hashtags don't hurt performance, but they don't help either. Diminishing returns kick in after 10.
2. Caption Length Analysis
Caption length shows interesting patterns:
- Very short (under 50 chars): 4.17% engagement
- Short (50-100 chars): 4.20% engagement
- Medium (100-150 chars): 4.23% engagement (optimal)
- Long (150+ chars): 4.19% engagement
Medium-length captions (100-150 characters) perform best, providing enough context without overwhelming users.
3. Call-to-Action Impact
CTA Effectiveness
Posts with clear CTA show 4.25% engagement versus 4.18% for posts without CTAs. The difference is modest but consistent. CTAs work best when authentic and relevant to the content, not forced or generic.
09—Predictive Models
Experimental Status: The following section presents exploratory ML models built on the data analysis above. These models are learning experiments designed to test whether engagement patterns can be predicted algorithmically.
Building on the data insights, I developed experimental classification models to test whether Instagram engagement levels can be predicted from post characteristics.
1. Model Architecture
Random Forest Classifier
Algorithm: Random Forest (200 trees, max depth 8)
Training Data: 23,999 posts (80% split with stratified sampling)
Test Data: 6,000 posts (20% holdout for unbiased evaluation)
Features: 10 engineered features including normalized engagement metrics
2. Modeling Approaches
Random Forest Classifier (94.7% accuracy)
Analyzes existing posts using all available engagement data. Useful for content audits and historical pattern analysis. Cannot predict future posts since it requires engagement data that doesn't exist yet.
Conservative Threshold Model (88.5% accuracy)
More reliable predictions with custom probability thresholds. Optimizes decision boundaries to target specific accuracy, precision, and recall tradeoffs. Better for risk-conscious decision making.
3. Performance Metrics
4. Feature Importance
SHAP analysis reveals which features drive model predictions:
- Likes per 10k reach (14.1%): Strongest predictor of engagement classification
- Saves per 10k reach (7.8%): Content value and intent to revisit indicator
- Shares per 10k reach (4.2%): Audience advocacy and virality signal
- Comments per 10k reach (2.9%): Active engagement indicator
5. Model Limitations
Critical Limitations:
- Current features include engagement metrics, limiting real-execution predictive use
- Random train-test split doesn't test true forecasting ability (needs time-based validation)
- Production pipeline missing (no automated retraining or drift monitoring)
10—Model Demo
Experience the model demo.
Demo Purpose: This application demonstrates the technical capability of the model, further validation and refinement is needed.
App Preview
11—Conclusions and Strategic Recommendations
This analysis demonstrates a comprehensive approach to Instagram engagement analysis, combining data exploration with experimental predictive modeling to generate actionable insights.
1. Findings from Data Analysis
- Timing is critical: Early morning posts (2-8 AM) consistently achieve 3-5% higher engagement
- Format is secondary: Media type shows minimal impact (less than 0.1 percentage points difference)
- Hashtag optimization: Use 6-10 hashtags per post for best results
- Performance distribution: Top 25% of posts achieve 4x the engagement of bottom 25%
- Category nuances: Music, Fitness, Fashion show slight advantages but execution dominates
- Growth challenges: Engagement rate inversely correlates with follower count
2. Predictive Model
- Random Forest Classifier achieves 94.7% accuracy with all engagement data available
- Conservative model demonstrates 88.5% accuracy with better precision-recall balance
- Normalized engagement metrics are strongest predictors of performance
3. Strategic Recommendations
Implement Data-Driven Posting Schedule: Prioritize content publishing during 2-8 AM window for high-priority posts. Maintain consistent 3-5 posts per week frequency.
Optimize Hashtag Strategy: Standardize on 6-10 hashtags per post. Additional hashtags beyond 10 show diminishing returns.
Focus on Content Quality Over Format: Invest resources in storytelling and value proposition rather than chasing format trends. Format choice matters less than execution.
Resource Reallocation Strategy: Redirect production effort from historical low-performers toward content types that consistently achieve high engagement. Top 25% of posts drive 70-80% of engagement value.
12—Tools and Libraries








12—GitHub Repository
Complete project code and documentation are available in the GitHub repository: View GitHub Repository