Description:
Objective
Analyzed shifts in user behavior, sentiment, and engagement in response to a significant tech event: the iPhone 16 announcement on September 9, 2024. The goal was to apply advanced data science techniques to extract insights from large volumes of unstructured data.
Skills Demonstrated
- Data Collection & Preprocessing: Managed and cleaned data from diverse sources using robust data engineering techniques.
- Machine Learning: Developed and fine-tuned classifiers for spam detection, sentiment analysis, and anomaly detection using custom transformers and semi-supervised learning.
- Natural Language Processing (NLP): Leveraged state-of-the-art transformer models to perform emotion and sentiment analysis.
- Network Analysis: Applied graph theory concepts to study engagement patterns and community structures.
- Data Visualization & Reporting: Created clear, insightful visualizations to communicate complex data-driven findings.
Project Workflow
- Data Collection:
- Collected over 300,000 data samples from YouTube comments spanning before, during, and after the event using the YouTube API.
- Focused on both static data (500 videos before and after) and dynamic data (1 live stream during the event).
- Data Preprocessing:
- Handled data cleaning, tokenization, and removal of noise.
- Structured data for analysis, ensuring quality and consistency.
Models & Techniques Implemented
- Spam Detection:
- Semi-Supervised Labeling: Used OpenAI’s GPT-3.5 to label a sample of comments, followed by manual verification.
- Modeling Approach: Developed custom transformers for spam detection, implemented nested stratified cross-validation, and performed model selection and fine-tuning.
- Results: Achieved high accuracy and precision in identifying spam, tracking shifts in spam density around the event timeline.
- Emotion & Sentiment Analysis:
- Pre-Trained Models: Used Hugging Face’s
j-hartmann/emotion-english-distilroberta-basefor emotion classification. - Emotion to Sentiment Conversion: Mapped emotions to sentiment labels and adjusted ambiguous cases (e.g., treating “surprise” as neutral).
- Sentiment Trends: Analyzed sentiment distribution across different iPhone 16 features, showing how user emotions evolved.
- Pre-Trained Models: Used Hugging Face’s
- Anomaly Detection:
- Isolation Forest: Deployed for identifying accounts with anomalous behavior, incorporating feature extraction and hyperparameter tuning.
- Insights: Identified patterns and flagged suspicious accounts contributing to engagement or spam.
- Network Analysis:
- Conducted a comprehensive network analysis to explore user interaction patterns, employing measures such as degree distribution, centrality, and transitivity.
- Community Detection: Analyzed how communities formed and whether spam influence was prevalent in high-engagement clusters.
Research Questions Addressed
- What are the most frequently used words related to iPhone 16, and do they reveal any key product features?
- How did sentiment around iPhone 16 features shift before, during, and after the announcement?
- Did the density of spam comments change over time, and which accounts generated the most spam?
- Are central nodes in the engagement network driven by genuine interaction or spam activity?
- Do spam-generating accounts target similar videos, and how does sentiment behavior compare across communities?
Key Insights & Outcomes
- Spam Analysis:
- Significant increase in spam activity around the announcement, with specific accounts identified as prolific spam producers.
- Insights into spam patterns provided actionable data for improving content moderation.
- Sentiment Analysis:
- Revealed a general positive sentiment shift post-announcement, driven by popular feature discussions.
- Highlighted emotion trends associated with specific features, aiding in understanding user excitement or frustration.
- Community Dynamics:
- Uncovered engagement clusters with varying sentiment and spam levels.
- Demonstrated how spam impacts overall network structure and user interaction patterns.
Technologies & Tools Used
- Programming & Data Processing: Python, Pandas, NumPy, YouTube API
- Machine Learning & NLP: Scikit-Learn, Hugging Face Transformers, OpenAI API
- Data Visualization: Matplotlib, Seaborn, NetworkX
- Model Development: Custom transformers, semi-supervised learning, cross-validation techniques
Takeaways
- Impact: Showcased the application of data science in a real-world scenario, providing insights into social media trends and engagement patterns.
- Scalability: The methodologies used can be scaled and adapted for similar data mining and NLP projects.
- Business Relevance: Insights gained from this project can inform marketing strategies, content moderation systems, and community engagement approaches.
Download the presentation here








Grade obtained:
100% with honours (30+/30)
Client:
University of Milan
Project Duration:
2 months
Place:
Remote
Language:
English
Downloads:
Download the presentation here









