Data science project

Description:

Objective

Analyzed shifts in user behavior, sentiment, and engagement in response to a significant tech event: the iPhone 16 announcement on September 9, 2024. The goal was to apply advanced data science techniques to extract insights from large volumes of unstructured data.


Skills Demonstrated

  • Data Collection & Preprocessing: Managed and cleaned data from diverse sources using robust data engineering techniques.
  • Machine Learning: Developed and fine-tuned classifiers for spam detection, sentiment analysis, and anomaly detection using custom transformers and semi-supervised learning.
  • Natural Language Processing (NLP): Leveraged state-of-the-art transformer models to perform emotion and sentiment analysis.
  • Network Analysis: Applied graph theory concepts to study engagement patterns and community structures.
  • Data Visualization & Reporting: Created clear, insightful visualizations to communicate complex data-driven findings.

Project Workflow

  1. Data Collection:
    • Collected over 300,000 data samples from YouTube comments spanning before, during, and after the event using the YouTube API.
    • Focused on both static data (500 videos before and after) and dynamic data (1 live stream during the event).
  2. Data Preprocessing:
    • Handled data cleaning, tokenization, and removal of noise.
    • Structured data for analysis, ensuring quality and consistency.

Models & Techniques Implemented

  1. Spam Detection:
    • Semi-Supervised Labeling: Used OpenAI’s GPT-3.5 to label a sample of comments, followed by manual verification.
    • Modeling Approach: Developed custom transformers for spam detection, implemented nested stratified cross-validation, and performed model selection and fine-tuning.
    • Results: Achieved high accuracy and precision in identifying spam, tracking shifts in spam density around the event timeline.
  2. Emotion & Sentiment Analysis:
    • Pre-Trained Models: Used Hugging Face’s j-hartmann/emotion-english-distilroberta-base for emotion classification.
    • Emotion to Sentiment Conversion: Mapped emotions to sentiment labels and adjusted ambiguous cases (e.g., treating “surprise” as neutral).
    • Sentiment Trends: Analyzed sentiment distribution across different iPhone 16 features, showing how user emotions evolved.
  3. Anomaly Detection:
    • Isolation Forest: Deployed for identifying accounts with anomalous behavior, incorporating feature extraction and hyperparameter tuning.
    • Insights: Identified patterns and flagged suspicious accounts contributing to engagement or spam.
  4. Network Analysis:
    • Conducted a comprehensive network analysis to explore user interaction patterns, employing measures such as degree distribution, centrality, and transitivity.
    • Community Detection: Analyzed how communities formed and whether spam influence was prevalent in high-engagement clusters.

Research Questions Addressed

  • What are the most frequently used words related to iPhone 16, and do they reveal any key product features?
  • How did sentiment around iPhone 16 features shift before, during, and after the announcement?
  • Did the density of spam comments change over time, and which accounts generated the most spam?
  • Are central nodes in the engagement network driven by genuine interaction or spam activity?
  • Do spam-generating accounts target similar videos, and how does sentiment behavior compare across communities?

Key Insights & Outcomes

  1. Spam Analysis:
    • Significant increase in spam activity around the announcement, with specific accounts identified as prolific spam producers.
    • Insights into spam patterns provided actionable data for improving content moderation.
  2. Sentiment Analysis:
    • Revealed a general positive sentiment shift post-announcement, driven by popular feature discussions.
    • Highlighted emotion trends associated with specific features, aiding in understanding user excitement or frustration.
  3. Community Dynamics:
    • Uncovered engagement clusters with varying sentiment and spam levels.
    • Demonstrated how spam impacts overall network structure and user interaction patterns.

Technologies & Tools Used

  • Programming & Data Processing: Python, Pandas, NumPy, YouTube API
  • Machine Learning & NLP: Scikit-Learn, Hugging Face Transformers, OpenAI API
  • Data Visualization: Matplotlib, Seaborn, NetworkX
  • Model Development: Custom transformers, semi-supervised learning, cross-validation techniques

Takeaways

  • Impact: Showcased the application of data science in a real-world scenario, providing insights into social media trends and engagement patterns.
  • Scalability: The methodologies used can be scaled and adapted for similar data mining and NLP projects.
  • Business Relevance: Insights gained from this project can inform marketing strategies, content moderation systems, and community engagement approaches.

Download the presentation here

Grade obtained:

100% with honours (30+/30)

Client:

University of Milan

Project Duration:

2 months

Place:

Remote

English

Download the presentation here