Tuesday, July 1, 2025

Project on Analyzing Fake vs Real News Patterns

Introduction

The rise of fake news is of grave concern for the media industry. Fake news erodes trust, may influence public opinions on critical matters like elections and create fear among the public. In this regard, this project tried to explore fake news patterns by comparing them with true news patterns and come up with some results that are applicable to the media industry in general. This project used the “fake-and-real-news-dataset” from Kaggle, available as two CSV files (Fake.csv, True.csv), both containing almost equal number of Fake and Real news. Python was utilized for data cleaning, transforming, and merging task, and finally Tableau dashboard was created for visually analyzing the fake news pattern thus providing suggestions for spotting fake news. 


Business Impact

This analysis is expected to help media strategists and content moderators to:

  • Identify Sensationalism: Detect fake news characteristics (e.g., longer titles, sensational words) to enhance content filtering or fact-checking systems.
  • Understand Publication Trends: Understand when fake news spikes (e.g., election periods) to optimize editorial planning during major political events.
  • Enhance Content Strategy: Leverage true news characteristics (e.g., neutral language) to craft credible content.

Data

  • Dataset Name: “fake-and-real-news-dataset” from Kaggle
  • File Names: Fake.csv, True.csv
  • Description: Fake and Real News articles from various sources, covering 2015–2018.
  • Dataset Details: Initially 44,898 rows (23,420 Fake, 21,478 True), 6 columns after merging and processing. After-cleaning: ~38,688 rows after removing 41 null dates and 6,165 duplicate titles and merged file name: news_merged.CSV
  • Size: ~100 MB (cleaned CSV file after merging).
  • Target Features:
    • title (text): Article titles for word frequency analysis.
    • date (datetime): Publication date for trend analysis.                 
    • subject (categorical): Article category (merged to 5: worldnews, politicsNews, Government News, US_News, left-news).
    • label (categorical): Fake or True, the primary target for comparison.
    • title_length (numerical): Word count of titles, new feature introduced.

 

These features addressed the problem by enabling analysis of content, timing, and characteristics of fake vs. true news.


Data Analysis & Computation

There were 5 major analyses done on the dataset to obtain insights into the data:

 

Analysis #1: Prevalence of fake news and real news in the dataset

Post-cleaning using news_merged.CSV dataset, it shows that fake news and real news are almost spread equal thereby providing a robust dataset for analysis. Here is a Tableau worksheet snapshot for the same:



Analysis #2: Title Length Distribution

Histograms of title_length by label using “matplotlib.pyplot” and Tableau reveals:

  • Fake: Right-skewed, median ~14 words, longer due to sensational phrases (e.g., “Sheriff David Clarke Becomes An Internet Joke For Threatening To Poke People ‘In The Eye”).
  • True: Very slightly Left-skewed, median ~10 words, concise (e.g., “Senate Approves New Policy”). 

The histogram also shows that Fake titles have higher variability and outliers (3 to 42 words) which supports the assumption that fake news uses longer titles due to sensationalism or attention-grabbing tactics. 


Analysis #3: Label and Subject Distribution

The analysis of label and subject (after merging “News” to “worldnews” and “politics” to “politicsNews”) via (df.groupby(['subject', 'label']).size() / len(df) * 100) shows:

  • Label: ~46% Fake (~17,862 articles), ~54% True (~20,826 articles).
  • Subject:
    • worldnews: ~48% (half Fake, half True, per curation).
    • politicsNews: ~45% (mixed Fake/True, True dominant).
    • Government News: ~1.5% (Fake only).
    • US_News: ~2% (Fake only).
    • left-news: ~2% (Fake only). This near-even Fake/True split, with worldnews and politicsNews dominating, highlights source differences (Fake.csv vs. True.csv). The stacked bar chart in Tableau visualizes this distribution.


Analysis #4: Date Trends

Grouping date by year and label (via df.groupby([df['date'].dt.year, 'label']).size()) shows temporal patterns:

  • Fake Articles: Spike in 2016 (~67% of Fake articles), likely due to the U.S. presidential election, then decline in 2017–2018.
  • True Articles: None in 2015, less (4,650) in 2016 and high (16,176) in 2017. 

 

Below is a line chart using Python that visualizes this, confirming the hypothesis that fake news surges during high-profile events, suggesting editorial planning during election periods.



Analysis #5: Word Frequency Analysis

Using NLTK, the top 50 words in title by label were extracted to get a new CSV file: word_frequencies.csv. Then a word cloud is formed in Python visualizing repeated words that come under Fake and Real labels.

  • Fake: Sensational words dominate (e.g., “trump”, “video”, “breaking”), suggesting clickbait tactics.
  • True: Neutral words prevail (e.g., “official”, “senate”, “police”), indicating factual reporting. The word cloud in Python (and corresponding bubble chart in Tableau with word size by frequency, color by label) highlights these stylistic differences, supporting the hypothesis that fake news uses attention-grabbing language, though some words overlap.


Conclusion & Future Work

After carefully analyzing the data and charts, we have come to the following conclusions to spot fake news:

       The title length of fake news is longer (~ 14 words) in comparison to true news (~ 10 words).

       Fake news is not specific to any subject and can be found among diverse categories or subjects.

       Fake news or misinformation is rampant during major geopolitical events like the year 2016 during US election time and so news monitoring is necessary during those times.

       The presence of some words like ‘trump’, ‘video’, ‘watch’, etc. in the news article are a red flag and the authenticity of the news must be re-checked or the content be moderated.  


This project utilized only one dataset which may have errors or potential biases. So, in the future other data sources may be merged with this dataset and sentiment analysis can also be included for deeper insights.


Reference

https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset