Data Wrangling, Insight and Visualization Reports

by Audrey S Tan

Data Wrangle Report

1) Gather data from three different sources:

  • WeRateDogs Twitter archive. This is provided by Udacity in a csv file format and contains 5000+ basic tweet data about dog rating, name, and "stage".
  • Tweet image predictions. This is also provided by Udacity in tsv file format which I downloaded programmatically from Udacity site. This file contains dog breed prediction results (from a Neural Network classifier) for every dog images from the WeRateDogs Twitter archive.
  • Additional Twitter Data. The data resides on Twitter site and can be pulled via their API tweepy. I used the API to query additional data (in JSON format) and downloaded into a file named tweet_json.txt. This file has favorite and retweet count information for each tweet ID in the WeRateDogs Twitter archive, which are crucial for the dog rating analysis.

2) Assess data for quality and tidiness:

  • Inspected the three datasets visually and programmatically to produce a list of quality and tidiness issues.
  • Quality issues include:
    • various issues pertains to incorrect rating numerator and denominator values in the main twitter dataset.
    • inproper data types for tweet id, timestamp, rating numerator and denominator in the main twitter dataset.
    • invalid dog names and inconsistent dog naming convention in the main and secondary twitter datasets.
    • presence of retweet and reply-to data in the main and secondary twitter data datasets.
    • superfluous columns in the main twitter data dataset.
  • Tidiness issues include:
    • dog stages span four different columns in the main twitter dataset which can and should be combined into one.
    • three types of observations (dog, non dog and partial) in the prediction dataset
    • the three datasets can be combined into one single dataset

3) Clean data to fix quality and tidiness issues identified:

  • for each of the issues identified in each dataset, prescribed a code fix, built, executed and tested the code fix.
  • combined the three datasets into a single master dataset, store it to a csv file and a python database table.

4) Analyze and visualize the wrangled data:

  • looked at the cleaned master dataset and produce three insights with visualizations.
  • Insights and visulizations produced include:

    • correlation between favorite and retweet counts.
    • the trend of favorite and retweet counts with respect to time and classification of dog species
    • performance of the dog image classifier

    Created by Audrey S Tan.

Insight and Visualization Report

Overview

With the cleansed dataset created from gathering twitter data pertaining to the popular WeRateDogs dog [ref 1] rating provider on Twitter, analyzed and produced the following insights and visualizations:

1) correlation between favorite and retweet counts.

2) the trend of favorite and retweet counts with respect to time.

3) the trend of favorite and retweet counts with respect to classification of dog species.

4) performance of the dog image classifier.


Correlation between favorite counts and retweet counts

favorite counts and retweet counts have a positive correlation. The intensity of the counts is heavily concentrated from the begining up to 40k favorite counts, with the counts largely below the regression line.

fav_vs_retwt_cnt2.png

From the begining till 2016-04, the intensity of both favorite and retweet counts is similar, although favorite counts are higher than retweet counts. This trend is even more conspicuous with the progression of time, with favorite counts steadily rising above retweet counts from around 2016-09. Interestingly, the bulk of retweet counts remains below 10k.

fav_Retwt_cnt3.png

Among the three speicies, both favorite and retweet counts have a positive correlation. The bulk of tweets is mainly from the species of dog and hybrid, which is in line with their respective species counts - 1194 dog and 472 hybrid.

dog_species.png

fav_Retwt_cnt_dogclass2.png

Given WeRateDogs is all about dogs, it is unconvincing there are tweets about hybrid and not dog species classified by the neural network dog breed classifier, which led us to take a look at the performance of the classifier next.

Performance of the dog image classifer

First, gathered some descriptive statistics on rating, favorite, retweet counts and the top 3 model confident predictions.

DogDescStats.png

Next, take a look at the top 10 dog ratings and breeds

top10_dog_rating.png

top10_dog_breeds.png

From the descriptive statistics, we checked the recipients of the highest

  • favorite counts 132810.0
  • retweet counts 79515.0
  • p1 prediction confidence of 1.0

then got their pictures from Twitter site.

The highest favorite counts recipient is:

fav_cnt.png

It's a dog (Labrador Retriever) with no name and a p1 prediction 0.196015 and looks like this ...

topfav.png

The highest retweet counts recipient is:

retwt_cnt.png

It's a hybrid (Labrador Retriever) with no name and a p1 prediction 0.825333 and looks like this ...

topRetwt.png

The highest p1 prediction confidence (1.0) recipient is:

p1_conf.png

It's not dog named Shaggy (Spanish Water Dog) and a p1 prediction 1.0 and looks like this ...

topP1Conf.png

So what is performance of the image classifier afterall ?

Well, the dog and hybrid share the same rating of 13 while the not dog has a rating of 10, which rank 4th and 2nd respectively in the top 10 dog ratings. Both the dog and hybrid are actually Labrador Retriever which ranks 3rd in the top 10 dog breeds. The not dog was misclassified but is actually a Spanish Water Dog, though the breed is not among the top 10 dog breeds.
So of the 1971 entries in the twitter_archive_master dataset, 305 entries were misclassified as not dog, 472 as hybrid (i.e. may be dog). Only 1194 were correctly classified as dog yet none attains the highest p1 predection of 1. Ironically, the not dog has the highest prediction confidence of 1.
The numbers represent a 60.58% (=1194/1971 x 100%) chance the model correctly identifies a dog as a dog. Obviously, the model has more room for improvement.