Let’s get started. First, we can write text to describe what we’re doing next.
For example, let’s load tidyverse
.
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
Next, let’s use the read_csv()
function load our dataset.
tweets <- read_csv("../data/CharlotteTweets20Sample.csv")
## Parsed with column specification:
## cols(
## body = col_character(),
## postedTime = col_datetime(format = ""),
## actor.id = col_double(),
## displayName = col_character(),
## actor.postedTime = col_datetime(format = ""),
## summary = col_character(),
## friendsCount = col_integer(),
## followersCount = col_integer(),
## statusesCount = col_integer(),
## actor.location.displayName = col_character(),
## generator.displayName = col_character(),
## geo.type = col_character(),
## point_long = col_double(),
## point_lat = col_double(),
## urls.0.expanded_url = col_character(),
## klout_score = col_integer(),
## hashtags = col_character(),
## user_mention_screen_names = col_character()
## )
Recall – what’s going on with the path?
Why have we not needed to set our working directory?
We can use the ###
parameter to specify a new section. For example, in this section, let’s consider looking at the first five tweets.
head(tweets$body, n = 3)
## [1] "Treon to WR is a really good move by Mac, very familiar with playbook. Very good runner in open space too"
## [2] "primus #vsco #vscocam #primus #primussucks #charlotte #bojanglescoliseum #livemusic… https://t.co/IqI6BIPVqn"
## [3] "WOAH!!!!!!"
Next, let’s convert the date to allow us to plot our data.
# converts YYYY-MM-DD HH:MM:SS to string of YYYY-MM-DD
tweets$day <- as.Date(tweets$postedTime)
# recall from dplyr
dayCount <- tweets %>%
group_by(day) %>%
summarise(Count=n())
Next, we’re going to use ggplot2
to plot our time series. We’ll formally introduce ggplot more next week.
ggplot(dayCount, aes(x = day, y = Count)) +
geom_line()
Not bad. We can add a few parameters to this plot to make it pretty.
ggplot(dayCount, aes(x = day, y = Count)) +
geom_line() +
labs(x = "Posted Day", y = "Tweet Count", title = "Charlotte Geo Tweets per Day") +
expand_limits(y = 0) # set y axis to start at zero
What the heck are those spikes?
One way we can check is to look at the most prominent hashtags and mentions.
To do this, we can run a pre-specified function (regular expression) that will keep only hashtags or handles from the body of our tweet.
getCommonHashtags <- function(text, n=20){
hashtags <- regmatches(text, gregexpr("#(\\d|\\w)+",text))
hashtags <- unlist(hashtags)
tab <- table(hashtags)
return(head(sort(tab, dec=TRUE), n=n))
}
getCommonHandles <- function(text, n=20){
handles <- regmatches(text, gregexpr('@([0-9_A-Za-z]+)',text, perl=TRUE))
handles <- unlist(handles)
tab <- table(handles)
return(head(sort(tab, dec=TRUE), n=n))
}
For example, now we can run:
getCommonHashtags(tweets$body)
## hashtags
## #KeepPounding #Charlotte #NC #realestate #charlotte
## 487 458 443 429 256
## #keeppounding #traffic #trndnl #photo #listing
## 234 212 205 185 159
## #clt #Repost #realtor #CIAA #SB50
## 154 144 133 123 109
## #Panthers #CIAA2016 #Concord #CLT #panthers
## 95 89 88 86 86
or …
getCommonHandles(tweets$body)
## handles
## @cltairport @Panthers @midnight @panthers
## 218 212 180 130
## @EthanDolan @ness @hornets @GraysonDolan
## 117 80 78 77
## @SportsCenter @F3theFort @realDonaldTrump @nodabrewing
## 69 64 64 63
## @CampersHaven @marcuslemonis @F3Isotope @marielawtf
## 58 52 48 43
## @oakley @LifeTimeFitness @ChickenNGreens @wcnc
## 39 38 36 36
Ah… maybe they’re Carolina Panther related!
Let’s use a different regular expression and save only the tweets that include three keywords…
panthers <- c("#keeppounding", "#panthers", "@panthers")
# find only the Tweets that contain words in the first list
hit <- grepl(paste(panthers, collapse = "|"), tolower(tweets$body))
# create a column to separate panther tweets and non-Panther tweets
tweets$pantherFlag <- ifelse(hit, "Panther Tweet", "Non-Panther Tweet")
With our new flag, let’s rerun our time series but differentiate between Panther and non-Panther tweets.
dayCount <- tweets %>%
group_by(day, pantherFlag) %>%
summarise(Count=n())
Next, we’re going to use ggplot2
to plot our time series. We’ll formally introduce ggplot more next week.
ggplot(dayCount, aes(x = day, y = Count, color = pantherFlag)) +
geom_line() +
labs(x = "Posted Day", y = "Tweet Count", title = "Charlotte Geo Tweets per Day") +
expand_limits(y = 0) # set y axis to start at zero
This is a good start, but it still looks like we may have missed some Tweets.
Querying Twitter data is an incredibly hard problem (much harder than many researchers realize!).
For example, King, Lam and Roberts (2017) investigates this problem and finds that the choice of keywords can drastically affect results.
For a deeper analysis, see my Fall 2016 Twitter Workshop Day 1 Exercise.