Download Link

update 6th feb

1 hour shift 3 hour shift Various topic graphs

update 2nd feb

For 10 step ahead prediction using just the no. of unique users talking on the topic
Mean squared error = 2.48397e-06 (regression)
Squared correlation coefficient = 0.853021 (regression)
Next step prediction
Mean squared error = 6.32844e-08 (regression)
Squared correlation coefficient = 0.997544 (regression)
Naive prediction - 10 step ahead
MSE 3.0702156848e-06
Next step Naive
MSE 5.0463118889e-08
Link to topic graph

update 30th Jan

Around 30648 user in our SCC tweet a hashtag or url in our relevant period. Also for learning and training we are considering the top 5000 topics which have greater than 50 unique user tweeting about them. Also we disambiguated a number of hashtags manually. The top hashtags + urls now are :

   4970 followfriday
   4673 haiyan
   3553 rt
   3537 affordablecareact
   3504 emazing
   3350 np
   3324 tbt
   2762 veteransday
   2598 sfbatkid
   2530 http://Unfollowers.me
   2395 peoplechoice
   2357 mtvstars
   2277 tcot
   2171 wcw
   1888 lestweforget
   1794 oomf
   1748 believemovie
   1687 love
   1591 catchingfire
   1564 obama
   1564 facebook
   1454 benghazi
   1451 xfactor
   1420 breaking
   1411 truth
   1377 ifwedate
   1349 mcm
   1338 p2
   1294 twitter
   1274 christmas
   1264 fail
   1194 mentionatruefriend
   1168 oneofmyfavoritemoviesis
   1122 jfk
   1099 playstation4
   1096 selfie
   1087 nyc
   1075 news
   1069 sorrynotsorry
   1069 http://www.justunfollow.com/?r=tw
   1058 gop
   1056 uniteblue
   1049 iran
   1045 music
   1040 t
   1038 amas
   1031 s
   1019 tlot
   1005 blessed
   1002 scandal

update 29th Jan

In trying to expand the dataset what we tried was that we were considering the neighbours of the strongly connected component that we have and putting them in a map. It turns out that the neighbours form a size of 47483801.

Geolocation can also be found in a coordinates tag in the tweet response. Out of 17377085 tweets that are relevant to us 261862 have this tag. GeoLocation of the tweet can be found in the user 'object' in the tweet that we recieve. Although this might not be very trustworthy but non the less we can use this. Refer: https://dev.twitter.com/docs/faq#6981 .
Now the user child object contains a time_zone and a location. Location is the user-defined location for this account's profile. Not necessarily a location nor parseable. This field will occasionally be fuzzily interpreted by the Search service. Almost all tweets have this information !

In the dataset that we have there are around 404524 urls that are present and around 11K have greater than 15 unique users tweeting about them. Also there are around 470377 hashtags with arounf 13K hashtags with greater than 15 unique users.

update 27th Jan

The size of the largest connected component after removing the data of users that wasn't complete is 62519. Top few hashtags are:
   4819 ff
   3101 haiyan
   2896 rt
   2694 tbt
   2527 obamacare
   2357 mtvstars
   2350 peopleschoice
   2288 emazing
   2064 veteransday
   2021 mtvema
   2000 wcw
   1986 np
   1937 sfbatkid
   1922 tcot
   1904 philippines
   1794 oomf
   1687 love
   1596 nowplaying
   1564 obama
   1420 breaking
   1411 truth
   1390 benghazi
   1377 ifwedate
   1353 aca
   1342 batkid
   1338 p2
   1294 twitter
   1289 xfactor
   1274 christmas
   1264 fail
   1251 teaparty
   1205 retweet
   1194 mentionatruefriend
   1168 oneofmyfavoritemoviesis
   1154 catchingfire

Top few urls are: 
   2530 http://Unfollowers.me
   2188 https://www.healthcare.gov/
    981 https://twitter.com/rx
    809 http://www.justunfollow.com/?r=td
    741 http://www.peopleschoice.com/pca/votenow.jsp
    648 https://twitter.com/nsm
    623 https://twitter.com/bhaggs
    571 http://fllwrs.com
    493 http://nypost.com/2013/11/18/census-faked-2012-election-jobs-report/
    435 https://twitter.com/minimalist
    407 http://newsfeed.time.com/2013/11/07/interactive-this-is-how-much-money-twitter-owes-you/
    403 http://movies.yahoo.com/video/justin-biebers-believe-trailer-171023965.html?soc_src=mediacontentsharebuttons
    383 https://about.twitter.com/download
    362 http://smarturl.it/MidnightMemoriesiT
    360 http://socialbuzz.mtvema.com/
    355 http://pbs.twimg.com/media/BY0Rl0FIQAAXtpo.jpg
    355 http://paulstamatiou.com
    353 http://www.gerryeisenhaur.com
    351 http://pbs.twimg.com/media/BYqhnW_IAAADSQr.jpg
    346 http://www.justinbieberbelieve.com/
    343 http://www.youtube.com/watch?v=YLriiVE1OWc
    330 https://vine.co/v/htbdjZAPrAX
    329 https://twitter.com/gnaphos
    318 https://twitter.com/cg
    312 http://pbs.twimg.com/media/BZNUrQsCQAAAMQt.jpg
    310 http://www.youtube.com/watch?v=A7JRa4zyl2M&feature=youtu.be

Tweet mining details

Total number of users mined were 107711. Out of which 8641 users had there timeline locked or there accounts deleted, which is roughly 8%

Also out of the 97752 unique users for which I have the tweet data, I have the follower relations of 96909 users

No. of SCCs : 31799
Sizes of the Top 5 SCCs : 64045, 56, 54, 13, 10
No. of nodes with no outgoing edge : 26552

9th September 2013

All the simulations for this day are on a graph of 10000 nodes running for 1000 timesteps. Also presently features only include the number of nodes that the topic has spread to in the past.

Figure 1: Next step prediction using epsilon SVR -- Mean squared error = 0.00299039

Figure 2: 10 step ahead prediction using epsilon SVR -- Mean squared error = 0.00394066

Figure 3: Single topic next step prediction but using nu-SVR -- Mean squared error = 0.000188936. It was here I realized that nu-SVR performs relatively better. So switched to it

Figure 4: Next step prediction using nu SVR -- Mean squared error = 0.000188578 (about an order less than epsilon-SVR)

Figure 5: 10 step ahead prediction using nu SVR -- Mean squared error = 0.00164919

Figure 6: 10 step ahead prediction using nu SVR but for a single topic -- Mean squared error = 0.000532476

Figure 7: Maxpeak prediction. Training and test done for only during the rising period of the topic ie. until max peak is reached -- Mean squared error = 0.00176657

Figure 8: Maxpeak prediction for a single topic -- Mean squared error = 0.000714208