Around 30648 user in our SCC tweet a hashtag or url in our relevant period. Also for learning and training we are considering the top 5000 topics which have greater than 50 unique user tweeting about them. Also we disambiguated a number of hashtags manually. The top hashtags + urls now are :
4970 followfriday 4673 haiyan 3553 rt 3537 affordablecareact 3504 emazing 3350 np 3324 tbt 2762 veteransday 2598 sfbatkid 2530 http://Unfollowers.me 2395 peoplechoice 2357 mtvstars 2277 tcot 2171 wcw 1888 lestweforget 1794 oomf 1748 believemovie 1687 love 1591 catchingfire 1564 obama 1564 facebook 1454 benghazi 1451 xfactor 1420 breaking 1411 truth 1377 ifwedate 1349 mcm 1338 p2 1294 twitter 1274 christmas 1264 fail 1194 mentionatruefriend 1168 oneofmyfavoritemoviesis 1122 jfk 1099 playstation4 1096 selfie 1087 nyc 1075 news 1069 sorrynotsorry 1069 http://www.justunfollow.com/?r=tw 1058 gop 1056 uniteblue 1049 iran 1045 music 1040 t 1038 amas 1031 s 1019 tlot 1005 blessed 1002 scandal
In trying to expand the dataset what we tried was that we were considering the neighbours of the strongly connected component that we have and putting them in a map. It turns out that the neighbours form a size of 47483801.
Geolocation can also be found in a coordinates tag in the tweet response. Out of 17377085 tweets that are relevant to us 261862 have this tag. GeoLocation of the tweet can be found in the user 'object' in the tweet that we recieve. Although this might not be very trustworthy but non the less we can use this. Refer: https://dev.twitter.com/docs/faq#6981 . Now the user child object contains a time_zone and a location. Location is the user-defined location for this account's profile. Not necessarily a location nor parseable. This field will occasionally be fuzzily interpreted by the Search service. Almost all tweets have this information !
In the dataset that we have there are around 404524 urls that are present and around 11K have greater than 15 unique users tweeting about them. Also there are around 470377 hashtags with arounf 13K hashtags with greater than 15 unique users.
The size of the largest connected component after removing the data of users that wasn't complete is 62519. Top few hashtags are: 4819 ff 3101 haiyan 2896 rt 2694 tbt 2527 obamacare 2357 mtvstars 2350 peopleschoice 2288 emazing 2064 veteransday 2021 mtvema 2000 wcw 1986 np 1937 sfbatkid 1922 tcot 1904 philippines 1794 oomf 1687 love 1596 nowplaying 1564 obama 1420 breaking 1411 truth 1390 benghazi 1377 ifwedate 1353 aca 1342 batkid 1338 p2 1294 twitter 1289 xfactor 1274 christmas 1264 fail 1251 teaparty 1205 retweet 1194 mentionatruefriend 1168 oneofmyfavoritemoviesis 1154 catchingfire
Top few urls are: 2530 http://Unfollowers.me 2188 https://www.healthcare.gov/ 981 https://twitter.com/rx 809 http://www.justunfollow.com/?r=td 741 http://www.peopleschoice.com/pca/votenow.jsp 648 https://twitter.com/nsm 623 https://twitter.com/bhaggs 571 http://fllwrs.com 493 http://nypost.com/2013/11/18/census-faked-2012-election-jobs-report/ 435 https://twitter.com/minimalist 407 http://newsfeed.time.com/2013/11/07/interactive-this-is-how-much-money-twitter-owes-you/ 403 http://movies.yahoo.com/video/justin-biebers-believe-trailer-171023965.html?soc_src=mediacontentsharebuttons 383 https://about.twitter.com/download 362 http://smarturl.it/MidnightMemoriesiT 360 http://socialbuzz.mtvema.com/ 355 http://pbs.twimg.com/media/BY0Rl0FIQAAXtpo.jpg 355 http://paulstamatiou.com 353 http://www.gerryeisenhaur.com 351 http://pbs.twimg.com/media/BYqhnW_IAAADSQr.jpg 346 http://www.justinbieberbelieve.com/ 343 http://www.youtube.com/watch?v=YLriiVE1OWc 330 https://vine.co/v/htbdjZAPrAX 329 https://twitter.com/gnaphos 318 https://twitter.com/cg 312 http://pbs.twimg.com/media/BZNUrQsCQAAAMQt.jpg 310 http://www.youtube.com/watch?v=A7JRa4zyl2M&feature=youtu.be
Total number of users mined were 107711. Out of which 8641 users had there timeline locked or there accounts deleted, which is roughly 8%
Also out of the 97752 unique users for which I have the tweet data, I have the follower relations of 96909 users
No. of SCCs : 31799 Sizes of the Top 5 SCCs : 64045, 56, 54, 13, 10 No. of nodes with no outgoing edge : 26552
All the simulations for this day are on a graph of 10000 nodes running for 1000 timesteps. Also presently features only include the number of nodes that the topic has spread to in the past.
Figure 1: Next step prediction using epsilon SVR -- Mean squared error = 0.00299039
Figure 2: 10 step ahead prediction using epsilon SVR -- Mean squared error = 0.00394066
Figure 3: Single topic next step prediction but using nu-SVR -- Mean squared error = 0.000188936. It was here I realized that nu-SVR performs relatively better. So switched to it
Figure 4: Next step prediction using nu SVR -- Mean squared error = 0.000188578 (about an order less than epsilon-SVR)
Figure 5: 10 step ahead prediction using nu SVR -- Mean squared error = 0.00164919
Figure 6: 10 step ahead prediction using nu SVR but for a single topic -- Mean squared error = 0.000532476
Figure 7: Maxpeak prediction. Training and test done for only during the rising period of the topic ie. until max peak is reached -- Mean squared error = 0.00176657
Figure 8: Maxpeak prediction for a single topic -- Mean squared error = 0.000714208