Web Archives and Data Challenges - Archives Unleashed

Please download to get full document.

View again

All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
 9
 
  Overview presentation for Archives Unleashed 2016 outlining data challenges associated with working with large scale Web archive data
Related documents
Share
Transcript
  • 1. Put Hacks to Work: Archives in Research
  • 2. Credit: Flickr @ilovecology Can we use what we make?
  • 3. 3
  • 4. Who is the audience?
  • 5. What matters?
  • 6. 8
  • 7. Filtering to what matters 9 Source | Destination | Date | Frequency | Content Type | Bytes | Content Link Data: http://gawker.com/5953665/mitt-romneys- staff-played-the-media-covering-them-in-a- friendly-game-of-flag-football Mitt Romney's Staff Played the Media Covering Them in a Friendly Game of Flag http://gawker.com 2012-10-22
  • 8. 14 News Media on the Web (Weber, Ognyanova, Kosterich & Nguyen, 2015)
  • 9. NJ Local News: 2007 - 2012
  • 10. 18 0 1 2 3 4 5 6 7 0 100 200 300 400 500 600 700 800 900 1000 2007 2008 2009 2010 2011 2012 Avg.MBperWebpage Avg.NumberofWebpages NJ.com Domain Analysis Number of Pages Avg MB
  • 11. 19 Dataset Research Potential Dates Captures Unique URLs Hurricane Katrina Online networks and organizational resilience (Chewning, Lai and Doerfel, 2012; Perry, Taylor and Doerfel, 2003) in the wake of disasters; information dissemination 2003 – 2012 1,694,236 663,740 Superstorm Sandy 2003 – 2012 41,703,112 20,013,455 US Senate Study the growth of political activity in online environments (Adamic & Glance, 2005; Bruns, 2007; Chang & Park, 2012); polarization & media discourse 109th – 112th Congresses 26,965,770 8,674,397 US House 51,840,777 12,410,014 Occupy Wall Street Previous research on NGOs in the online environment (Bach & Stark, 2004; Shumate, 2003, 2012; Shumate, Fulk, & Monge, 2005); use of hyperlink data to study the formation and role of alliances between SMOs 2010 – 2012 247,928,272 11,3259,655 US Media Previous studies of news media organizations (Greer & Mensing, 2006; Weber, 2012; Weber & Monge, In Press); focus on evolutionary patterns 2008 – 2012 1,315,132,555 539,184,823
  • 12. What about reliability?
  • 13. 21 Validity?
  • 14. 22
  • 15. 23
  • 16. 24
  • 17. 0 5 10 15 20 25 30 050000010000001500000200000025000003000000 Potential vs. Actual URLs CountofPages 25t CountofURLs Potential Actual Difference
  • 18. 26 0e+002e+064e+066e+06 Changes in Crawl Completeness CountofPages t CountofURLs OWS House Senate Katrina existing potential b = set a unit of time for analysis, c choosing n periods across a total time T
  • 19. 28 Challenges are not unique to these data Courtesy of Marc Smith, NodeXL
  • 20. 29
  • 21. Research support from: NSF Award #1244727; Additional support from the NetSCI Lab @ Rutgers
  • Related Search
    We Need Your Support
    Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

    Thanks to everyone for your continued support.

    No, Thanks
    SAVE OUR EARTH

    We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

    More details...

    Sign Now!

    We are very appreciated for your Prompt Action!

    x