A Data Scientist Exploration in the World of Heterogeneous Open Geospatial Data

Please download to get full document.

View again

All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
 9
 
  We present the challenges faced by a Data Scientist in exploring and analyzing heterogeneous Open Geospatial Data. This work is aimed at explaining the initial steps of a data exploration process, specifically aimed at discovering similarities and differences conveyed by diverse sources and resulting from their correlation analysis; we also explore the influence of spatial resolution on the dependence strength between heterogeneous urban sources, to pave the way to a meaningful information fusion.
Related documents
Share
Transcript
  • 1. A Data Scientist Exploration in the World of Heterogeneous Open Geospatial Data Gloria Re Calegari and Irene Celino CEFRIEL – Politecnico di Milano Como, July 17th 2015 Free and Open Source Software for Geospatial - FOSS4G Europe 2015
  • 2. Digital information about cities • Open data (large number of data sources available on the web): • Urban planning (land cover, public registers) • Demographics and statistics about municipality • Closed data sources produced and maintained by enterprises: • Phone activity data  but sometimes made open! • User generated information: • Volunteered geographic information and crowdsourcing information (Open Street Map) • Location based social network (Foursquare check-ins and geo located information) • Real-time and streaming information • Sensors (e.g. Temperature, energy consumption, ..) 2Free and Open Source Software for Geospatial - FOSS4G Europe 2015
  • 3. Data exploration process and case study A lot of data could describe the urban environment from different perspectives -> great wealth for data scientist. Managing, processing and comparing those data can be cumbersome -> smarter solutions are required. Data exploration of hetherogeneous urban information sources related to the city of Milano in Italy: • Possible issues • Best practices • Data exploration through correlation analysis (understand if diverse information sources mirror the same picture of a city) Free and Open Source Software for Geospatial - FOSS4G Europe 2015 3
  • 4. Milano datasets Demographics: • Population density • Spatial resolution: census area (6079 – median size of census area 12,000 m2) • Source: Milano open data Points of interest (POIs): • Trasports, schools, sports facilities, amenity places, shops ... • Spatial resolution: lat-long points • Source: Milano open data (official, 6718) and Open Street Map (user generated, 44351) 4Free and Open Source Software for Geospatial - FOSS4G Europe 2015
  • 5. Milano datasets Land use cover: • type of land use according to CORINE taxonomy (3-levels hierarchy, up to 40 types of land use defined) • CORINE taxonomy http://swa.cefriel.it/ontologies/corine.html# • 2 types selected (which better feature metropolitan area as Milan) 1. Residential 2. Agricultural • Spatial resolution: building level • Source: Lombardy region open data 5Free and Open Source Software for Geospatial - FOSS4G Europe 2015
  • 6. Milano datasets Call data records: • 5 phone activities • Incoming SMS • Outcoming SMS • Incoming CALL • Outcoming CALL • Internet • Recorded every 10 minutes (144 values a day for each activity) for 2 months (Nov-Dec 2013) • Spatial resolution: grid of 3538 square cells of 250m • Source: Telecom Italia – provided for their Big Data Challenge http://theodi.fbk.eu/openbigdata/ 6Free and Open Source Software for Geospatial - FOSS4G Europe 2015
  • 7. Challenges • Varying spatial resolution of information sources (census area for population, single points for POIs, ...) • Different time frames (population census done every 10 years, tlc data every 10 minutes) • Reliability (to what extent the sources can be trusted; data from public authorities or from crowdsourcing) Free and Open Source Software for Geospatial - FOSS4G Europe 2015 7
  • 8. Best practices adopted 1) Data transformation, cleansing or normalization (standard operation) 2) Making spatial resolution uniform Spatial resolutions used: • District level with 88 official subdivisions • Grid level with 3.538 square cells of 250m Free and Open Source Software for Geospatial - FOSS4G Europe 2015 8 Cells Districts New datasets generated: • Density of POIs in each cell/district • Weighted sum of population density in each cell/district • Percentage shares of each land use over each cell/district area
  • 9. Best practices adopted 3) Data compression (pre-processing large scale time series to get a more manageable compressed representation) Free and Open Source Software for Geospatial - FOSS4G Europe 2015 9 Telecom data Footprint/temporal signature for each cell/district (average activity over all the 60 days, distinguishing between week and weekend days)
  • 10. Correlation analysis Try to identify possible correspondences between different datasets. Measure whether and how two variables change together using correlation indexes -> Pearson’s correlation coefficient Free and Open Source Software for Geospatial - FOSS4G Europe 2015 10 -1 < r < 1 Positive correlation Negative correlation
  • 11. Correlation analysis - datasets Pairwise comparisons between 1-dimensional vectors: • POIs municipality: density • POIs OSM: density • Population: density • Telecom: first Principal Component with 90% of explained variability • Land use data: residential and agricultural used separately, in term of belonging percentages to district/cell Free and Open Source Software for Geospatial - FOSS4G Europe 2015 11
  • 12. Correlation analysis at district level Free and Open Source Software for Geospatial - FOSS4G Europe 2015 12 • Correlation between • Telecom and residential • Telecom and POIs can actually exist. Data fits quasi linear models. tlc resid agric POI mun POI OSM pop • Negative correlation between agricultural land use and other datasets -> human action inversely related to agricultural areas.
  • 13. Correlation analysis at cell level • All coefficients lower than the district level • Higher values again between Telecom and residential and POIs => the choice of resolution level can have a significant impact on the correlation results. Free and Open Source Software for Geospatial - FOSS4G Europe 2015 13 tlc resid agric POI mun POI OSM pop • Some phenomena causing the correlation are independent of the resolution level (0.76 residential- population) .
  • 14. Correlation analysis: phone calls and population Free and Open Source Software for Geospatial - FOSS4G Europe 2015 14 • Could the correlation change during the day according to the everyday human behaviour pattern (get up, go to work, come back home in the evening)? • Call activity at 6 different day times • Week and weekend profiles are different -> mirroring people’s different habits • Average correlation higher in the weekend (phone activity related to the actual presence of people at home) • Weekday profile -> human behaviour pattern DISTRICT CELL WEEKWEEKEND
  • 15. Conclusions and future works To sum up... • Presentation of the best practices for data exploration process applied on urban dataset of Milano • Approach presented in a urban environment but can be applied also in different environment • Correlation between different sources exists and it is strongly related to the resolution level adopted What is coming next? • Extending our investigation toward a predicting approach • Would it be possible to use one or more ‘cheap’ datasets (like open data) as a proxy for more ‘expensive’ data sources? • Explorative analysis => statistical and machine learning techniques. 15Free and Open Source Software for Geospatial - FOSS4G Europe 2015
  • 16. Predictive analysis (not in the paper) • Support Vector Machine to classify the CORINE classes using the POIs as predictors. • Accuracy > 83% • Errors (black dots) on the boundary => promising results, go on in this direction! Free and Open Source Software for Geospatial - FOSS4G Europe 2015 16
  • 17. Thank you! Any question? A Data Scientist Exploration in the World of Heterogeneous Open Geospatial Data Gloria Re Calegari and Irene Celino CEFRIEL – Politecnico di Milano Free and Open Source Software for Geospatial - FOSS4G Europe 2015
  • Related Search
    We Need Your Support
    Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

    Thanks to everyone for your continued support.

    No, Thanks