Unveiling the Depths: Navigating Data Science Beyond the Surface

Published by: Yen Tung Chin

Source – NewScientist

 

Data science arises from the intersection of different fields, combining mathematics, statistics, computer science, and domain expertise. It draws upon various disciplines to navigate the complex yet intricate landscape of data analysis, interpretation, and decision-making. The goal of data science is to find patterns, derive insights, and make informed decisions based on data analysis.

 

In today's data-driven world, the study of data science has become more synonymous with machine learning models, analyzing statistical numbers, and predictive algorithms. However, beneath the surface lies a mixture of interconnected disciplines that extend far beyond what we learn conventionally in university. In this piece, we explore the fundamentals of data science - Algorithms and Data Structures, as well as its lesser-known extension in the field of Geospatial Data Science. But before you continue, take everything I’ve written here with a grain of salt. Some of the topics or use cases mentioned may more closely align with interrelated fields like data engineering or big data analytics. However, the reality is that most companies nowadays still do not have the frameworks or resources needed for good-quality data science work. Better yet, some companies facing budget constraints may opt to consolidate the roles of a data engineer, scientist, and analyst into a single position, commonly referred to as a data scientist. With that said, let's get right into the untold world of data science.

 

Firstly, Algorithms: The Foundation of Data Science

When diving into the realm of data science, the focus often gravitates toward topics like statistics and machine learning. But besides passing technical interviews and getting into top companies, there is actually a real use for knowing and understanding data structures and algorithms in data science. While numbers and statistics often take the spotlight, algorithms remain the unsung heroes of data science and serve as a driving force underpinning data science. They form the backbone of most processes in data science from extracting insights from enormous datasets to analyzing the said data, enabling the transformation of raw data into actionable insights.

 

The role of algorithms extends beyond pattern recognition and prediction; they optimize processes, enhance efficiency, and contribute to real-time decision-making across industries. With how complex and large datasets used in data science are, big data wouldn’t be a thing if algorithms didn’t exist to optimize their extraction and analysis. Understanding algorithmic complexity (things like Big-O and Time complexity) can prove useful in assessing the performance of two different algorithms as a data scientist especially when time is of essence in obtaining results. Computers are fast, but not infinitely fast. Knowing how to write or understand efficient code will be a very useful skill set in boosting algorithmic efficiency in processing data.

Source: Geeks for Geeks - Analysis of Algorithms

 

Understanding how different algorithms support the foundations of data science is essential in choosing the right approach for each problem. Furthermore, understanding them would allow us to determine if it would give us the desired result we’re looking for, or if there is a need to modify it to suit our use case. In most cases, off-the-shelf black-box tools, libraries, and existing algorithms can be used to solve, give us the performance we’re looking for, or give us the answers we’re looking for to most, if not all simple use cases of analyzing and deriving insights from our data. In simple terms, we can usually use ready-made tools and algorithms to easily analyze our data and get the results we need for most basic situations. However, although rare, there are cases where we need to either modify existing algorithms or create an entirely new one/innovate new solutions to solve a specific problem that we’re tackling. This is especially true when business knowledge or domain-specific applications are involved. This is when our knowledge of data structures and algorithms would shine in optimizing the performance of the algorithms we write or use.

 

With that being said, there are still plenty of data science roles out there (in fact most) where larger emphasis is placed on other skill sets besides algorithmic proficiency. However, I personally think that knowing and excelling at algorithms can certainly elevate your abilities as a data scientist.

 

Geospatial Data Science 

We’ve seen a lot of examples where data science can be applied. From a quick Google search to example cases used in assignments, common examples include things like sales analysis, healthcare analytics, recommender systems, and many more. Yet, amidst these prevalent applications lies an underrepresented domain – one that didn't cross my radar until my Industry-Based Learning (IBL) internship – known as Geospatial Data Science.

 

What is Geospatial Data Science?

Geospatial Data Science is a multidisciplinary field where GIS (Geographic Information System) intersects with data science. It focuses on location-based analysis and harnesses the geographical component of data to unlock insights and solutions from geospatial data. Having latitude and longitude coordinates in the data itself does not make what we do “Geospatial Data Science”. Rather, geospatial data science utilizes the physical locations of data points and understands the connections or correlations that exist. It considers distance, spatial interaction, and location of each data point important in analysis.

 

The significance of spatial data science lies in its ability to discover patterns, correlations, and insights that are otherwise concealed in traditional datasets. It allows us to perceive relationships based on their geographical proximity or dispersion, making it different from conventional data science. It allows us to answer questions like where things happen and why they happen there.

Source: NASA - Mapping Wildlife Corridors in Costa Rica with NASA Earth Observations

 

Tools and Technologies Used 

Geospatial data science relies on a variety of tools that help uncover hidden geographical patterns and insights. Some tools in this domain include:

  1. GIS software – these tools allow you to easily visualize maps. Some examples are:

●      ArcGIS – ArcGIS contains inbuilt analysis tools that allow for the extraction of insights from spatial data. This tool also has a Python library called Arcpy which allows us to develop models using Python. GIS developers can also use APIs and SDKs to connect data and create user interfaces to allow users to interact with the data. However, using this software comes with a cost for purchasing the license to use this software.

●      QGIS – QGIS has a highly interactive community with support in the form of QGIS Stack exchange and plugins designed by the community members for other developers and users. Furthermore, the best part about QGIS is that it is free to use.

2. Programming languages – such as R and Python with geospatial libraries such as Geopandas.

Note that these tools are not exclusive, and there are more tools out there for Geospatial Data Science covering aspects like databases, cloud, GPS technologies and receivers, IoT devices, etc.

 

Challenges of Spatial Data Science

  1. Huge Volume of data

Spatial data science deals with huge amounts of information, often sourced from sensors, various IoT devices, surveys, and satellites. These data pose challenges in terms of storage, management, and processing. High-resolution satellite imagery for example can generate terabytes of data. Managing these volumes of data requires a robust infrastructure, efficient storage solutions, and optimised processing techniques. Analyzing and extracting insights from these datasets needs scalable computing resources and specialized algorithms capable of handling large spatial data.

2. Long processing time

Resources are limited, and there is always a limit to how much computing power we have. Faster and optimized CPUs are often expensive and unavailable to the public. Optimized workflows to process and manipulate large spatial data are required to avoid bottlenecks. Addressing this challenge involves employing strategies like data partitioning, distributed computing, and utilizing specialized software optimized for handling large spatial datasets. There is also a question of whether we should preprocess data so that lower costs are involved with the latter process, or if we should optimize it on the go so that there is more flexibility when answering questions.

3. Lack of knowledge

Only a limited fraction of individuals possess the proficiency to handle geospatial data effectively. If analysts lack proficiency and experience in this field, they won't be able to derive value from the data or contribute to their organization's business objectives. Furthermore, the distinct behaviour of geospatial data poses a challenge for organizations seeking to integrate it into their workflows.

Note that these challenges are also not exclusive.

 

What Problems Can We Solve?

Geospatial data science allows us to tackle urban planning challenges by optimizing city layouts for efficiency, improving transportation networks, and identifying ideal locations for infrastructure development. In environmental conservation, it helps in monitoring ecosystems, predicting natural disasters, and planning resource allocation for conservation efforts. In epidemiology, it plays a pivotal role by analyzing disease spread patterns, optimizing healthcare resource distribution, and aiding in disaster response during pandemics or emergencies. These applications are not extensive and more ways of applying geospatial data science are still being explored. Who knows? Maybe geospatial data science could allow us to better understand extraterrestrial bodies and play a role in identifying potential locations for future human settlement.

Source: NASA - Image of the star-forming region

The Bottom Line

Data science is a multifaceted discipline. While machine learning and statistical analysis often take the spotlight, the underappreciated heroes are algorithms and data structures. Understanding their role is crucial as they optimize processes, enhance efficiency, and allow you to be more malleable with how you want to use data science as a tool. Geospatial data science on the other hand is a growing field with the potential to revolutionize various industries. Embracing geospatial data science will not only broaden the scope of traditional data analysis but also pave the way for groundbreaking discoveries and advancements in fields that heavily rely on spatial information. In reality, it may be difficult to incorporate geospatial data science with it being such a niche field and many organizations facing challenges such as limited expertise in geospatial analytics. However, we’re in good times with technology growing and innovation at the forefront. What is niche today may not be niche tomorrow. I implore you to explore your interests and delve into these exciting fields as we progress into a more sophisticated future!

 

 

Sources

AYADATA 2023, The Ultimate Guide to Geospatial Data Science, accessed 13 January 2024, <https://www.ayadata.ai/blog-posts/the-ultimate-guide-to-geospatial-data-science/>

 

Bhat, H.V. 2023, What is geospatial data and how to implement it in Data Science?, Analytics Vidhya, accessed 13 January 2024, <https://www.analyticsvidhya.com/blog/2023/02/implementing-geospatial-data-analysis-in-data-science-techniques-challenges-trends-and-best-practices/>

 

Carto n.d., What is Spatial Data Science?, accessed 13 January 2024, <https://carto.com/what-is-spatial-data-science>

 

Ellipsis Drive n.d., Spatial Data Science: The New Ace in Data Analytics, accessed 13 January 2024, <https://ellipsis-drive.com/blog/defining-spatial-data-science/>

 

GISGeography 2023, What is Spatial Data Science?, accessed 13 January 2024, <https://gisgeography.com/spatial-data-science/>

 

IBM n.d., What is geospatial data?, accessed 13 January 2024, <https://www.ibm.com/topics/geospatial-data>

 

Kornhauser W 2020, Algorithms, Data Structures, and Data Science, accessed 13 January 2024, <https://towardsdatascience.com/algorithms-data-structures-and-data-science-8d7a4e62758e>

 

Lapchev D 2020, Why Data Scientists Should Learn Algorithms and Data Structures?, accessed 13 January 2024, <https://medium.com/swlh/why-data-scientists-should-learn-algorithms-and-data-structures-4d93237a1026>

 

National University n.d., What is Data Science?, accessed 13 January 2024, <https://www.nu.edu/blog/what-is-data-science/>

 

Safegraph n.d., Challenges of Geospatial Data Integrations, accessed 13 January 2024, <https://www.safegraph.com/guides/geospatial-data-integration-challenges>

 

Zola, A. and Fontecchio, M. 2021, What is spatial data and how does it work?, accessed 13 January 2024, <https://www.techtarget.com/searchdatamanagement/definition/spatial-data>

 

 

This article is published by CCA, a student association affiliated with Monash University. Opinions published are not necessarily those of the publishers. CCA and Monash University do not accept any responsibility for the accuracy of information contained in the publication.

CCA