Keywords

1 Introduction

Smart City term has many different definitions, evolved in the years with evolving of technologies and lifestyle. One of first definition is dated 2007. In [1] the authors describe Smart City as a city with the following six characteristics:

  • Smart Economy (Competitiveness)

  • Smart People (Social and Human Capital)

  • Smart Governance (Participation)

  • Smart Mobility (Transport and ICT)

  • Smart Environment (Natural resources)

  • Smart Living (Quality of life)

These features are described in detail in Fig. 1 below.

Nowadays, Smart City refers mainly to smart environments integrated into city life, that is, cities in which objects such as sensors, devices, appliances, and integrated systems provide services, producing and manipulating complex data, thus focusing more on the aspect of Information Technology (IT) serving inhabitants to improve their lives and make urban development more sustainable.

Therefore, Smart Cities are places where traditional networks and services are made more efficient through digital solutions that benefit residents and businesses, as stated by the European Commission. In addition, a Smart City must also have a more interactive and responsive city government that keeps citizens informed and involves them in decision-making processes.

Fig. 1.
figure 1

Characteristics and factors of a Smart City according [1]

The United Nations 2030 Agenda [2] set out an action plan to achieve participatory, integrated, and sustainable territory planning and management in Goal 11 of the agenda.

It is expected by citizens the participation in city administration and its sustainable development, taking an active interest in the governance and management of the city.

The inclusion of citizens in decision processes, to shape public policies and services is also strongly agreed by all that countries partners of the Open Government Partnership (OGP) initiative [3]. OGP claims that a transparent and open government can build trust, fight corruption, tackle inequality, and create more resilient democracies [4, 5].

To promote a culture of transparency in public administration, OGP pursues Open Data policies.

Open Data must be available under a license or regulatory provision that allows its use by anyone, including commercial purposes, in a disaggregated format, accessible through digital technologies, provided with relevant metadata, and made available free through digital technologies.

The European Commission (EC) also states that “Data-Driven innovation is a key enabler of growth and jobs in Europe.”

How do Smart Cities play an important role in Data-Driven innovation?

As will be explained in Sect. 5, Smart Cities are an inexhaustible source of data that delineate and shape the well-being of their inhabitants. These data are produced by both public and, more importantly, private entities. Since the United Nations estimates that today 57% of the world population lives in urban areas and that this percentage will reach 68.4% in 2050 [6], then the data produced by Smart Cities is very significant for the sustainable development of human welfare.

The fact that Open Data, produced mainly by government organizations for the common good, is only a part of the available Big Data, raises the need for its coexistence with data produced by the private sector. As a result, the European Community has established guidelines for European Data Spaces to encourage the sharing of private sector data in the European data economy. This new paradigm has thus forced a new evolution of the data management system.

2 An Overview on the Data Management Evolution.

Since 1980, a Data Warehouse has been an extensive collection of business data helping the decisions of an organization. It belongs to a single organization and usually contains private and sensitive data to not share with others. Data sources of data warehouses are mainly from internal applications such as marketing, sales, finance, customer-facing apps, and a few external partner systems.

In data warehouses, the data are periodically fetched from their sources, processed, and stored ready for the decision support systems. The only data available with this solution are those autonomy chosen by single departments, losing a holistic vision of the organization.

In 2010, the term Data Lake was born. It is a centralized storage for raw, unstructured, semi-structured, and structured data, taken from multiple sources, mostly external to the organization. The data are simply stored in one owned place from where all departments could Extract, Transform, and Load (ETL) for their purposes. The processes are distributed among departments that can extract the information relevant for them without excluding the information relevant for the others.

The following Fig. 2 summarizes the relevant evolution of data paths, where (Dep1 … Depn) are internal departments and (Ds1…Dsn) are generic external data sources. In Fig. 1(a), enterprises were almost isolated, except for some partnerships among them. The data processing was centralized for cost optimization due to the high processing resource and storage costs. In Fig. 1(b), the enterprises take data from various heterogeneous sources as raw data, store them in centralized storage resources, and distribute the data processing among the internal consumers (departments.) In this way, each department can process the data according to its interests.

Fig. 2.
figure 2

From Warehouse to Data Lake

Undoubtedly, scenarios have changed today. Sharing data is a win-win situation. Data provides information that enables better decisions. Those who have data can offer it for free if it benefits them or sells it for a profit.

3 Data Sharing in the EU: Data Governance

The EC aims to create a single market for data, where data from public bodies, businesses, and citizens can be used safely and fairly for the common good.

This initiative will develop common European Data Spaces rules guidelines. The goals are to better use public data for the common good, support voluntary data sharing by individuals, and create structures to enable critical organizations to share data.

Data Spaces should be places where it is possible to find the sources of relevant data. They are not thought to be central storage of data but a central repository from where to connect to the data sources, which are widely distributed.

This paradigm shift stems from an awareness of two factors. First, with the large amount of heterogeneous data sources, it is no longer thinkable to an integration that allows for a unique structuring of all the data. Second, the sheer volume of data, produced continuously, does not allow for centralized storage, which in many cases would be a duplication of data.

In addition, the different mix of sources that each consumer uses to extract information is an important competitive factor that allows each stakeholder to prepare its own secret recipe.

Therefore, Data Spaces are end points from which to obtain valuable, secure and validated data, and end points to securely offer the data possessed.

In addition, data owners can determine and keep track of the permitted access and use of the data provided and any compensation charged.

Figure 3 depicts the Data Spaces stakeholders as prosumers, i.e., Producers and Consumers at the same time.

In a Data Space, Producers open the access to not personal data, specifying license, and all the conditions of use and access.

Fig. 3.
figure 3

Data Spaces

Although data produced by public funds must be Open Data, according to the EC, with the Data Spaces approach, private companies also have the opportunity to profit from the investment made in data collection and processing and at the same time support the Data-Driven innovation.

Whenever a company shares data provided by its processes or services, needs to trust that it can control the limit of use of those data and that the licenses associated with those data are respected. Thus, the EC Data Governance Act [7] try to address these concerns, making the Data Spaces paradigm feasible by strengthening data-sharing mechanisms across the European Union (EU) and creating a robust framework to generate digital trust.

Furthermore, the “Guidance on sharing private sector data in the European data economy” [8] document provides a toolbox for data producers, data consumers, or data prosumers, guiding on the legal, business, and technical aspects of data sharing that can be used in practice when considering and preparing for data transfers between companies or to the private.

4 Data Sharing in the EU: Technical Aspects

Guidelines to make data interactions in business-to-business and business-to-government from a legal and technical point of view are defined in [8, 9].

In [10], a factsheet of the [9], authors highlight how the Data processing will move from a centralized architecture from the year 2018 to a distributed architecture in the year 2025. This change estimation is depicted in the following Fig. 4.

Fig. 4.
figure 4

Data processing chances

Indeed, Data Spaces will be almost marketplaces where stakeholders will interact and exchange data in a safe and controlled way, and the data processing and storing will be executed by data producers and data consumers.

5 Smart and Intelligent Cities

Smart City refers mainly to smart environments, as seen in previous introduction. Smart environments imply smart objects connected to the internet, i.e., the Internet of Things (IoT).

The Cisco Annual Internet Report [11], estimates that by 2023 there will be 29.3 billion networked devices. These devices will directly or indirectly produce an unprecedented amount of precious data, most of it in continuous streams. Thinking about integrating this data with conventional structured data is not feasible in the traditional way and especially in a centralized way. The IoT infrastructure should be seen as a system of systems (SOS) where the information must be extracted by coexisting heterogenous systems producing data.

In [12] a data ecosystem is depicted as a socio-technical system for extracting value from data with the support of interacting organizations and individuals. Data Spaces should be exactly such an ecosystem.

Although smart and intelligent are almost synonymous in natural languages, in technical language they express significant differences.

Intelligent systems (IS) are systems that use artificial intelligence (AI) technologies and are a complement to the smart environment. Most often, intelligent environments govern smart environments, thus joining the SOS and becoming a new type of stakeholder in the data ecosystem.

This complexity entails two important requirements that this chapter seeks to emphasize.

Broaden the audience of participants in Data Spaces as much as possible and create data access solutions that are both human and machine readable.

6 Conclusions

6.1 Broaden the Audience

In [13] it is observed that in enterprises 50–80% of the costs of data projects are related to data integration and preparation activities. These costs are high and are not sustainable by small enterprises or individuals. If we want a broad sustainability effort that enables small stakeholders to engage and leverage the value available in the data, this must change. The solutions glimpsed involve the adoption of new data management systems other than the traditional Data Base Management System (DBMS). In fact, the words Data Base itself no longer fits the situation described.

Pay-as-you-go data management is the most promising approach in Data Space implementation [14,15,16]. Instead, of the classic one-time integration of datasets, which causes a significant initial cost, the pay-as-you-go paradigm supports an incremental approach to data management and the principle that the data publisher is responsible for paying the costs of joining the dataspace. That is, the publisher is responsible for finding a solution that makes its data accessible. In addition, the data are managed using a tiered approach, where an increase in the level of active data management results in a corresponding increase in associated costs [13].

Another concern for effective and broad participation in the Data Spaces is the constraint on data validity. This constraint primarily affects data producers. Data cleaning and validation is a very expensive task, especially for large amounts of data produced in real time. Although it has traditionally been said that good data are only reliable data, in the current evolution of Big Data, even this taboo needs to be reconsidered. Many sources are not validatable, especially when made available voluntarily by citizens eager to contribute to the sustainable development of their territory. But precluding the participation of these individuals would go against the principles of the United Nations’ Big Data for Sustainable Development (BDSD) [17].

Thus, the invitation is to consider what Pat Helland of Microsoft expressed in 2011: “If you have too much data, then ‘Good Enough’ is good enough.” In [16] he argues that it is possible to switch from an absolutely accurate approach to a more “lossy” response when the amount of data is huge. This principle could allow the involvement of a significant amount of data from unvalidated sources, particularly networked sensors, or smartphone applications, but with some concerns.

A first concern is that when the data are incorrect, the information will not be accurate, and the decisions made may not be correct. Another concern is that the data could be maliciously wrong to steer to decisions that do more for personal interests than for the community.

The proposal is to equip the datasets with a reputation metric, which can be continuously reevaluated by nonarbitrary mechanisms in the head of the community. In this way unreliable datasets will tend to disappear because they will have dubious reputations and those with more reliable reputations will emerge. In any case, consumers of the data will be informed in advance of the reliability of the data and will be able to decide, according to their own goals, the right cost to pay to get the information they need. In addition, as is already the case, private companies could perform the task of data cleaning and validation, as a service, by taking data from the Data Space and providing the results in the same.

One example of this strategy is the Kaggle website [18]. It offers a huge repository of data published by the community. It allows users to publish and find datasets but gives each dataset a usability rating. This rating is a single number calculated for each dataset that rates them on several factors such as level of documentation, availability of related public content as references, the type of file, and coverage of key metadata.

While there are many data profiling tools that assess data quality, [19] and [20] are just an example, they could need to be set up by experts in the field. Thus, any solution should consider tools for volunteer collaboration and engagement, incentivizing their participation and making it as easy as possible for them to assess data quality as well.

Data quality is often defined in terms of measuring specific dimensions, including completeness, validity, uniqueness, timeliness, consistency, and accuracy.

Many data profiling tools combine the results of the various data quality dimensions into an aggregate data quality score. This approach could provide a simple metric that assigns an initial level of confidence to the data element.

A similar score should be given to the producers of the data themselves so that the reputation of the sources is also valued. Such a score would encourage data producers to strive to produce better data and gain, if not financially, at least in reputation.

Their score would, go up when their data has a good rating over a long period and go down when their data has a persistent bad rating. To this should be added the reviews of the users who have used the data, which will affect the overall score of the data producer.

In the latter case, since the reviews are subject to 'fake reviews,' 'deceptive reviews,' 'deceptive opinion spam,' 'review spam,' or 'review fraud,' i.e., practices aimed at artificially altering the real reviews result, one of several mitigation methods would be established. In [21] the authors give an example of a solution for such an issue.

Thus, the data consumers should opt for adequate data cleaning and validation processing depending on the rate of the data source and the context of use. Of course, private, and validated data would have a high rating but could be costly and have a restrictive license. Open Data would have a high rating and be free with a permissive license but not cover all the stakes.

Low-rated data could need a cleaning and validation process but could be free, have a permissive license, and cover many domains. All of them should be available from Data Spaces.

6.2 Humans and Machine Accessibility

As mentioned before, the data provider should take care of make their data accessible.

In terms of Open Data, current examples are the Spatial Data Infrastructure (SDI). In [22] SDI is defined as “a framework of spatial data, metadata, users, and tools that are interactively connected in order to use spatial data in an efficient and flexible way”. As an example, the INSPIRE Directive, [23], aims to create a European Union SDI for the purpose of environmental policies or activities that may have an impact on the environment. This European SDI should enable the sharing of environmental spatial information among governmental organizations and facilitates public access to spatial information throughout Europe. Unfortunately, in practice, derived national platforms are not easily used by non-experts because of the methods and technologies adopted.

One of the reasons that most impact the usability of such solutions is the use of separate catalogs in which the metadata necessary to know the structure and content of datasets are described. Such catalogs are not searchable via Web search engines, and the user must first be familiar with their repositories, which is unsuitable for users who are not domain experts. The catalogs then provide instructions on how and where to access the actual datasets, which are often found on other repositories. Once the user has located the dataset of interest, he or she can download it as a file. In the best cases, the data are also offered through traditional Web services with which specially programmed applications can interface.

Instead, the proposal is that Data Spaces should be designed so that even non-expert users in the field can find and use the data, just as they do with other documents available on the Internet, while applications can discover the data sets and use them mostly on their own.

An example can be found on the Internet by turning to private companies that offer data via Web Application Programming Interface (API). A Web API allows external applications to access datasets by common web protocols and technologies. Furthermore, the access is controlled, according to the license assigned, and in an authenticated manner so that the identity of the user is known.

Most importantly, the API documentation is embedded in the API itself, is human- and machine-readable, and can be searched by standard Web search engines. This means that people can search for data sources using common search engines, read the documentation, and learn about metadata, data structure, and retrieval methods. Any application that decides to use the API can do the same automatically.