Keywords

1 Introduction

In recent years, the amount of information accessible through the World Wide Web (in the following “the Web”) has been growing exponentially. More and more people across the globe are turning to the Web to obtain the information they require for personal, social, or professional reasons. Thus, access to information via the Web has become a prerequisite for conducting many kinds of transactions, and it seems only slightly exaggerated to state that “everything is on the Web.”

However, the Web is not a structured medium in which extant information is labelled or classified for easy retrieval. It is not a database in which each piece of data is set in a field and then defined with attributes and properties. Instead, the Web is a heterogeneous assortment of webpages that present manifold kinds of information in very diverse ways. Its enormous wealth of content requires navigation by means of hyperlinks. This digital environment gives precedence to style and appearance over the structure of data, and often displays content without context or hierarchy. Nevertheless, the inherent decontextualization of the Web is partially remedied by its increasing use as a dissemination channel for data that are suitable for research purposes.

The concept of open data refers to the fact that many organizations are making a vast range of big and complex datasets freely available with a view to spurring the development of new applications and services, as well as to fostering research on globally-pressing issues such as climate change, mobility patterns, etc. As a side-effect of this unprecedented possibility of sharing and reusing data, the scholarly community is demanding more transparency in scientific research and a higher degree of availability of research data that permit the checking, and eventually the reproduction, of scientific advances more easily. Ultimately, these developments have contributed to a more general call for “open science” (Bartling & Friesike, 2014).

Access to this enormous volume of data requires specific skills that enable the location, extraction, processing, and analyzing of large datasets of different typologies from different sources. Specific data management capacities are necessary for putting structured datasets to use for research purposes and for initiating new research approaches to unstructured data, for example, the data embedded in websites. To highlight two crucial examples, web scraping is a data extraction technique that harvests plain text from patterns in webpage source code for storage in relational databases, whereas Application Programming Interfaces (APIs) enable the development of scripts that can automatically retrieve data from a website.

Such data management competencies are especially beneficial for migration studies, a field of research that deals with complex, large-scale patterns of human mobility and interaction, which oftentimes elude timely, encompassing, and rigorous analysis by the traditional instruments of empirical social research (Dodge, 2019; Giannotti et al., 2016; Sirbu et al., 2020). The varied assortment of administrative records, censuses, and surveys that has dominated the field for decades does not guarantee the availability of reliable, complete, comparable, timely, and disaggregated data on migrant stocks and flows, let alone the patterns of interactions with autochthonous populations and the manifold processes of economic, social, and cultural integration. Notorious issues include diverging definitions of key concepts, insufficient periodicity of sources, insufficient size of (sub-)samples, coverage and/or participation biases, and event-based administrative records that impede cross-referencing, among others (Kupiszewska et al., 2010; Font & Méndez, 2013).

Such difficulties and shortcomings suggest that an enormous potential exists for innovative, large-scale datasets to generate relevant knowledge in the migration studies field. To mention some examples, mobile phones’ Call Detail Records (CDRs) can track seasonal mobility between and within countries (Ahas et al., 2007; Deville et al., 2014) as well as post-disaster and war-induced displacements (Wilson et al., 2016; Salah et al., 2019); search engine terms can contribute to forecasting migration trends (Wladyka, 2017; Lin et al., 2019); social network messages can help with the analysis of the perceptions of, and opinions about, foreign-born populations (Greco et al., 2017); and geo-located social media data can provide information about personal (Twitter, Facebook) or professional movements (LinkedIn) (Barslund & Busse, 2016; Spyratos et al., 2019; State et al., 2014; Zagheni et al., 2014). Often, the insights emerging from such unconventional data sources can be validated with traditional sources, albeit typically with minor granularity and timeliness.

These examples illustrate how the data revolution is being seized by relevant parts of the migration research community. However, many migration scholars lack the necessary skills for locating, extracting, and managing the large volumes of raw data available on the Web. This chapter provides a glimpse into the new universe of the opportunities granted migration researchers by the exponential increase of web-based data on human mobility, behavior patterns, relationships, preferences, and opinions. While obviously not a substitute for proper data-management training, this chapter seeks to encourage migration researchers to build the necessary skills, individually or at a team level. The chapter has been co-authored by a data scientist who is a novice regarding migration studies (JLO) and a migration scholar who is a novice regarding data science (SR). We hope that the result of this trans-disciplinary co-authorship, a process that taught the two of us a lot, proves useful to the migration research community.Footnote 1

We begin by introducing several key concepts regarding the extraction and management of web information (“Key concepts” Section). Next, we describe selected services and repositories specializing in demography and migration studies that supply open data for research projects, as well as more generalist data sources (“Data sources” Section). The “Data extraction” Section presents various data extraction techniques that enable the obtaining of data from structured and unstructured Web sources, transform textual information in tabulated data, and carry out queries of linked data repositories. We conclude with an outlook on the promise the Web offers migration scholars, as well as related challenges.

2 Key Concepts

In this section, in addition to the primordial concept of big data, we also introduce the related concepts of open data and linked data.

2.1 Big Data

The term big data, which began to emerge at the turn of the twenty-first century, reflects the exponential growth of digital data due to the increasingly ubiquitous deployment of data-sensor equipment such as mobile devices, the generation of ever more copious metadata such as software logs or internet clicks and, more recently, the inter-connection of all kinds of gadgets, appliances, and procedures (“Internet of Things”). Rather than refer to any particular size of datasets, big data refers to amounts of data so massive that they transcend the capacity of customary processing facilities and techniques. Thus, in terms of actual data volume, the notion of what counts as “big” is highly dynamic, given that computing systems also have evolved at a notoriously fast pace over these past decades, albeit with an increasing difficulty of living up to “Moore’s Law” that processing power will double every 18 months or so. Therefore, any meaningful definition of the term has to refer to advanced techniques of data management, processing, and analysis (Mayer-Schönberger & Cukier, 2013), rather than data volume per se.

In addition to the astounding growth of data generation and data processing capacities, big data has been favored decisively by the unprecedented interoperability provided by the Internet. Access to a multiple and varied range of data types through the Web enables the integration and analysis of data dispersed across diverse sources. For example, to get to know their clients, a bank does not rely only on monitoring their account activity, but also can combine this information with streams of data relative to their preferences in social networks, shopping patterns, localization, etc. Apart from their innumerable commercial applications, such data offer new analytical possibilities for understanding all kinds of social behavior and phenomena, such as transportation needs, climate change, and the risk of exposure to pathogens, among many others.

The initial characterizations of the big data concept (Laney, 2001) used to rely on the “three Vs”, i.e., volume (huge), velocity (near “real time”), and variety (comprising non-structured data with temporary and spatial references). Additional traits of fundamental importance include exhaustivity (data captured from entire populations rather than samples), flexibility (favoring scalability and the addition of new sources), high resolution (descending to fine-grained detail to permit deep analysis), and relationality (defining universal IDs that enable cross-reference data from different sources). The focal point of all these characteristics is flagged by two further “Vs”: veracity and value, i.e., the overall quality of the data and hence its utility.

2.2 Open Data

Digital data are collected for all kinds of purposes by an ever-increasing range of organizations and institutions—commercial, governmental, and not-for-profit. Depending on the nature of these data and the objectives pursued in gathering them, they may or may not be made publicly available by default. For obvious reasons, data entailing information on any particular individual will not be eligible for public release on privacy grounds, nor will the data that affect national security. Private enterprises often prefer to restrict access to data that may confer a competitive advantage, unless potential higher-order gains can be obtained by granting free access.

Similar to the movements in the domains of software (Open Source) or scientific publishing (Open Access), the Open Data movement demands that data be freely available and re-usable without prior permission (Charalabidis et al., 2018). This demand refers primarily to the data produced by public administrations and governmental organizations, due to the absence of commercial interests and the intrinsic value of transparency in democratic systems of government. Since the benefits of publishing specific kinds of data may be obvious (as is the case with epidemiological information or the spending details of public budgets) or largely unpredictable (for example, when information on specific resources is converted into mobile apps targeting people with special needs), the baseline demand of the Open Data movement is that all non-personal data be freely accessible.

The Open Data concept is usually associated with three interrelated characteristics, the first of which regards the full availability of, and access to, complete datasets (rather than samples) preferably via download from the Internet. The idea is to make the data available free of charge, so any fees must be limited and well justified. Second, the data should be re-usable and redistributable both from a legal viewpoint (cf. user agreement) and in terms of technical specifications; ideally, this reuse and redistribution should include the possibility of incorporation in other datasets. Third, the notion of universal participation alludes to the absence of restrictions regarding the kinds of data (re-)users and their fields of endeavor, be they commercial, educational, scientific, or otherwise.

The combination of these traits basically amounts to conceiving digital data as a public good. To unfold their virtues, the data need to be ready for processing by different hard- and soft-ware environments (interoperability) and also favor aggregation with other data (linked data).

2.3 Linked Data

The third crucial concept concerning data management, linked data, refers to the possibility of interconnecting and sharing data between different sources in an open and transparent way (Heath & Bizer, 2011). This concept was first coined by the British computer scientist Tim Berners-Lee—a decisive contributor to the creation of the Internet—to allude to a World Wide Web Consortium (W3C) project about developing a technology for linking data from different sources (Berners-Lee, 2006). Faced with the problem of different web sites offering information about organizations, events, objects, and persons in ways that could not be easily located mutually, W3C proposed to establish a huge and dynamic data network that could be navigated with hyperlinks. Since then, the successful implementation of that proposal has enabled data repositories across the globe to group and connect information (for an illustrative example, see https://lod-cloud.net/clouds/geography-lod.svg, a network graph that shows the connections between data repositories).

Linked data technology defines each object through a uniform resource identifier (URI), a unique tag that, similar to a web address, enables the identification of an object in a data network. For example, the URI http://www.wikidata.org/entity/Q30 corresponds to the entity United States in Wikidata. While Web content is written in hypertext markup language (HTML), the linked-data Web operates according to specific protocols of its own for publishing and querying the data. We will postpone an introduction to querying to a later section but briefly introduce the resource description framework (RDF), a convention for publishing machine-readable data and linking them to other datasets (Curé & Blin, 2014).

RDF is not a computing language, but rather a framework in which other languages are written (Carroll & Stickler, 2004) by a process called serialization, which is a scheme for defining sets of basic information about the nature of the relation of an entity to other entities. The RDF structure is based on triples that include three fundamental elements: subject, predicate, and object. For example, the triple

expresses that entity Q12418 “Mona Lisa” has a relationship of the type creator with the resource “Leonardo da Vinci.”

The RDF data model epitomizes the mutual advantages of cooperation via common and open standards (Bergman, 2009). Since this model enables linking data from different platforms, it improves their interoperability; for example, the above triplet connects objects deposited in three different locations. Also, the RDF syntax is unambiguous, and hence efficient, given that only one definition of each entity or relationship is provided. A third advantage of the RDF data model is scalability: there is no limitation regarding the volume of data that can be stored as linked data, or the number of repositories that can be interconnected in various ways.

3 Data Sources

This section reviews and describes some outstanding data sources for migration studies. Rather than personal preferences, our selection reflects objective criteria such as the volume of web traffic generated by these platforms and the data’s use in the scientific literature. However, it seems prudent to remark that these and other platforms’ data portfolios are subject to more or less continuous innovations and improvements. Also, we do not wish to suggest that the platforms presented here are the only ones useful for migration scholars, since our selection is just a glimpse of the wealth of the extant web resources, rather than an exhaustive compilation.

We distinguish between four categories of sources, which require different degrees of data management skills (ranging from basic to advanced) to make them useful for substantive research. First, the subsection “Specialized sites” presents specific portals that compile information on migration and related phenomena. These portals have the advantage of providing contextual analyses and graphs that make the data more understandable. However, in some cases, the data is not very fine-grained, heterogeneous, and of diverse origin, which may decrease their rigor and reliability. By contrast, the “Generalist data portals” subsection provides data on a wide range of topics, including but not limited to migration-related facts and figures. The offerings of inter-governmental organizations such as the European Union, OECD, and the UN feature prominently in this category. Disaggregation usually is provided at the country and/or regional level, in line with each institution’s membership. Third, “data repositories” are platforms for distributing research data, mostly survey files uploaded by researchers to guarantee transparency, quality control by peers, and reproducibility, as well as to facilitate re-utilization for additional studies. Such data are appreciated because they are original, disaggregated to high geographical detail (counties, municipalities, etc.), and often dedicated to aspects (habits, beliefs, etc.) not easily found in official sources. However, these data sometimes are limited to small samples, narrow purposes, and incompatible formats. Fourth and finally, we also draw attention to a dedicated dataset search engine.

3.1 Specialized Sites

The three specialized migration-data platforms we describe here are characterized by providing open access, without registration, to their data in an interchangeable format. The platform World Pop facilitates geospatial information on demographic issues in low- and middle-income countries across three continents; the Migration Data Portal provides a starting point for understanding human movements at a global scale; and the Migration Policy Institute’s Migration Data Hub provides a general outlook on immigration to the United States.

World Pop (http://www.worldpop.org/) is the leading open-access platform for spatial demography, providing finely granulated geospatial datasets on population growth, distribution, and characteristics in Central and South America, Africa, and Asia. Launched in 2013 as a platform integrating various specialized portals, World Pop basically pursues the mission of making top-of-the-line geospatial mapping available regarding the global South where such data would otherwise be missing or insufficient. The ultimate aim of this portal is to foster scientific research and better-informed policy interventions regarding economic development, ecological sustainability, health-care, and other issues. Its high-resolution spatial distributions, rigorous methodologies, and open-source documentation have been recognized as vital input for development projects. The provision of up-to-date information is an important goal, yet achieving it is occasionally hindered by time-lagged source data.

The World Pop portal integrates a wide range of sources, including censuses, surveys, satellite data, administrative statistics, mobile phone data, and others to produce fine-gridded maps of population distributions. It employs advanced data management techniques such as machine learning to disaggregate information regarding administrative units of varying and often excessive size to grid units of just 100 × 100 m. Building on the static snapshots of population distributions and characteristics at certain time points, the portal also elaborates high-resolution maps of population dynamics. Its current data line-up covers 11 substantive areas, which include spatial population distributions by continent and country, internal migration, global settlement growth, and a host of development indicators.

World Pop’s estimation and imputation procedures—documented extensively and made available as metadata—are in constant flux to seize on evolving technological options. World Pop’s data can be accessed in two ways: through a dedicated API application (see Sect. 7.4.1) or directly by discharging datasets from the website.

The Migration Data Portal (http://migrationdataportal.org/) caters to a broad audience (including policy makers, statistics officers, journalists, and the general public) with a view to showcasing how migration policy is evidence-based and to contributing to more facts-driven public debate on migration and its manifold effects. The portal was established with backing from the German government in the wake of the 2015 surge in refugee flows, and it is hosted at the Global Migration Data Analysis Centre (GMDAC), a Berlin-based research outfit belonging to the International Organization for Migration (IOM). It provides international migration data obtained from a range of sources with the stated aim to make international migration data and information more accessible, visible, and easier to understand.

Such emphasis on user-friendliness translates into a penchant for infographics, data sheets, and clickable maps with links to definitions and additional information. The portal’s “Data” Section offers dynamic access to dozens of indicators pertaining to a vast variety of thematic groupings such as stock and flow statistics, integration processes, and public opinion, among others. When selecting any particular indicator and geographic area, related values such as the highest- and lowest-scoring countries are displayed, and a timeline ranging from 2000 through to the present adapts to the periodicity of the respective source data. The “Themes” and “Resources” Sections provide various kinds of background information on measurement, data sources, context, and analysis of migration data. The portal also offers data regarding the United Nation’s Sustainable Development Goals.

While specialized scholars will find fault with some of the portal’s details, anyone aware of the challenges of integrating data from so many diverse sources and diverse range of aspects cannot but admire the portal’s accomplishments. Researchers may especially savor offerings such as a searchable database of innovations in data migration statistics.

Our third platform entry, the Migration Policy Institute (MPI), is a Washington-based think tank that aims to foster liberal (“pragmatic”) migration policies. Focusing mainly on North America and Europe, with special emphasis on the United States, the MPI seeks to engage policy-makers, economic stakeholders, the media, and the general public. It conducts research on migration management and integration policies, and strives to make its publications accessible to non-specialist audiences. The MPI website includes a Migration Data Hub (https://www.migrationpolicy.org/programs/migration-data-hub) that supplies tables, graphs, and maps with recent and historical data on migration flows and stocks, residence status, integration markers, economic performance, employment, and remittances. For the international migration research community, this portal’s data offerings are relevant mostly when fine-grained information on the US is needed, which can be provided at the state or even county level. Regarding a range of demographic, economic, and integration indicators, data output can be customized for user-defined comparisons between states.

3.2 Generalist Data Portals

In addition to specialist websites, an enormous wealth of migration-related information can be obtained from generalist data portals. The world’s primary international or inter-governmental organizations predominate in this category, thanks largely to their capacity to leverage the vast statistical input provided by their respective member states. The flip-side of this advantage is that each organization’s membership tends to define the geographical coverage of its data portfolio.

The European Union has merged all its data offerings in one platform (http://data.europa.eu) that provides both metadata from public sector portals throughout Europe at any geographical level (from international to local) and datasets collected and published by European institutions, prominently including Eurostat but also many other EU agencies and organizations. This portal merits extensive exploration, since it constitutes a veritable dataset library of tens of thousands of entries. Downloads are facilitated in a vast range of formats.

The Organization for Economic Co-operation and Development (OECD) maintains three databases relevant for migration studies (https://www.oecd.org/migration/mig/oecdmigrationdatabases.htm). Its International Migration Database provides recent data and historic series of migration flows and stocks of foreign-born people and foreign nationals in OECD countries as well as data on acquisitions of nationality. The Database on Immigrants in OECD Countries (DIOC) includes comparative information on demographic and labor market characteristics of immigrants living in OECD countries and a number of non-OECD countries (DIOC extended or DIOC-E). Finally, Indicators of Immigrant Integration are gathered on employment, education and skills, social inclusion, civic engagement, and social cohesion. Data are displayed in user-defined tables and charts and can be downloaded.

The United Nations’ sprawling Internet presence includes tables, maps, and graphs based on estimates of international migration flows and migrant populations elaborated by the UN Department of Economic and Social Affairs (UN DESA). The bonus of global coverage comes at the price of varying definitions and data quality. Such issues have delayed the launch of the United Nations’ Global Migration Database (UNGMD) (https://population.un.org/unmigration/index_sql.aspx), which draws on data from about 200 countries. At the time of writing, this database was still being tested.

Special mention is due to Our World in Data (https://ourworldindata.org), which is a website that, thanks to the collaborative effort of numerous academics, delivers open-access data and analyses on a vast range of issues, including the root causes of international migrations such as global income inequality or population growth. This portal features concise contextualization and impactful data visualizations.

3.3 Data Repositories

Apart from researchers’ contributions to making the data generated by governmental organizations and public administrations widely accessible, the scientific community also has expressed an increasingly strong commitment to the public availability of research data. This concern is motivated, on one hand, by a quest for transparency, which demands that third parties be offered the opportunity to reproduce results to verify scientific discoveries or claims. On the other hand, a growing insistence has arisen that the very nature of scientific inquiry, being cumulative, demands open access to all the data used in research (Murray-Rust, 2008; Molloy, 2011). Both lines of reasoning are adopted increasingly by funding agencies and publication outlets, thus converting open data access into a requirement both for research projects to be financed, and their results to be published. This trend has led a number of different data repositories to flourish: scholars upload datasets and complementary information for other researchers, so they can freely reuse that data. Among the most important sites, by volume of uploaded datasets, are the Dryad Digital Repository, figshare, Harvard Dataverse, and Zenodo. The Registry of Research Data Repositories (http://www.re3data.org) database gathers the most extended list of data repositories arranged by type, language, country, and subject. At this time, the IOM’s aforementioned Migration Data Portal is the only specialized data repository for migration studies found in the Registry of Research Data Repositories.

Dataverse, a repository management software designed at Harvard University, is one of two outstanding resources that should be highlighted. Initially conceived as a data repository of this particular institution, Harvard Dataverse (https://dataverse.harvard.edu/) has evolved into a platform that integrates repositories from all over the world that employ its software. Overall, at this time, Harvard Dataverse has gathered more than 114,000 datasets, about half of which belong to the Social Sciences, of which approximately 1600 datasets can be retrieved by a “migration” query. Note that all these numbers are increasing at an astonishing pace. Zenodo (https://zenodo.org/), another outstanding resource, was created in 2013, thanks to the collaboration between the research project OpenAIRE and the CERN (European Organization for Nuclear Research). This relatively novel, multidisciplinary repository is structured in some 7000 “communities,” each of which represents an organization, research group, or subject-matter; however, few such communities are related to migration studies. This repository, which currently offers more than 115,000 data sets, also is growing fast.

To complete this roundup of web resources for locating research-based datasets on migration, two recent initiatives—geographically centered on Europe—that favor expert-defined taxonomies over algorithms are worth mentioning. The Migration Research Hub (http://www.migrationresearch.com), which is sponsored by the research network IMISCOE, aims to convert itself into a platform for identifying migration-related expertise across a wide range of topics. Although academic publications account for most of its content, the database also contains references to hundreds of datasets. For its part, the EthMigSurveyDataHub (https://ethmigsurveydatahub.eu/) focusses specifically on improving the access, usability, and dissemination of survey data on the economic, social, and political integration of ethnic and migrant minorities. Currently, this project is developing online databases of such surveys, as well as their questionnaires’ items.

3.4 Dataset Search Engine

To locate specific datasets across the growing and diverse range of extant data repositories, a specific search engine, Google Dataset Search (https://datasetsearch.research.google.com), is available. This search engine indexes datasets written in many different formats, on the one condition that the metadata is written according to Google’s stated instructions (cf. schema.org). Query results are listed in the website’s screen’s left-side margin, while the right-side displays detailed information about a chosen item, such as title, contents, link to the repository, last update, author, license, format, etc. Searches can be customized with various parameters. This recently launched service will prove enormously useful to researchers in any thematic domain, including migration studies.

4 Data Extraction

Many repositories do not require any particular procedure: after downloading the data (in widely used formats such as .csv or .xlsx), a user just has to clean the file to retrieve the precise information required, or perhaps adapt the data to the processing system that they are using. However, in some cases, the platforms provide several endpoints that automatize data extraction and develop specific applications for data analysis. In the following discussion, we briefly present three different data extraction techniques: Application Programming Interfaces (APIs), Web Scraping techniques, and the SPARQL language.

Each of these options has specific advantages when employed with regard to particular types of data. APIs and SPARQL are suitable for obtaining and analyzing structured data, such as governmental statistics, because these endpoints are commonly created by the data providers for dissemination. Both techniques are appropriate for large and updated datasets. However, scraping techniques are recommended for extracting the non-structured data available on the Web, such as textual information (e.g., public opinion blogs or media reports) or links on social network sites.

While we anticipate that the following section might strike some readers as somewhat more arcane than the preceding sections, we would like to stress that the procedures sketched here do not necessarily require prolonged training to be put to use. Indeed, a key goal of this chapter is to encourage migration scholars to become acquainted with these techniques (for further reading, see Salah et al., forthcoming).

4.1 Application Programming Interfaces (APIs)

In general terms, an application programming interface (API) refers to a set of functions and procedures that enable software to interact with other applications (Blokdyk, 2018). A Web API is designed specifically to provide direct access to web servers with a view to using their data in other applications in a massive and automatic form. Web APIs are offered by the data provider, which means that it is only possible to access the data if the server has implemented a public API to operate with it; fortunately, this is the case with many of the data portals mentioned previously. Commonly, access is obtained through a HTTP protocol specifying a URL that pinpoints a route to the server where the data are stored. Normally, a data provider supplies a handbook or guide with information about the content, structure, and type of queries that their API supports.

An API can be accessed in two ways: by the representational state transfer (REST) procedure, on one hand, and the simple object access protocol (SOAP) method, on the other. REST uses different parameters in the data provider’s URL to retrieve and filter the needed information.Footnote 2 This is the most extended method because it grants more freedom when designing a query to enable the selection of the exact elements required for a given research purpose. SOAP requests are not defined in terms of target URLs; rather they use SQL and SPARQL (see Sect. 7.4.3) to query the database.

APIs supply data in several formats that facilitate the understanding of the structure of the dataset and its subsequent processing and management. The most important data formats are JavaScript object notation (JSON), which is supported by the REST protocol, and extensible markup language (XML), which is employed by REST and SOAP.

JSON is an open data-interchange format that facilitates both reading by humans and parsing and generating by machines (Crockford, 2006). This format displays the information in a structured, user-friendly, and compressible way, which makes it possible to, for example, contract and expand items, depict hierarchical object structures, color objects by type, and filter the text. In addition, JSON is suitable even for non-structured data because it is not necessary to previously define fields and attributes. These advantages have made JSON the most extended format, and it is implemented in most extant Web APIs. However, the XML format remains common as well, and it displays information in a structured form using marks that define each object and their associated attributes (Abiteboul et al., 2000). Similar to JSON, XML enables the contraction or expansion of hierarchical data trees. Xpath is the language used for the extraction and processing of data in XML.

The use of APIs has important advantages for users and data providers. First and foremost, APIs permit access to huge data volumes automatically and rapidly. Second, they do so in ways that adapt to the requirements of the user’s data processing system by enabling structured queries and filters; by capturing only the specific data needed for a given purpose; and by saving time, storage space, and system capacity. Third, APIs facilitate access to real-time data: retrieval is possible as soon as the data are generated. This advantage is a huge bonus when compared to closed data files, especially with regard to short-lived or high-frequency parameters such as online social networks, etc. Finally, APIs favor free-flowing web traffic, since they are much less taxing in terms of file size than the formats, images, and applications commonly disseminated on the Web, which reduces the risk of server saturation and failure.

However, APIs have some limitations that impede or restrict their use for research purposes. The principal problem is that some data providers, especially commercial ones that offer paid services on demand, do not allow API access to all the information displayed on their respective websites, since they consider some of that information to be sufficiently valuable to impede it from being massively processed by third parties. For example, Facebook and Twitter only make a fraction of all their data freely available. Another limitation is that APIs require some programming skills. Although many sites provide an endpoint that helps with the writing of queries, sometimes it is necessary to know some programming language (JavaScript, SQL, Python, R, etc.) when designing a query and storing the data.

4.2 Web Scraping

Web scraping is a technique for extracting data from Web pages. This is achieved by simulating the navigation of a web user and capturing the information displayed on the computer screen (Vanden Broucke & Baesens, 2018; Mitchell, 2018). Technically, this procedure extracts text patterns from the HTML structure of a web page and then creates a structured file (.csv, .xls, etc.) with the harvested data. Among the most popular applications for web scraping are Heritrix (Internet archive), Import.io (web solution) and Scrapy (Phyton). The main advantage of web scraping as compared to API access is that the number and type of data retrievable is almost unlimited. Thus, this tool is recommended when API access is rated as too restrictive with respect to the research objectives being pursued.

In addition to the possibility of seeking information from specific and pre-identified websites, the Web also can be tracked automatically by following the network of links that connect each web page; the applications employed to achieve this tracking are variously called web crawlers, spiders, or bots. Such applications explore all the links they identify according to a range of parameters (links from specific sites, depth, content, etc.) and extract basic information about the webpages accessed (Olston & Najork, 2010). These instruments have long been used by search engines (Googlebot, Bing bot) to index content from the Web, as well as by webmasters for detecting broken links (link rot) and design errors. With respect to web scraping, the role of bots is to run the navigation, whereas the scrapper extracts information from each visited page. This joint activity enables a massive extraction of data from large websites and an integration of information from various web pages.

R language, which is used widely in the scientific community to develop statistical analyses and data processing, provides an easy option for creating a crawler, among other reasons because it is freely available and does not require advanced programming knowledge (Munzert et al., 2014; Aydin, 2018). The most important tool is Rcrawler, a package that permits the development of a crawler to extract information from the Web. This package includes a specific module for scraping data from patterns in HTML and Cascading Style Sheets (CSS) tags (Khalil & Fakir, 2017).

Rcrawler can be customized to define different values for a range of arguments. For example, the following script tracks the United Nation’s site on Refugees and Migrants with depth 1 (only the main page of the site), 10 threads, a request delay of 2 seconds, and a timeout of 10 seconds. These parameters enable the definition of the behaviour of the crawler from a technical viewpoint.

  • Rcrawler(Website = “https://refugeesmigrants.un.org/”, MaxDepth = 1, no_conn = 10, RequestsDelay = 2, Timeout = 10, urlregexfilter = “/refugees-compact/”, KeywordsFilter = c(“Syria”))

Rcrawler also permits the filtering of the crawling process by addressing and retrieving only pages that fit specific criteria. For example, the previous script surfs only on pages in the “Refugees” Section and extracts only information from pages that include the word Syria. However, this process only harvests information about web pages (title, url, inlinks, outlinks, etc.), so another module is necessary to extract text patterns from the content of these pages. ContentScraper is a function of Rcrawler that extracts text from the HTML by using Xpath and CSS tags. In the following example, the script extracts the table “Number of first-time asylum applicants” from the Wikipedia entry “European migrant crisis.”

The argument “XpathPatterns” points out in which part of the HTML code the text to extract is located. Applications designed with R offer much more flexibility than any commercial bot. Rcrawler enables users to define each parameter with any value, thus focusing the process toward the needed data. However, it requires some programming and is limited to the functionalities of each package; also, the data obtained need to be cleaned and processed before being analyzed.

While advantageous to the proficient user, the extraction of massive data from the Web has important legal and technical implications that require careful consideration. In many countries, legal constraints tend to be concerned with the subsequent use of data, rather than the extraction process as such (Kienle et al., 2004; Marres & Weltevrede, 2013). For example, United States case law considers data duplicity and re-publishing to be legal (Feist Publications vs. Rural Telephone Service; Associated Press vs. Meltwater U.S. Holdings, Inc.) (Hamilton, 1990; Schonwald, 2014), while considering massive data extraction that inflicts any commercial damage (eBay vs. Bidder’s Edge; Cvent, Inc. vs. Eventbrite, Inc.) as illegal (Chang, 2001)—thus, copyrighted material is protected. In Europe, the General Data Protection Regulation (GDPR) limits the web scraping of EU residents’ personal data (email and physical addresses, full names, birthdates, etc.), even if publicly displayed on web pages, by requiring the affected subjects’ consent; importantly, it does not interfere with the extraction of non-personal data, and thus mostly affects the data extraction from online social networks. These rulings and regulations confirm that the automated extraction of non-personal data from websites is perfectly legal, especially when serving scholarly purposes.

However, web-scraping can inflict functional damage to websites. Automated high-frequency, large-volume requests of web pages can occupy an excessive amount of bandwidth, thereby slowing down the service, impeding access by other internauts, or even triggering server failure, thus causing potentially serious economic harm. Also, web-scraping may distort statistics about visits, downloads, likes, etc. Consequently, in their terms and conditions, some corporations and entities explicitly prohibit the scraping of their web resources.

Such problems can be largely prevented by following simple courtesy rules that permit data extraction in a respectful way. The most important rule is to identify any crawling/scraping process in the “user-agent” field of the application by stating that this is a robot and revealing its IP, which enables the tracing of a person responsible for web resources malfunctioning or misconduct. Second, appropriate time intervals (ideally several seconds) should be set between requests (many crawlers/scrapers include options to define the rate of petitions). Also, the number of threads (simultaneous requests launched to the server) should be chosen conservatively. Last, the extraction period and corresponding use of bandwidth can be reduced by the efficient design of crawlers/scrapers, with a view to visiting only the pages necessary for achieving well-specified objectives. These criteria should be modulated according to the size and importance (i.e., number of daily visits) of a site: while goliaths such as Google, LinkedIn, or YouTube maintain high-volume servers that are barely affected by the activity of a bot, less well-resourced sites can collapse due to invasive crawling.

4.3 SPARQL Language

SPARQL is the recursive acronym of SPARQL Protocol and RDF Query Language. As its name indicates, it is a language designed to query data repositories that contain RDF data (Feigenbaum, 2009). The result of a query can be displayed in several formats: XML, JSON, RDF, and HTML. Different interfaces (YASGUI, Virtuoso, Stardog, Sparql Playground, etc.) can be used to access any SPARQL endpoint to help write the queries.

The main advantage of SPARQL is that this language is simple and easy to use if RDF philosophy is understood. Another important benefit is that it enables the integration of different datasets in the same query by linking data from distant parts. However, SPARQL has important drawbacks: the RDF model is not widely extended, not all data repositories provide their data in RDF format, and it is necessary to be acquainted with the structure of a SPARQL data query.

The SPARQL structure is defined by five elements: (1) a prefix declaration (PREFIX) of the URIs that will be used in the query if we want to abbreviate them, (2) a dataset definition (FROM) stating what RDF graphs are being queried, (3) one or several result clauses (SELECT, ASK, CONSTRUCT, DESCRIBE) identifying what information to return from the query, (4) a query pattern (WHERE) specifying what to query for in the underlying dataset, and (5) query modifiers that slice, order, and otherwise rearrange the query results; for example, a user can limit the number of results (LIMIT), rank them by some criteria (ORDER BY), or define from what point they are visualized (OFFSET).

Since the repository commonly establishes the RDF graph by default, the dataset definition is optional. In contrast, the prefix declaration is mandatory. SPARQL accepts as many prefixes as the user wants, even if they are not used in the query. To know the correct prefix of any URI, the user can employ prefix.cc, which is a search engine that retrieves the full link of any prefix. To define query patterns, a user can place an interrogation mark before the corresponding word.

To illustrate, the following example selects two fields: Title and Title_Subject. First, it returns articles from the category Human migration; next, it takes the title from the articles; and then, the title of the subject.

In this example, a filter is set to limit the results to the English language (“en”).

5 Concluding Remarks

The terms big data, open data, and linked data allude to an ongoing transformation of scientific research (Crosas et al., 2015). Troves of previously unavailable data are opening up innovative research opportunities, improving the reproducibility of results, and interconnecting datasets across the globe. This data revolution contributes to making scientific research more collaborative and transparent than seemed conceivable just a few years ago. The center of gravity of the research process is inexorably being displaced from (relatively small-scale) data production toward (markedly large-scale) data extraction.

Although this data revolution does not mean that empirically minded social scientists will cease to conduct fieldwork of their own, it does mean (at least in our opinion) that the value of primary research data will be benchmarked increasingly against data collected by third parties via the Internet. Even conceding, as we are inclined to do, that large volumes of data do not necessarily convey rigorous knowledge, not to mention wisdom, it would be foolish to ignore the huge opportunities awarded social researchers by the transformative leap of pervasive digitization with respect to the timeliness, cost, variety, versatility, comprehensiveness, and ubiquity of data. The Web provides access to real-time, free-of-charge, population-level, and finely-grained data on an ever-expanding array of facets of social reality, everywhere on the globe. The ensuing research opportunities are especially obvious in the field of migration studies, given that, rather often than not, migrant populations are “hard-to-reach” (Font & Méndez, 2013) with traditional research methods due to the combination of cultural diversity and geographic dispersion, on the one hand, and precarious administrative status and under-coverage by official sources, on the other.

In this chapter, we have reviewed some outstanding examples of the current line-up of web resources relevant for migration research, including specialized websites, generalist data portals, dataset repositories, and a dedicated dataset search engine. In addition to providing access to a continuously growing trove of data, these resources facilitate the sharing of our own primary data so to make them reviewable and reusable. This snapshot should not be mistaken for a permanent inventory: since the Internet is in constant evolution, additional web resources for migration scholars are sure to emerge rather sooner than later, potentially eclipsing some of the offerings mentioned here. Thus, the search for data sources relevant to a specific research project constitutes a vital (if obvious) precondition for benefiting from the ongoing data revolution.

However, to seize the opportunities granted by the process of ever-more ubiquitous digitization, source identification is only the first step, since scholars also require some technical skills (Light et al., 2014). In most research institutions, data processing is not (yet) a task for specialized programmers, but rather a set of abilities that researchers need to develop to effectively manage their projects. In this chapter, we have sketched three of the most important ways of extracting data from the Web: APIs, web scraping, and SPARQL. More and more organizations (governments at all levels, intergovernmental and international institutions, for-profit companies, etc.) are creating open endpoints for accessing the data they produce, thereby generating new research possibilities. Presently, APIs are the best and most common mode of data provision, although the use of linked data technologies and SPARQL endpoints is becoming more frequent mainly by governmental and statistics offices. Yet, in some cases, data are not easily accessible and web scraping techniques are necessary for obtaining the required information. This technique enables the harvesting of valuable non-structured data from different websites, and can adapt to any web structure, although it must be used responsibly to prevent disruption.

The risk of automated data requests overwhelming the capacity of target servers is one of the challenges originated by researchers’ piggybacking on the ongoing process of global digitization. A second challenge of paramount importance is the preservation of privacy and the protection of personal data. This is a serious problem given that vast amounts of personal information, including highly sensitive data, are easily accessible on social networks and other web platforms. In spite of regulations (GDPR) and case laws that prohibit abusive scraping, the risk of security breaches is evident (Isaak & Hanna, 2018). The responsibility to protect non-public personal data by employing encryption and anonymization procedures falls primarily on platform owners and web managers. The GDPR forbids the scraping of publicly available personal data without subjects’ consent, which contrasts with a more laissez-faire approach in the US (cf. hiQ Labs, Inc. vs. LinkedIn Corp.). Regarding both server saturation and privacy protection, it seems fair to say that scholars’ web crawling poses much less of a challenge than web mining for commercial or political purposes. Still, the imperative of ethical conduct requires researchers to proactively prevent any harm that potentially derives from their data collections.

Third, the many advantages of web-based data extraction must not obscure the most basic of methodological precautions—not to confuse data coverage with truth. Any data category or research result contains some trace of its conditions of production, a context that inevitably shapes patterns of visibility, intelligibility, and knowability. Even with regard to population-level observational data, concepts such as coverage bias and selection bias continue to be pertinent: cases in point are the socio-demographically skewed distribution of body-sensor wearables and Internet penetration rates, as is the digital divide across the global North and South. Since the increasing pervasiveness of digitization seems prone to breed hubris, we pointedly recommend big-data analysts to be humble instead.

Based on the expectation that these challenges will prove manageable, we would like to conclude by insisting on the strategic importance of migration scholars building, either individually or collectively, the skills necessary for successfully navigating this emerging new world of big, open, and linked data.