1 Introduction

Global climate change is one of the world’s greatest challenges to achieve sustainable goals, where transportation accounts for a fourth of the total CO2 emissions [2], therefore transforming this industry plays a significant role. Diverse transport policies need to be assessed for their effectiveness, sustainability, and feasibility before being implemented in real-life. Estimating mobility behaviour is crucial for developing tools and simulation environments which enable decision-makers to properly assess the policies in advance to their implementation. Nevertheless, transport user behaviour is complex and could rapidly adapt to different trends, such as the pandemic [24].

Traditional methods for capturing these behaviours are travel surveys and travel diaries, which face several challenges and potential data inaccuracies which can impact the performance of transport models [38], as data might not reflect the real behaviour of the population [82]. These methods have major challenges such as low response and completion rate [79, 93], or underestimation of short trips [13, 94]. Some of the reasons might be associated with the survey duration, forgetfulness of respondents, selective omission of some trips, or incorrect understanding of trip or activity definitions [12, 38, 79, 94], as well as difficulties to reach the potential respondents and their unwillingness to participate, which could be linked to the increasing screening of marketing calls [86].

New technologies and other data sources may not only improve the quality and quantity of the data collected by traditional travel survey methods, but also increase the variety of the data, as well as provide data over a longer period [16, 96]. Social media platforms such as Twitter and Instagram have emerged as significant sources of location-based information due to their geo-tagging capabilities for user posts. Although this data is presently available for research, its future accessibility is uncertain due to proprietary business strategies and general data protection regulations. Additionally, the data relies on user-generated posts, which may not represent all demographic groups and can be challenging to use for identifying individual trips. Nonetheless, Twitter has become a popular dataset for researchers, being a source for behaviour analysis, opinion mining, trend tracking [92], and sentiment analysis [32]. Twitter has previously been used to analyse population density, for example estimating the population distribution throughout the day [20] or in relation to the land use [30, 47]. Luo et al. [59] related the spaciotemporal features to demographic information, and Shelton et al. [72] observed the socio-spatial inequalities within users. Twitter data was also used to complement other data sources to predict large-scale human mobility [83].

Several countries are currently encountering difficulties with their traditional travel surveys, and there has been limited research on the use of Twitter data as a complementary source of information. This paper aims to bridge this gap by exploring the potential feasibility and reliability of leveraging large-scale Twitter data to analyse human mobility patterns. Over 12 million georeferenced Tweets from Norway, spanning from 2012 to 2022, were analysed to estimate origin–destination (OD) trips between municipalities. These estimates were then compared with data from traditional Norwegian travel surveys to assess the potential of Twitter data as a supplementary resource for traditional travel surveys, contributing with valuable insights into the integration of social media data with traditional transportation research methodologies.

A more detailed literature review is included in Sect. 2. The methodology, including data sources, processing, and analysing is described in Sect. 3. The results are gathered in Sect. 4, followed by the discussion and implications in Sect. 5. Finally, a short summary with the main conclusions is in Sect. 6.

2 Literature review

In the last years, the drawbacks of traditional travel surveys have become more evident, highlighting the need to seek alternative data sources for understanding travel behaviour [71]. Traditional surveys, typically based on self-reported data gathered through telephone or computer-assisted interviews, encounter numerous issues such as decreasing response rates, recall bias, and high costs [82]. Integrating other data sources with traditional surveys can enhance the overall quality and depth of travel behaviour research, leading to better-informed transportation planning and policy decisions, but the challenge lies in digitalizing and consolidating data from multiple sources for effective exploitation. Liu et al. [52, 53] emphasised that big data must be carefully used due to challenges related to unrepresentativeness, inconsistency, and unreliability. Several big data sources could be further explored, Li et al. [50] divided in three main types depending on the generation of the data, from transactions, from devices, or from users.

Transactions data, which could be web search data, or bank transactions, is limited for travel behaviour due to privacy policies. However, electronic fare payment systems in public transportation might be used for estimating travel behaviour [5]. An assumption needed is that a single card is used by a unique person, which might not be the case, this was only tested by Chu and Chapleau [18]. Hussain et al. [43] used these data for estimating OD matrixes, although some shortcomings were stated that could be overcome with the integration of several data sources. Unlike traditional travel surveys, it is challenged to obtain sociodemographic data from transactions data, as well as reliable data on travel demand for different transport modes. Nevertheless, it could be explored the suitability for model calibration.

There are several device data that has been explored to complement traditional travel surveys, such as data generated by Global Positioning System satellites (GPS), sensors, smartphones, or mobile roaming data.

GPS-based surveys might have a potential to replace or supplement traditional methods [16], which may enable large scale surveys at lower cost [9], and provide a more flexible method to capture rapid behaviour changes [38]. GPS-positioning captured with higher precision spatial–temporal movements of travellers [13], which reduce underreporting short trips as in traditional travel diaries [64, 73]. Rasmussen et al. [69] derived with high accuracy other trip attributes such as trip purpose. Nevertheless, people may forget to carry the device and there might be signal losses due to obstructions between the device and the satellites, in undergrounds for instance. Moreover, the travel attributes depend to a larger extent on the post-processing [76], although the potential assumptions might be validated by recall surveys [54].

Other data types are automatic vehicle location, automatic passenger counting and traffic counts. Automatic vehicle location uses GPS to record the position of the vehicles in the network in real time [58]. The vehicles tracked could be from private owners, companies with freight vehicles, to public transport companies. This data is not linked to a person, however, Chapleau et al. [15] highlighted that it could complement traditional travel surveys, especially in the context of estimating public transportation demand. Within this setting, automatic passenger counting is also a data source to be considered, this sensor technology counts the number of passengers, mainly boarding and alighting at each stop. Nonetheless, sociodemographic information is also disregarded [31], thus these data could be more useful for model calibration purposes rather than as complement to traditional travel surveys.

Traffic counts data refer to the number of vehicles, which could be length aggragated, passing a point during a specified period of time. The technology to capture these data could be sensors or video recordings. This data has several limitations when compared to travel surveys, as it does not refer to a unique person and the complete trip patterns are not known, as data is only associated to a point. Despite numerous studies have concentrated on optimizing the placement of traffic counts to estimate origin–destination (OD) patterns, such as Fu et al. [28], the topic is still open to further research. Nevertheless, its used for model calibration is widely recognised [42].

Bluetooth technology, by fixed or mobile sensors, might be used to get insights into travel behaviour, some examples are Mei et al. [61] who used these data for estimating travel times,Abedi et al. [1] that estimated space-temporal movements of cyclist and pedestrians in combination to other data sources,or Yang and Wu [97] who estimated travel mode, although presenting some limitations. The lack of personal information, the trip patterns, and the transport mode reduce its used for travel demand estimations or in combination to travel surveys, being more significant for specific model calibrations.

Today’s technology allows collecting travel data using smartphones. A rich data set can be derived and computed from multiple built-in sensors, such as motion sensors (accelerometer, gravity sensor and gyroscope), environmental sensors (barometer, photometer and thermometer) and position sensors (GPS and magnetometer) [6]. The smartphone applications may be divided in two main types, active or passive. The former requires that the user interacts, i.e. to select the start and end of the trips, as well as the mode, and potentially the purpose. A limitation is that users forget to activate and deactivate the application before and after the trip [96], as well as being more time consuming. The passive application does not require interaction, as it runs in the background of the phone. A set of algorithms automatically detect trips and modes [9, 16]. Ferrer and Ruiz [26] detected travel modes by using raw accelerometer data, with over 89% match for all modes. Alexander et al. [3, 4] showed representative results for daily origin–destination matrices by purpose. An advantage is that it is possible to overcome loss of GPS-signal in urban or indoor areas with use of accelerometer sensor or connection to Wi-Fi access points [26, 51], but as a result of an increased sensor usage, high battery power consumption is a disadvantage [51, 96]. Discussions at different countries are being held to assess the potential of replacing the data collection of traditional travel surveys by smartphone applications, however there are concerns related to lack of standardization and reproducibility [7]. Additionally, passive tracking of people’s behaviour also introduces privacy concerns that may set restrictions for the survey design. There is also scepticism among certain type of people to participate which might lead to bias representativity [81].

Another example of data source is cellular network signalling which might provide more information in terms of sample variety and duration. By using CDR-data there is no need for users to do anything, thus it is the most battery efficient method. It is based on cell tower triangulation from call detail records (CDR) from any telephone type [34]. Previous research showed the possibility to identify movements (origin–destination) [27, 85], transport modes with a precision between 80 and 97% [96], and activities [10] to serve directly into transport models. Bachir et al. [8] estimated travel mode and OD trips, which were validated by traditional travel surveys. On the other hand, Šulíková et al. [80] explored this data source to complement data from the Slovakian traditional travel survey for transport modelling purposes, however several challenges disregard this option. Individual trips cannot be tracked according to the European General Data Protection Regulation (GDPR), in addition at national level some countries might have more strict rules, in Norway each mobile ID are renamed every day, making very difficult to detect work or residential locations [23]. Moreover, the location precision depends on the tower density, being less suitable for rural areas [51].

In relation to user generated data, this could be online photo data, or online textual data. Peoples’ movements can be extracted from their photo post on Flick [11], this platform is especially interesting for tourist behaviour [95], including visiting places, crowded areas, or trajectories [22, 40, 55, 99]. Instagram was also used to identify the most visited places [35], however this data is not openly available longer which reduces their research interest. Similarly, Panoramio was scarcely explored when it was active [44].

Georeferenced Tweets represent a powerful and high-quality data source to gain a new perspective for estimating mobility patterns [36], being also valuable for continuous monitoring and trend detections [98]. These could be place-referenced or coordinate-referenced Tweets,the former represents different levels at municipality, city or town, or neighbourhood, whilst the latter could have a precision down to 5 m under open sky [89]. The granularity of the Tweets allows to observe not only residence and work locations but also visited places or specific routes [47]. Lenormand et al. [49] compared the spaciotemporal distribution of people and individual mobility patterns using data from Twitter, cell phones, and census and concluded that the three data sources are feasibly interchangeable. Some work by Kurkcu et al. [46] and Lee et al. [48] comparing the mobility patterns to travel surveys also confirmed their similarities. Some of the previous research on mobility patterns from Twitter data estimated: origin–destination mobility flows [29], Jiajun [52, 53], next position in human trajectories [21], traffic events [68], preferred visited places [45], tourist flows [19], mobility patterns and dynamics in retail locations [57], mobility patterns between residence locations and public spaces in a medium-size city [70], differences of mobility patterns between visitors and residents [56], commuting patterns [60, 65], mobility dynamics before and after the pandemic [41, 74, 98].

Recent literature highlights the potential of georeferenced Tweets for analysing travel behaviour, yet few studies have investigated this data as a complement to traditional travel surveys. This paper aims to address this gap by exploring the feasibility of integrating user-generated data with conventional travel surveys to enhance the estimation of travel behaviour and improve the development of transport models.

3 Data

The geographical focus of this research was Norway. Although Twitter data is globally accessible, data pertaining to national travel surveys is restricted due to general data protection regulations. For this study, data from the Norwegian Travel Survey was made available. In this section, the two datasets used in this research are further described, namely, the National Travel Survey is Sect. 3.1, and the Twitter data in Sect. 3.2, including the data collection and cleaning processing.

3.1 National travel survey

The transport pattern data in Norway is mainly collected through computer assisted telephone interviews (CATI). The survey is distributed among a representative population sample in terms of sociodemographic features every 4 years. Since 1985, the response rate has dropped from 77 to 20% [37]. Wilson [93] found similar decreasing response rates in the traditional data collection methods in other national household surveys. The total number of respondents for the national travel survey for 2014 and 2019 were 61,314 and 88,548, respectively Hjorthol et al. [37], Grue et al. [33]. The next Norwegian national travel survey will probably be performed on CATI and computer-assisted web interviews (CAWI), introducing new challenges. This makes it more important that participants understand the definitions by written explanations, as there is no interaction with the interviewer. Despite that, Christiansen et al. [17] found an increase in the short trips reported with this method. However, the low response rate, amongst other limitations, may not be overcome using this method.

The information collected through the travel survey is divided into 8 sections: (1) residence location, (2) access to different transport modes, (3) job/study information, (4) short trips, (5) long trips, (6) commuting trips, (7) family structure and home options regarding parking and public transport availability, and (8) sociodemographic information.

Each respondent must describe all the short trips (less than 100 km) performed the day before to the interview, including origin, destination, purpose, transport mode, number of people travelling together, access to car, and public transport card. In addition, the frequency of the weekly use of the different transport modes for the season. For long trips (over 100 km or to and from out of the country), each respondent states the number of these trips for the last 30 days. The most recent long trip is described with more details, including day of the week, purpose, transport mode, origin, destination, number of people travelling together, number of days overnighting, type of accommodation, payer of the trip, frequency of long trips due to work, and some characteristics of these trips [33, 37].

3.2 Twitter data

The Twitter streaming Application Programming Interface (API) [87] was used to collect all georeferenced Tweets posted in Norway from 2012 (the earliest available Tweets in the API) to 2022. This period was selected in order to obtain information from several days to compensate for the potential spatial sparsity of the sample [39].

The total dataset from January 2012 to December 2022 consisted of 12,727,651 Tweets generated by 224,096 unique users. The characteristics of the extracted data are as shown in Table 1.

Table 1 Characteristics of extracted data from the Twitter streaming API

Nevertheless, some of the Tweets did not represent actual people using this social media properly. As already identified in the literature, repeated Tweets might be spam [36, 91]. The number of Tweets per user could also be an indication of potential fake accounts. Lansley and Longley [47] considered as a maximum 3000 Tweets in 1 year, whilst 1000 Tweets in 2 years was set as limit in Osorio-Arjona and García-Palomares [65]. Simultaneously, a minimum number of Tweets was also considered in previous studies, although the limits vary into a great extent, 2,5 Tweets per day [70] or 5 Tweets during a period of 2 years [65]. In relation to spatial information, users that did not move [65] or Tweets with uncertain coordinates [47] were also removed from previous studies focusing on identifying mobility patterns.

The data cleaning process for our dataset is summarized in Table 2. After the process, users who still had more than 1500 Tweets per year (less than 0.1% of the total) were closely observed to identify potential fake accounts which could trigger unrealistic mobility patterns, however, the accounts were real users.

Table 2 Data cleaning process in relation to number of Tweets and users (2012–2022)

Each tweet has a limited number of characters. Thus, users might need to post multiple consecutive Tweets to express their thoughts. The final dataset consisted of 92,785 users.

4 Methodology

This section describes the processing of the Twitter data, including the estimation of the location of the tweets at municipality level (Sect. 4.1), the trip patters (Sect. 4.2), and the demographic information of the Twitter users (Sect. 4.3). Additionally, the residence location of the Twitter users is validated against the data from the National Travel Survey (Sect. 4.4).

4.1 User location

Coordinate-referenced Tweets provide the location of the user at the time the tweet is posted. There are two types of locations on Tweets: exact coordinates, or bounding box coordinates, i.e., the tweet was posted within the borders of a polygon area. Both types are represented in Fig. 1. Tweets with exact coordinates were around 60% until the change in the Twitter’s policy on sharing spatial information in 2015 [14], then around 10% after that.

Fig. 1
figure 1

Example of georeferenced Tweets, exact and bounding box coordinate (background map source: OpenStreetMap)

Tweets’ locations were assigned to the different municipality borders for further estimation of the mobility patters. Tweets with exact coordinates were mapped to the municipality that contained them. In the bounding box cases, each nearby municipality (m) was given a match score for a given bounding box (bb) (score = intersection (bb, m)/union (bb, m)), then the municipality with the highest score was picked. Using this score instead of a simple overlap test solved some challenges for concave municipality shapes and even for municipalities that surrounds others.

A random sample was taken to verify the matching between the bounding box and the municipality. This was possible as some Tweets were also place-referenced, which could be a town or a city, within a municipality. Tweets with an accuracy of less than 0.3 out of 1 were disregarded for the estimation of the mobility patterns, corresponding to 1.5% of the sample.

There are several studies applying similar but slightly different methods to detect the origin and destination of the trips. Some studies use the frequency counts to identify home or work location of the user, e.g. the most frequent tweet location as ‘home’ and second most frequent location as ‘work [60] or a combination of frequency and temporal (day and night) filtering [4, 67]. In this study, we defined the night as between 21:00 and 07:00, and identified residence location at municipality level as the location of most of the Tweets during the nights and weekdays i.e. between 21:00 and 07:00 from Monday to Thursday. This was estimated by each year, as some users might have changed their residence. Only users who did at least one trip in the observed year and posted Tweets during the night period were considered. The residence location was estimated for 84% and 76% of the users for the years 2014 and 2019, respectively.

4.2 Trip patterns

In the Twitter data, there is no explicit information about trip patterns, therefore some assumptions were taken.

Using a well-known trip-extraction procedure [67, 84], two subsequent Tweets from the same user were considered a trip if they were posted from two different municipalities and were within a given time limit. A person might have started the real trip at one municipality but not posted a tweet until passing another, which would give a bias starting point, the same could happen at the destination, however, the likelihood of this was assumed to be low. In this study, the time limit was set to 12 h to allow the long trips that are possible within Norway. Terroso-Saenz et al. [84] assumed this limit to be 24 h in Spain.

The average number of trips per user per day was estimated as the total number of trips per studied period by the total number or users travelling per day in such period. Only Twitter users that travelled were included in the estimation, i.e. non-trips users where disregarded. The average trip distance was estimated based on the distance between the centroids of the origin and destination municipalities per trip.

4.3 Demographic information

In addition, the biography description of each user was analysed to identify the gender of the user. Although, there are methods in the literature utilizing deep neural networks or traditional machine learning to identify demographic aspects of Twitter users, these methods require semantic analysis of the tweet texts [75, 90], which is beyond the scope of this paper. Using a word detection method in the bio description of the users and a manual quality check, the gender of about six percent of the users was identified (within this share 56% were men and 44% were women). Yet, the lack of more detailed demographic information is a challenge to ensure the representativity. Nevertheless, the use of Twitter in Norway is spread among the age groups: 41% 18–29 years old, 31% 30–19 years old, 31% 40–49 years old, 27% 50–59 years old, and 12% more than 60 years old, with about 1.1 million Twitter users in 2021 [78].

4.4 Validation

To validate the sample, the residence location of the users to the population census and to the stated residence location in the travel survey for the years 2014 and 2019 was compared. This method was previously recognised in some studies [66, 70], whilst other studies compare the density distribution of Tweets to population to assess their validity [25], Jiajun [52, 53].

To make the population distribution comparable, the respondents or the users in each municipality were divided by the total number of respondents or by the total number of users. Figures 2 and 3 represent these distributions, as well as the sample number and the number of municipalities included.

Fig. 2
figure 2

Population distribution in 2014 of a census, b Travel survey, c Twitter (background maps source: OpenStreetMap)

Fig. 3
figure 3

Population distribution in 2019 of a census, b Travel survey, c Twitter (background maps source: OpenStreetMap)

Validation estimations show that the residence location distribution of the users resembles the population distribution from the census, data for 2014 covered 360 out of 422 municipalities, where more than 97% of the population lives. In 2019, there were less Twitter data, probably due to the sharing information policy [88], but still the estimated residence locations included 301 municipalities, where more than 92% of the census population lives, covering more territory than the national travel survey.

5 Results

In this paper the origin–destination (OD) trips between municipalities from the travel survey and from the Twitter data for the years 2014 and 2019 were estimated to assess if social media data could complement traditional data collection methods. It is important to acknowledge the disparity in trip definitions between the two data sources, rendering a direct comparison inappropriate.

Trips from the travel survey were already reported in the data by respondents, both short and long trips between different municipalities were considered. Trips from the Twitter data were estimated as previously described in Sect. 4.1. The number of OD trips between municipalities, the number of persons performing these trips, and the average distance of the trips, are described in Table 3. Figures 4 and 5 visually represent the OD trips and the distance distribution with respect to the number of trips for 2014 and 2019 respectively.

Table 3 Number of persons, OD trips and average distance for both datasets
Fig. 4
figure 4

OD trips and histogram of distance for Travel Survey and Twitter data for 2014 (background maps source: OpenStreetMap)

Fig. 5
figure 5

OD trips and histogram of distance for Travel Survey and Twitter data for 2019 (background maps source: OpenStreetMap)

In 2014, almost 7500 Twitter users did more than 45,000 OD trips, with an average trip length of 83 km. In the travel survey, the reported trips were not evenly distributed along the year, resulting in less information for some months. Even if not directly comparison should be made, the mobility behaviour between the large cities was similar, although there was a slight underrepresentation of trips between the capital and municipalities 50–300 km towards the west in the Twitter data.

In 2019, more than 4000 Twitter users performed more than 23,000 OD trips, with an average trip length of 84 km. Even if less trips were represented, compared to 2014, the mobility patters remained similar. On the contrary, the mobility behaviour reported in the travel survey was different between 2019 and 2014 as most of the trips were shorter, i.e. the average distance was less than 60 km and concentrated in the densely populated areas of the country, where most respondents lived.

Twitter data was further explored for the whole dataset 2012–2022. In Fig. 6 the average number of OD trips between municipalities per user per day are displayed. The average was around 1.5 trips per user per day, the yearly variations within the studied period were lower than 5%, presenting a steady data source, although unable to correlate this metric to external trends.

Fig. 6
figure 6

Average trip distance and average number of OD trips per user per day (2012–2022)

In Fig. 6 the yearly distribution of the average trip distance is also shown, in this case there were three trend changes, which could be related to four time periods. (1) Prior to 2013, the trip distance remained relatively constant. However, due to the limited data spanning only two years, statistical significance could not be determined.

(2) Between 2013 and 2017, there was a notable upward trend in the average trip distance, rising from 76 to 109 km, indicating a growth of over 40%. This increase could potentially be attributed to a shift in trip destinations to municipalities located farther away. (3) Conversely, from 2017 to 2020, an opposite trend emerged, with a decline observed in average trip distances during these years. (4) Starting from 2020, the trend of decreasing trip distances persisted, with the lowest values along the period, although the decreasing trend had a less pronounced decline compared to previous periods.

In Fig. 7 the monthly distribution is represented for the studied periods. In general, the average trip distance is slightly larger in the winter (January-March) and summer (July–September) periods, which could be associated to vacation periods and more trips to cabin areas.

Fig. 7
figure 7

Number of OD trips per user and average trip distance (2012–2022)

The temporal data distribution allowed to detect a significant trend change in September 2017, associated to a significant increase in the average trip distance, although the number of OD trips and unique users for the same months in previous years were similar, when taking the spatial distribution into account, several origins and destinations were more popular in Nordland municipality, shown in Fig. 8 with yellow borders, where the Lofoten islands are situated among other touristic areas.

Fig. 8
figure 8

OD trips for September 2015 (left) and September 2017 (right) (background maps source: OpenStreetMap)

6 Discussion

Although this study only used Tweets from Norway, the data is available worldwide, and could be relevant for any other country or region. The number of georeferenced Tweets in Norway for a ten year period was only six times more than the Tweets in Australia for 1 week [63], and less than double than for 9 months in Spain [84], which is among the 20 leading countries based on Twitter users, still with ten and seven time less users than United Stated and Japan, respectively [77]. This emphasises the relevance for this data source in other countries.

A potential limitation of this study is the number of georeferenced Tweets, as an average of 1.5 trips between municipalities per user and per day, might be low compared to other data sources, which may challenge the trend detection. Despite that, a slight reduction of the number of trips and the average trip length was detected from 2020, when the pandemic restrictions started. Zhong et al. [98] also detected that in London users were making fewer trips, although these were longer. In the United Stated, Twitter data was also used as a source to detect mobility changes, although different trends were found between states [41].

The movements in this study were limited to movements between municipalities, this aggregation might limit the full exploitation of the data, although few tweets were georeferenced with exact coordinates. Recent research is investigating how to estimate coordinates of Tweets without geographically identified data, which could expand the data sample at any location [62], as well as allowing more detailed analyses. Thus, further work towards a finer spatial distribution is desired, which could also improve the detection of the trip purpose. Communing trips could also be further explored as in McNeill et al. [60] and Osorio-Arjona and García-Palomares [65], given that the residence location was estimated, however several commuting trips are within the same municipality.

Twitter data presented a more stable data source than the national travel surveys along both years with similar population distribution and average trip length. Some previous work comparing the mobility patterns to travel surveys confirmed their similitudes, for New York city [46], and for California, where similarities in spatial distributions and trip lengths were also detected [48]. However, the latent mobility behaviour could not be captured in Spain [84]. Nevertheless, the behaviour could not be associated to different sociodemographic groups of population. Simple analyses were used to detect the user gender, resulting in a similar share than the travel survey, although further research should focus on expanding both the estimated sample and to other features, such as age.

The Norwegian travel survey is conducted quadrennially, covering varying time periods in each cycle. The 2014 survey collected data from January to October, with limited responses post-August, whereas the 2019 survey encompassed the entire year. Additionally, respondents only report trips from the previous day, resulting in the absence of panel data. Consequently, capturing and analysing longitudinal trend changes becomes challenging. National travel surveys reflect the mobility patterns of the population residing in the country, i.e., mobility pattern of the non-residents is never included. As a result, in some areas the real traffic volumes generated by the people movements differ far from those reflected in these surveys, especially in tourist areas. Social media data could potentially improve the representation of the non-resident’s mobility as well as serving a complementary role for national travel surveys for residents’ travel behaviour. It could also aid in uncovering trends through sentiment analysis of the post contents.

7 Conclusion

This paper contributes with the assessment of the feasibility of integrating user-generated data with conventional travel surveys to enhance the estimation of travel behaviour and improve the development of transport models. Twitter data presented a broad and stable spatial and temporal distribution of users’ movements despite having limitations in relation to the socio-demographic information of the users, compared to the travel survey. In addition, the availability of these data in real time could serve as a tool to detect trend changes, as consequence of diverse policies or other events at micro or macro level, such as recessions, pandemics, or wars.

Further work should concentrate on reducing the spatial location of the data from municipality level to spatial units corresponding to the transport models, as well as on improving the detection of the socio-economic characteristics of the users. These will ensure representativity and provide more detailed information towards a potential data fusion.

Integrating user-generated data with traditional travel surveys has significant policy implications. This approach enhances data accuracy and granularity, providing policymakers with more precise insights into travel behaviour, and rapid detection of trend changes. This integration aids in further development of transport models to evaluate policy impacts and targeted interventions and might be a cost-effective alternative to complement traditional surveys, allowing for more frequent and updated data collection.