
1 Introduction

In recent years, the importance of short-form travel videos (STVs) in destination marketing has been widely acknowledged [1]. These videos offer engaging and immersive travel experiences that can quickly evoke emotional responses in potential tourists, such as inspiration, memories and a desire to travel [2, 3]. STVs have become an integral part of the tourist experience and have a significant influence on tourists’ decision-making processes [2, 4]. In China, one of the world’s largest tourism source markets, 407 million users are interested in travelling on the Douyin, the Chinese version of TikTok [5]. More and more tourists are using short videos to share their travel experiences, seeking both social expression and emotional satisfaction [1, 2].

While many previous studies have focused on visual reference cues in STVs (e.g., [3]), the important role of music as an auditory reference cue has not been sufficiently emphasised. As noted by Aufderheide [4], music can create an immersive environment, evoking experiences and emotions that drive purchase intentions and foster a sense of belonging. Recognised as a powerful sensory cue and environmental stimulus [6], music could act as a crucial bridge between the communication in short-form video content and the emotional engagement of the viewer. Specifically, the congruity between music and the elements in STVs, such as destination attributes and video types, may strongly influence video aesthetics and shape tourists’ perceptions of destination-related emotions [2]. A classic example is the travel challenge hashtag ‘#fallingtrend’ on the TikTok. User-generated 15-s STVs, precisely synchronised to the beat of Taylor Swift’s song ‘Love Story’, motivated a significant number of potential tourists to spontaneously visit destinations and create travel-related content. However, the exact mechanism of music in STVs remains unclear at present. Considering the above, this study constructs a research model based on SOR theory and emotional resonance through a scenario-based experiment, aiming to address the following research questions:

RQ1: Does music congruity affect tourists’ perception of the audiovisual environment (music cognition, video aesthetics) in virtual space?

RQ2: How do video aesthetics and music cognition evoke emotional resonance and influence the travel decision-making process of potential tourists?

2 Literature Review

2.1 The Audiovisual Environment in Short-Form Travel Videos

Mehrabian and Russell [7] introduced an influential theoretical framework known as the Stimulus-Organism-Response (SOR) model. This theory proposes that environmental stimuli can affect an individual’s behaviour by influencing their emotional state. In the field of tourism, it has become an essential framework for researchers seeking to understand tourists’ behaviour and emotions [6]. Although previous studies have recognised that stimuli in STVs include more than visual aesthetics, such as music-induced sensory experiences [3], and have confirmed the relationship between these stimuli and potential tourists’ emotions, there is still a lack of quantitative studies investigating the relationship between the audiovisual environment in STVs and potential tourists’ behavioural intentions. Furthermore, the interaction and impact of two key environmental factors, music [6, 8] and video aesthetics [3], in STVs remain unclear.

To address this research gap, this study conceptualises the audiovisual environment in short videos as ‘Stimulus (S)’, considers emotions as ‘Organism (O)’, and examines tourists’ sharing intentions and impulsive travel intentions as ‘Response (R)’.

2.2 Music Congruity

Music congruity is a complex and multidimensional concept. In the fields of psychology and marketing, many studies highlight the powerful emotional impact of music [e.g., 9, 10]. Specifically, the congruity between music and visual advertising could strongly influence positive emotional responses and attitudes towards the advertising [9]. Additionally, Baumgartner et al. [11] first used neuropsychological techniques to confirm that emotionally congruent musical excerpts significantly enhanced the emotional experience induced by visual stimuli. While the definition of music congruity may vary depending on the above research context, its practical value in tourism has been confirmed by its application in retail and offline tourism marketing [6, 8].

In recent years, tourists’ travel content preferences have shifted towards virtual tourism on social media and digital audiovisual travel-related content [12]. This change in the habits of tourists has highlighted the importance of congruity between music and visual elements in videos [13]. In the context of tourism, Fang et al. [3] argues that it is essential to achieve a coordinated audiovisual experience. STVs can effectively evoke emotional responses in viewers, such as feelings of inspiration, only when there is optimal coordination. To achieve coordinated audiovisual experience in STVs, music may need to be coordinated with aesthetic design components such as copywriting, animation, and rhythm [3]. Therefore, in this study, music congruity is defined as the congruence between music and various video elements, including copywriting, video graphics, post-editing, and so on. Furthermore, while Raja et al. [8] introduced a conceptual framework for understanding the impact of music congruity on tourists’ emotional resonance, further exploration of the mechanisms and quantitative models is needed.

2.3 Emotional Resonance and Behaviour Intentions

Previous studies have highlighted the critical role of emotional resonance in STVs for effective destination marketing [2]. Therefore, this study argues that emotions evoked by the audiovisual environment of STVs, as the key component of the ‘environment-emotion-behaviour’ pathway, are a necessary driver for the success of destination marketing campaigns. However, measuring emotional responses in STVs is challenging. This is due to the complex virtual audiovisual environment on social media platforms and the variety of video content available [14]. These factors may limit the effectiveness of traditional measures, such as the PAD scale, in accurately assessing the impact of emotions on STV users. To address this issue, Cheng et al. [2] investigated four main emotional resonance related factors (entertainment, inspiration, escapism, and self-congruence) in the context of STVs. Meanwhile, Cheng et al. [2] also confirmed the relationship between emotional resonance and engagement of STV users. However, Cheng et al. [2] did not consider the impact of music in STVs and did not use a scenario-based experiment. Furthermore, previous studies suggest that tourists’ moods or emotional states influence their impulsive travel intentions [15] and sharing intentions [16].

Based on the above, Fig. 1 shows our research framework.

3 Proposed Methodology

The offline pilot and main surveys will be conducted in China, Japan and South Korea in October 2023. These countries were chosen because they are recognised by the UNWTO as major international tourist source countries. They also have rich cultural music resources and many short video users, making them ideal for our research design.

Due to the subjective nature of music congruity, and to minimise potential bias arising from participants’ cultural backgrounds and inherent perceptions, firstly, we will employ convenience sampling and multigroup analysis (MGA). Our goal is to recruit 600 short video users (200 from each country) with travel experience, excluding those who have visited the destinations in sample videos. Secondly, we introduce music familiarity and demographic variables as control variables, according to Hadinejad et al. [10]. Thirdly, in terms of producing and selecting sample videos, we will collaborate with professional video production companies. Participants will be asked to watch 3 sample videos, each featuring a different themed tourist attraction. Each video will be paired with two types of background music: one congruent with the elements of the video, and another deliberately incongruent choice. This procedure serves as an experimental manipulation of music congruity.

To validate our music congruity manipulations, we will conduct manipulation checks focusing on two key aspects. First, we will measure participants’ awareness of the music changes in the sample videos using music perception measures [6]. Second, participants will rate the appropriateness of the background music by considering elements such as video tempo, visual style, copywriting, and destination attributes.

Measurement items were developed from existing literature using a seven-point Likert scale (1 = strongly disagree; 7 = strongly agree). The measurement and structural models will be evaluated via Mplus or SmartPLS (version

Fig. 1.
figure 1

Research Model.

4 Expected Results and Potential Implications

This study can examine the impact of multisensory congruity in online virtual environments, specifically how congruity between music and various video elements (e.g., tempo, copywriting, style, destination attributes) influences the emotional resonance and behavioural intentions of potential tourists. This research could provide valuable insights for marketers and content creators by focusing on the following key findings.

First, music congruity and incongruity may have opposite effects on the constitutive elements of the audiovisual environment (video aesthetics and music cognition) of STVs. Specifically, higher levels of congruity lead to increased stimulation perceived by users when interacting with the music and video content. These findings are consistent with previous studies [e.g., 9, 10] and could demonstrate the effectiveness of using scenario-based experiments, thereby extending the music congruity in the context of STVs. Marketers should consider the strategic use of music in STVs. For example, it is crucial to fine-tune the congruity between various video elements and music through test broadcasts prior to the formal release of STVs. Second, the emotional resonance evoked by the audiovisual stimuli in virtual environments could have a mediating effect. This is consistent with the SOR theory’s proposition of an ‘environment-emotion-behaviour’ framework. Third, music and video are key factors in evoking emotional resonance and user behaviour, with emotional resonance directly influencing users’ sharing intentions and impulsive travel intentions. Furthermore, marketers may need to consider factors such as nationality, cultural background and music familiarity when developing precise marketing strategies for different demographics.