# RecSys2020推荐系统论文Recommending the Video to Watch Next: An Offline and Online Evaluation at YOUTV.de

## 摘要 Abstract

The task “recommend a video to watch next?” has been in the focus of recommender systems’ research for a long time. However, adequately exploiting the clues hidden in the sequences of actions of user sessions in order to reveal users’ short-term intentions moved only recently into the focus of research. Based on a real-world application scenario, in this paper, we propose a Markov Chain-based transition probability matrix to efficiently reveal the short-term preferences of individuals. We experimentally evaluated our proposed method by comparing it against state-of-the-art algorithms in an offline as well as a live evaluation setting. In both cases our method not only demonstrated its superiority over its competitors, but exposed a clearly stronger engagement of users on the platform. In the online setting, our method improved the click-through rate by up to 93.61%. This paper therefore contributes real-world evidence for improving the recommendation effectiveness, by considering sequence-awareness, since capturing the short-term preferences of users is crucial in the light of items with a short life span such as tv programs (news, tv shows, etc.).

## 1. 引言 Introduction

YOUTV.de operates as an online video recording service that allows users to record high quality German television programs of their choice and view them whenever and wherever they want. Users of YOUTV.de can watch broadcasts directly after recording them online without any time delay (via streaming) or on the go by using an iOS or Android device. Personalized recommendations are therefore becoming a core functionality in order to enhance customer engagement and retention for such a service provider [11]. YOUTV.de provides the functionality of recommending which videos to watch next based on the sequence of tv programs that the users have interacted with (see at the bottom of Figure 1). Consumer studies suggest that typical visitors of an Internet TV website lose their interest after clicking on 10 to 20 titles (perhaps reviewing 3 in more detail) [6]. That is, users either find something interesting within a short time-span or the likelihood that they abandon the service increases significantly.

YOUTV.de是一个在线视频录制服务平台，可让用户录制自己选择的高质量德国电视节目，并随时随地观看。YOUTV.de的用户在在线录制广播后可以直接观看广播，没有任何时间延迟(通过流媒体)，也可以使用iOS或Android设备边走边看。

Collaborative Filtering (CF) helps users in exploring videos that other users with similar behavior and taste have already identified in the past. That is, given a target user and her positively rated tv programs, a CF algorithm will identify their neighbourhood of similar users, i.e. user-based CF (UBCF) [27]. Another CF algorithmic variation is the item-based CF (IBCF) [28], where given a target user and her positively rated tv programs, the algorithm relies on the items’ similarities for the formation of a neighborhood of nearest items. Recently, session-based CF (SBCF) has been proposed by [12] also known as session-knn, where similar sessions to the current session of a user are identified. Please note that a session is a short-term interaction of a user with the system.

CF的另一种算法变化是基于项目的CF(IBCF) [ 28 ] [28] ，给定目标用户及其积极评价过的电视节目，该算法依赖观看项的相似性来形成最接近观看项的推荐项邻域。

However, all the aforementioned methods are considered as non-sequential, since they learn a user’s preference on each individual item, and then they rank these items based on their score to provide recommendations. This functionality makes these methods good at capturing the general tastes of users by using the whole historical information (i.e. by aggregating a user’s complete log history). In contrast, sequence-aware recommender systems capture the transition relationship between two or more adjacent items in user log sequences (i.e. user sessions) so as to learn the sequential dynamics among items [24]. For example, many sophisticated sequence-based approaches were proposed that implement some form of sequence modelling based on Markov Chain Models (MCM) [1,21,23], which can capture the item transition probabilities and the very last user intentions inside a user session.

In this paper, we use MCM to provide session-based video recommendations, by trying to capture the short-term preferences of users. To reveal the very last user’s intention, we analyse the item interactions inside her latest sessions (i.e., MCM-based items’ similarity) by building a Transition Probability Matrix (TPM). Moreover, we track the evolution of these preferences by using a sliding time window to assign smaller importance to videos that are outdated or old. Thus, our model is continuously updated with the latest user clicks, which makes it sensitive to adapt to the changes of the user preferences. We have run offline and online experiments with our proposed method and other baselines. In both settings, our method performs best. In particular, for online setting, our method which profiles users in real time improves the click-through rate by up to to 93.61%.

The rest of the paper is organized as follows. Section 2 summarizes the related work. Section 3 provides the problem formulation, whereas Section 4 presents a motivation toy example. Section 5 presents our proposed methodology. Experimental results are given in Section 6, whereas we discuss some methodological challenges in Section 7. Finally, Section 8 concludes the paper.

## 2. 相关工作 Related Work

There are several works [1, 3, 4, 7, 21, 23, 34] in recommender systems that use sequential modelling based on MCMs. Zimdars et al. [34] were the first to suggest the sequentially of the recommendation process. Esiyok et al. [3] studied the users’ behaviour in the context of news categories by building a MCM based on the Plista data set and by describing patterns in the evolution of news categories while users browse news articles online. Moreover, approaches for recommender systems that use MDPs are published by Moling et al. [22] and Shani et al. [29]. A hybrid model that combines MCM with MF is proposed by Rendle et al. [26] and it is denoted as Factorized Personalized Markov Chains (FPMC). FPMC is used for the next-basket recommendation problem. The task at hand is to predict user’s next basket content, given history of past shopping baskets. This approach combines MCM with MF using a three-dimensional tensor (user, current item, next item). Each entry in the tensor corresponds to an observed transition between two items performed by a specific user. The proposed method then uses pairwise factorisation to predict the unobserved entries in the sparse tensor [20]. However, this approach is computationally intensive, and does not scale well in real world on-line situations.

Zimdars等人 [ 34 ] [34] 首先提出了推荐过程的序列化建议。

Esiyok等 [ 3 ] [3] 通过基于Plista数据集来构建MCM，并描述了用户在线浏览新闻文章时，新闻类别演变的模式，研究了用户在新闻类别情景下的行为。

Moling et al. [ 22 ] [22] 和Shani et al. [ 29 ] [29] 公布了使用MDP的推荐系统的方法。

Rendle等 [ 26 ] [26] 提出了将MCM与MF相结合的混合模型，称为分解个性化马尔科夫链(FPMC)。FPMC用于下一个购物车推荐问题。给定过去购物车的历史记录，当前的任务是预测用户的下一个购物车内容。该方法利用三维张量(用户、当前项、下一项)将MCM与MF相结合。张量中的每一项都对应于一个特定用户所观察到的其他两项之间的转换。然后，该方法使用两两分解来预测稀疏张量中未观察到的项 [ 20 ] [20] 。然而，这种方法计算量大，在现实世界的在线情况下无法很好地扩展。

Session-based recommendation usually refers to the case, in which we only have anonymous sessions and we are not able to build a user’s profile. Recently, session-based recommendations have been modelled with Recurrent Neural Networks (RNNs). Hidasi et al. [9] presented a recommender system based on Gated Recurrent Unit (GRU), which learns when and how much to update the hidden state of the GRU model. However, a more recent study [12] have shown that a simple k-nearest neighbor (session-kNN) scheme adapted for session-based recommendations often outperforms the GRU4Rec model. Nevertheless, several new adjustments were proposed during last years that improve the performances of the initial RNN model [8, 10, 25, 30, 31]. Neural Attentive Recommendation Machine (NARM) [16] proposed an attentive mechanism to capture in a session both the user’s (i) sequential behavior and (ii) main purpose. Short-Term Attention/Memory Priority (STAMP) [19] model proposed another attention mechanism, which can effectively capture long-term and short-term interests of a user in a session. Yuan et al. [33] proposed a convolutional generative model, which combines masked filters with 1D dilated convolutions to model the long-range item dependencies among user sessions.

Hidasi等 [ 9 ] [9] 提出了一种基于门控递归单元(GRU)的推荐系统，该系统学习何时更新GRU模型的隐藏状态以及更新多少。

Yuan等人 [ 33 ] [33] 提出了一种卷积生成模型，该模型将掩码过滤器与一维扩张卷积相结合，来建模用户会话之间的长期项依赖关系。

Another research direction is the combination of collaborative filtering with content-based filtering for providing item recommendations. Related work [17] has shown that a way to increase accuracy is to consider the context of the user (i.e., time, location, mood, etc). For example, Das et al. [2] generated recommendations based on collaborative filtering that takes under consideration the co-visitation count of articles, which is the number of times a news story was co-visited with another news stories in the user’s click-history. Later, Liu et al [18] combined a content-based method with the collaborative filtering method previously developed for Google News [2] to generate personalised news recommendations. The hybrid method develops a Bayesian framework for predicting users’ current news interests based on profiles learned from: (i) the target user’s activity and (ii) the news trends demonstrated in the activity of all users based on the categories that the news articles belong to. We will use in our experiments this method as a comparison partner since also videos have content that describes them and could be combined with the collaborative data.

## 3. 问题描述 Problem Formulation

We are interested in building a recommender system that suggests videos to interested internet television viewers (see the ”you may also like” section on Figure 1). The internet tv provider may update a small personalised top-N list of video recommendations (which may be shown inside a widget), every time a video is selected by the viewer, because the provider wants to engage the user more time in the website for reasons of advertisement and for fulfilling his viewing desires. The system monitors how visitors react upon the received recommendations to drive better suggestions and try to predict their next click/item inside a session.

Let U U denote the increasing set of users that visit the online web site, and I I represent the increasing set of incoming tv programs/items. We keep track of users’ actions over items on the website. In particular, whenever a user views one or more videos in a short period of time, we store these interactions in the database as a user’s session. These interactions with items have a sequence. That is, we know for every item that belongs to a session, if it is selected first, second or last and how long the user interacted with it. For instance, session S 1 ( u s e r = u 1 , T i m e S t a r t e d = t 1 ∣ i 1 , 20 s e c , i 2 , 145 s e c ) S_1(user = u_1, TimeStarted = t_1|{i_1, 20sec}, {i_2, 145sec}) indicates that within session S 1 S_1 that started at timepoint t 1 t1 from user u 1 u_1 , item i 1 i_1 was selected first, it was read for 20 seconds, i 2 i_2 was selected second and it got user’s attention for 145 seconds. Table 1 summarises some basic symbols and notations that will be used later.

U U 表示不断增加的访问在线网站的用户集合， I I 表示不断增加的电视节目/观看项集合。

U U 用户集， U = U 1 , U 2 , . . . , U n U = {U_1, U_2, ...,U_n}
I I 项集， I = i 1 , i 2 , . . . , i m I = {i_1, i_2, ...,i_m}
S S 会话集， S = S 1 , S 2 , . . . , S N S = {S_1, S_2, ...,S_N}
S u S_u 用户 u u 所属的会话集
S i S_i i i 所属的会话集
S t p S^{t_p} , S t y S^{t_y} 带有时间窗口的会话集
w w 滑动时间窗口的大小

## 4. 启发性示例 Motivating Example

Our session-based recommender consists of two modules. The first one is the user profile updater module, which reads instances from the stream of sessions combining them with earlier recorded information. In particular, our user profile updater assigns validity intervals to elements of the sessions stream S S . Then, a sliding time window of size w w states that the processing at a point in time t t should respect all events not older than t − w t−w . Therefore, the profile updater sets a (half open) validity interval ( t − w , t ] (t−w, t] to an event that has been arisen at time t t . Then, the second module is the recommender that runs on top of the profile updater to deliver the top- N N recommended items to each user.

To better explain our approach, we will use the following graphical representation for our running example, which is shown in Figure 2. We have 3 users and want to predict the video that user 2 will click next in his unfinished session (i.e. session S 7 S_7 ).

For computing similarities between target user 2 with the two other users, please note that sessions S 1 S_1 and S 2 S_2 are ignored, due to being outside of the valid time window interval set as ( t − w , t ] (t−w, t] . This sliding window captures the notion of recency of tv programs. That is, a video may have a life span, which obsoletes fast (e.g. the news broadcast of the previous day is already old). Thus, we should try not to recommend videos to users that are not recent. Please note that still entertainment content, such as movies, tv series, and tv shows could be interesting for more extended time periods and have longer life span. When two or more items are selected within one session, these items can be considered to be more similar compared to items that were selected in different sessions from the same user. For instance, by taking actions of user 1 (i.e., U 1 U_1 ) into account we can infer that item i 4 i_4 is more similar to item i 6 i_6 and i 7 i_7 than item i 9 i_9 , since they were selected inside the same session S 4 S_4 together with item i 4 i_4 .

Our running example is depicted in Figure 2, session S 7 S_7 is still open and it is running for user U 2 U_2 . Thus, items i 2 i_2 and i 3 i_3 of session S 7 S_7 can be matched in order to make item recommendations to user U 2 U_2 . As shown, user U 3 U_3 has also selected the same items (inside the valid window time interval) with U 2 U_2 . He has also selected i 8 i_8 , which could be a nice recommendation for U 2 U_2 . Please note that also U 1 U_1 has selected exactly the same items with those of session S 7 S_7 of U 2 U_2 . However, we cannot use session S 1 S_1 for training recommendations since it is not inside the specified time window.

In summary, items that are selected within the same session (intra-session Markov Chain-based item similarity) are considered to be more similar than those items which are selected by the same user within different sessions (inter-session item similarity). Intra-session Markov Chain-based item similarity thus reveals the short-term preference of the user and his intentions inside a session independently of other sessions.

## 5. 方法 Methodology

In this Section, we will introduce our Markov Chain-based algorithm in more detail. In particular, for each user session, we update an item transition probability matrix based on the subsequences between items.

### 5.1. 马尔科夫链 Markov Chain

A Markov chain is a stochastic process of possible events that satisfies the Markov property, where the probability of each event depends only on the present state and not on the previous states. A variation of MCM, denoted as Markov Chain Model of Order m, states that the future state depends on the past m states. Hidden Markov Model (HMM) is also a MCM with hidden states. Moreover, Markov Decision Processes (MDPs) extend MCMs, where at each timepoint t t , when the process is in state x t x_t , the decision maker may choose any action a ∈ A x t a\in A_{x_t} . MDP reacts at the next time step by randomly moving into a new state x t + 1 x_{t+1} , and giving the decision maker a corresponding reward R ( x t ,   a ,   x t + 1 ) R(x_t, a, x_t+1) .

MCM的一个变体，表示为m阶的马尔可夫链模型，该模型的未来状态依赖于过去的m个状态。隐马尔可夫模型(HMM)也是一种具有隐状态的MCM。

One of the ways that a Markov Chain Model { X } \{X\} can be represented is by using a transition matrix P i , j P_{i,j} , where each row contains the probability of transition between states.
P i , j = P ( X t + 1 = x j ∣ X t = x i ) (1) \begin{aligned} P_{i,j}=\mathbb{P}(X_{t+1}=x_j|X_t=x_i) & \qquad\qquad\text{(1)} \end{aligned}

Each row of the matrix is a probability vector, and the sum of its components is equal to 1. For instance, a transition matrix that represents a model with three possible states:
P = [ 0.7 0.05 0.25 0.25 0.6 0.15 0.35 0.1 0.55 ] P=\begin{bmatrix} 0.7 & 0.05 & 0.25 \\ 0.25 & 0.6 & 0.15 \\ 0.35 & 0.1 & 0.55 \\ \end{bmatrix}

P = [ 0.7 0.05 0.25 0.25 0.6 0.15 0.35 0.1 0.55 ] P=\begin{bmatrix} 0.7 & 0.05 & 0.25 \\ 0.25 & 0.6 & 0.15 \\ 0.35 & 0.1 & 0.55 \\ \end{bmatrix}

If the current state of the process is x 2 x_2 , i.e. the second row of the matrix is considered, then the probability of transition into the state x 3 x_3 equals to P ( X t + 1 = x 3 ∣ X t = x 2 ) = 0.15 P( X_{t+1}=x_3| X_t=x_2)=0.15 . Another way to represent MCM is with a transition diagram, that is a weighted directed graph, where each vertex represents a state of the MCM and there is a directed edge from vertex x i x_i to vertex x j x_j if the transition probability P i , j > 0 P_{i,j}>0 ; this edge has the weight/probability of P i , j P_{i,j} . An example of such diagram can be found in Figure 3.

### 5.2. 本文提出的算法 Proposed Method

Based on the Bayesian inference that considers independence among evidences, we can predict the items that will be included in a last session S N S_N of a user u u based on the items that are already included in S N S_N . In particular, we can use the following formula to build the Markov Chain-based transition probabilities between any two subsequent items in each distinct session in time window t p t_p as follows:
p ( j ∈ S N ∣ i 1 : m ∈ S N ) ∝ ∏ i k ∈ S N , k = 1... m p ( j ∈ S N ∣ i k ∈ S N ) (2) p(j\in S_N|i_{1:m}\in S_N)\propto\prod_{i_k\in S_N,k=1...m}p(j\in S_N|i_k\in S_N)\qquad\qquad\text{(2)}

where i k i_k is the set of items that user u u already has clicked in current session S N S_N , and j j is the item to be predicted as next recommended item in S N S_N .

p ( j ∈ S N ∣ i 1 : m ∈ S N ) ∝ ∏ i k ∈ S N , k = 1... m p ( j ∈ S N ∣ i k ∈ S N ) (2) p(j\in S_N|i_{1:m}\in S_N)\propto\prod_{i_k\in S_N,k=1...m}p(j\in S_N|i_k\in S_N)\qquad\qquad\text{(2)}

The Markov Chain-based item Transition Probability Matrix(TPM) captures the transition probability between two subsequent events in a session. That is, we can simply count how often users viewed item i b i_b immediately after viewing item i a i_a .

Let a session Sn be a chronologically ordered set of item click events S n = ( i 1 , i 2 , . . . , i m ) S_n=(i_1,i_2,...,i_m) and S S be a set of all sessions S = S 1 , S 2 , . . . , S N S={S_1,S_2,...,S_N} . Given a user’s current session S N S_N with i m i_m being the last item in S N S_N , we can define the score for a recommendable item j j as follows: p ( j ∈ S N ∣ i m ∈ S N ) = s c o r e ( j , i m ) = ∑ S n ∈ S , n = 1... N ∑ i k ∈ S n , n = 1... N i s S a m e ( i m , i k ) ⋅ i s S a m e ( j , i k + 1 ) ∑ S n ∈ S , n = 1... N ∑ i k ∈ S n , n = 1... N i s S a m e ( i m , i k ) (3) \begin{aligned} p(j\in S_N|i_m\in S_N) & = score(j,i_m) \\ & = \frac{\sum_{S_n\in S,n=1...N}\sum_{i_k\in S_n,n=1...N}isSame(i_m,i_k)\cdot isSame(j,i_{k+1})}{\sum_{S_n\in S,n=1...N}\sum_{i_k\in S_n,n=1...N}isSame(i_m,i_k)}\\ \end{aligned}\qquad\text{(3)}

where the function i s S a m e ( i a , i b ) isSame(i_a, i_b) indicates where i a i_a and i b i_b refer to the same item as follows: i s S a m e ( i a , i b ) = { 1 , 若  i a = i b 0 , 若  i a ≠ i b isSame(i_a,i_b) = \begin{cases} 1, & \text{若 $i_a=i_b$} \\ 0, & \text{若 $i_a\neq i_b$} \end{cases}

Based on Equation 3, in our running example of Figure 2, transition probability from item i 4 i_4 to item i 6 i_6 is equal to 1 2 \frac{1}{2} , and it is so since in all the sessions of time window t p t_p there is only one case where i 4 i_4 is followed by i 6 i_6 (session S 4 S_4 ); and the denominator is equal to two, since there are two sessions where i 4 i_4 is followed by any other item (sessions S 3 S_3 and S 4 S_4 ). The markov chain-based transition probability matrix of our running example is presented in Table 2 (rows and columns with zeros are not shown).

i 2 i_2 0 0 0 0 1 2 \frac{1}{2} 0 0 0 0 1 2 \frac{1}{2}
i 3 i_3 0 0 1 1 0 0 0 0 0 0 0 0
i 4 i_4 1 2 \frac{1}{2} 0 0 0 0 1 2 \frac{1}{2} 0 0 0 0
i 6 i_6 0 0 0 0 0 0 0 0 1 1 0 0

To summarise, Markov Chain-based TPM infers similarity among items inside each session independently from other sessions. Please note that the Markov Chain-based TPM similarity is more effective with smaller time window sizes w w , which makes it more suitable for capturing the short-term user preferences.

### 5.3. 推荐列表创建 Recommendation List Creation

Our recommender module provides recommendations based on the Markov Chain-based TPM presented in previous section. For each target user u u , the recommender checks the set of her recently viewed items I t p , u I_{t_p,u} (i.e., the ones she has interacted with in the current time period t p t_p ) and computes K i K_i , which is the set of the k k nearest items to each item i i that belongs in I t p , u I_{t_p,u} . Next, for each target user u u in t p t_p and for each item j j we compute a ranking score s c o r e ( t p , u , j ) score(t_p,u,j) as follows: s c o r e ( t p , u , j ) = ∑ i ∈ I t p , u T P M ( i , j ) ∗ 1 ( j , K i ) (3) score(t_p,u,j)=\sum_{i\in I_{t_p,u}}TPM(i,j)*1_{(j,K_i)}\qquad\text{(3)}

where 1 j , K i 1_{j,K_i} is an indicator function that is equal to 1 1 if the item j j is present within the k-nearest neighbors of item i i , and 0 0 otherwise. Moreover, T P M ( i , j ) TPM(i,j) is a function that returns a similarity score for two items i i and j j based on the Markov Chain-based TPM (transition probability matrix) that we computed in the previous section. Then, for each user we sort the items in decreasing score and recommend to her the top-N ones.

## 6. 实证评估 Experimental Evaluation

In this Section, we will perform off-line and on-line evaluation of our proposed method together with other baselines and state-of-the-art comparison partners.

### 6.1. 离线评估 Offline Evaluation

#### 6.1.1 数据集特征 Data Set Characteristics

For the offline evaluation, the data set was collected during 2 weeks in May 2018 (18/5/2018-1/6/2018). It accommodates 1,146,452 interactions/events on 63,897 videos of 18,447 unique users. The interactions of each session are logged with the following information: the user session’s identifier, the interaction’s time stamp and duration, the tv program’s textual content. As shown in Figure 4(a), fifty percent (50%) of user sessions in the offline data set have only one video interaction. However, as can be seen in Figure 4(b), most users interact with two videos before leaving the service in the online evaluation scenario. As will be explained later, this happens because user satisfaction is increased due to the existence of a new better recommendation algorithm in the A/B testing phase. Please note that for the data set used in the offline evaluation, there was a cleaning procedure, which lied in removing the sessions that contain only one video interaction, as no recommendations can be tested on such sessions, and no video co-occurence item patterns can be identified to build a model. Detailed general statistics of the data sets used for the offline and the online evaluation are summarized in Table 3.

Please note that for the offline evaluation, the average number of interactions per session is 3.6, which seems adequate for building a recommendation model.

#### 6.1.2. 事前评估方案 Prequential Evaluation Protocol

In this Section, we present the evaluation protocol used for the offline evaluation, which is in the same direction, with the one introduced by Jannach et al. [12, 20] for predicting the next item inside a session, known also as prequential evaluation in stream mining [24, 32].

As shown in Figure 5, the time frame of the available data is split into N t N_t equal time periods t p t_p , and the data are then split in the way that in each period there are only sessions that were made during that time period. We use the splitting to later aggregate the evaluation results for each time period. Please note that t p t_p is the size of the horizon of the future that we will be able to predict. In the next section, we will try to identify what is the best future horizon that we will be able to predict more effectively (e.g. some hours in future, or some days in future).

We also use parameter w w to specify the window size on which the model is trained, which is defined as “Train data” in Figure 5. Parameter w w controls how far back into the past we go to exploit information. Please note that if w w is too large the system is not sensitive to changes (concept drifts). If it is too small there is not enough data to build a model predicting the next items in a session. In the test phase we also use a parameter v v to specify how many views of a currently evaluated session are revealed before recommendation is made. As shown in the “Test phase” rectangle in Figure 5, after the first prediction is tested, the data point is added to the model, and the process repeats until there are data points in the session to be tested. Finally, we evaluate the precision (i.e., the number of hits divided by the number of recommended items) we get when we recommend top-5 videos for each next item prediction inside a session.

#### 6.1.3. 所提出算法的灵敏度分析 Sensitivity Analysis of the proposed method

In this Section, we study the accuracy performance of the Markov Chain-based Transition Probability Matrix (TPM). We will explore, how the precision of the aforementioned method changes as we vary different parameters such as (i) different time period splits: N t = 14 N_t=14 time periods of 24 24 hours, 28 28 time periods of 12 12 hours, 55 55 time periods of 6 6 hours, 331 331 time periods of 1 1 hour (ii) various time window sizes: w = 1 , 3 , 9 , 12 , 15 , 18 , 21 w = 1, 3, 9, 12, 15, 18, 21 . For all experiments we provide top-5 recommended items.

(i)不同的时间段划分： N t = 14 N_t=14 24 24 小时时间段、 28 28 12 12 小时时间段、 55 55 6 6 小时时间段、 331 331 1 1 小时时间段；

(ii)不同时间窗口大小： w = 1 、 3 、 9 、 12 、 15 、 18 、 21 w=1、3、9、12、15、18、21

As it is shown in Figure 6(a), as we split the timeline of data (i.e., the 14 days) in different time periods, recommendation accuracy changes drastically. As discussed earlier, parameter N t N_t controls how big the future horizon (that we will try to predict) is. With this experiment, we want to identify how short or long this future horizon should be to have the best recommendation accuracy. Of course, this is related to the life span of the items and how often users re-appear into the system or change their preferences. Thus, by setting the time window size w = 1 w=1 , when N t = 331 N_t=331 and t p = 1 t_p=1 hour, the precision we get when we recommend 5 items (i.e., precision@5) is less than 0.04 0.04 . When N t = 28 N_t=28 and t p = 12 t_p=12 hours, we are able to get the best precision (0.051). Later, our prediction accuracy again drops as we consider N t = 14 N_t=14 and t p = 24 t_p=24 hours. Henceforth, we will set N t = 28 N_t=28 and t p = 12 t_p=12 hours for the rest experiments, which means that we are able to predict better when we try to predict the next 12 hours.

N t = 28 N_t=28 t p = 12 t_p=12 个小时时，可以得到最佳的准确度(0.051)。

#### 6.1.4. 与其他算法的比较 Comparison with other methods

In this Section, we compare our Markov Chain-based TPM algorithm with the following baselines and state-of-the-art comparison partners, which are representatives of different algorithmic families such as collaborative filtering (IBCF), session-based filtering (Session-knn), GRU4Rec and hybrid collaborative with content-based filtering (Cat-TPM).

(i) Recently Most Popular Items (Recently POPULAR): Recently POPULAR baseline recommends the top-N most clicked videos of the active/valid time period t p t_p .
(ii) Item-based Collaborative Filtering (IBCF) [ 2 ] [2] : Based on IBCF, two items are considered similar, if they are selected by similar users. In [ 2 ] [2] , IBCF considers the co-visitation count of news articles, which counts the number of times an item was co-visited (clicked before of after) with another item.
(iii) Session-knn [ 13 ] [13] : Session-knn method takes the set of user actions in the current session, e.g. two view events for certain items, and then in a first step determines the k most similar past sessions in the training data. Then, given the current session s s , the set of k k nearest neighbors N s N_s , and a function s i m ( s 1 , s 2 ) sim(s_1,s_2) that returns a similarity score for two sessions s 1 s_1 and s 2 s_2 , the score of a recommendable item i i is: s c o r e K N N ( i , s ) = ∑ n ∈ N s s i m ( s , n ) × 1 n ( i ) (3) score_{KNN}(i,s)=\sum_{n\in N_s}sim(s,n)\times1_n(i)\qquad\qquad\text{(3)}

where 1 n ( i ) = 1 1_n(i)=1 if n n contains i i and 0 0 otherwise. The similarity measure used by Jannach et al. [ 13 ] [13] in experiments is cosine similarity, as it was found out that the best results are achieved when encoding sessions as binary vectors of the item space.
(iv) GRU4Rec [ 9 ] [9] : GRU4Rec is a neural network-based recommender system that uses Gated Recurrent Units (GRU), which learns when and how much to update the hidden state of the GRU model. In particular, GRU4Rec is a recurrent neural network, which modifies the basic GRU to fit the prediction task better by introducing session-parallel mini-batches, mini-batch output negative sampling and a pairwise ranking loss function.
(v) Category-based TPM: (Cat-TPM) [ 18 ] [18] : Based on Cat-TPM, when a user selects two videos in a row, a transition from a category of the first video to a category of the second video is recorded. Cat-TPM combines the content-based with the collaborative filtering methods to generate the personalized Google news recommendations [ 18 ] [18] .

(i) 最近最流行的推荐项(最近流行)：最近流行的基线推荐在活动/有效时间段 t p t_p 内点击次数排名前n的视频。

(ii) 基于项目的协同过滤(IBCF) [ 2 ] [2] ：基于IBCF，如果两个项目被相似的用户选择，则认为它们是相似的。在文献 [ 2 ] [2] 中，IBCF考虑新闻文章的共同访问次数，它用于统计一个推荐项与另一个推荐项被共同访问(单击之前或之后)的次数。

(iii) Session-knn [ 13 ] [13] ：Session-knn方法取当前会话中的用户动作集，例如两次查看某些观看项的事件，然后在第一步确定训练数据中k个之前最相似的会话。那么，给定当前会话 s s k k 最近邻的集合 N s N_s 和函数 s i m ( s 1 , s 2 ) sim(s_1,s_2) ，该函数返回两个会话 s 1 s_1 s 2 s_2 的相似性得分，那么一个推荐项 i i 的得分表示为： s c o r e K N N ( i , s ) = ∑ n ∈ N s s i m ( s , n ) × 1 n ( i ) (3) score_{KNN}(i,s)=\sum_{n\in N_s}sim(s,n)\times1_n(i)\qquad\qquad\text{(3)}

(iv) GRU4Rec [ 9 ] [9] ：GRU4Rec是一种基于神经网络的推荐系统，它使用门控循环单元(GRU)来学习何时更新GRU模型的隐藏状态以及更新多少。其中，GRU4Rec是一种递归神经网络，通过引入会话并行迷你批，使其更好地适应预测任务，迷你批输出负抽样和成对排序损失函数，对基本的GRU进行了改进。

(v) Category-based TPM：(Cat-TPM) [ 18 ] [18] :基于Cat-TPM，当用户连续选择两个视频时，会记录从第一个视频的类别到第二视频的类别的转移。Cat-TPM将基于内容的方法与协同过滤方法相结合，生成个性化的谷歌新闻推荐。

The parameters we used to evaluate the performance of all the aforementioned comparison partners are in accordance with those reported in the original papers and for our data set were tuned so as to get the best results for these methods.

Figure 7 reports the average precision over the two weeks of the comparison algorithms for N t = 28 N_t=28 time splits, t p = 12 t_p=12 hours and sliding time window w = 18 w=18 for the offline data set. We run experiments with top-5 recommended tv programs. The reported results are tested for the difference of means between Markov Chain-based TPM and each of the rest comparison partners and found statistically significant based on one-sided t-test at the 0.05 level. As shown in Figure 7, our proposed approach has the best average precision over the 2 weeks among all comparison partners. This happens because we take under consideration the sequence between the items clicked inside a user session. This allows Markov Chain-based TPM to capture better the notion of recency of user’s interest (i.e. short-term preferences).

As far as the rest comparison partners is concerned, “Recently POPULAR” baseline attains the worst results in terms of precision, since it cannot provide personalized recommendations. As expected, Session-knn attains better results than GRU4Rec as already reported in [ 14 ] [14] . IBCF does not attain very good results because there are not enough data to build its prediction model. That is, many users re-appear irregularly and very rarely at timepoints that are far apart, which means that collaborative filtering cannot build always a model, since users should appear in two consecutive time slots. In other words, IBCF should be better for capturing the long-term preferences of a user and it is not so effective for items which have short life span such as news stories, tv programs, etc. We will further analyze this in the discussion section later. Moreover, as expected, Session-knn is far worst than Markov Chain-based TPM, because it cannot capture adequately the latent associations among items inside the same session. Moreover, the Cat-TPM which tries to combine collaborative filtering with content-based filtering (i.e., the category of the video is used) is also very ineffective because it probably is able to build only the long-term profile of users but misses to identify the very last short-term intentions of the users. Please note that in our experiments we have tested a combination of our Markov Chain-based TPM with the Cat-TPM, but this surprisingly did not resulted in better recommendaiton accuracy.

IBCF并没有得到很好的结果，因为没有足够的数据来建立它的预测模型。也就是说，很多用户会不定期地重新出现，而且很少会出现在相隔较远的时间点上，这意味着协同过滤无法建立稳定的模型，因为用户应该会出现在两个连续的时间段中。换句话说，IBCF应该能更好地捕捉用户的长期偏好，而对于那些寿命较短的项目，如新闻报道、电视节目等，它就不那么有效了。我们将在后面的讨论部分对此进行进一步分析。

### 6.2. 在线评估 Online Evaluation

The online A/B testing experiment was performed in 30 days between June and July 2019 (5/6/2019-4/7/2019). From the offline evaluation, we have selected the two best single (non-hybrid) methods in terms of precision, which was our Markov Chain-based TPM and the IBCF methods. We wanted to check their performance with a live experiment, to understand the user experience with our personalized tv program recommendations. In particular, we conducted experiments on a fraction of the live traffic at YOUTV, similarly to the procedure followed in the personalized Google news recommendations [ 18 ] [18] work. The users were randomly assigned to a control group and a test group. The two groups had about the same number of users. The total number of users is 22,375, which have 2,630,008 interactions on 132,895 videos, as can be seen from Table 3. Please note that the engagement of users in the service has increased drastically since the average number of interactions per session is increased to 7.24 from 3.6 of the previous year. That is, at least half of the users (i.e., test group) in the online evaluation increased their interaction with the service due to the existence of a new recommendation algorithm. As can be seen in Figure 4(b), users interact with two videos before leaving the service. In addition, the average number of interactions per session (+3.64) and per user (+55.39) markedly increased (almost doubled) from the offline data set, as can be also seen by Table 3.

The evaluation protocol was the following: When a logged-in YOUTV user (who also explicitly has enabled web history) visited the website, we recommended to him tv programs. In our experiment, the users in the control group get video recommendations from the existing collaborative filtering method (i.e., IBCF method); while the new Markov Chain-based TPM is used for providing recommendations to the test group. The metric which is used to measure the performance of both recommendation algorithms is the click-through rate (CTR), which is the ratio of the number of clicks to the number of views of the recommended tv programs. We calculated the CTR metric for each user on daily basis. The performance of the control and test group was derived by averaging the measurements of all the users in the corresponding group.

The CTR of the recommended videos is calculated as the number of clicks on the recommended videos every time the user visits the YOUTV website. It directly measures the quality of the recommendations as how many of the recommendations are clicked on, thus liked, by the user. Figure 8 shows the CTR of the recommended videos for the control and test group in the 30 days. The values are scaled so that the CTR of the control group in the first day is 1. As shown in the figure, the CTR in the test group is clearly higher than the CTR in the control group. This demonstrates that the proposed Markov chain-based TPM recommendation method improved considerably the quality of videos recommendations. On average, the Markov chain-based TPM method improves the CTR upon the pre-existing collaborative IBCF method by 93.61% in this real-world setting. The reported results are tested for the difference of CTR means between Markov Chain-based TPM and IBCF and found statistically significant based on one-sided t-test at the 0.05 level. This is very strong evidence for the improvement due to the exploitation of the sequential interactions of users.

## 7. 讨论 Discussion

In this section, we discuss the challenges for evaluating recommendations with an offline and an online evaluation protocol. We used the offline evaluation to select the two best recommendation algorithms along with their parameters that could most likely generate good recommendations in a live experiment. In contrast to what is reported in [5], in our case we have verified that our method is better than the IBCF in both the online and offline experiments. Thus, based on our own findings the offline experiments can reflect the relative performance of these two techniques. Of course, not many researchers can have access in the log server of a real-life company. To solve this, NewsReel [15] enabled an offline evaluation scenario similar to the online one by representing the data as a stream, which allow researchers to “replay” it offline. In other words, researchers can emulate the server used in online scenarios and carry out A/B tests for the news recommendation task.

In our live experiments, we used the CTR metric, whereas in our offline evaluation we have used the precision metric. Generally, the two metrics are considered as similar, but of course there are differences. The main difference is that the precision is computed for a specific number of recommended items, whereas for the CTR there is no such a parameter, but instead it just uses in the denominator the number of video views. This means that in precision as we increase the number of top-N recommended items precision always drops, whereas in the CTR you do not have such trend, which means that the acquired results differ but they should follow the same relative trends. The reason that CTR is used in the industry is because it is related with the revenues generated by the online advertisement (i.e, pay per click over views), which is a way to justify the performance of a recommender system through its generated revenue or other engagement objectives such as user clicks, user likes, and leaving a positive rating. For example, to measure the engagement of the users in social media (i.e, LinkedIn, Facebook, etc.) for a business’s post, we simply compute the ratio of the sum of comments, clicks, and likes on this post to the number of impressions.

One of the important findings in this work, is that the Markov Chain-based TPM, which exploits information from sessions is proved to be very effective in capturing the short-term preferences of the users, whereas the IBCF is mainly adequate for capturing the long term user preferences. The fact that in our case study, the items have a very short life span (e.g., news and tv programs) makes Markov Chain-based TPM more suitable for capturing this short-term dynamics. Of course, for cases where items have longer life span such as movies or books, the idea of trying to capture together the long-term and short-term dynamics may be more effective. As discussed in section 3 the intra-session Markov Chain-based item similarity reveals the short-term preference of the user. Based on this stand point, inter-session item similarity could also probably help in finding item similarities when sessions have a very small number of item interactions (i.e. low average number of item interactions per session) and could capture the long-term user preferences. In a future extended version of this paper, we plan to consider also the inter-session item similarity.

Doing online experiments on a real-world system on real users in real time is extremely complicated and tricky to configure and interpret. Someone could argue, that in the offline data set, it is not mentioned what else the user saw on the screen when making decisions about the next item. For example, if an existing system was providing recommendations on the screen, that could have biased the data based on “what” was recommended and “where” on the user’s screen it was shown (e.g., center bias). Moreover, someone could argue that in the online experiments, there could have been some confounding factors, such as latency differences in the recommendation algorithms, or perhaps pre-existing differences in engagement between the two under comparison user groups, that could have affected the outcome. We have to mentioned that, the interface used to provide the recommended items, is shown in Figure 1, which depicts the items which are similar to the target item that the user interacts with.

Another challenge for the offline evaluation of recommender systems is the identification of the size of the future horizon for our predictions and the time window size of how far back into the past we should go and exploit information to train our models. In this work, we have seen some evidence that these thresholds are very sensitive to the items’ life span and to how often users re-appear into the system. However, we need to make further investigations to prove this correlation. In any case, these factors should be taken under consideration for further optimising the recommendation accuracy.

Our method is content agnostic and can be applied to different recommendation domains i.e., news recommendations in social media or even medicine recommendations for patients based on their Electronic Health Records (EHRs). Please note that for each different domain, our method only needs to sequentially process the user-item interactions. For example, in healthcare we can process the therapeutics (sessions) of patients (users) to consider the medicines (items) which were prescribed in each therapeutic.

## 8. 结论 Conclusion

In this paper, we proposed a Markov Chain-based TPM method to reveal the short-term intentions of individuals. We have evaluated experimentally our method and compare it against baselines and state-of-the-art algorithms in an offline and an online evaluation setting. We have shown the superiority of our method over its competitors specifically for the case where the items’ life span is very short (i.e., news and tv programs). As future work, we would like to test what is the users’ perception over different explanation styles (i.e., item-based, user-based, session-based, etc.), whenever users get an explanation along with a recommendation for reasons of transparency and accountability. Moreover, we plan to extend our algorithm to consider both the intra- and the inter-session item similarities, to deal with the problem of extreme data sparsity in user sessions.

