sigir20-How to Retrain Recommender System A Sequential Meta-Learning Approach

sigir20-How to Retrain Recommender System A Sequential Meta-Learning Approach 论文解读

Introduction

Recommender systems play an increasingly important role in
the current Web 2.0 era which faces with serious information
overload issues. The key technique in a recommender system is the
personalization model, which estimates the preference of a user
on items based on the historical user-item interactions [14, 33].
Since users keep interacting with the system, new interaction data
is collected continuously, providing the latest evidence on user
preference. Therefore, it is important to retrain the model with
the new interaction data, so as to provide timely personalization
and avoid being stale [36]. With the increasing complexity of
recommender models, it is technically challenging to apply real-
time updates on the models in an online fashion, especially
for those expressive but computationally expensive deep neural
networks [13, 26, 43]. As such, a common practice in industry is to
perform model retraining periodically, for example, on a daily or
weekly basis. Figure 1 illustrates the model retraining process.
在当前面临严重信息过载问题的web2.0时代，推荐系统扮演着越来越重要的角色。推荐系统中的关键技术是个性化模型，它根据用户-项目的历史交互作用来估计用户对项目的偏好[14，33]。由于用户不断与系统交互，新的交互数据被不断收集，为用户偏好提供最新证据。因此，使用新的交互数据重新训练模型是很重要的，以便提供及时的个性化并避免过时[36]。随着推荐模型的日益复杂，以在线方式对模型应用实时更新在技术上具有挑战性，特别是对于那些表达能力强但计算代价昂贵的深层神经网络[13,26,43]。因此，行业中的常见做法是定期进行模型再培训，例如，每天或每周进行一次。图1说明了模型再培训过程。

interest and the newly collected interactions are more reflective of user short-term preference. To date, three retraining strategies are most widely adopted, depending on the data utilization:
直观地说，历史交互提供了更多关于用户长期（如内在）兴趣的证据，而新收集的交互更能反映用户的短期偏好。到目前为止，三种再训练策略被广泛采用，这取决于数据利用率：

Fine-tuning, which updates the model based on the new interactions only [35, 41]. This way is memory and time efficient, since only new data is to be handled. However, it ignores the historical data that contains long-term preference signal, thus can easily cause overfitting and forgetting issues [6]
微调，仅根据新的交互作用更新模型[35，41]。这种方法节省内存和时间，因为只需要处理新数据。但是，它忽略了包含长期偏好信号的历史数据，容易造成过度拟合和遗忘问题[6]
Sample-based retraining, which samples historical interactions
and adds them to new interactions to form the training data [6,
42]. The sampled interactions are expected to retain long-
term preference signal, which need be carefully selected to
obtain representative interactions. In terms of recommendation
accuracy, it is usually worse than using all historical interactions
due to the information loss caused by sampling [42].
基于抽样的再训练，对历史交互进行抽样，并将其添加到新的交互中，以形成训练数据[6，42]。抽样的交互保留长期的偏好信号，需要仔细选择这些信号以获得具有代表性的交互信息。在推荐准确度方面，由于抽样造成的信息丢失，它通常比使用所有历史交互更差[42]。
Full retraining, which trains the model on the whole data that
includes all historical and new interactions.This method costs most resources and training time, but it provides the highest model fidelity since all available interactions are utilized.
完全再训练，即在包括所有历史和新的交互作用的整个数据上训练模型。该方法花费了大量的资源和训练时间，但由于所有可用的交互都得到了利用，因此它提供了最高的模型保真度。
While the above three strategies have their pros and cons, we
argue a key limitation is that they lack an explicit optimization
towards the retraining objective — i.e., the retrained model should
serve well for the recommendations of the next time period. In
practice, user interactions of the next time period provide the
most important evidence on the generalization performance of
the current model, and are usually used for model selection or
validation. As such, an effective retraining method should take
this objective into account and formulate the retraining process
towards optimizing the objective, a much more principled way than
manually crafting heuristics to select data examples [6, 35, 40, 42]
虽然上述三种策略各有利弊，但它们缺乏对再训练目标的明确优化，即再训练模型应能很好地为下一个时间段的提供推荐服务。在实践中，下一时间段的用户交互是验证当前模型泛化性能的最重要方式，通常用于模型选择或验证。因此，有效的再培训方法应考虑到这一目标，并制定优化目标的再培训流程，这是一种比手工制作启发式方法选择数据示例更为原则的方法[6,35,40,42]
In this work, we explore the central theme of model retraining
in recommendation, a topic of high practical value in industry
recommender systems but receives relatively little scrutiny in
research. Although full model retraining provides the highest
fidelity, we argue that it is not necessary to do so. The key
reason is that the historical interactions have been trained in the
previous training, which means the model has already distilled
the “knowledge” from the historical data. If there is a way to
retain the knowledge well and transfer it to the training on new
interactions, we should be able to keep the same performance level
as the full retraining, even though we do not use the historical data
during model retraining. Furthermore, if the knowledge transfer is
“smart” enough to capture more patterns like recent data is more
reflective of near future performance, we even have the opportunity
to improve over the full retraining in recommendation accuracy.
在这篇论文中，我们探讨了推荐系统再训练，这是一个在行业推荐系统中具有很高实用价值但在研究中却很少受到关注的主题。虽然完全模型再训练提供了最高的保真度，我们认为没有必要这样做。究其原因，关键在于在之前的训练中已经训练了历史交互，这意味着模型已经从历史数据中提取了“知识”。如果有一种方法可以很好地保留知识并将其转移到新交互的训练中，我们应该能够保持与完全模型再训练相同的绩效水平，即使我们在模型再培训期间不使用历史数据。此外，如果知识转移足够“智能”，能够捕捉到更多的模式，比如最近的数据更能反映近期的绩效，那么我们甚至有机会在推荐准确度方面进行全面的再培训。
To this end, we propose a new retraining method with two major
considerations: (1) building an expressive component that transfers
the knowledge gained in previous training to the training on new
interactions, and (2) optimizing the transfer component towards
the recommendation performance in the near future. To achieve
the first goal, we devise the transfer component as a convolutional
neural network (CNN), which inputs the previous model parameters
as constant and the present model as trainable parameters. The
rationality is that the knowledge gained in previous training is
condensed in model parameters, such that an expressive neural
network should be able to distill the knowledge towards the desired
purpose. To achieve the second goal, in addition to normal training
on newly collected interactions, we further train the transfer CNN
on the future interactions of next time period. As such, the CNN can
learn how to combine the old parameters with present parameters,
with the objective of predicting the user interactions of the near feature。The whole architecture can be seen as an instance of meta-learning [9]: the retraining of each time period is a task, which hasthe new interactions of the current period as the training set and the future interactions of the next period as the testing set. By learning to train historical tasks well, we expect the method to perform well for future tasks. Since our meta-learning mechanism is operated on sequential data, we name it as Sequential Meta-Learning (SML).
为此，我们提出了一种新的再训练方法，主要考虑两个方面：（1）构建一个转移组件，将之前训练中获得的知识转移到新交互的训练中；（2）针对近期的推荐性能优化转移组件。为了实现第一个目标，我们设计了一个卷积神经网络（CNN），它将以前的模型参数作为常数输入，将当前模型作为可训练参数输入。其合理性在于，把以往训练中获得的知识浓缩在模型参数中，这样的神经网络应该能够将知识提取到预期的目的。为了达到第二个目标，除了对新收集到的互动进行正常的训练外，我们还进一步对转移CNN进行下一个时间段的互动训练。因此，CNN可以学习如何将旧参数与现有参数相结合，以预测近期用户的交互作用。整个体系结构可以看作metal-Learning的一个实例[9]：每个时段的再训练都是一个任务，以当前时段的新交互作为训练集，下一个时段的未来交互作为测试集。通过学习如何训练好历史任务，我们期望这种方法在将来的任务中表现良好。由于我们的元学习机制是对序列数据进行操作的，因此我们将其命名为序列元学习（SML）。
The main contributions of this work are summarized as follows:
• We highlight the importance of recommender retraining research
and formulate the sequential retraining process as an optimizable
problem.
• We propose a new retraining approach that is 1) efficient by
training on new interactions only, and 2) effective by optimizing
for the future recommendation performance.
• We conduct experiments on two real-world datasets of Adressa
news and Yelp business. Extensive results demonstrate the
effectiveness and rationality of our method.
本文的主要贡献如下：
•我们强调了推荐系统再训练的重要性，并将序列再训练过程作为一个可优化问题进行了阐述。
•我们提出了一种新的再培训方法，即1）通过仅对新的交互进行培训而有效；2）通过优化未来的推荐性能而有效。
•我们对Adressa news和Yelp business的两个真实数据集进行了实验。大量的实验结果证明了该方法的有效性和合理性。

Problem formulation

In real-world recommender systems, user interaction data streams
in continuously. To keep the predictive model fresh with recent data,
a common choice is to retrain the model periodically. We represent
the data as {D0,D1,…Dt,Dt+1… }, where Dt denotes the data newly
collected in the time period t. Assume each retraining is triggered
right after Dt is collected. A period can be any length of time, e.g.,
daily, weekly or until a certrain number of interactions are collected,
depending on the system requirement and implementation abilty.
在实际推荐系统中，用户交互的数据流是连续不断的。为了使预测模型保持最新的数据，一个常见的选择是定期重新训练模型。我们将数据表示为{D0,D1,…Dt,Dt+1… }，其中Dt表示在时间段t中新收集的数据。假设每次再训练都是在收集到Dt之后触发的。周期可以是任意长度的时间，例如每天、每周或直到收集到一定数量的交互，这取决于系统需求和实现能力。
In the retraining of time period 푡, the system has access to all
previous data, i.e., {D0,D1,…Dt-2,Dt-1… }, and the new data Dt. Since the
retrained model is used to serve for the near future, it is reasonable
to judge its effectiveness based on Dt+1— the data collected in the
next time period. As such, we set the recommendation performance
on Dt+1as the generalization goal of the t-th period retraining. Let
the model parameters after the t-th peirod retraining be Wt. We
treat each retraining as a task, formulating it as:

在时间段t的再训练中，系统可以访问所有先前的数据，即 {D0,D1,…Dt-2,Dt-1… }以及新数据Dt。由于再训练模型是为不久的将来服务的，因此根据下一时间段收集的数据Dt+1来判断其有效性是合理的。因此，我们将Dt+1上的推荐性能设置为第t-期再培训的泛化目标。让第t次时间段再训练后的模型参数为Wt。我们将每次再训练视为一项任务，将其表述为：
That is, based on all accessible data at the time of retraining and the
model parameters of the previous retraining, we aim to get a new
set of model parameters that can perform well on the near future
data Dt+1. The mostly used solution in industry is to perform a full
retraining on the whole data withWt−1as the initialization. This
solution is straightforward to implement. However, the drawback
is that it takes too many computation resources, a relative long
retraining time, and requires to enlarge the computation power as
time goes by. Another limitation is that the full retraining lacks
explicit optimization for the performance on Dt+1. This is non-
trivial to address, since directly using Dt+1in training will cause
information leak and worse generalization ability.
也就是说，基于再训练时所有可访问的数据和之前再训练的模型参数，我们的目标是得到一组新的模型参数，这些参数可以很好地应用于近期的数据Dt+1。工业中最常用的解决方案是对整个数据进行全面的再培训，初始化时使用Wt−1,这个解决方案很容易实现,但其缺点是计算资源太多，再训练时间较长，并且需要随着时间的推移而增大计算能力。另一个限制是完全再训练缺乏对Dt+1性能的显式优化。这一点很重要，因为在训练中直接使用Dt+1会导致信息泄漏和泛化能力下降。
In this work, we aim to utilize the newly collected data Dt only
plus the previous model parameters Wt−1, so as to pursue a good
retrained model as evaluated on Dt+1. Thus we reformulate the
retraining process as:
which we denote as the task Tt. For T0, the previous model
parameters are just random initialization. A straightforward
solution is to perform stochastic gradient descent (SGD) updates on
Dt with Wt−1 as initialization. However, it is easy to encounter
the forgetting issue of user long-term interest, since the effect
of initialization is weakening with more updates. Moreover, this
solution also lacks optimization scheme towards serving Dt+1.
在这项工作中，我们的目的是利用新收集的数据Dt仅加上之前的模型参数Wt-1，以便在Dt+1上寻求一个好的再训练模型。因此，我们将再培训过程重新制定为：
式（2）我们称之为任务Tt。对于T0的模型参数只是随机初始化。一个简单的解决方案是在初始化时执行随机梯度下降（SGD）用Wt-1更新Dt。但是，由于初始化的效果随着更新次数的增加而减弱，很容易遇到用户长期兴趣的遗忘问题。此外，该方案还缺乏服务于Dt+1的优化方案。
Distinct from the definition of푡푎푠푘 in standard meta-learning [9,
21], the tasks here naturally form a sequence {T0, …, Tt, Tt+1, …}.
In online serving (testing), only if Tt has been completed we can
move toTt+1. As such, the offline training should follow the similar
manner of sequential training to ensure the method can generalize
well in future serving. Lastly, addressing the problem can be seen
as an instance of meta-learning, since the learning target is how
to solve the tasks well (i.e., with a good generalization ability on
future tasks), which is a higher-level problem than simply learning
model parameters on Dt.
与标准元学习[9，21]中的定义不同，这里的任务自然形成一个序列{0，…，Tt，Tt+1，…}。在在线服务（测试）中，只有完成了Tt，我们才能转到Tt+1。因此，离线训练应遵循序列式训练的类似方式，以确保该方法在今后的服务中能得到很好的推广。最后，解决问题可以看作是元学习的一个实例，因为学习的目标是如何很好地解决任务（即对未来任务具有良好的泛化能力），这是一个比简单地学习模型参数更高层次的问题。

3. Method

Firstly, we present the model overview to solve the task Tt, the core
of which is to design a transfer component that effectively converts
the old model Wt−1 to a new model Wt. Then, we elaborate our
design of the transfer. Next, we discuss how to train the model with
good performance on current data Dt as well as good generalization
to future data Dt+1. Lastly, we demonstrate how to instantiate our
generic method on matrix factorization, one of the most classic and
representative models for collaborative filtering.
首先，我们提出了解决这一任务Tt的模型概述，其核心是设计一个能有效地将旧模型Wt-1转换为新模型Wt的转换组件。然后，我们详细说明我们的设计。接下来，我们将讨论如何训练模型，使其在当前数据Dt上具有良好的性能，并能很好地推广到未来数据Dt+1。最后，我们演示了如何将我们的通用方法实例化到矩阵分解上，矩阵分解是协同过滤最经典和最具代表性的模型之一。

Model overview

We aim to solve the task Tt defined in Equation (2) which leverages
only the new data Dt to achieve a comparable or even better
performance than the full retraining. The belief is that the past
data {D0, …, Dt−1} have been seen in previous training, such that
the “knowledge” useful for recommendation has been gained and
stored in model parameters Wt−1. Another consideration is to make our method technically applicable to many recommender models,
rather than a specific one
本文的目标是解决等式（2）中定义的任务Tt，该任务仅利用新数据Dt，以获得与完全再训练相当甚至更好的性能。人们认为，过去的数据{D0，…，Dt−1}已经在以前的培训中看到，因此对于推荐有用的“知识”已经获得并存储在模型参数Wt−1中。另一个考虑是我们的方法在技术上适用于许多推荐模型，而不是一个特定的模型
To this end, we design a generic model framework, as illustrated
in Figure 2. It has three components: 1)Wt−1represents the previous
recommender model that is trained from past data, 2)ˆWtdenotes a
new recommender model that needs to be learned from the current
data Dt, and 3) Transfer is the module to combine the “knowledge”
contained in Wt−1 and ˆWt to form a new recommender model Wt,
which is used for serving next period recommendations. In the t-th
period retraining,Wt−1is set as constant input, and the retraining
consists of two main steps:
为此，我们设计了一个通用模型框架，如图2所示。它有三个组成部分：1）Wt-1代表从过去数据中训练出来的推荐模型，2）^Wt表示需要从当前数据Dt中学习得到的新的推荐模型；3）传递是将包含在Wt−1和 ^Wt中的“知识”组合成一个新的推荐模型Wt，即用于下一时期的推荐。在第t个周期的再训练中，将Wt−1设为恒定输入，再训练包括两个主要步骤：
1. Obtaining ˆWt, which is expected to contain useful signal for recommendation from Dt. This step can be done
by optimizing standard recommendation loss, denoted as Lr(Wt | Dt).
获得 ^Wt,它从Dt中获得有用的信息，这一步的优化损失定义为： Lr(Wt | Dt).
2. Obtaining Wt, which is the output of the transfer module:

获得 ^Wt，这是转移模型的输出。
In this framework, Wt−1 and
ˆWt can be any differentiable
recommender model, as long as they are of the same architecture
(i.e., the parameter number and semantics are the same). Only the
transfer component needs to be carefully designed, which is our
contribution to be introduced next.
在这个框架中，Wt−1和ˆWt可以是任何可微推荐模型，只要它们具有相同的架构（即参数数量和语义相同）。只有传输组件需要仔细设计，这是我们接下来要介绍的贡献。

Transfer Design

Functionally speaking, the transfer combines parameters Wt−1and
ˆWt to form a new group of parameters Wt. As the most basic requirement, Wt needs be of the same shape with Wt−1 andˆWt.
This requirement can be easily satisfied by operations like weighted sum:
从功能上讲，该转移组合了参数Wt−1和ˆWt，形成一组新的参数Wt。作为最基本的要求，Wt需要与Wt−1和ˆWt的形状相同。这一要求可以通过加权求和等运算轻松满足：
where alpha is the combination coefficient which can be either pre-defined or learned. The method is simple to interpret by paying different attentions to previous and current trained knowledge; it is also easy to train, since few parameters are introduced. However it has limited representation ability, for example, cannot account for the relations between different dimensions of parameters.
alpha是预先定义或学习的组合系数。该方法通过对已有知识和现有知识的不同关注而进行简单解释；由于引入的参数较少，训练也容易。然而，它的表现能力有限，例如不能解释参数不同维度之间的关系。
For expressiveness of the transfer, multi-layer perceptron (MLP) can be another option:
对于传输的表现力，多层感知器（MLP）是另一种选择：
Despite the universal approximation theorem of MLP [19], it may be practically difficult to be trained well [1, 13]. Another limitation is that it does not emphasize the interactions beweeen the parameters of the same dimension, which could be important for understanding parameter evolution. As an example, suppose the model is matrix factorization and the parameters are user embedding. Then the
difference ˆWt−Wt−1 means parameter change which can capture the interest drift; and each dimension of the product Wt−1⊙ˆWt indicates the importance of the dimension in reflecting user interest of both short-term and long-term. However, MLP lacks mechanisms to explicitly capture such patterns.
尽管MLP[19]有普遍的近似定理，但实际上很难训练好[1,13]。另一个局限性是它没有强调同一维度参数之间的相互作用，这对于理解参数演化是很重要的。列举一个例子，假设模型是矩阵分解，参数是用户嵌入的。然后，差值ˆWt-Wt−1表示可以捕捉兴趣转移的参数变化；而产品的每个维度Wt−1⊙ˆWt表示可以捕捉兴趣漂移的。然而，MLP缺乏明确捕捉这种模式的机制。
To this end, we design the transfer component to be capable of not only emphasizing the relation between Wt−1 an d ˆWt at each dimension, but also capturing the relations among different
dimensions. Inspired by the success of convolutional neural network (CNN) in capturing local-region features in image processing, we design the transfer based on CNN. The CNN architecture can be
found in the green box of Figure 2, which consists of a stack layer, two convolution layers, and a fully connected layer for output.
为此，我们设计的传输组件不仅能够强调Wt−1和ˆWt之间在每个纬度上的关系，也捕捉不同维度之间的关系。受卷积神经网络（CNN）在图像处理中获取局部特征的成功启发，我们设计了基于CNN的传输。CNN架构可以在图2的绿框中找到，它由一个堆栈层、两个卷积层和一个用于输出的完全连接层组成。
Next we detail the CNN design. Without loss of generality, we treat Wt−1 andˆWt as a row vector, denoted as Wt−1and ˆ Wt, respectively, even though their original form can be matrix or tensor. This facilitates us performing dimension-wise operations on combining two models.
接下来我们详细介绍CNN的设计。在不损失一般性的情况下，我们将Wt−1和ˆWt视为行向量，分别表示为wt−1和ˆwt，即使它们的原始形式可以是矩阵或张量。这有助于我们在组合两个模型时执行维度操作。
Stack layer
This layer stacks wt−1, ˆ wt, and their element-wise product interaction vector as a 2D matrix, which serves as an “image”
to be processed by the later convolution layers. Specifically, we formulate it as:
该层堆叠wt−1，ˆwt，以及它们的元素积交互向量作为2D矩阵，被之后的卷积层处理成“图像”。具体而言，我们将其表述为：
The wt−1⊙ ˆ wt can capture that when wt−1 evolves to ˆ wt, which dimension values are enlarged or diminished. The denominator of wdot is used for normalization, and e= 10−15 is a small number to prevent the denominator being zero. The size of H0 is 3 ×d, where
d denotes the size of wt−1and ˆ wt.