Learning User Interest Model for Content-based Filtering in Personalized Recommendation System

Songjie Gong

Zhejiang Business Technology Institute, Ningbo 315012, China

E-mail: eizolz@163.com

Original: http://www.aicit.org/JDCTA/ppl/JDCTA%20Vol6%20No11_part20.pdf

Abstract

With the emergence and evolution of Networks, the information on the Internet has increased greatly. Retrieving useful information from a large amount of information has become a key technology in the information area. The application of personalized recommendation in the Internet effectively improved its service, especially the service of E-commerce. Traditional search engine do not take different user’s interest into consideration, so the result they retrieved cannot satisfy user’s specified needs. In order to effectively solve the problem, this paper presented a personalized recommendation system employing user interest model for content-based filtering. This paper analyzes the system of five different components: document information extraction, document vectors representation, user interest model representation; matching algorithms, user feedback update. This personalized recommendation system can describe user’s interest type and interest degree well, and can enhance the personalized information service efficiency.

Keywords: Personalized Service, Information Retrieve, Information Filtering, Recommender System, Content-Based Filtering, User Interest Model

1. Introduction

With the popularization of the Internet and the development of E-commerce, the E-Commerce system’s structure becomes more complicated when it provides more and more choices for users [1,2,3]. The recommender system can alleviate the information overload [4,5,6]. Lots of personalized recommendation systems have been proposed in many fields. Two main technologies are usually adopted in personalized recommendation systems: content-based filtering and collaborative filtering [7,8,9].

Traditional search engine do not take different user’s interest into consideration, so the result they retrieved cannot satisfy user’s specified needs [10,11]. In order to effectively solve the problem, this paper presented a personalized recommendation system employing user interest model for content- based filtering. This paper analyzes the system of five different components: document information extraction, document vectors representation, user interest model representation; matching algorithms, user feedback update. This personalized recommendation system can describe user’s interest type and interest degree well, and can enhance the personalized information service efficiency.

2. Content-based Filtering

There exist two main approaches in information filtering: collaborative and content-based. In collaborative filtering, the system selects and rank-orders items for a user based on the similarity of the user to other users who read/liked similar items in the past. In content-based filtering, the system selects and rank-orders items based on the similarity of the user's profile and the items' profiles.

2.1. Framework of content-based filtering

Content-based filtering has five parts: document information extraction, document vectors, user interest model representation; matching algorithms, user feedback update. The framework of content- based filtering is as figure 1 shown.

Figure 1. Framework of content-based filtering

3. Extracting feature of items

Get text from a text by the feature vector, to go through the process of extracting a feature item. Feature extraction is the key vocabulary from all possible to extract an expression of strong persuasive texts feature the best subset of items. The purpose of doing this mainly two: First, to improve the efficiency of procedures, streamline operations, improve the operating speed; all the tens of thousands of pairs of text content of the meaning of words is different. Prevalence of some common terms on the contribution of small text. In order to improve the accuracy of recommendation systems, should be removed that is not strong and expressive vocabulary, text selected for the optimal set of feature items of interest[12,13,14,15].

Best feature items are those with the relevant text set rel (Q) maximum mutual information terms, vocabulary, and related text set on the number of mutual information between the calculated by the following:

logMI (wi,rel(Q)) =log ( P(wi|wi∈rel(Q) )/p(wi))

where, wi is the ith word in the text; P(wi|wi∈rel(Q) is the ratio as the word wi in the relevant text set rel (Q); p(wi) is the ratio as the word wi in the data processing text.

4. Item presentation model 4.1.Vector space model

Expressed in the traditional practices of information resources and the interests of users, is a vector space model. The vector space model is a text representation model. It has the text and all the functional items constitute the basic unit of the terms set project. Each item can be expressed as a vector and the dimension of the vector is the number of item sets. General is not fixed and we can also specify a fixed size. Because the characteristic frequency of the word document to a certain extent reflects the theme of the file, so each component is the number of items in the feature vector document. This concentration of resources in the resource can be expressed as a term sets of vectors[16,17].

4.2. Probability model

The probability model is firstly established in the field of the classification model and then calculates the classification probability distribution of all the files and users interested in the model. Used to denote the probability distribution of documents and users' interests can better reflect the diversity of user interest, and easy to implement. The classification model is using the Bayesian method of training. The expression of the interests of users and files are the same.

4.3.Improved probabilistic model

Vector space model method can only express user interest keywords. It can not distinguish the difference between the user interests. Despite the differences can be distinguished on the probability model approach is based on the user's interests, the diversity of the user's interest, but can not express the love of the user of the level of interest rates. Therefore, in order to improve the method, the improved probability model can express user interest keywords and express the level of user interest.

5. User Interest Model

Interest to the user and the candidate documents match the calculation, first need to define the user's computer interest and candidate documents said[18,19,20]. We use the classic VSM model document, said the candidate, that candidate document D can be expressed as follows:

Where wi is the first document D i a feature term weight. We select the word as the feature item, and use the relative term frequency as the characteristics of term weight. Relative Frequency Words can tf- idf formula is as follows:

of times the word occurs, N the total number of that document.

5.1.User Interest Model based on Interest Document Vector

The most simple idea, the user U can be expressed as a series of interest in the document I set the user, namely: Interest formula

The user interest and the candidate matching documents can be expressed as each user interested in documents I and the candidate document D, the sum of matching, namely:

The user interested in documents I and D of the matching candidate documents VSM can be used in the similarity formula, namely:

5.2.User Interest Model based on Interest Vector

Although the model can be used on the candidate documents match the user's interest calculation, but this method of storage space and matching calculation of the time have much overhead, so use the model to improve the model in space and time deficiencies. In this model, the definition of the user's interest model U can also use the VSM vector model, ie:

Where, interest for the user's first i-U characteristics of term weight.

U user interest in the model selected as the feature that contains the word item, and use the relative term frequency as the characteristics of term weight. We define the set U contains the word K as follows:

Here I model for the user interested in documents included in interest.

And define the relative word frequency word k as follows:

Definition of a good user interest vector that, you can use the VSM model to calculate the similarity of user interest formula and candidate matching documents are as follows:

Can see that the interest in the document containing the n-user interest model, without reducing the accuracy of matching the case, the model will be stored in the space and matching time is reduced to the model of 1 / n.

5.3.User Interest Model based on Multi-Interest Vector

Because the user's query Q is often reflected the interest of concern to the present, in order to resolve these issues, we extend the model of a user to maintain multiple interests. Match in the candidate documents, we first query the user Q and users interested in V for every match, only when the match is greater than a threshold L, we think that V is a user interested in the present inquiry concerns , so will interest in the document D, V and candidate interest multiplied by matching V and query Q as a weight, adding the user interest and document D, U in the matching. Matching algorithm used in the model as follows[21,22,23]:

You can see, using this matching algorithm, the model can effectively identify the user interest in the current concern and interest in accordance with the current concerns matching calculation, making the final result returned to the user effectively reflect current interest concerns.

5.4.User Interest Model based on Role

In real life, each user belongs to one or several roles, such as Joe Smith's work is a program, and his hobby is mountain climbing, then Joe Smith's role is to programmers and climbers. Contain some kind of interest a user, that user belongs to a role also includes the interest. In contrast, the role of interest included interest than the user that contains more accurate, because sometimes the user can not accurately express their interest in the role the user belongs to effectively modify the user's interest.

We define a user U, and the role the user belongs to R1 ,R2 ,...,Rn , then the candidate document D and user interest matching degree is calculated as follows:

Where, P1 is the model matching formula, α and β are weight coefficients. We also can define the basis of multi-user role model, but the actual application, a layer of role models able to effectively identify the user interest.

6. Matching Algorithm

A set of similarity measures are presented and a metric of relevance between two vectors. The similarity measure can be effectively used to balance the ratings significance in a prediction algorithm and therefore to improve accuracy[24,25,26].

There are several similarity algorithms that have been used in the recommendation algorithm: Pearson correlation, cosine vector similarity, adjusted cosine vector similarity, mean-squared difference and Spearman correlation.

Pearson’s correlation measures the linear correlation between two vectors of ratings.

The cosine measure looks at the angle between two vectors of ratings where a smaller angle is regarded as implying greater similarity.

The adjusted cosine is used in some filtering methods for similarity among users where the difference in each user’s use of the rating scale is taken into account.

7. Feedback and Update of User Interest Model

After the user interest model, can allow users to take the initiative to update, you can also track the user's behavior dynamically updated. Talking about the latter, that according to the user's actions produce different current update [27,28,29]. User action can be add a bookmark to download documents, visit summary, ignore and delete bookmarks and other documents, these actions reflect the different interests of users, and therefore have a different meaning[30,31,32,33], see Table 1.

7.1.Short-term interest

The user short interest is shown as figure 2.

Short interest Ps has tow parts: P(s1) and P(s2) . P(s1) , that from the first day of searches here to receive part of the record of user interest; P(s2) based on the current search and get the latest part of user interest. Ps is defined as:

where, x + y = 1.

7.2.Long-term interest

The user long interest is shown as figure 3.

Figure 3. User long interest

Long interet Pl is defined as:

7.3 Upgrading user interest.

Then we can update the user interest model.

where, a + b = 1 ; x+ y = 1.

8. Conclusions

The application of personalized recommendation in the Internet effectively improved e-commerce service [34,35,36,37]. In order to effectively solve the problem, in this paper, we presented a personalized recommendation system employing user interest model for content-based filtering. This paper analyzes the system of five different components: document information extraction, document vectors representation, user interest model representation; matching algorithms, user feedback update. This personalized recommendation system can describe user’s interest type and interest degree well, and can enhance the personalized information service efficiency.

9. Acknowledgment

A Project Supported by Scientific Research Fund of Zhejiang Provincial Education Department (Grant No. Y201121981).

10. References

[1] M. Staring, S. Klein, J. P. Pluim, “Nonrigid registration with adaptive, content-based filtering of the deformationfield,” In Proceedings of SPIE Medical Imaging, pp. 212 - 221, 2005.
[2] Shoval, P, Maidel, V., and Shapira, B, “An ontology content based filtering method”, International Journal of Information Theories and Applications, pp. 51-63, 2008.
[3] Gunduz S., Ozsu M.T., “Recommendation Models for User Accesses to Web Pages”, In Proceedings of ICANN pp. 1003 - 1010, 2003.
[4] Polcicová, G., and Návrat, P., “Semantic similarity in content-based filtering”, In Proceedings of Advances in Databases and Information Systems, pp. 80-85, 2002.
[5] Gündüz, S., “Recommendation models for Web users: User interest model and click-stream tree”, PhD. Thesis, Institute of Science and Technology, Istanbul Technical University, TURKEY, 2003.
[6] HE Weihong, CAO Yi, “An E-commerce recommender system based on content-based filtering”, Wuhan University Journal of Natural Sciences, Vol. 11, No. 5, pp. 1091 - 1096, 2006.
[7] Amato G., Straccia U., “User Profile Modeling and Applications to Digital Libraries”, In Proceedings of the 3rd European conference on research and advanced technology for digital libraries, pp. 184 – 197, 1999.
[8] G. Luo, P. S. Yu, “Content-based Filtering for Efficient Online Materialized View Maintenance”, In Proceedings of t CIKM, pp. 163–172, 2008.
[9] G.Y. SU, J.H. LI, Y.H. MA, S.H. LI, “Improving the precision of the keyword-matching pornographic text filtering method using a hybrid model,” Journal of Zhejiang University SCIENCE, Vol. 5, No. 9, pp.1106-1113, 2004.
[10] R.V . Meteren, M.V . Someren, “Using content-based filtering for recommendation”, In Proceedings of ECML Workshop: Machine Learning in New Information Age, pp. 47-56, 2000.
[11]Gao Fengrong, Xing Chunxiao, Du Xiaoyong, Wang Shan, “Personalized Service System Based on Hybrid Filtering for Digital Library”, Tsinghua Science and Technology, Vol. 12, No. 1, pp. 1-8, 2007.
[12] Pasi, G., Bordogna, G., Villa, R., “A multi-criteria content-based filtering system”, In Proceedings of the 30th Annual international ACM SIGIR Conference on Research and Development in information Retrieval, pp. 775–776, 2007.
[13]Golemati M., Katifori A., Vassilakis C., Lepouras G., Halatsis C., “Creating an Ontology for the User Profile: Method and Applications”, In Proceedings of the First RCIS Conference, pp. 407–412, 2007.
[14] Hurwitz, J. B., “Empirical Evaluation of Content-Based Filtering for Personalization” In Proceedings of 20th International Symposium on Human Factors in Telecommunication, pp. 1-8, 2006.
[15]Bollacker KD, Lawrence S, Giles CL, “Discovering relevant scientific literature on the Web”, IEEE Intelligent Systems, Vol. 15, No. 2, pp. 42-47, 2000.
[16]Buckley C, Sahon G, Allan J, Singhal, A, “Automatic query expansion using SMART”, In Proceedings of the 3rd Text Retrieval Conference (TREC-3), pp. 69-80, 2005.
[17] Champa Jayawardana, K., Priyantha Hewagamage., Massahito Hirakawa., “A Personalized Information Environment for Digital Libraries”, Information Technology and Libraries, Vol. 20, No. 4, pp. 185- 195, 2001.
[18]Dumais ST, Platt J, Heckerman D, Sahami M, “Inductive learning algorithms and representations for text categorization”, In Proceedings of the International Conference on Information and Knowledge Management, pp.148-155, 1998.
[19]G Salton, M J McGill, “Introduction to Modem Information Filtering”, Massachusetts Inst of Technology, 1994.
[20]Hofmann T, “Probabilistic latent semantic analysis”, In Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence, pp.289-296, 1999.
[21]Joachims T, “A probabilistic analysis of the rocchio algorithm with TFIDF for text categorization”, In Proceedings of the 14th International Conference on Machine Learning, pp. 143—151,1997.
[22]Lawrence, R. D., Almasi, G. S., Kotlyar, V., Viveros, M. S., Duri, S. S., “Personalization of supermarket product recommendations”, Data Mining and Knowledge Discovery, Vol. 5, No.1–2, pp. 11–32, 2001.
[23]Belkin, N. J., Croft, W. B., “Information filtering and information retrieval: two sides of the same coin?”, Communications of the ACM, Vol. 35, No. 12, pp. 29–38, 1992.
[24]Pretschner, A., Gauch,S., “Ontology-Based Personalized Search”, In Proceedings of ICTAI, pp. 391–398, 1999.
[25]Ricardo Baeza Yates, Berthier Ribeiro Neto, “Modern Information Retrieval”, Addison Wesley, 1999.
[26]Robertson S, Hull D., “The TREC- 9 filtering track final report”, In Proceedings of the 9th Text Retrieval Conference (TREC- 9), pp. 25 – 40, 2001.
[27]Seo Y, Zhang B, “Learning user's preferences by analyzing Web-browsing behaviors”, Artificial Intelligence, Vol. 15, No. 6, pp. 381-387, 2001.
[28]Sugiyama K, Hatano K, Yoshiakawa M, “Adaptive web search based on user profile constructed without any effort from users”, In Proceedings of WWW2004, pp. 675-684, 2004.
[29]Sugiyama K, “Studies on Improving Retrieval Accuracy in Web Information Retrieval”, Tokyo: Nara Institute of Science and Technology, 2004.
[30]Turney PD, “Learning algorithms for key phrase extraction”, Information Retrieval, Vol. 2, No. 4, pp. 303-336, 2000.
[31]Hanani, U., Shapira, B., Shoval, P., “Information Filtering: Overview of Issues, Research and Systems”, User Modeling and User-Adapted Interaction, Vol. 11, No. 3, pp. 203-259, 2001.
[32]Witten IH, Paynter GW, Frank E, Gutwin C, Nevill Manning, “KEA: practical automatic key phrase extraction”, In Proceedings of 4th ACM Conference on Digital Library, pp.254-255, 1999.
[33]Zeng C, Xing CX, Zhou LZ, “A survey of personalization technology”, Journal of Software, Vol. 13, No. 10, pp. 1952-1961, 2002.
[34]Elena Vlahu-Gjorgievska, Vladimir Trajkovik, "Personal Healthcare System Model using Collaborative Filtering Techniques", AISS : Advances in Information Sciences and Service Sciences, Vol. 3, No. 3, pp. 64 -74, 2011
[35]Heng-Li Yang, Hsiao-Fang Yang, "Recommendation Mechanism Based on Multi-attribute Utility Theory", JDCTA: International Journal of Digital Content Technology and its Applications, Vol. 5, No. 3, pp. 373 - 382, 2011
[36]Hochul Jeon, Taehwan Kim, Joongmin Choi, "Personalized Information Retrieval by Using Adaptive User Profiling and Collaborative Filtering", AISS : Advances in Information Sciences and Service Sciences, Vol. 2, No. 4, pp. 134 - 142, 2010.
[37]Zhimin Chen, Yi Jiang, Yao Zhao, "A Collaborative Filtering Recommendation Algorithm Based on User Interest Change and Trust Evaluation", JDCTA: International Journal of Digital Content Technology and its Applications, Vol. 4, No. 9, pp. 106 - 113, 2010.