Journal of Computers, Vol 7, No 7 (2012), 1545-1554, Jul 2012
doi:10.4304/jcp.7.7.1545-1554

A Fast and Efficient Algorithm for Finding Frequent Items over Data Stream

Ling Chen, Yixin Chen, Li Tu

Abstract


We investigate the problem of finding the frequent items in a continuous data stream. We present an algorithm called λ-Count for computing frequency counts over a user specified threshold on a data stream. To emphasize the importance of the more recent data items, a fading factor l is used. Our algorithm can detect ε-approximate frequent items of a data stream using O(logλε) memory space and O(1) time to process each data record. The computation time for answering each query is O( ), and for answering a query about the frequentness of a given data item is O(1). Experimental study shows that λ-Count outperforms other methods in terms of accuracy, memory requirement, and processing speed.

 



Keywords


Data mining; data stream; frequent items

References


 

[1] G. S. Manku and R. Motwani, Approximate frequency counts over data streams. In Proc. of 28th Intl. Conf on Very Large Data Bases, pages, 2002, pp.346-357

[2] R. M. Karp, S. Shenker, and C. H. Papadimitriou. A simple algorithm for finding frequent elements in streams and bags. ACM Transactions on Database Systems, vol.28,2003,pp.51-55
http://dx.doi.org/10.1145/762471.762473

[3] E. D. Demaine, A. Lopez-Ortiz, and J. I. Munro. Frequency estimation of internet packet streams with limited space. In Proceeding of the 10th Annual European Symposium on Algorithms, 2002,pp.348-360

[4] J. Misra and D. Gries. Finding repeated elements. Science of Computer Programming, vol.2, 1982,pp.143-152
http://dx.doi.org/10.1016/0167-6423(82)90012-0

[5] P. Flajolet, G.N.Martin, Probabilistic counting algorithms for data base applications, Journal of Computer and System Sciences, vol. 31,1985, pp. 182-209
http://dx.doi.org/10.1016/0022-0000(85)90041-8

[6] K. Y. Whang, B. T. Vander-Zanden, H.M. Taylor, A linear-time probabilistic counting algorithm for a database applications, ACM Trans. Database Systems, vol.15, 1990, pp. 208-229
http://dx.doi.org/10.1145/78922.78925

[7] L. Golab, D. DeHaan, A. Lopez-Ortiz, and E. D. Demaine. Finding frequent items in sliding windows with multinomially-distributed item frequencies. In Proceedings of the 16th International Conference on Scientific and Statistical Database Management, 2004, pp. 425-426
http://dx.doi.org/10.1109/SSDM.2004.1311242

[8] P. B. Gibbons and Y. Matias. New sampling-based summary statistics for improving approximate query answer, Proc. SIGMOD, 1998, pp.331-341

[9] Hongyan Liu, Ying Liu, Jiawei Han, Jun He, Error-adaptive and time-aware maintenance of frequency counts over data streams, Proceeding of WAIN 2006, INCS 4016, pp. 484-495

[10] C. Estan and G. Varghese. New directions in traffic measurement and accounting: focusing on the elephants, ignoring the mice. ACM Transactions on Computer System, vol. 21, 2003,pp. 270-313
http://dx.doi.org/10.1145/859716.859719

[11] Charikar M, Chen K, Farach-Colton M. Finding frequent items in data streams. In: Widmayer P, Ruiz FT, Bueno RM, Hennessy M, Eidenbenz S, Conejo R, eds. Proc. of the Int’l Colloquium on Automata, Languages and Programming. Malaga: Springer-Verlag, 2002. 693-703.

[12] Cormode G, Muthukrishnan S. What’s hot and what’s not: Tracking most frequent items dynamically. In: Halevy AY, Ives ZG, Doan AH, eds. Proc. of the 22nd ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems. San Diego: ACM Press, 2003. 296-306.

[13] Jin C, Qian W, Sha C, Yu JX, Zhou A. Dynamically maintaining frequent items over a data stream. In: Carbonell J, ed. Proc. of the 2003 ACM CIKM Int’l Conf. on Information and Knowledge Management. New Orleans: ACM Press, 2003, pp. 287-294.

[14] M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D. Ullman. Computing iceberg queries efficiently. In Proceedings of 24th International Conference on Very Large Data Bases (VLDB 1998), New York, USA, Aug. 1998.

[15] Toon Calders, Nele Dexers, Bart Goethals, Mining frequent itemsets in a stream, ACM

[16] Bill Lin, Wai-Shing Ho, Ben Kao, Chun-Kit Chui, Adaptive frequency counting over bursty data streams, Proceedings of the 2007 IEEE Symposium on Computational Intelligence and data mining, 2007, pp. 516-523
http://dx.doi.org/10.1109/CIDM.2007.368918

[17] M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries. In Proc. SIGMOD, 2001.

[18] Wei-Ping WANG, Jian-Zhong LI, Dong-Dong ZHANG, Long-Jiang GUO, An efficient algorithm for mining approximate frequent item over data streams, Journal of Software, vol.18, no.4, 2007, pp.884-892.
http://dx.doi.org/10.1360/jos180884

[19] A. Arasu and G. Manku. Approximate counts and quantiles over sliding windows. In Proceedings of the 23rd ACM Symposium on Principles of Database Systems, 2004, pp. 286-296

[20] L. Golab, D. DeHaan, E. D. Demaine, A. Lopez-Ortiz, and J. I. Munro. Identifying frequent items in sliding windows over on-line packet streams. In Proceedings of the Internet Measurement Conference, 2003, pp.173-178
http://dx.doi.org/10.1145/948205.948227

[21] L. Lee and H. Ting. A simpler and more efficient deterministic scheme for finding frequent items over sliding windows. Proceedings of the 23rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS 2006), Chicago, USA, June 2006.
http://dx.doi.org/10.1145/1142351.1142393

[22] Linfeng Zhang, Yong Guan, Frequency estimation over sliding windows, Proceedings of SIGKDD 2007,

[23] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. Proc. PODS, 2002.

[24] S. Muthukrishnan. Data streams: Algorithms and applications. Technical Report, Rutgers University, Piscataway, USA, 2003.

[25] Hüseyin Akcan, Alex Astashyn, Hervé BrÅ‘nnimann,Deterministic algorithms for sampling count data, Data & Knowledge Engineering,vol. 64, 2008, pp.405–418
http://dx.doi.org/10.1016/j.datak.2007.07.011

[26] Nishad Manerikar, Themis Palpanas, Frequent items in streaming data: An experimental evaluation of the state-of-the-art, Data & Knowledge Engineering, Volume 68, Issue 4, April 2009, pp.415-430
http://dx.doi.org/10.1016/j.datak.2008.11.001

[27] Bibudh Lahiri, Srikanta Tirthapura Identifying frequent items in a network using gossip, Journal of Parallel and Distributed Computing, Volume 70, Issue 12, December 2010, pp.1241-1253.
http://dx.doi.org/10.1016/j.jpdc.2010.07.006

[28] 4Nuno Homem, Joao Paulo Carvalho, Finding top-k elements in data streams, Information Sciences, Volume 180, Issue 24, 15 December 2010, pp. 4958-4974.

[29] Nuno Homem and Joao Paulo Carvalho, Finding top-k elements in a time-sliding window, Evolving Systems, 2011, Volume 2, Number 1, Pages 51-70.
http://dx.doi.org/10.1007/s12530-010-9020-z

[30] Regant Y. S. Hung and Hingfung F. Ting, An Ω(1/εlog(1/ε)) Space Lower Bound for Finding ε-Approximate Quantiles in a Data Stream, Lecture Notes in Computer Science, 2010, Volume 6213, Frontiers in Algorithmics, 2010, pp.89-100.

[31] C. Busch and S. Tirthapura. A deterministic algorithm for summarizing asynchronous streams over sliding windows. Proceedings of the 24th International Symposium on Theoretical Aspects of Computer Science (STACS 2007), Aachen, Germany, Feb. 2007.

[32] Zhang, Shan; Chen, Ling; Tu, Li, Frequent Items Mining on Data Stream Based on Time Fading Factor, 2009. AICI '09. International Conference on Artificial Intelligence and Computational Intelligence, Volume: 4, 2009, pp. 336 - 340.
http://dx.doi.org/10.1109/AICI.2009.369

[33] http://ita.ee.lbl.gov/html/contrib/WorldCup.html.2004

[34] Hongyan Liu, Ying Lu, Jiawei Han, et al. Error-Adaptive and Time-Aware Maintenance of Frequency Counts over Data Streams. Advances in Web-Age Information Management (WAIM 2006). 2006. pp. 484-495.

[35] A. Manjhi et al., Finding (recently) frequent items in distributed data streams, CMU-CS-04-121, Technical report.

[36] G. Cormode and M. Hadjieleftheriou, Finding frequent items in data streams, VLDB'08, 2009, pp. 1530-1541

[37] H. Liu, Y. Liu, J. Han, Methods for mining frequent items in data streams: an overview. Knowledge Information System, vol. 26, 2011. pp.1-30
http://dx.doi.org/10.1007/s10115-009-0267-2


Full Text: PDF


Journal of Computers (JCP, ISSN 1796-203X)

Copyright @ 2006-2013 by ACADEMY PUBLISHER – All rights reserved.