skip to main content
article

Googling the internet: profiling internet endpoints via the world wide web

Published: 01 April 2010 Publication History

Abstract

Understanding Internet access trends at a global scale, i.e., how people use the Internet, is a challenging problem that is typically addressed by analyzing network traces. However, obtaining such traces presents its own set of challenges owing to either privacy concerns or to other operational difficulties. The key hypothesis of our work here is that most of the information needed to profile the Internet endpoints is already available around us--on the Web. In this paper, we introduce a novel approach for profiling and classifying endpoints. We implement and deploy a Google-based profiling tool, that accurately characterizes endpoint behavior by collecting and strategically combining information freely available on the Web. Our Web-based "unconstrained endpoint profiling" (UEP) approach shows advances in the following scenarios: 1) even when no packet traces are available, it can accurately infer application and protocol usage trends at arbitrary networks; 2) when network traces are available, it outperforms state-of-the-art classification tools such as BLINC; 3) when sampled flow-level traces are available, it retains high classification capabilities. We explore other complementary UEP approaches, such as p2p- and reverse-DNS-lookup-based schemes, and show that they can further improve the results of the Web-based UEP. Using this approach, we perform unconstrained endpoint profiling at a global scale: for clients in four different world regions (Asia, South and North America, and Europe). We provide the first-of-its-kind endpoint analysis that reveals fascinating similarities and differences among these regions.

References

[1]
Alexa, {Online}. Available: http://www.alexa.com/
[2]
BitTorrent, {Online}. Available: http://www.bittorent.com/
[3]
China gaming, {Online}. Available: http://spectrum.ieee.org/dec07/ 5719
[4]
China P2P streaming, {Online}. Available: http://newteevee.com/ 2007/08/25/asias-p2p-boom/
[5]
Google, {Online}. Available: http://www.google.com/
[6]
"Google 101: How Google crawls, indexes and serves the Web," {Online}. Available: http://www.google.com/support/webmasters/bin/answer. py?hl=en&answer=70897
[7]
"Indeed: Job search engine," {Online}. Available: www.indeed.com
[8]
"L7 filter supported protocols," {Online}. Available: http://l7-filter. sourceforge.net/protocols
[9]
"Linux in Brazil," {Online}. Available: http://www.brazzil.com/2004/ html/articles/mar04/p107mar04.htm
[10]
"Linux in France," {Online}. Available: http://news.zdnet.com/ 2100-3513_22-5828644.html
[11]
MSN Search, {Online}. Available: http://search.live.com/
[12]
Narusinsight Traffic Intelligence Platform {Online}. Available: http:// www.narus.com/products/trafficIntelligence.html
[13]
Orkut, {Online}. Available: http://en.wikipedia.org/wiki/Orkut
[14]
"Spock: People search engine," {Online}. Available: www.spock.com
[15]
"University of Oregon Route Views project," {Online}. Available: http://www.routeviews.org
[16]
"Wikipedia ban," {Online}. Available: http://www.iht.com/articles/ap/ 2006/11/17/asia/AS_GEN_China_Wikipedia.php
[17]
Yahoo, {Online}. Available: http://www.yahoo.com/
[18]
YouTube, {Online}. Available: http://www.youtube.com/
[19]
P. Cheeseman and J. Stutz, "Bayesian classification (AutoClass): Theory and results," in Advances in Knowledge Discovery and Data Mining. Cambridge, MA: MIT Press, 1996, pp. 153-180.
[20]
H. Chen and L. Trajkovic, "Trunked radio systems: Traffic prediction based on user clusters," in Proc. IEEE ISWCS, Sep. 2004, pp. 76-80.
[21]
"Web traffic overtakes peer-to-peer (P2P) as largest percentage of bandwidth on the network," Ellacoya Networks, 2007 {Online}. Available: http://www.circleid.com/posts/web_traffic_overtakes_p2p_bandwidth/
[22]
J. Erman, M. Arlitt, and A. Mahanti, "Traffic classification using clustering algorithms," in Proc. ACM SIGCOMM MINENET Workshop, Pisa, Italy, Sep. 2006, pp. 281-286.
[23]
T. Karagiannis, K. Papagiannaki, and M. Faloutsos, "BLINC: Multilevel traffic classification in the dark," in Proc. ACM SIGCOMM, Philadelphia, PA, Aug. 2005, pp. 229-240.
[24]
T. Karagiannis, K. Papagiannaki, N. Taft, and M. Faloutsos, "Profiling the end host," in Proc. PAM, Louvain-la-neuve, Belgium, Apr. 2007, pp. 186-196.
[25]
H. Kim, K. Claffy, M. Fomenkov, D. Barman, M. Faloutsos, and K. Lee, "Internet traffic classification demystified: Myths, caveats, and the best practices," in Proc. ACM CONEXT, Madrid, Spain, Dec. 2008, Article no. 11.
[26]
J. Liang, R. Kumar, Y. Xi, and K. Ross, "Pollution in P2P file sharing systems," in Proc. IEEE INFOCOM, Miami, FL, Mar. 2005, vol. 2, pp. 1174-1185.
[27]
H. Madhyastha, T. Isdal, M. Piatek, C. Dixon, T. Anderson, A. Krishnamurthy, and A. Venkataramani, "iPlane: An information plane for distributed services," in Proc. OSDI, Seattle, WA, Nov. 2006, pp. 367-380.
[28]
J. Mai, C. Chuah, A. Sridharan, T. Ye, and H. Zang, "Is sampled data sufficient for anomaly detection?," in Proc. ACM IMC, Rio de Janeiro, Brazil, Oct. 2006, pp. 165-176.
[29]
P. McDaniel, S. Sen, O. Spatscheck, J. van der Merwe, W. Aiello, and C. Kalmanek, "Enterprise security: A community of interest based approach," presented at the NDSS, San Diego, CA, Feb. 2006.
[30]
A. Mislove, M. Marcon, K. Gummadi, P. Druschel, and B. Bhattacharjee, "Measurement and analysis of online social networks," in Proc. ACM IMC, San Diego, CA, Oct. 2007, pp. 29-42.
[31]
L. Plissonneau, J. Costeux, and P. Brown, "Analysis of peer-to-peer traffic on ADSL," in Proc. PAM, Boston, MA, Mar. 2005, pp. 69-82.
[32]
J. Rexford, J. Wang, Z. Xiao, and Y. Zhang, "BGP routing stability of popular destinations," in Proc. ACM SIGCOMM IMV Workshop, Pittsburgh, PA, Aug. 2002, pp. 197-202.
[33]
N. Spring, R. Mahajan, and D. Wetherall, "Measuring ISP topologies with rocketfuel," in Proc. ACM SIGCOMM, Pittsburgh, PA, Aug. 2002, pp. 133-145.
[34]
I. Trestian, S. Ranjan, A. Kuzmanovic, and A. Nucci, "Unconstrained endpoint profiling (Googling the Internet)," in Proc. ACM SIGCOMM, Seattle, WA, Aug. 2008, pp. 279-290.
[35]
P. Verkaik, O. Spatscheck, J. van der Merwe, and A. Snoeren, "Primed: Community-of-interest-based DDoS mitigation," in Proc. SIGCOMM LSAD Workshop, Pisa, Italy, Sep. 2006, pp. 147-154.
[36]
S. Zander, T. Nguyen, and G. Armitage, "Automated traffic classification and application identification using machine learning," in Proc. IEEE LCN, Sydney, Australia, Nov. 2005, pp. 250-257.

Cited By

View all
  • (2021)Cyber reconnaissance techniquesCommunications of the ACM10.1145/341829364:3(86-95)Online publication date: 22-Feb-2021
  • (2019)A Wearable Machine Learning Solution for Internet Traffic Classification in Satellite CommunicationsService-Oriented Computing10.1007/978-3-030-33702-5_15(202-215)Online publication date: 28-Oct-2019
  • (2014)A taxonomy-based model of security and privacy in online social networksInternational Journal of Computational Science and Engineering10.1504/IJCSE.2014.0607179:4(325-338)Online publication date: 1-Apr-2014
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image IEEE/ACM Transactions on Networking
IEEE/ACM Transactions on Networking  Volume 18, Issue 2
April 2010
339 pages

Publisher

IEEE Press

Publication History

Published: 01 April 2010
Revised: 06 January 2009
Received: 23 June 2008
Published in TON Volume 18, Issue 2

Author Tags

  1. clustering
  2. endpoint profiling
  3. google
  4. traffic classification
  5. traffic locality

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2021)Cyber reconnaissance techniquesCommunications of the ACM10.1145/341829364:3(86-95)Online publication date: 22-Feb-2021
  • (2019)A Wearable Machine Learning Solution for Internet Traffic Classification in Satellite CommunicationsService-Oriented Computing10.1007/978-3-030-33702-5_15(202-215)Online publication date: 28-Oct-2019
  • (2014)A taxonomy-based model of security and privacy in online social networksInternational Journal of Computational Science and Engineering10.1504/IJCSE.2014.0607179:4(325-338)Online publication date: 1-Apr-2014
  • (2013)Feature interview with Antonio NucciACM SIGCOMM Computer Communication Review10.1145/2567561.256757144:1(53-55)Online publication date: 31-Dec-2013
  • (2013)MosaicACM SIGCOMM Computer Communication Review10.1145/2534169.248600843:4(279-290)Online publication date: 27-Aug-2013
  • (2013)MosaicProceedings of the ACM SIGCOMM 2013 conference on SIGCOMM10.1145/2486001.2486008(279-290)Online publication date: 12-Aug-2013
  • (2012)A hybrid approach for personalized recommendation of news on the WebExpert Systems with Applications: An International Journal10.1016/j.eswa.2011.11.08739:5(5806-5814)Online publication date: 1-Apr-2012
  • (2012)Pinning Synchronization for a General Complex Networks with Multiple Time-Varying Coupling DelaysNeural Processing Letters10.1007/s11063-012-9213-535:3(221-231)Online publication date: 1-Jun-2012

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media