demonstration

Free Access

SmartPub: A Platform for Long-Tail Entity Extraction from Scientific Publications

Authors:
Sepideh Mesbah

Delft University of Technology, Delft, Netherlands

Delft University of Technology, Delft, Netherlands
View Profile

,
Alessandro Bozzon

Delft University of Technology, Delft, Netherlands

Delft University of Technology, Delft, Netherlands
View Profile

,
Christoph Lofi

Delft University of Technology, Delft, Netherlands

Delft University of Technology, Delft, Netherlands
View Profile

,
Geert-Jan Houben

Delft University of Technology, Delft, Netherlands

Delft University of Technology, Delft, Netherlands
View Profile

WWW '18: Companion Proceedings of the The Web Conference 2018April 2018Pages 191–194https://doi.org/10.1145/3184558.3186976

Published:23 April 2018Publication History

WWW '18: Companion Proceedings of the The Web Conference 2018

Pages 191–194

ABSTRACT

This demo presents SmartPub, a novel web-based platform that supports the exploration and visualization of shallow meta-data (e.g., author list, keywords) and deep meta-data--long tail named entities which are rare, and often relevant only in specific knowledge domain--from scientific publications. The platform collects documents from different sources (e.g. DBLP and Arxiv), and extracts the domain-specific named entities from the text of the publications using Named Entity Recognizers (NERs) which we can train with minimal human supervision even for rare entity types. The platform further enables the interaction with the Crowd for filtering purposes or training data generation, and provides extended visualization and exploration capabilities. SmartPub will be demonstrated using sample collection of scientific publications focusing on the computer science domain and will address the entity types Dataset (i.e. dataset presented or used in a publication), and Methods (i.e. algorithms used to create/enrich/analyse a data set)

References

A. Bozzon, P. Fraternali, L. Galli, and R. Karam. Modeling crowdsourcing scenarios in socially-enabled human computation applications. Journal on Data Semantics, 3(3):169--188, Sep 2014.Google ScholarCross Ref
Q. Le and T. Mikolov. Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 1188--1196, 2014. Google ScholarDigital Library
P. Lopez. GROBID: Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications. In European Conference on Digital Library (ECDL), Corfu, Greece, 2009. Google ScholarDigital Library
S. Mesbah, A. Bozzon, C. Lofi, and G.-J. Houben. Describing data processing pipelines in scientific publications for big data injection. In Proceedings of the 1st Workshop on Scholarly Web Mining, pages 1--8. ACM, 2017. Google ScholarDigital Library
S. Mesbah, K. Fragkeskos, C. Lofi, A. Bozzon, and G.-J. Houben. Facet embeddings for explorative analytics in digital libraries. In International Conference on Theory and Practice of Digital Libraries, pages 86--99. Springer, 2017.Google ScholarCross Ref
S. Mesbah, K. Fragkeskos, C. Lofi, A. Bozzon, and G.-J. Houben. Semantic annotation of data processing pipelines in scientific publications. In European Semantic Web Conference, pages 321--336. Springer, 2017.Google ScholarDigital Library
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111--3119, 2013. Google ScholarDigital Library
R. Reinanda, E. Meij, and M. de Rijke. Document filtering for long-tail entities. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pages 771--780. ACM, 2016. Google ScholarDigital Library
C.-T. Tsai, G. Kundu, and D. Roth. Concept-based analysis of scientific literature. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, pages 1733--1738. ACM, 2013. Google ScholarDigital Library

Index Terms

SmartPub: A Platform for Long-Tail Entity Extraction from Scientific Publications
1. Information systems
  1. Information retrieval
    1. Document representation
      1. Content analysis and feature selection
      2. Document structure
  2. Information systems applications
    1. Digital libraries and archives

Recommendations

Automatic gazette creation for named entity recognition and application to resume processing
COMPUTE '12: Proceedings of the 5th ACM COMPUTE Conference: Intelligent & scalable system technologies

Named entities are important content-carrying units within documents. Consequently named entity recognition (NER) is an important part of information extraction. One fast and accurate approach to NER uses a list or gazette consisting of known instances. ...
Read More
NCBI disease corpus

Graphical abstractDisplay Omitted NCBI disease corpus is built as a gold-standard resource for disease recognition.793 PubMed abstracts are annotated with disease mentions and concepts (MeSH/OMIM).14 Annotators produced high consistency level and inter-...
Read More
Two-stage approach to named entity recognition using Wikipedia and DBpedia
IMCOM '17: Proceedings of the 11th International Conference on Ubiquitous Information Management and Communication

In natural language understanding, extraction of named entity (NE) mentions in given text and classification of the mentions into pre-defined NE types are important processes. Most NE recognition (NER) relies on resources such as a training corpus or NE ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '18: Companion Proceedings of the The Web Conference 2018
April 2018
2023 pages
ISBN:9781450356404
General Chairs:
Pierre-Antoine Champin
Université Claude Bernard Lyon 1, France
,
Fabien Gandon
Inria, Université Côte d'Azur, CNRS, I3S, France
,
Lionel Médini
Université Claude Bernard Lyon 1, CNRS, LIRIS, France
,
Program Chairs:
Mounia Lalmas
Spotify, UK
,
Panagiotis G. Ipeirotis
New York University, USA
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
International World Wide Web Conferences Steering Committee
Republic and Canton of Geneva, Switzerland
Publication History
- Published: 23 April 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
information extraction. document metadata
long-tail entity types
named entity recognition
training data generation
Qualifiers
- demonstration
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 488
  Total Downloads
- Downloads (Last 12 months)47
- Downloads (Last 6 weeks)9
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

SmartPub: A Platform for Long-Tail Entity Extraction from Scientific Publications

WWW '18: Companion Proceedings of the The Web Conference 2018

ABSTRACT

References

Cited By

Index Terms

Recommendations

Automatic gazette creation for named entity recognition and application to resume processing

NCBI disease corpus

Two-stage approach to named entity recognition using Wikipedia and DBpedia