We have often heard that data is the new oil. In particular, extracting information from semi-structured textual documents on the Web is key to realize the Linked Data vision. Several attempts have been proposed to extract knowledge from textual documents, extracting named entities, classifying them according to pre-defined taxonomies and disambiguating them through URIs identifying real world entities. As a step towards interconnecting the Web of documents via those entities, different extractors have been proposed. Although they share the same main purpose (extracting named entity), they differ from numerous aspects such as their underlying dictionary or ability to disambiguate entities. NERD proposes a web framework which unifies numerous named entity extractors using the NERD ontology which provides a rich set of axioms aligning the taxonomies of these tools.
Table of contents
Extractors supported
NERD API documentation
POST http://nerd.eurecom.fr/api/document
Request parameters
A UTF8, the NERD APIkey.
A UTF8, the text file which will be processed to extract entities. Although the field is optional, it is required if {timedtext,uri} are not declared.
A UTF8, the SRT file which will be processed to extract entities. Although the field is optional, it is required if {text,uri} are not declared.
A UTF8, the URI of the article. Although the field is optional, it is required if {timedtext,text} are not declared.
Response parameters
A UTF8, the document identifier.
Example
POST | curl -i -X POST http://nerd.eurecom.fr/api/document -d "uri=http://www.bbc.co.uk/news/world-us-canada-19644448&key=YOUR_API_KEY" |
{
"idDocument":164
}
POST http://nerd.eurecom.fr/api/annotation
Request parameters
A UTF8, the NERD APIkey.
A UTF8, the document identifier.
A UTF8, the name an extractor. The accepted values are: {combined, alchemyapi, dandelionapi, dbspotlight, lupedia, opencalais, saplo, semitags, textrazor, thd, wikimeta, yahoo, zemanta}.
A UTF8. The accepted values are: core, extended. The default value is core.
A UTF8, the maximum interval in seconds to perform the annotation.
Response parameters
A UTF8, the id of the document.
Example
POST | curl -i -X POST http://nerd.eurecom.fr/api/annotation -d "key=YOUR_API_KEY&idDocument=164&extractor=alchemyapi&ontology=core&timeout=10" |
{
"idAnnotation":427
}
GET http://nerd.eurecom.fr/api/entity
Request parameters
A UTF8, the NERD APIkey.
A UTF8, the annotation identifier.
A UTF8. Accepted values: oen | oed. The oen (One Entity per Name) reads all the entities found in the document. The oed (One Entity per Document) removes duplicates (a duplicate happens when two or more entities have the same NE,type and URI) and reads only one occurrence.
Response parameters
An array of entity object. The extractor field assumes the following values: alchemyapi,dandelionapi,dbspotlight,opencalais,lupedia,saplo, semitags,wikimeta,yahoo,zemanta (names of the services supported) or combined. For futher details, see the example below.
Example
GET | curl -i -X GET -H "Accept: application/json" "http://nerd.eurecom.fr/api/entity?key=YOUR_API_KEY&idAnnotation=427" |
[
{
idEntity: 120,
label: "BBC",
startChar: 138,
endChar: 141,
extractorType: "Company",
nerdType: "http://nerd.eurecom.fr/ontology#Organization",
uri: "http://dbpedia.org/resource/BBC",
confidence: 0.0582796,
relevance: 0.5,
extractor: "dbspotlight"
},
...
]
NERD API libraries
- nerd4java - Java client
- source code
- documentation: refer to the README.md
- nerd4python - Python client
- source code
- documentation: refer to the README.md
- nerd4node - Nodejs client
- source code
- documentation: refer to the README.md
- nerdier - Ruby client