Knowledge_discovery

Knowledge extraction

Creation of knowledge from structured and unstructured sources

Knowledge extraction is the creation of knowledge from structured (relational databases, XML) and unstructured (text, documents, images) sources. The resulting knowledge needs to be in a machine-readable and machine-interpretable format and must represent knowledge in a manner that facilitates inferencing. Although it is methodically similar to information extraction (NLP) and ETL (data warehouse), the main criterion is that the extraction result goes beyond the creation of structured information or the transformation into a relational schema. It requires either the reuse of existing formal knowledge (reusing identifiers or ontologies) or the generation of a schema based on the source data.

The RDB2RDF W3C group ^[1] is currently standardizing a language for extraction of resource description frameworks (RDF) from relational databases. Another popular example for knowledge extraction is the transformation of Wikipedia into structured data and also the mapping to existing knowledge (see DBpedia and Freebase).

Examples

Extraction from structured sources to RDF

Survey of methods / tools

More information Name, Data Source ...

Name	Data Source	Data Exposition	Data Synchronisation	Mapping Language	Vocabulary Reuse	Mapping Automat.	Req. Domain Ontology	Uses GUI
A Direct Mapping of Relational Data to RDF	Relational Data	SPARQL/ETL	dynamic	—	false	automatic	false	false
CSV2RDF4LOD	CSV	ETL	static	RDF	true	manual	false	false
CoNLL-RDF	TSV, CoNLL	SPARQL/ RDF stream	static	none	true	automatic (domain-specific, for use cases in language technology, preserves relations between rows)	false	false
Convert2RDF	Delimited text file	ETL	static	RDF/DAML	true	manual	false	true
D2R Server	RDB	SPARQL	bi-directional	D2R Map	true	manual	false	false
DartGrid	RDB	own query language	dynamic	Visual Tool	true	manual	false	true
DataMaster	RDB	ETL	static	proprietary	true	manual	true	true
Google Refine's RDF Extension	CSV, XML	ETL	static	none		semi-automatic	false	true
Krextor	XML	ETL	static	xslt	true	manual	true	false
MAPONTO	RDB	ETL	static	proprietary	true	manual	true	false
METAmorphoses	RDB	ETL	static	proprietary xml based mapping language	true	manual	false	true
MappingMaster	CSV	ETL	static	MappingMaster	true	GUI	false	true
ODEMapster	RDB	ETL	static	proprietary	true	manual	true	true
OntoWiki CSV Importer Plug-in - DataCube & Tabular	CSV	ETL	static	The RDF Data Cube Vocaublary	true	semi-automatic	false	true
Poolparty Extraktor (PPX)	XML, Text	LinkedData	dynamic	RDF (SKOS)	true	semi-automatic	true	false
RDBToOnto	RDB	ETL	static	none	false	automatic, the user furthermore has the chance to fine-tune results	false	true
RDF 123	CSV	ETL	static	false	false	manual	false	true
RDOTE	RDB	ETL	static	SQL	true	manual	true	true
Relational.OWL	RDB	ETL	static	none	false	automatic	false	false
T2LD	CSV	ETL	static	false	false	automatic	false	false
The RDF Data Cube Vocabulary	Multidimensional statistical data in spreadsheets			Data Cube Vocabulary	true	manual	false
TopBraid Composer	CSV	ETL	static	SKOS	false	semi-automatic	false	true
Triplify	RDB	LinkedData	dynamic	SQL	true	manual	false	false
Ultrawrap	RDB	SPARQL/ETL	dynamic	R2RML	true	semi-automatic	false	true
Virtuoso RDF Views	RDB	SPARQL	dynamic	Meta Schema Language	true	semi-automatic	false	true
Virtuoso Sponger	structured and semi-structured data sources	SPARQL	dynamic	Virtuoso PL & XSLT	true	semi-automatic	false	false
VisAVis	RDB	RDQL	dynamic	SQL	true	manual	true	true
XLWrap: Spreadsheet to RDF	CSV	ETL	static	TriG Syntax	true	manual	false	false
XML to RDF	XML	ETL	static	false	false	automatic	false	false

Extraction from natural language sources

The largest portion of information contained in business documents (about 80%^[10]) is encoded in natural language and therefore unstructured. Because unstructured data is rather a challenge for knowledge extraction, more sophisticated methods are required, which generally tend to supply worse results compared to structured data. The potential for a massive acquisition of extracted knowledge, however, should compensate the increased complexity and decreased quality of extraction. In the following, natural language sources are understood as sources of information, where the data is given in an unstructured fashion as plain text. If the given text is additionally embedded in a markup document (e. g. HTML document), the mentioned systems normally remove the markup elements automatically.

Linguistic annotation / natural language processing (NLP)

As a preprocessing step to knowledge extraction, it can be necessary to perform linguistic annotation by one or multiple NLP tools. Individual modules in an NLP workflow normally build on tool-specific formats for input and output, but in the context of knowledge extraction, structured formats for representing linguistic annotations have been applied.

Typical NLP tasks relevant to knowledge extraction include:

part-of-speech (POS) tagging
lemmatization (LEMMA) or stemming (STEM)
word sense disambiguation (WSD, related to semantic annotation below)
named entity recognition (NER, also see IE below)
syntactic parsing, often adopting syntactic dependencies (DEP)
shallow syntactic parsing (CHUNK): if performance is an issue, chunking yields a fast extraction of nominal and other phrases
anaphor resolution (see coreference resolution in IE below, but seen here as the task to create links between textual mentions rather than between the mention of an entity and an abstract representation of the entity)
semantic role labelling (SRL, related to relation extraction; not to be confused with semantic annotation as described below)
discourse parsing (relations between different sentences, rarely used in real-world applications)

In NLP, such data is typically represented in TSV formats (CSV formats with TAB as separators), often referred to as CoNLL formats. For knowledge extraction workflows, RDF views on such data have been created in accordance with the following community standards:

NLP Interchange Format (NIF, for many frequent types of annotation)^[11]^[12]
Web Annotation (WA, often used for entity linking)^[13]
CoNLL-RDF (for annotations originally represented in TSV formats)^[14]^[15]

Other, platform-specific formats include

LAPPS Interchange Format (LIF, used in the LAPPS Grid)^[16]^[17]
NLP Annotation Format (NAF, used in the NewsReader workflow management system)^[18]^[19]

Tools

The following criteria can be used to categorize tools, which extract knowledge from natural language text.

Source	Which input formats can be processed by the tool (e.g. plain text, HTML or PDF)?
Access Paradigm	Can the tool query the data source or requires a whole dump for the extraction process?
Data Synchronization	Is the result of the extraction process synchronized with the source?
Uses Output Ontology	Does the tool link the result with an ontology?
Mapping Automation	How automated is the extraction process (manual, semi-automatic or automatic)?
Requires Ontology	Does the tool need an ontology for the extraction?
Uses GUI	Does the tool offer a graphical user interface?
Approach	Which approach (IE, OBIE, OL or SA) is used by the tool?
Extracted Entities	Which types of entities (e.g. named entities, concepts or relationships) can be extracted by the tool?
Applied Techniques	Which techniques are applied (e.g. NLP, statistical methods, clustering or machine learning)?
Output Model	Which model is used to represent the result of the tool (e. g. RDF or OWL)?
Supported Domains	Which domains are supported (e.g. economy or biology)?
Supported Languages	Which languages can be processed (e.g. English or German)?

The following table characterizes some tools for Knowledge Extraction from natural language sources.

More information Name, Source ...

Name	Source	Access Paradigm	Data Synchronization	Uses Output Ontology	Mapping Automation	Requires Ontology	Uses GUI	Approach	Extracted Entities	Applied Techniques	Output Model	Supported Domains	Supported Languages
^[24]	plain text, HTML, XML, SGML	dump	no	yes	automatic	yes	yes	IE	named entities, relationships, events	linguistic rules	proprietary	domain-independent	English, Spanish, Arabic, Chinese, indonesian
AlchemyAPI ^[25]	plain text, HTML				automatic		yes	SA					multilingual
ANNIE ^[26]	plain text	dump				yes	yes	IE		finite state algorithms			multilingual
ASIUM ^[27]	plain text	dump			semi-automatic		yes	OL	concepts, concept hierarchy	NLP, clustering
Attensity Exhaustive Extraction ^[28]					automatic			IE	named entities, relationships, events	NLP
Dandelion API	plain text, HTML, URL	REST	no	no	automatic	no	yes	SA	named entities, concepts	statistical methods	JSON	domain-independent	multilingual
DBpedia Spotlight ^[29]	plain text, HTML	dump, SPARQL	yes	yes	automatic	no	yes	SA	annotation to each word, annotation to non-stopwords	NLP, statistical methods, machine learning	RDFa	domain-independent	English
EntityClassifier.eu	plain text, HTML	dump	yes	yes	automatic	no	yes	IE, OL, SA	annotation to each word, annotation to non-stopwords	rule-based grammar	XML	domain-independent	English, German, Dutch
FRED ^[30]	plain text	dump, REST API	yes	yes	automatic	no	yes	IE, OL, SA, ontology design patterns, frame semantics	(multi-)word NIF or EarMark annotation, predicates, instances, compositional semantics, concept taxonomies, frames, semantic roles, periphrastic relations, events, modality, tense, entity linking, event linking, sentiment	NLP, machine learning, heuristic rules	RDF/OWL	domain-independent	English, other languages via translation
iDocument ^[31]	HTML, PDF, DOC	SPARQL		yes			yes	OBIE	instances, property values	NLP		personal, business
NetOwl Extractor ^[32]	plain text, HTML, XML, SGML, PDF, MS Office	dump	No	Yes	Automatic	yes	Yes	IE	named entities, relationships, events	NLP	XML, JSON, RDF-OWL, others	multiple domains	English, Arabic Chinese (Simplified and Traditional), French, Korean, Persian (Farsi and Dari), Russian, Spanish
OntoGen ^[33]					semi-automatic		yes	OL	concepts, concept hierarchy, non-taxonomic relations, instances	NLP, machine learning, clustering
OntoLearn ^[34]	plain text, HTML	dump	no	yes	automatic	yes	no	OL	concepts, concept hierarchy, instances	NLP, statistical methods	proprietary	domain-independent	English
OntoLearn Reloaded	plain text, HTML	dump	no	yes	automatic	yes	no	OL	concepts, concept hierarchy, instances	NLP, statistical methods	proprietary	domain-independent	English
OntoSyphon ^[35]	HTML, PDF, DOC	dump, search engine queries	no	yes	automatic	yes	no	OBIE	concepts, relations, instances	NLP, statistical methods	RDF	domain-independent	English
ontoX ^[36]	plain text	dump	no	yes	semi-automatic	yes	no	OBIE	instances, datatype property values	heuristic-based methods	proprietary	domain-independent	language-independent
OpenCalais	plain text, HTML, XML	dump	no	yes	automatic	yes	no	SA	annotation to entities, annotation to events, annotation to facts	NLP, machine learning	RDF	domain-independent	English, French, Spanish
PoolParty Extractor ^[37]	plain text, HTML, DOC, ODT	dump	no	yes	automatic	yes	yes	OBIE	named entities, concepts, relations, concepts that categorize the text, enrichments	NLP, machine learning, statistical methods	RDF, OWL	domain-independent	English, German, Spanish, French
Rosoka	plain text, HTML, XML, SGML, PDF, MS Office	dump	Yes	Yes	Automatic	no	Yes	IE	named entity extraction, entity resolution, relationship extraction, attributes, concepts, multi-vector sentiment analysis, geotagging, language identification	NLP, machine learning	XML, JSON, POJO, RDF	multiple domains	Multilingual 200+ Languages
SCOOBIE	plain text, HTML	dump	no	yes	automatic	no	no	OBIE	instances, property values, RDFS types	NLP, machine learning	RDF, RDFa	domain-independent	English, German
SemTag ^[38]^[39]	HTML	dump	no	yes	automatic	yes	no	SA		machine learning	database record	domain-independent	language-independent
smart FIX	plain text, HTML, PDF, DOC, e-Mail	dump	yes	no	automatic	no	yes	OBIE	named entities	NLP, machine learning	proprietary	domain-independent	English, German, French, Dutch, polish
Text2Onto ^[40]	plain text, HTML, PDF	dump	yes	no	semi-automatic	yes	yes	OL	concepts, concept hierarchy, non-taxonomic relations, instances, axioms	NLP, statistical methods, machine learning, rule-based methods	OWL	deomain-independent	English, German, Spanish
Text-To-Onto ^[41]	plain text, HTML, PDF, PostScript	dump			semi-automatic	yes	yes	OL	concepts, concept hierarchy, non-taxonomic relations, lexical entities referring to concepts, lexical entities referring to relations	NLP, machine learning, clustering, statistical methods			German
ThatNeedle	Plain Text	dump			automatic		no		concepts, relations, hierarchy	NLP, proprietary	JSON	multiple domains	English
The Wiki Machine ^[42]	plain text, HTML, PDF, DOC	dump	no	yes	automatic	yes	yes	SA	annotation to proper nouns, annotation to common nouns	machine learning	RDFa	domain-independent	English, German, Spanish, French, Portuguese, Italian, Russian
ThingFinder ^[43]								IE	named entities, relationships, events				multilingual

Share this article:

This article uses material from the Wikipedia article Knowledge_discovery, and is written by contributors. Text is available under a CC BY-SA 4.0 International License; additional terms may apply. Images, videos and audio are available under their respective licenses.

[RDB2RDF-1] [1]
RDB2RDF Working Group, Website: http://www.w3.org/2001/sw/rdb2rdf/, charter: http://www.w3.org/2009/08/rdb2rdf-charter, R2RML: RDB to RDF Mapping Language: http://www.w3.org/TR/r2rml/

[lod2_eu-2] [2]
LOD2 EU Deliverable 3.1.1 Knowledge Extraction from Structured Sources http://static.lod2.eu/Deliverables/deliverable-3.1.1.pdf Archived 2011-08-27 at the Wayback Machine

[OpenCalaisLinkedData-3] [3]
"Life in the Linked Data Cloud". www.opencalais.com. Archived from the original on 2009-11-24. Retrieved 2009-11-10. Wikipedia has a Linked Data twin called DBpedia. DBpedia has the same structured information as Wikipedia – but translated into a machine-readable format.

[timbl_reldb4semweb-4] [4]
Tim Berners-Lee (1998), "Relational Databases on the Semantic Web". Retrieved: February 20, 2011.

[Hu-5] [5]
Hu et al. (2007), "Discovering Simple Mappings Between Relational Database Schemas and Ontologies", In Proc. of 6th International Semantic Web Conference (ISWC 2007), 2nd Asian Semantic Web Conference (ASWC 2007), LNCS 4825, pages 225‐238, Busan, Korea, 11‐15 November 2007. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.97.6934&rep=rep1&type=pdf

[Ghawi-6] [6]
R. Ghawi and N. Cullot (2007), "Database-to-Ontology Mapping Generation for Semantic Interoperability". In Third International Workshop on Database Interoperability (InterDB 2007). http://le2i.cnrs.fr/IMG/publications/InterDB07-Ghawi.pdf

[Li-7] [7]
Li et al. (2005) "A Semi-automatic Ontology Acquisition Method for the Semantic Web", WAIM, volume 3739 of Lecture Notes in Computer Science, page 209-220. Springer. doi:10.1007/11563952_19

[Tirmizi-8] [8]
Tirmizi et al. (2008), "Translating SQL Applications to the Semantic Web", Lecture Notes in Computer Science, Volume 5181/2008 (Database and Expert Systems Applications). http://citeseer.ist.psu.edu/viewdoc/download;jsessionid=15E8AB2A37BD06DAE59255A1AC3095F0?doi=10.1.1.140.3169&rep=rep1&type=pdf

[Cerbah-9] [9]
Farid Cerbah (2008). "Learning Highly Structured Semantic Repositories from Relational Databases", The Semantic Web: Research and Applications, volume 5021 of Lecture Notes in Computer Science, Springer, Berlin / Heidelberg http://www.tao-project.eu/resources/publications/cerbah-learning-highly-structured-semantic-repositories-from-relational-databases.pdf Archived 2011-07-20 at the Wayback Machine

[Wimalasuriya-10] [10]
Wimalasuriya, Daya C.; Dou, Dejing (2010). "Ontology-based information extraction: An introduction and a survey of current approaches", Journal of Information Science, 36(3), p. 306 - 323, http://ix.cs.uoregon.edu/~dou/research/papers/jis09.pdf (retrieved: 18.06.2012).

[11] [11]
"NLP Interchange Format (NIF) 2.0 - Overview and Documentation". persistence.uni-leipzig.org. Retrieved 2020-06-05.

[12] [12]
Hellmann, Sebastian; Lehmann, Jens; Auer, Sören; Brümmer, Martin (2013). Alani, Harith; Kagal, Lalana; Fokoue, Achille; Groth, Paul; Biemann, Chris; Parreira, Josiane Xavier; Aroyo, Lora; Noy, Natasha; Welty, Chris (eds.). "Integrating NLP Using Linked Data". The Semantic Web – ISWC 2013. Lecture Notes in Computer Science. 7908. Berlin, Heidelberg: Springer: 98–113. doi:10.1007/978-3-642-41338-4_7. ISBN 978-3-642-41338-4.

[13] [13]
Verspoor, Karin; Livingston, Kevin (July 2012). "Towards Adaptation of Linguistic Annotations to Scholarly Annotation Formalisms on the Semantic Web". Proceedings of the Sixth Linguistic Annotation Workshop. Jeju, Republic of Korea: Association for Computational Linguistics: 75–84.

[14] [14]
acoli-repo/conll-rdf, ACoLi, 2020-05-27, retrieved 2020-06-05

[15] [15]
Chiarcos, Christian; Fäth, Christian (2017). Gracia, Jorge; Bond, Francis; McCrae, John P.; Buitelaar, Paul; Chiarcos, Christian; Hellmann, Sebastian (eds.). "CoNLL-RDF: Linked Corpora Done in an NLP-Friendly Way". Language, Data, and Knowledge. Lecture Notes in Computer Science. 10318. Cham: Springer International Publishing: 74–88. doi:10.1007/978-3-319-59888-8_6. ISBN 978-3-319-59888-8.

[16] [16]
Verhagen, Marc; Suderman, Keith; Wang, Di; Ide, Nancy; Shi, Chunqi; Wright, Jonathan; Pustejovsky, James (2016). Murakami, Yohei; Lin, Donghui (eds.). "The LAPPS Interchange Format". Worldwide Language Service Infrastructure. Lecture Notes in Computer Science. 9442. Cham: Springer International Publishing: 33–47. doi:10.1007/978-3-319-31468-6_3. ISBN 978-3-319-31468-6.

[17] [17]
"The Language Application Grid | A web service platform for natural language processing development and research". Retrieved 2020-06-05.

[18] [18]
newsreader/NAF, NewsReader, 2020-05-25, retrieved 2020-06-05

[19] [19]
Vossen, Piek; Agerri, Rodrigo; Aldabe, Itziar; Cybulska, Agata; van Erp, Marieke; Fokkens, Antske; Laparra, Egoitz; Minard, Anne-Lyse; Palmero Aprosio, Alessio; Rigau, German; Rospocher, Marco (2016-10-15). "NewsReader: Using knowledge resources in a cross-lingual reading machine to generate more knowledge from massive streams of news". Knowledge-Based Systems. 110: 60–85. doi:10.1016/j.knosys.2016.07.013. ISSN 0950-7051.

[Cunningham-20] [20]
Cunningham, Hamish (2005). "Information Extraction, Automatic", Encyclopedia of Language and Linguistics, 2, p. 665 - 677, http://gate.ac.uk/sale/ell2/ie/main.pdf (retrieved: 18.06.2012).

[21] [21]
Chicco, D; Masseroli, M (2016). "Ontology-based prediction and prioritization of gene functional annotations". IEEE/ACM Transactions on Computational Biology and Bioinformatics. 13 (2): 248–260. doi:10.1109/TCBB.2015.2459694. PMID 27045825. S2CID 2795344.

[Erdmann-22] [22]
Erdmann, M.; Maedche, Alexander; Schnurr, H.-P.; Staab, Steffen (2000). "From Manual to Semi-automatic Semantic Annotation: About Ontology-based Text Annotation Tools", Proceedings of the COLING, http://www.ida.liu.se/ext/epa/cis/2001/002/paper.pdf (retrieved: 18.06.2012).

[Rao-23] [23]
Rao, Delip; McNamee, Paul; Dredze, Mark (2011). "Entity Linking: Finding Extracted Entities in a Knowledge Base", Multi-source, Multi-lingual Information Extraction and Summarization, http://www.cs.jhu.edu/~delip/entity-linking.pdf%5B%5D (retrieved: 18.06.2012).

[Rocket-Software-Inc-24] [24]
Rocket Software, Inc. (2012). "technology for extracting intelligence from text", http://www.rocketsoftware.com/products/aerotext Archived 2013-06-21 at the Wayback Machine (retrieved: 18.06.2012).

[Orchestr8-25] [25]
Orchestr8 (2012): "AlchemyAPI Overview", http://www.alchemyapi.com/api Archived 2016-05-13 at the Wayback Machine (retrieved: 18.06.2012).

[The-University-of-Sheffield-26] [26]
The University of Sheffield (2011). "ANNIE: a Nearly-New Information Extraction System", http://gate.ac.uk/sale/tao/splitch6.doc#chap:annie (retrieved: 18.06.2012).

[ILP-Network-of-Excellence-27] [27]
ILP Network of Excellence. "ASIUM (LRI)", http://www-ai.ijs.si/~ilpnet2/systems/asium.doc (retrieved: 18.06.2012).

[Attensity-28] [28]
Attensity (2012). "Exhaustive Extraction", http://www.attensity.com/products/technology/semantic-server/exhaustive-extraction/ Archived 2012-07-11 at the Wayback Machine (retrieved: 18.06.2012).

[Mendes-29] [29]
Mendes, Pablo N.; Jakob, Max; Garcia-Sílva, Andrés; Bizer; Christian (2011). "DBpedia Spotlight: Shedding Light on the Web of Documents", Proceedings of the 7th International Conference on Semantic Systems, p. 1 - 8, http://www.wiwiss.fu-berlin.de/en/institute/pwo/bizer/research/publications/Mendes-Jakob-GarciaSilva-Bizer-DBpediaSpotlight-ISEM2011.pdf Archived 2012-04-05 at the Wayback Machine (retrieved: 18.06.2012).

[Gangemi-30] [30]
Gangemi, Aldo; Presutti, Valentina; Reforgiato Recupero, Diego; Nuzzolese, Andrea Giovanni; Draicchio, Francesco; Mongiovì, Misael (2016). "Semantic Web Machine Reading with FRED", Semantic Web Journal, doi:10.3233/SW-160240, http://www.semantic-web-journal.net/system/files/swj1379.pdf

[Adrian-31] [31]
Adrian, Benjamin; Maus, Heiko; Dengel, Andreas (2009). "iDocument: Using Ontologies for Extracting Information from Text", http://www.dfki.uni-kl.de/~maus/dok/AdrianMausDengel09.pdf (retrieved: 18.06.2012).

[SRA-International-Inc-32] [32]
SRA International, Inc. (2012). "NetOwl Extractor", http://www.sra.com/netowl/entity-extraction/ Archived 2012-09-24 at the Wayback Machine (retrieved: 18.06.2012).

[Fortuna-33] [33]
Fortuna, Blaz; Grobelnik, Marko; Mladenic, Dunja (2007). "OntoGen: Semi-automatic Ontology Editor", Proceedings of the 2007 conference on Human interface, Part 2, p. 309 - 318, http://analytics.ijs.si/~blazf/papers/OntoGen2_HCII2007.pdf (retrieved: 18.06.2012).

[Missikoff-34] [34]
Missikoff, Michele; Navigli, Roberto; Velardi, Paola (2002). "Integrated Approach to Web Ontology Learning and Engineering", Computer, 35(11), p. 60 - 63, http://wwwusers.di.uniroma1.it/~velardi/IEEE_C.pdf (retrieved: 18.06.2012).

[McDowell-35] [35]
McDowell, Luke K.; Cafarella, Michael (2006). "Ontology-driven Information Extraction with OntoSyphon", Proceedings of the 5th international conference on The Semantic Web, p. 428 - 444, http://turing.cs.washington.edu/papers/iswc2006McDowell-final.pdf (retrieved: 18.06.2012).

[Yildiz-36] [36]
Yildiz, Burcu; Miksch, Silvia (2007). "ontoX - A Method for Ontology-Driven Information Extraction", Proceedings of the 2007 international conference on Computational science and its applications, 3, p. 660 - 673, http://publik.tuwien.ac.at/files/pub-inf_4769.pdf (retrieved: 18.06.2012).

[semanticweb-org-37] [37]
semanticweb.org (2011). "PoolParty Extractor", http://semanticweb.org/wiki/PoolParty_Extractor Archived 2016-03-04 at the Wayback Machine (retrieved: 18.06.2012).

[Dill-38] [38]
Dill, Stephen; Eiron, Nadav; Gibson, David; Gruhl, Daniel; Guha, R.; Jhingran, Anant; Kanungo, Tapas; Rajagopalan, Sridhar; Tomkins, Andrew; Tomlin, John A.; Zien, Jason Y. (2003). "SemTag and Seeker: Bootstraping the Semantic Web via Automated Semantic Annotation", Proceedings of the 12th international conference on World Wide Web, p. 178 - 186, http://www2003.org/cdrom/papers/refereed/p831/p831-dill.doc (retrieved: 18.06.2012).

[Uren-39] [39]
Uren, Victoria; Cimiano, Philipp; Iria, José; Handschuh, Siegfried; Vargas-Vera, Maria; Motta, Enrico; Ciravegna, Fabio (2006). "Semantic annotation for knowledge management: Requirements and a survey of the state of the art", Web Semantics: Science, Services and Agents on the World Wide Web, 4(1), p. 14 - 28, http://staffwww.dcs.shef.ac.uk/people/J.Iria/iria_jws06.pdf%5B%5D, (retrieved: 18.06.2012).

[Cimiano05-40] [40]
Cimiano, Philipp; Völker, Johanna (2005). "Text2Onto - A Framework for Ontology Learning and Data-Driven Change Discovery", Proceedings of the 10th International Conference of Applications of Natural Language to Information Systems, 3513, p. 227 - 238, http://www.cimiano.de/Publications/2005/nldb05/nldb05.pdf (retrieved: 18.06.2012).

[Maedche-41] [41]
Maedche, Alexander; Volz, Raphael (2001). "The Ontology Extraction & Maintenance Framework Text-To-Onto", Proceedings of the IEEE International Conference on Data Mining, http://users.csc.calpoly.edu/~fkurfess/Events/DM-KM-01/Volz.pdf (retrieved: 18.06.2012).

[Machine-Linking-42] [42]
Machine Linking. "We connect to the Linked Open Data cloud", http://thewikimachine.fbk.eu/doc/index.doc Archived 2012-07-19 at the Wayback Machine (retrieved: 18.06.2012).

[Inxight-Federal-Systems-43] [43]
Inxight Federal Systems (2008). "Inxight ThingFinder and ThingFinder Professional", http://inxightfedsys.com/products/sdks/tf/ Archived 2012-06-29 at the Wayback Machine (retrieved: 18.06.2012).

[Williams1992-44] [44]
Frawley William. F. et al. (1992), "Knowledge Discovery in Databases: An Overview", AI Magazine (Vol 13, No 3), 57-70 (online full version: http://www.aaai.org/ojs/index.php/aimagazine/article/viewArticle/1011 Archived 2016-03-04 at the Wayback Machine)

[Fayyad1996-45] [45]
Fayyad U. et al. (1996), "From Data Mining to Knowledge Discovery in Databases", AI Magazine (Vol 17, No 3), 37-54 (online full version: http://www.aaai.org/ojs/index.php/aimagazine/article/viewArticle/1230 Archived 2016-05-04 at the Wayback Machine

[46] [46]
Cao, L. (2010). "Domain driven data mining: challenges and prospects". IEEE Transactions on Knowledge and Data Engineering. 22 (6): 755–769. CiteSeerX 10.1.1.190.8427. doi:10.1109/tkde.2010.32. S2CID 17904603.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

[42]

[43]

[44]

[45]

[46]

Source	Which data sources are covered: Text, Relational Databases, XML, CSV
Exposition	How is the extracted knowledge made explicit (ontology file, semantic database)? How can you query it?
Synchronization	Is the knowledge extraction process executed once to produce a dump or is the result synchronized with the source? Static or dynamic. Are changes to the result written back (bi-directional)
Reuse of vocabularies	The tool is able to reuse existing vocabularies in the extraction. For example, the table column 'firstName' can be mapped to foaf:firstName. Some automatic approaches are not capable of mapping vocab.
Automatization	The degree to which the extraction is assisted/automated. Manual, GUI, semi-automatic, automatic.
Requires a domain ontology	A pre-existing ontology is needed to map to it. So either a mapping is created or a schema is learned from the source (ontology learning).

Name	marriedTo	homepage	status_id
Peter	Mary	http://example.org/Peters_page%5B%5D	1
Claus	Eva	http://example.org/Claus_page%5B%5D	2

Knowledge_discovery

Knowledge extraction

Overview

Examples

Entity linking

Relational databases to RDF

Extraction from structured sources to RDF

1:1 Mapping from RDB Tables/Views to RDF Entities/Attributes/Values

Complex mappings of relational databases to RDF

XML

Survey of methods / tools

Extraction from natural language sources

Linguistic annotation / natural language processing (NLP)

Traditional information extraction (IE)

Ontology-based information extraction (OBIE)

Ontology learning (OL)

Semantic annotation (SA)

Tools

Knowledge discovery

Input data

Output formats

See also

Further reading

References

Share this article: