chaudhri2022

## Knowledge graphs: Introduction, history, and perspectives > [!Abstract]- > Knowledge graphs (KGs) have emerged as a compelling abstraction for organizing the world's structured knowledge and for integrating information extracted from multiple data sources. They are also beginning to play a central role in representing information extracted by AI systems, and for improving the predictions of AI systems by giving them knowledge expressed in KGs as input. The goals of this article are to (a) introduce KGs and discuss important areas of application that have gained recent prominence; (b) situate KGs in the context of the prior work in AI; and (c) present a few contrasting perspectives that help in better understanding KGs in relation to related technologies. > [!Cite]- > Chaudhri, Vinay K., Chaitanya Baru, Naren Chittar, et al. “Knowledge Graphs: Introduction, History, and Perspectives.” _AI Magazine_ 43, no. 1 (2022): 17–29. [https://doi.org/10.1002/aaai.12033](https://doi.org/10.1002/aaai.12033). > > [link](https://onlinelibrary.wiley.com/doi/abs/10.1002/aaai.12033) [online](http://zotero.org/users/17587716/items/CSLG46JA) [local](zotero://select/library/items/CSLG46JA) [pdf](file://C:\Users\erikt\Zotero\storage\URNENGD7\Chaudhri%20et%20al.%20-%202022%20-%20Knowledge%20graphs%20Introduction,%20history,%20and%20perspectives.pdf) ## Notes %% begin notes %% %% end notes %% %% begin annotations %% ### Imported: 2025-08-05 1:09 pm A KG is a directed labeled graph in which domain-specific meanings are associated with nodes and edges. A node could represent any real-world entity, for example, people, companies, and computers. An edge label captures the relationship of interest between the two nodes. For example, a friendship relationship between two people; a customer relationship between a company and person; or a network connection between two computers. There are multiple approaches for associating meanings with the nodes and edges. At the simplest level, the meanings could be stated as documentation strings expressed in a human understandable language such as English. At a computational level, the meanings can be expressed in a formal specification language such as first-order logic. Information can be added to a KG via a combination of human-driven, semiautomated, and/or fully automated methods. Regardless of the method, it is expected that the recorded information can be easily understood and verified by humans. Search and query operations on KGs can be reduced to graph navigation. Practical systems adapt the directed labeled graph representation to suit specific application requirements. For example, a KG model prominently used over the World Wide Web, called the Resource Description Framework (RDF) (Cygniak, Wood, and Lanthaler 2014), uses International Resource Identifiers (IRIs) to uniquely identify “things” (entities). Property graph models (Robinson, Webber, and Eifrem 2015) associate properties and values with each node and each edge. With the recent emphasis on responsible AI, annotating the edges with information on how they were obtained plays a key role in explaining inferences based on the KG. For example, an edge property of confidence could be used to represent the probability with which that relationship is known to be true. Finally, query languages, such as SPARQL (Pérez et al. 2006) for RDF and Graph Query Language for property graph models, provide the ability to query the information in respectively RDF and property graph KGs. Two key applications that have led to a surge in popularity of KGs are: (1) integration and organization of information about known “entities,” either as an openly accessible resource on the web, or as a proprietary resource within an enterprise/organization; and (2) representation of input and output information for AI/ML algorithms. Wikidata is a collaboratively edited open KG that provides data for Wikipedia and for other uses on the web (Vrandečić and Krötzsch 2014). By using unique internal identifiers for distinct entities, for example, Winterthur, from a variety of sources, such as, the Library of Congress and others, the information about an entity can be easily linked together. Wikidata makes it easy to integrate the different data sources by publishing a mapping of the Wikidata relations to the schema.org ontology. As per a recent estimate, 31% of all websites and over 12 million data providers are currently using the vocabulary of schema.org to publish annotations to their web pages (Guha, Brickley, and Macbeth 2016). This is because a relational system is typically modeled to support the application (McComb 2018), and thus, schema changes often require database reorganization. On the other hand, in a KG system, the schema is modeled to represent the enterprise (McComb 2019), and its representation in triples remains fixed. Due to the relative ease of creating and visualizing the schema and the availability of built-in analytics operations, KGs are becoming a popular solution for turning data into intelligence in the enterprises. KGs are an essential technology for natural language processing (NLP), computer vision (CV), and commonsense reasoning. In CV, an image is represented as a set of objects with a set of properties, where each object corresponds to a bounding box, identified by an object detector, and the objects are interconnected by a set of named relationships that are predicted by a model trained for identifying visual relationships. In Figure 4, a CV algorithm produces the KG shown to the right with objects such as a woman, a cow, and a mask, and relationships such as holding, feeding, and others. In modern CV research, such a KG is referred to as a scene graph (Chen et al. 2019), which has become a central tool for achieving compositional behavior in CV algorithms. That is, once a CV algorithm has been trained to recognize certain objects, then by leveraging scene graphs, it can be trained to recognize any combination of those objects with fewer examples. Scene graphs also provide the foundation for tasks such as visual question answering (Zhu et al. 2016). The earliest research in AI used frame representations, known as semantic networks, which were directed labeled graphs (Woods 1975). In some data models, given a triple (A, B, C), we refer to A, B, C as the subject, the predicate, and the object of the triple, respectively. For example, given the triple (“Biden,” “President,” “USA”), “Biden” is the subject, “President” is the predicate, and “USA” is the object of the triple. A directed labeled graph containing data and taxonomy is often referred to as an ontology. While some researchers used first-order logic (FOL) to computationally understand semantic networks (Hayes 1981), others advocated that FOL was required to represent the knowledge needed for AI agents (McCarthy 1989). Because of the computational difficulty of reasoning with FOL, different subsets of FOL, such as description logics (Brachman and Levesque 1984) and logic programs (Kowalski 2014), were investigated. There was an analogous development in databases where the initial data systems were based on a network data model (Taylor and Frank 1976), but a desire to achieve independence between the data model and the query processing eventually led to the development of relational data model (Codd 1982), which shares its mathematical core with logic programming. A need to handle semistructured data (Buneman 1997) inspired the investigation of “schema-free” systems or triple stores that capture an important class of problems addressed by modern KG systems. This trajectory of development in AI can be loosely characterized as starting from the need for explicit representations (McCarthy 1989; Newell 1982) to expert systems (Feigenbaum 1984) to large common sense knowledge bases (Lenat 1995). The mid-1990s saw an explosion of information on the web, and better methods to access and search this information were needed. There was a tremendous success in using information retrieval methods such as the Page Rank algorithm (Page et al. 1999), and yet it was felt that more was possible if there was a way for us to convey the semantics to our search algorithms (Berners-Lee, Hendler, and Lassila 2001). That vision is coming to fruition with the improvement in search results with the help of resources such as Wikidata and Data Commons which use representations heavily influenced by an earlier language called the Meta Content Format (Guha 1996). With the increasing adoption and use of KGs in different scenarios and use cases, three contrasting perspectives have emerged: symbolic representation versus vector representation, human curation versus machine curation, and “little semantics” versus “big semantics.” Machine learning algorithms used for NLP and CV rely on a vector representation of text and images. The recent success of deep learning on multiple tasks has prompted many to reject the need for any symbolic representation. Graph embedding is a generalization of word embedding, but for graph-structured input (Hamilton 2020). The use of graph embeddings with a neural network—also known as machine learning with graphs—is being used for handling unseen actions in the cause-effect KGs we considered earlier. Neuro-symbolic reasoning is a fast-emerging area of research that leverages the benefits of automatic calculation of embeddings while recognizing the need for a discrete KG to produce a human-understandable representation. Industrial KGs, such as the Google KG, Amazon Product Graph (APG), and Microsoft Academic Graph (MAG) are of unprecedented scale (Noy et al. 2019). The MAG team used machine curation to solve the problem of uniquely identifying authors and their publications (Wang et.al 2020). A human curation strategy advocates setting up standards such as Document Object Identifier (DOI) for uniquely identifying publications, and Open Researcher and Contributor ID (ORCID) for uniquely identifying authors. This approach relies on the authors and publishing organizations contributing manual effort to annotating documents with DOIs and ORCIDs. Wikidata has leveraged standard published identifiers, including the International Standard Name Identifier (ISNI), China Academic Library and Information System (CALIS), International Air Transport Association (IATA), MusicBrainz for albums and performers, and North Atlantic Basin’s Hurricane Database (HURDAT). Wikidata itself publishes a list of standard identifiers for items that appear in its corpus, which are now increasingly being used in commercial KGs. The Cyc knowledge base was largely created through human curation because the project aims to capture “hidden” knowledge that is not explicitly written down in text and, thus, cannot be automatically extracted. The big semantics perspective may be viewed as one that advocates for capturing more meaning about concepts. Whereas, the little semantics perspective, is focused on capturing/recording the basic facts and not so much the concept meanings. A KG defined as a directed labeled graph is a representative technique of the little semantics approach. The representation language CycL is a representative technique of the big semantics approach. Wikidata, Data Commons, MAG, and APG all employ a directed labeled graph representation at their core and their existence and commercial usefulness is a strong evidence that a little semantics goes a long way (Hendler 2007). Early semantic networks were created by top-down design methods and manual knowledge engineering processes. They never reached the size and scale of today’s KGs. In contrast, modern KGs tend to be large in scale; employ bottom-up development techniques; and employ manual as well as automated strategies for their construction. vast proliferation of available data, difficulty in arriving at a top-down schema design for data integration, and the data-driven nature of machine learning have all led to a bottom-up methodology for creating KGs. However, we posit that modern KG construction methods should also learn the lessons from classical knowledge representation, as there is much to benefit from the substantial body of prior research without reinventing available methods and tools. Setting a use-inspired context enables us to justify the need and helps specify the requirements for the specific innovations for KGs to have the maximum societal and scientific impact. %% end annotations %% %% Import Date: 2025-08-05T13:10:03.387-06:00 %%