Work package 8: Association analyser implementation

Objectives: To provide tools to identify various types of referential and semantic connections between different items in the content repositories, as described in the Metadata Repository, and also between such items and external resources. This will allow simple navigation and browsing through networks of related and interconnected documents and linking of elements (e.g. names of people, theorems or concepts) in these items to and from external resources such as encyclopaedia entries (including Wikipedia), historical information and cultural references.

  • Deliverable 8.1 – Association analyzer implementation: State of the art This report focuses on two key technologies: Citation Indexing and Document Clustering. Citation Indexing concerns the automatic parsing and linking of citations to create a network of documents within the collection. This technology is well established in digital libraries and searchable archives such as CiteSeerX, Google Scholar, general projects as DRIVER, and mathematical specific digital libraries such as NUMDAM, DML-CZ or referative databases Zentralblatt MATH and Mathematical Reviews. Document Classification and Clustering are also established technologies within Information Retrieval but have not to date been widely used within digital libraries. In particular, there is very little previous work applying classification and clustering techniques to mathematical documents. However, initial research appears promising and we believe that the addition of these technologies will allow facilities beyond the current state of the art.
  • Deliverable 8.2 – Toolset for entity and semantic associations – initial release In this document we describe the initial release of the toolset for entity and semantic associations, integrating Unsupervised Document Clustering (initially implemented by the partner MU) and Citation Indexing and Matching (as provided by partners ICM and UJF/CMD). We give a brief description of each tool and some initial evaluation.
  • Deliverable 8.3 – Toolset for entity and semantic associations – value release In this document we describe the value release of the toolset for entity and semantic associations, integrating Unsupervised Document Similarity implemented by MU (using GENSIM tool) and Citation Indexing and Matching (as provided by ICM and UJF/CMD). We give a brief description of tools and provide some initial evaluation.
  • D8.4 – Toolset for Entity and Semantic Associations – Final release In this document we describe the final release of the toolset for entity and semantic associations, integrating two versions (language dependent and language independent) of Unsupervised Document Similarity implemented by MU (using gensim tool) and Citation Indexing, Resolution and Matching (UJF/CMD). We give a brief description of tools, the rationale behind decisions made, and provide elementary evaluation.
    Tools are integrated in the main project result, EuDML website, and they deliver the needed functionality for exploratory searching and browsing the collected documents. EuDML users and content providers thus benefit from millions of algorithmically generated similarity and citation links, developed using state of the art machine learning and matching methods.