Authors:    Dimitris Papadopoulos

 

Noima: A social media and Web analysis suite for Greek corpora

Noima (Greek for “meaning”) is Infili’s text analytics and information extraction platform. It leverages web crawling and computational linguistics techniques to collect and process thousands of data sources, including unstructured text (web sites, blogs, forums, etc.). It then uses open information extraction models to isolate the valuable information from these sources, to collect named entities and their in-between direct and latent semantic relationships. The platform is primarily used for landscaping entities (enterprises, individuals/products) ecosystems, by applying graph databases for entity linking and NLP techniques for syntactic parsing and part-of-speech tagging, entity extraction, sentiment analysis etc. Infili’s objective is to enrich the suite’s toolkit throughout the INODE project, by integrating state-of-the-art deep-learning models for language modelling, coreference resolution, text comprehension and summarization.

 

Noima’s graph construction is based on the open-source version of the Neo4j graph database management system, an ACID-compliant transactional database with native graph storage and processing, which is optimal for deep or variable length traversals and path queries. A combination of open-source statistical and neural NLP tools and libraries (e.g. SpaCy, AllenNLP, HuggingFace) are used for language-processing tasks, enabling information extraction from unstructured corpora. Although Noima is originally intended for Greek Web sources, its NER-related capabilities could be easily re-purposed to allow for entity and relation extraction from English unstructured texts without major modifications.