Deep Mining for Information Discovery


DIMRC researches AI deep learning technologies to assist the curation of online information resources. This research supports the NLM Strategic Plan Goal 1: Accelerate discovery and advance health by providing the tools for data-driven research.


Disaster Information Curation

Figure 1: Monitoring disaster information 
Figure 1: Monitoring disaster information (click for larger image)


Maintaining up-to-date collections of high-quality, highly-relevant and timely content published on the Web by authoritative sources is labor intensive. This is particularly challenging in the context of disaster health, where information timeliness is of essence. New content is usually announced via social media and other communications channels. In order to find new content, information specialists scan a large number of sources, including a curated list of hundreds of selected content providers on Twitter, RSS feeds and more.


Figure 2: DisasterLit curation process 
Figure 2: DisasterLit curation process (click for larger image)


Machine Learning Approach

We are using an artificial intelligence approach based on Bidirectional Encoder Representations from Transformers (BERT) to automatically detect Twitter posts promoting new content of interest. The model uses a pre-trained deep learning network of over 340 million parameters fine-tuned on our data. One advantage of this state-of-the-art approach is that the model does not require a large number of examples to learn the desired behavior.



Figure 3: BERT-based document classifier 
Figure 3: BERT-based document classifier (click for larger image)


DisasterTweet Miner

A tool being developed, DisasterTweet Miner (DTM), is reducing the time and effort required to find new promising document leads in social media.


Figure 4: DisasterTweet Miner user interface 
Figure 4: DisasterTweet Miner user interface (click for larger image)


The posts are shown in descending order of relevance, which in this case we define as the probability that a post links to relevant content (i.e., the post is a “good lead”). The Information specialist can set a probability threshold to automatically discard posts that are not relevant enough. The tool also automatically discards posts that don’t comply with certain key conditions. DTM also has features that enable collecting examples for further fine-tuning the machine learning model.