Data Centricity and Nlp: Improving Quality of Biomolecular Data

Data and Model/Code are the two basic components of any Artificial Intelligence (AI) based system.

Both these components play an important role in producing desired results. The AI community, both academic and industry research, has been focused a lot more on improving and iterating the model on benchmark datasets rather than improving the quality of the data itself. In a model-centric approach, working on improving the NLP model/code is the central objective while keeping the data the same. The focus on improving the model has also been true for the Natural Language Processing (NLP) based workflows. Following this method has led to the creation of the Pathways Language Model (PaLM) by Google, which has about 540 billion parameters, GPT-3 by OpenAI having 175 billion parameters, and even bigger models by OpenAI, NVIDIA, Microsoft and others are already in the pipeline.

‍

While focusing on improving the NLP models, data is frequently overlooked, and most researchers view data collection as a one-time event leading to crucial data being mishandled or mislabelled. According to research (Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks), roughly 3.3% of samples in commonly used datasets across computer vision, natural language, and audio are mislabelled, mostly affecting large models. Data collected within an organization may have a higher chance of mislabelling as the data and process may not have been reviewed by a wider body of researchers. Training large models like GPT-3 costs a lot of time and money. Also, the large size of the model inherently makes it much more difficult to remove biases that might be present in the training data and impact the analysis.

Effective implementation of NLP on biomolecular data requires highly customized models that are trained on good quality healthcare datasets. To increase the quality of biomolecular data, one can focus on:

Improving label quality
Quality checks by subject matter experts
Data augmentation
Processing data

Improving Label Quality

Labels are specific values that are assigned to the data. During the manual curation process, incorrect labels can be assigned to the datasets, resulting in a decrease in the model accuracy. Below are some of the examples of text data labeled differently by two different manual curators:

Having consistent data annotation and labels can significantly improve the data quality and model performance. Several strategies like having a double blind review process already exists that can improve the labelling process.

Machine learning researchers have proposed various methodologies like self-supervision, weak-supervision, etc., to increase the efficiency of label utilization as well as improve the label annotation process. We will be covering about how to improve label quality using a variety of automation approaches in our subsequent blogs.

‍

Quality Checks by Subject Matter Experts

Having the correct domain knowledge and expertise is crucial for generating a high quality labelled data specially with regards to biomolecular data. Subject matter experts (SMEs) can detect small discrepancies in the data that data scientist or machine learning engineers may not. Below is an example:

To a normal person, FISH and chips would seem like a popular British dish consisting of fried fish in crispy batter, served with chips. However, in the biological context, FISH and chips would mean Fluorescence in situ hybridization (FISH), a laboratory technique for detecting and locating a specific DNA sequence on a chromosome, on microfluidic chips.

Hence, having the data reviewed by SMEs and having quality check protocols are crucial for improving the performance of the model using a data centric approach.

‍

Data Augmentation

There is a lot of biomolecular data, but high-quality biomolecular data is hard to find for training NLP models. Some of the data augmentation strategies can be used to overcome this challenge. But one has to consider that data augmentation is slightly difficult in the biomolecular NLP context. Data augmentation allows you to increase the amount of high-quality data by adding slightly modified copies of already existing data or by newly creating synthetic data from existing data. It can be achieved by replacing words or phrases with their biomolecular synonyms, using back-translation (translating text in English to a different language and re-translating the text back to English), etc.,. One can also explore XLNet, which is a generalized autoregressive pretraining method that can learn bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order. Generative adversarial networks (GANs) can also be used to generate a much broader set of augmented data. However, adding more noisy data may not be the best option as it may do more harm than good.

‍

Processing Data

When creating a good NLP model, clean, high-quality biomolecular data is very to come by. Preparing the raw data and making it suitable for the NLP model is a crucial step. It is important to clean the data, deal with missing values and convert it to an acceptable format. Following the best practices with respect to pre-processing of data can also help in improving the quality of data.

These are some of the standard heuristics pre-processing steps that are applied to the datasets before training the NLP models on them:

Selecting sentences that end with punctuation marks such as a period, exclamation mark, question mark, or end quotation mark.
Splitting an abstract into sentences might help in finding relationships between sentences.
Anonymizing target entities and chemical compounds.
Removing code from the text.
Cleaning text using regex patterns.
Text lemmatization (converting a word to its normalized form or its base root form).
Removing stop words.
Making sure the content is in the desired language.
Removing references to discrimination.
Eliminating text that refers to links that do not work or personal web pages.

We have seen how model-centric approaches have led to the creation of large NLP models. However, now there is a necessity to adopt data-centric approaches to create effective NLP models in biomolecular data. We explored a few methodologies and examples of improving data labels, data reviews, and quality checks that can help us in enhancing the quality of the data. Adopting these approaches to high-quality biomolecular datasets can lead to the creation of customized NLP models with superior accuracy and efficiency.

‍