A defining moment: automating dictionaries for social science research

Chaire de leadership en enseignement des sciences sociales numériques (CLESSN), a digital social sciences lab at Laval University, has partnered with NovaceneAI to help automate and analyze their data with greater efficiency and accuracy

Social scientists are living in a time where data is everywhere. But it wasn’t always like this.

In Machine Learning for Social Science: An Agnostic Approach, the authors describe how difficult it was to find data, how expensive it was to conduct surveys, and how storing records was “close to impossible.”

But in today’s world, data can be found by simply logging onto social media. Computers can easily store millions of records. While the digital era has helped provide more opportunities for social scientists, they are now faced with a new problem: how can they possibly sort it and analyze this data to answer their research questions?

A new partnership between NovaceneAI and the Chaire de leadership en enseignement des sciences sociales numériques (CLESSN), a digital social sciences lab at Laval University, shows that machine learning can help social scientists unlock the answers to their research questions – and, save them time by automating the mining of this data.

Machine learning (ML): Fuelling the discovery of information

CLESSN is comprised of academics, including undergraduate and graduate students, whose research focuses on the “three pillars of democracy” – the media, public opinion, and the political arena.

The research team publishes their findings in academic papers, and their insights help develop social science teaching programs and courses at Laval. 

Patrick Poncet, a data scientist at CLESSN, describes the lab as an institution that gives social science students real-world experience in digital social sciences – teaching them how to conduct social science research in the digital era.

“At CLESSN, we deal with a lot of text data and harvest data from public institutions, parliament, from the web, and social networks,” he explains.

When Novacene and CLESSN connected to discuss a potential partnership, Poncet says he immediately saw an opportunity to collaborate.

“Novacene had very interesting, enriching algorithms that could fuel the thinking process and the discovery of information in the data that our researchers had at their disposal,” he says.

Laval University Research Chair Holder Yannick Dufresne, who created CLESSN and heads the lab, says he was pleased to partner with Novacene and saw the collaboration as an excellent opportunity for the lab’s social scientists.  

“The digital era is here to stay, and it’s imperative that the next generation of social scientists learn to work with companies such as Novacene which can provide valuable machine learning technology,” he says. “These digital tools, validated by academic research, will be vital to informing their work, and provide more efficiency as they investigate their critical research questions.”

Marcelo Bursztein, founder and CEO of NovaceneAI, saw the partnership as a valuable opportunity for the AI company to apply its Platform to a real-world challenge.

“As a technology company in the AI, ML and natural language processing space, we need to constantly challenge our technical merit,” he says. “As we joined forces with the bright minds behind CLESSN, it was clear that there was no better place to validate our work.”

Creating dictionaries: a ‘labour-intensive’ process

Poncet says that through this initial project, Novacene helped automate the creation of dictionaries which help the lab’s researchers track specific topics, such as health care or the environment.

He says these dictionaries are important to helping narrow down research questions for specific topics within the social sciences.

Before collaborating with Novacene, researchers would be responsible for a particular topic and perform text analysis from sources such as parliamentary debates or tweets.

But it was not a straight-forward task, Poncet explains.

For instance, if a researcher was responsible for building the housing dictionary, they would manually have to analyze texts from these sources and highlight the relevant information. They would then add this information to their housing dictionary which would inform current and future research related to that topic.

Poncet describes this process as “labour-intensive” that introduces errors and biases to the research data.

Enabling future research through automation

As a starting point for creating these dictionaries and automating them, CLESSN formulated the following question: “What are the 2022 Quebec provincial election candidates talking about on Twitter?”

Novacene’s Platform helped CLESSN researchers harvest Twitter data from candidates across all of the province’s political parties, which totaled roughly 10,000 tweets. Poncet was specifically trained on the Platform, and learned how to send tweets to the Platform, cluster those tweets, and retrieve themes.

“It was reassuring to be accompanied by the Novacene team to understand how to maximise the benefits we extracted from the Platform,” he says. “Having the help of experts who understand what text analysis is about, and how it works, is definitely a time saver. And it’s nice to know that you’re enriching your data in a way that makes sense.”

The Platform helped the lab’s social scientists narrow down specific topics within their questions – for example in this project, what candidates are talking about regarding housing, education, or the environment. 

Poncet says there were about 10 specific topics that CLESSN researchers wanted to address, and they had to narrow down candidates’ tweets that addressed these specific topics.

“It’s important to be able to identify several factors, for example who is talking about which topics, which party they are from, their gender, their position on that topic and whether they are speaking about it positively or negatively,” he explains.

In addition to creating dictionaries based on this information, the Platform also automated the entire process – which going forward, can serve as a foundation to build out future automations and save time for researchers.

“Automatically classifying and grouping together a series of texts about a specific topic means there will be a lot of time saved,” Poncet says. “Automation will also make it easier for researchers to come across and explore unexpected topics that could be of interest to their research.”

Going forward, Poncet says this automation would be key to other areas of research where the lab’s social scientists require dictionaries to track their topics – which would be beneficial to social science researchers for years to come.