Mar 25, 2024

Happy birthday, Lady Florence Davies

Happy birthday, mum

Today (Sunday) is my mother’s birthday; she would have been 89.

And she would have been—as she would have put it—’tickled pink’ to see what the astounding long-term impact is of the data-driven approach to language of which she was one of the pioneers.

For, although she was a psycho-linguist rather than a computer scientist, she was a key member of the team that laid the foundation for large language models (LLMs), such as ChatGPT, by gathering large amounts of data about the reality of how words and language are used.

I would have loved to have been able to talk with her about how one of my fields, artificial intelligence and machine learning, is now intersecting with her world of words and language, and her great love of reading. Unfortunately, she passed away ten years ago in June, just as the potential impact of AI was becoming apparent, and although we got to spend some time talking about machine learning in the early days, as we sipped champagne in Farndon, I would have loved to have had more time with her.

Building COBUILD

In the early 1980s, working with Professor John Sinclair at the University of Birmingham, she was one of the team that built COBUILD—the Collins Birmingham University International Language Database.

“…the COBUILD project in lexical computing, funded by Collins, … revolutionized lexicography in the 1980s and resulted in the creation of the largest Corpus of English language texts in the world.”

“Having Corpus data allowed Professor Sinclair and his team to find out how people really use the English language and to develop new ways of structuring dictionary entries.

“For example, frequency information allowed the team to rank [the various possible] senses [of a word] by importance and usefulness to the learner (thus the most common meaning should be put first). The Corpus also highlights collocates (the words which go together), information which had only been sketchily covered in previous dictionaries. Under his guidance, Professor Sinclair’s team also developed a full-sentence defining style, which not only gave the user the sense of a word, but showed that word in grammatical context.”

The Corpus was the very first instance of focusing on the linkages (the ‘edges’ in the knowledge graph) between words (the ‘nodes’ in the knowledge graph).

It has become the foundation for the Bank of English (BoE)—which must not be confused with the Bank of England, a very different institution—a representative subset of 4.5 billion words of primarily British and other Commonwealth English, “as she is written and spoken”.

And as one of her specialties was ‘English as a Second Language’ (ESL), she would have found ChatGPT’s hallucinations both hilarious, in some ways all too familiar, and deeply intriguing; she took great delight in sharing examples of the ‘unusual usage’ of non-native English speakers, which as she would have insisted were not ‘wrong’ per se, but different.

The Corpus was the first step on the road to a data-driven approach to the reality of language, reflected in what we now see from ChatGPT.

And at the time, I was myself embarking on my career, pursuing jobs in robotics, and what we then called cybernetics, and now know as ‘artificial intelligence’. At the time, our focus was on explicit rules, rather than mining the data using ‘machine learning’, and it wouldn’t have occurred to us that about four decades later we’d just be using computers to crunch all of the words we could find in the world to create ‘bots that seemed to be able talk; even if they are oft-times ‘stochastic parrots’.

Much of my work nowadays is about harnessing the power of these rapidly evolving technologies, and teaching people how to do so, such as through the program I run at London Business School, on The Business of AI. We’ve just updated it to reflect recent advances in generative AI.

 

Back to Insights

*Speaking of acronymic confusion, I recall waiting for her to emerge from Customs and Immigration here in Boston, in the United States, where she had been delayed long past almost all of the other passengers disembarking from British Airways. It turned out that when asked what was the purpose of her visit to the United States of America, she blithely announced “I’m here for a very important meeting of the IRA.” Unsurprisingly this created some alarm, and led to her being detained for secondary screening, notwithstanding that she was a Lady, her husband a knight of the realm, and she was one of the least likely people on the planet to be a supporter, let alone a leader, of the Irish Republican Army. It took some time to clear up the confusion, and to establish that she was in fact attending a meeting of the International Reading Association. I note that in 2013 this IRA reformed itself as the International Literacy Association (ILA).