Researchers at Microsoft have developed a framework that makes it easier for large language models (LLMs) to analyze the contents of spreadsheets and perform data management and analysis tasks. Because why not?
In a research paper published on the open access repository arXiv, the Redmond team explains that spreadsheet formatting poses a significant challenge for LLMs.
According to the researchers, large spreadsheets often contain “numerous homogeneous rows or columns,” which “minimally contribute to understanding the layout and structure” and also make analysis difficult for humans.
To address this, their LLM-driven analysis tool, SpreadsheetLLM, serializes the data, incorporating cell addresses, values, and formats into a data stream. But this approach runs into another problem, namely the token limitations of many LLMs, where tokens are strings of characters or symbols.
To solve this conundrum, the research team had to develop yet another framework, called SheetCompressor, to compress the data. It consists of a few separate modules: one to analyze the spreadsheet structure and remove everything outside of a table; another to translate it into a more efficient data representation; and a third to aggregate the data.
The first does its job by identifying “structural anchors” such as table boundaries and removing other rows and columns to produce a “skeleton” version of the spreadsheet. The second removes the row and column formatting by converting it to an inverted index in JSON format. Finally, adjacent cells with the same cell formats are clustered.
The result is that a spreadsheet with, say, 576 rows and 23 columns, which would otherwise yield 61,240 tokens, can be reduced to a more compact representation of the spreadsheet with just 708 tokens, according to the example in the paper. In fact, the team claims that its experiments have shown that SheetCompressor typically reduces token usage for spreadsheet encryption by 96 percent.
According to the Microsoft team, SpreadsheetLLM appears to significantly reduce the token usage and computational overhead of processing spreadsheet data. This could, in principle, enable practical applications even for large datasets.
Refining various LLMs to improve spreadsheet understanding could also transform spreadsheet data management and analysis tasks, paving the way for more intelligent and efficient user interactions, the paper said.
Given the widespread use of spreadsheets in business, this move by a Microsoft research team could have a significant impact, if it lives up to its promise. There’s no word on whether this will ever see release as a product or developer resource – like something baked into Microsoft Excel or Coppilot – and for now it appears to be a lab-level effort.
As VentureBeat notes, SpreadsheetLLM, if ever made public, could allow non-technical users to access and edit spreadsheet data using natural language prompts.
A UCLA associate professor responded on Twitter/X by stating that there is “billions of dollars of value in this, given that much of the finance and accounting world still runs on spreadsheets and manual processes.”
We’re not convinced by mixing unreliable, guessing, hallucinatory LLMs with grids of numbers, we have to admit. Neural networks like predicting outputs and loosely interpreting inputs, which is not exactly what you want from a spreadsheet, in our humble opinion, unless you’re hoping for some kind of computer-aided creative accounting.
And whether this technology can prevent the blunders that have been seen in the use (or abuse) of spreadsheets by businesses is perhaps questionable. Last year, it emerged that junior doctors had been made “unapprovable” for mistakes in transferring data from one spreadsheet to another. And then there’s the infamous Excel blunder that led to the underreporting of thousands of coronavirus cases during the pandemic in England.
The Microsoft research team also points out that SpreadsheetLLM currently has a number of limitations that need to be addressed. Format details such as cell background color are ignored because it would require too many tokens, although this is sometimes used to encode information.
SheetCompressor also doesn’t currently support semantic compression for natural language cells, so it can’t categorize terms like “China” and “America” under a uniform label like “Country.” So maybe all those data analysts and other Excel experts can breathe a sigh of relief. ®