Automated archiving tool helps sort millions of government files

The Cabinet Office has developed an algorithm that can analyse, sort and delete the growing pile government’s digital documents faster and much more accurately than humans. 

Developed by the department’s Digital Knowledge and Information Management team (DKIM) in 2022, the Automated Digital Document Review has now been used to review 5.1 million documents, according to a newly published report on the automated tool.

A further 300,000 files are expected to be analysed over the course of the 2023-24 year. After this time, the DKIM team said the algorithm will be used on an annual basis – approximately 30,000-80,000 files per year.

The tool is being used to deal with the growing volumes of unorganised, legacy digital information in government that typically requires reviewers to manually identify what should be kept as a historic record and what should be deleted. This method, while "fairly accurate", is also "extremely slow," the DKIM team said. While a human can review up to 200,000 documents per year, they found that he automated tool can achieve a review of several million files with no increase in human resource.

The automation is not only able to perform checks at a far greater scale than would be possible by human reviews, but also more accurately: it is "consistently more accurate than humans at making decisions about the record's value," the report said.

How does it work? 

The algorithm, which is based on technology from specialist supplier Automated Intelligence, works by identifying patterns of language – key words or phrases commonly used by civil servants when organising documents – then creating a relevant score based on the occurrence of these terms. 

There is still an element of human involvement in the process. Once the tool completes its analysis, a report detailing the recommended files to be deleted is created and sent to a digital archivist, who can review the final results to ensure it is in line with departmental governance (less than 1% of files are incorrectly identified for deletion). 

New call-to-action

Also Read