This is a simple tool to remove duplicate text messages from XML backups of the "SMS Backup & Restore" format.
It also supports removal of more complicated duplicates than other tools while taking extreme care not to identify any false positives.
For example, we handle instances where
- One message contains a data attachment (e.g., images sent via text message), but the other does not
- The phone numbers inconsistently include or exclude country codes
- The internal timestamps have inconsistent millisecond vs. second precision
- The internal ordering of phone numbers is inconsistent between messages
- The internal SMIL data format varies, but the message content and data are otherwise identical
- The internal storage fields are inconsistently omitted or
null
These conflicts tend to occur when using multiple backup agents over time or simultaneously. E.g., accidentally recovering data from Google's backups and Samsung's backups or simply changing manufacturers or carriers.
If you intend to use this to remove duplicated messages on your device (rather than in your backup location), please read "An important warning about deduplicating messages on a device in practice".
The usage of this tool is extremely simple and can handle files of several gigabytes in a few seconds.
For example,
python3 dedupe_texts.py example-input.xml example-output.xml deduplication-results.log
This will produce output of the following form.
Reading 'example-input.xml'... Done in 8.1 s.
Preparing log file 'deduplication-results.log'.
Searching for duplicates... Done in 5.9 s.
Deduplication Summary:
Message Type | Original Count | Removed | Deduplicated Count
mms | 24893 | 10325 | 14568
sms | 19828 | 0 | 19828
Writing 'example-output.xml'... Done in 3.8 s
or
Reading 'example-input.xml'... Done in 8.1 s.
Preparing log file 'deduplication-results.log'.
Searching for duplicates... Done in 5.8 s.
Deduplication Summary:
Message Type | Original Count | Removed | Deduplicated Count
mms | 14341 | 0 | 14341
sms | 19676 | 0 | 19676
No duplicate messages found. Skipping writing of output file.
The log file contains sections of the following form for each removed message.
Removing mms:
date: 1680729606000
address: <REDACTED #1> | <REDACTED #2>
text: look at this amazing picture!
m_type: 128
type: 137 | 151
In favor of keeping mms:
date: 1680729606000
address: <REDACTED #1> | <REDACTED #2>
text: look at this amazing picture!
m_type: 128
type: 137 | 151
data: <LENGTH 539706 OMISSION>
The full usage information with a few optional features is below.
usage: dedupe_texts.py [-h] [--default-country-code [DEFAULT_COUNTRY_CODE]]
[--ignore-date-milliseconds]
[--ignore-whitespace-differences] [--aggressive]
input_file [output_file] [log_file]
Deduplicate text messages from XML backup.
positional arguments:
input_file The input XML to deduplicate.
output_file The output file to save deduplicated entries. Defaults
to the input filepath with "_deduplicated" appended to
the filename.
log_file The log file to record details of each removed
message. Defaults to the input filepath with
"_deduplication.log" appended to the filename.
options:
-h, --help show this help message and exit
--default-country-code [DEFAULT_COUNTRY_CODE]
Default country code to assume if a phone number has
no country code. Treat phone numbers as identical if
they include this country code or none at all.
Defaults to +1 (United States / Canada).
--ignore-date-milliseconds
Ignore millisecond precision in dates if timestamps
are slightly inconsistent. Treat identical messages as
duplicates if received in the same second.
--ignore-whitespace-differences
Ignore whitespace differences in text messages. Treat
identical messages as duplicates if they differ only
in the type of whitespace or leading/trailing spaces.
--aggressive Only consider timestamp and body/text/data in
identifying duplicates. Treat any matching messages as
duplicates, regardless of address, messaging protocol
(SMS, MMS, RCS, etc.), or other fields.
Note that
- SMS Backup & Restore avoids restoring duplicates by default, and
- Most messaging clients/apps actually hide deleted conversations before they are deleted internally (they continue the deletion work in the background)
Thus, if you flag conversations for deletion and then start restoring from backup (without verifying the message deletion has completed internally), you may lose messages!
In these cases, the backup restoration essentially detects duplicates of messages that were mid-deletion and only completes a partial restore. E.g.,
- With duplicates where some messages have data attachments and others do not, you may lose images, shared contacts, etc. from text messages
- Some messaging clients may continue the deletion after the backup is restored, in which case you will simply lose entire messages or conversations
With this in mind, to deduplicate messages on a device itself, you should
- Perform the backup and deduplicate it (retain both versions, just in case)
- Confirm you have not received any new messages in the meantime (consider using airplane mode)
- Mass-delete your text messages
- Wait a few minutes (the time required depends on your phone's processing speed, the number of messages, etc.)
- Clear your messaging client's data (
App Info > Storage > Clear Data
) to force a refresh of the text message view - Confirm that no messages appear in the view. Otherwise, return to step #2
- Restore from the deduplicated backup and verify that all messages were restored before removing the original (non-deduplicated) backup file
For step #7, consider keeping the restoration's duplicate check enabled. If it detects any duplicates when restoring to a phone that appears to have zero existing text messages, that should be a major warning that something has gone wrong.
Moreover, if you create a new backup afterward, it should not be much smaller than the one you restored from!
This tool is somewhat inspired by legacy work by radj, but is significantly simpler, much more up to date, requires far fewer dependencies/setup, and supports MMS messages.