Adventures in WhatsApp DB — extracting messages from backups (with code examples)
Getting your messages without giving a third party your credentials and data
This post will explain how to recover messages from WhatsApp using Python. In particular I’ll be explaining how I went about finding and extracting conversations from the WhatsApp’s sqlite database and parsing the fields and data there. This is in no way a comprehensive reverse engineering / forensic analysis work — the only reason I’m even publishing this is that I couldn’t find freely available information and/or open source tools that you can trust for this purpose so I thought I’d share the work I did to save others some time. You can probably use a very similar process to analyze other (messaging) apps.
- Step 1 — Backup and extract files from your iPhone
- Step 2 — Connecting to the DB and reading the messages table
- Step 3 — Extracting specific conversations
- Closing words and a note on media
I will be using Jupyter Notebooks and Pandas throughout this article because it makes life easy and is visually “pleasing” but pretty much all code can be used independently of Jupyter. The notebook is available on github here.
Background (feel free to skip this paragraph)
Due to some legal trouble with a housing developer (are there any honest developers out there?) I ended up needing to recover WhatsApp messages from an old iPhone device that sat in a closet for two years.
I’m not an iPhone user normally so I started looking for ways to do this and I discovered iCloud doesn’t give you access to the actual backup contents except what Apple decided you need (at least not if you surf into your online account). If you want there are tools that will take your login credentials and allow you to browse the full backup contents but I’m not into giving unknown tools my (wife’s) login credentials. So, I went ahead and figured out how to do this without compromising the security of that account which led to the process described below.
As an interesting side note for this article I had to recover said iPhone battery, start it and discover that WhatsApp doesn’t work on it due to a weird message
Your phone date is inaccurate! Adjust your clock and try again.I have no idea what do date issues have to do with not starting WhatsApp but fixing the date and time didn’t help and reading around about this most solutions seem to suggest uninstall-reinstall which is ok in general but not something I wanted to risk when I needed these messages for evidence in court.
Step 1 — Backup and extract files from your iPhone
This step is more Mac OS specific although it should work similarly in other OSs. The idea is simple:
- Create a virtual machine and install iTunes on it. Make sure the VM is not connected to the Internet. (If you have no idea how to create a VM you can try this article).
- Connect your iPhone to the VM. Follow common steps to backup iPhone locally on the VM. To those interested on Mac you can find the backups at:
The backup format is described here
- Install an iPhone backup extraction tool (such as iExplorer by macroplant) in the VM. It doesn’t matter which tool as it won’t have internet access and therefore your data is safe.
- Finally use the backup extraction tool to extract
ChatStorage.sqlitethat is the main WhatsApp conversation database. To get to it you need to browse to
“App Group” → “group.net.whatsapp.WhatsApp.shared”
- As a bonus, you can also extract the media by extracting the folder:
“App” → “net.whatsapp.WhatsApp” → “Library” → “Media”
Step 2 — Connecting to the DB and reading the messages table
WhatsApp seems to store all conversations in a Sqlite DB which makes our life fairly easy. As a first step let’s try to use the sqlite3 built-in python library + Pandas to see if we can connect to the DB and get the list of tables:
As you can see there are 18 tables. Digging through the content I’ve found that the interesting one (i.e. the one with your messages) seems to be
ZWAMESSAGE. Let’s open it:
There is a total of 34 columns that we can see:
Let’s see what we’ve got. The below is my analysis based on the data I’ve found in the tables in my own DB and my inferences about it:
1. Z_PK — seems like a serial number
2. Z_ENT to ZFILTEREDRECIPIENTCOUNT — seem less important
3. ZFLAGS — seems to indicate message state
4. ZGROUPEVENTTYPE — seems to be related to group chats
5. ZISFROMME — message is from me… it is 1 for messages sent by this user and 0 for messages received
6. ZMESSAGEERRORSTATUS to ZSPOTLIGHTSTATUS — seems like general statuses
7. ZSTARRED — did we star the message
8. ZCHATSESSION — unique identifier denoting a chat session
9. ZGROUPMEMBER — haven’t gotten to look at this one yet
10. ZLASTSESSION — last chat session? didn’t dig into it
11. ZMEDIAITEM — seems related to media item indexing, might be an identifier to one of the other tables
12. ZMESSAGEINFO and ZPARENTMESSAGE — seem simple enough to figure out from the names
13. ZMESSAGEDATE — message creation date probably (see date format discussion below)
14. ZSENTDATE — message sent date probably (see date format discussion below)
15. ZFROMJID — from who did we get it (if it is an incoming message)
16. ZMEDIASECTIONID — seems related to media storage for media messages, doesn’t show in messages without media
17. ZPHASH - hmmm... not sure
18. ZPUSHNAME — seems like the contact name on your phone
19. ZSTANZAID — some conversation / media id indicator. Format seems different in media messages and text messages
20. ZTEXT — message text
21. ZTOJID — to whom did we send it (if it is an outgoing message)
Step 3 — Extracting specific conversations
Our base goal is to extract specific conversations or communications. So we can search for a contact by name or phone number or search for a specific conversation by messages if we remember part of the text and eventually extract the complete conversation based on the conversation ID.
But, before we do any of that we should figure out how to get the dates and times to a human readable format. So intuitively I assumed that
ZMESSAGEDATE is a unix-time timestamp. So I converted it to a date and this is what I got:
Reading around it seems that Apple, in their infinite unique wisdom have decided to use dates starting from 1.1.2001 on iPhone so let’s see what happens if we add an offset from time 0 of unix-time to 1.1.2001:
That seems to have solved the issue. Along the way I also added a date index to the data-frame to make it more comfortable to use. Note that if you are importing from Android or another OS you probably don’t need to use this shift in dates.
Now that we have dates lets extract conversations. First let’s find a conversation we care about, such as by looking for a specific phone number:
you should notice that the lambda I created is used to search both the from and the to columns. An interesting side note is that there is a disappointing Python “Wat” moment here because
False | None != None | False which just does not make sense. Anyways, moving on if we look at the
ZCHATSESSION column we’ll notice that in my case the conversation id will be “104.0” for this number (if there are also group chats you may see more than one chat session id). So out next step will be to extract all the messages the chat session. To do this let’s add proper accessors for that and for the other search methods I mentioned above:
Now, to sum everything up, let’s get all messages from a chat session and display them (enabling multi-line format to see all text):
Part 4 — Closing words and a note on extracting media
Media files are stored in the folder mentioned in step 1 — as best as I can tell the subfolders are based on phone number of the contact that shared a file (and a suffix denoting group chats? — this is a guess). The subfolders under each phone number seem to follow a similar logic to the iTunes backup format and I suspect the messages table has some reference to links in the media table but I didn’t spend time on actually figuring the details out. One last note some folders will also contain thumb files which are actually just jpg files (rename them and open normally).
I have removed personal information from it but the code and samples are exactly the same. The notebook might also contain some extra details that I didn’t bother including here.
Finally, if you have the patience and do figure out more details or have any improvements to suggest to the code send a pull request on github and I’ll be happy to include it with relevant credits.
Please leave a comment if you enjoyed this. You can find me @gal_diskin on twitter or @Disk1n on github.