Datamule Cloud V2
New model, new features
I need to rework datamule cloud to support a larger userbase. The following text goes into why I need a new pricing model, what it will look like, the offerings, and the basic architecture.
Trust
Datamule was built with trust in mind. Specifically that I do not trust you, and that you shouldn’t trust me.
Me
When I started datamule, I had just left my PhD program. I had no reputation, and there was a possibility that I would stop maintaining the project. It was rational not to trust me, especially for commercial processes.
To build trust, among other reasons, I decided to open source my code and document how my infrastructure worked. This was to enable people to see how my stuff worked, and to spin up their own infrastructure in case I exited. Many commercial projects have since used datamule.
I hope in the future to have your trust based on my reputation, but until then I need to earn it.
You
Users of datamule range from extremely technical to not at all, and from rich to grad students. This makes trust an important factor in providing data and services. I need to ensure that I am protected, and that so are my users.
Let me give you an example. At 23 I was a researcher at Berkeley using Google Cloud. I accidentally broke the guard rails, and wracked up a $3,000 bill. I was nervous, but mollified as a friend of mine had left code running during Spring Break and wracked up a $10,000 AWS bill which was forgiven. My bill was also (mostly) forgiven.
This is why when I launched my SEC archive in December 2024, I put it behind a Stripe paywall. Users had to load credits into their account, which let them limit max spend. Unlike similar APIs, mine required no subscription. This is because I wanted to make the SEC Archive accessible. Put $5 in, test it out, if you like it spend more, if you don’t, there’s no cleanup.
Note: I chose the PAYG model as I had previously used OpenAI’s PAYG API which I found quite relaxing for experimentation compared to AWS or GCP.
This worked well. Some users did run into issues. Mostly PhD students who accidentally used up their balance running the wrong query. I refilled their balance, which was a relatively cost free action as the marginal cost of provision was negligible.
This will not be the case with Datamule Cloud V2. Marginal costs will not be negligible for many of the new offerings. So, I need to design the pricing structure so that there can be trust that if something goes wrong that you’re going to be fine, and that I’m going to be fine.
New Pricing Model
To deal with the trust issue, I’m thinking of offering multiple subscription tiers. Here are my thoughts, badly worded.
- No subscription. Load credits as you need. Access to basic APIs.
- Regular — $20/month. $20 credits added each month + PAYG. Access to more APIs.
- Plus — $100/month. $100 credits added each month + PAYG. Access to even more APIs.
- Enterprise — custom contract. Bulk discounts on credits. Access to experimental APIs.
The idea behind Regular is that a $20 subscription fee forces a conscious choice: “Is it worth it to me to spend $20/month on this service”. It also grounds expectations — here is how much I expect to spend, and for what services.
Customer filtering is the idea behind Plus. One hundred dollars a month is a lot of money. The price acts as signal for certain customer segments, such as PhDs, that these APIs might not be for you. Danger danger danger — do not click this button! It helps limit users to people who can afford a mistake.
I haven’t quite figured out Enterprise.
Note: I probably will give free Plus subscriptions to people who ask nicely and have a relevant technical background. This is how I got access to alpha Google table parsing tools in 2019, and I’d like to pay it forward.
Subscription UI
- Unsubscribing will take two clicks.
- Status will be available here.
Margins
My general goal is to make stuff really cheap, then charge 10x or more marginal cost of provision. For example, the marginal cost of provision of every 10-K submission (600gb) from my SEC archive is $0. I charge $2. A comparable provider charges $200. Their likely cost of provision is about $60.
Higher margins gives me room to screw up.
Note: J. Presper Eckert and John Mauchly were pioneers in early computing. Despite making amazing breakthroughs, they went bankrupt due to usage of fixed cost contracts instead of cost plus. I think about this whenever making pricing decisions.
New Cloud Architecture
I now have some experience building with AWS and Cloudflare. Below details the planned architecture of Version 2. Many of the features already exist in V1.
General Overview
The SEC filing archive updates as soon as new filings come out, as does the websocket. Other data offerings will initially update daily.
The general flow for daily updated offerings is:
- Update S3.
- Use S3 to update databases.
Cloud Offerings
R2 (NoSub)
R2 offerings are priced at $1/100,000 downloads.
Filings SGML R2
SEC submissions stored individually in sgml.zst form. datamule handles decompression and sgml parsing so you will likely never interact with this.
Filings Tar R2
SEC submissions stored individually where each document is compressed and combined into a tar file. This allows for HTTP range requests. You will also likely never interact with this, except to enjoy faster downloads.
R2 Transfer
Transfer from R2 to your preferred S3 provider. Runs on your machine, so your credentials are secure. Will likely support filtering such as cik, date, document type as well as submission type, sic code, etc.
S3 (Regular+)
S3 offerings are planned to be priced at $10/gb downloaded.
Example: User wants to download the text from every 2025 10-K which is ballpark 3gb. This data is then compressed into ballpark 300mb, and streamed to user at the cost of $3.
SEC Documents Data S3
SEC documents converted into dictionary form, stored in S3.
SEC Documents Data Tuples S3
Unnested data into form (id, type, content, level). Uses Documents Data S3 combined with doc2dict’s unnest_dictinstead of duplicating in S3.
SEC Documents Text S3
Text extracted from SEC documents. Uses Documents Data S3, combined with doc2dict’s flatten_dictinstead of duplicating in S3.
SEC XBRL S3
XBRL extracted from SEC documents. See: secxbrl (MIT License).
SEC Structured Output S3
Datasets constructed from text within SEC filings such as dividends per share, proposal results, and executive compensation. See: Structured SEC. Daily dumps will be likely be made available for free on GitHub.
SEC Tables S3
Tables extracted from SEC documents, stored in S3 Tables.
prefix: {accession}/{filename}/{tablename}.parquet
metadata: table description
SEC NLP S3
NLP such as entity detection and similarity scores stored in S3.
prefix: {nlp type}/{accession}/{filename}.json
SEC Full Text Index
SEC data tuples converted into a full text index. Useful for search.
S3 Transfer
Transfer from S3 to your preferred S3 provider. Runs on your machine, so your credentials are secure. Still thinking about implementation to prevent high egress fees.
Databases
Needs update to current version of MySQL.
Submissions and Documents Lookup (NoSub)
Note: I have reserved ‘lookup’ for databases that act primarily as redirects to other content, such as R2 or S3.
accession_cik
- accession
- cik
submission_details
- filing date
- is inline xbrl
- is xbrl
- report date
- acceptance date time
- act
- file number
- film number
- items
submission_documents
- document type
- sequence
- filename
- description
- secsgml_size_bytes (used for HTTP range requests)
SEC XBRL DB (NoSub)
Database created from extracted SEC XBRL. Type validation functions will be made open source.
SEC Fundamentals DB (NoSub)
Company fundamentals database. Reuses the XBRL database combined with client side calculations to convert xbrl into fundamentals data. See: companyfundamentals (MIT License).
SEC Tables DB (NoSub)
Tables extracted from SEC documents, uploaded to a database. Due to the number and size of tables, only high demand tables will be stored in RDS.
SEC Athena Tables (Plus)
Tables extracted from SEC documents, stored in S3 Tables, made SQL query-able using Athena. Response times will be a bit slower and more expensive than SEC Tables DB.
SEC Documents Data RDS (Regular)
SEC Documents Data Tuples S3 ingested into a database. Tables within converted to markdown form. New column for paths — e.g. id 1 -> id 33 -> id 55.
Still working on the design for this.
Other APIs
SEC Websocket (NoSub)
Be notified as soon as new SEC submissions come out. Free. Currently users are only allowed one connection to preserve the resource. Considering adding the ability to pay for more than one connection.
SEC Full Text Search (Regular)
Search SEC documents for keywords. Returns the context, such as a sentence where the keywords are mentioned. Requested by a user here.
Likely architecture will be to preprocess the data tuples of SEC documents, excluding filler words, and store results in a full text index. This should fit inside the memory of a small AWS instance.
Doc2Dict API (Plus)
Takes html, txt and pdf files and converts them into dictionaries. Useful for larger workloads in the >100gb range.
Probably will use ECS fargate for auto scaling, as doc2dict is quite cheap to run, so further hardware optimizations likely not needed.
Note: I need this for internal use, so might as well make it public.
NLP Persons API (Plus)
Takes html, text, and pdf files, converts them into data tuples using doc2dict, then applies persons detection on the content.
Uses a multistage pipeline:
- filtering using pattern matching (names within formal documents have a consistent structure).
- Caching.
- Classification using zero shot models such as DeBERTa.
Runs classification step on 80–90% spot older GPU instances.
For formal/legal documents such as SEC filings output quality should be quite good.
Should cost me ~$50 to do what Google charges $1 million for. Pricing TBD.
Note: I need this for internal use, so might as well make it public.
Internal APIs + Misc
validate s3 bucket — returns the contents of an s3 bucket. Need to think through listobjectsv2 to improve parallelization.
validate r2 bucket — my r2 buckets are already setup, so have to work with that. Parallelization with range based listing seems to have issues, so will have to debug or accept minimal loss.
lambda api — users currently access offerings via cloudflare worker API. This is not currently an issue for internal use of my APIs, as Cloudflare R2 does not charge egress. However, as I build more offerings in AWS, this does matter, as the flow would go AWS →R2 →AWS, which is both slower and expensive.
It likely makes sense for me to move away from Cloudflare workers to AWS Lambda, including migrating my user databases.
user data — I now need to keep track of more user data, for example: api usage. Probably best to redesign to dynamodb? I also need to keep track of user subscriptions.
Forthcoming Open Source Packages
New infrastructure requires new open source packages.
xml2tables — ingest many xml files, generates schema to convert said files into tabular form at scale.
table2standard — take tables across many documents and convert to standardized form using table data and adjacent context. Intended input is doc2dict tables applied to the SEC html corpus.
Timeline
Given my past performance, I expect to complete most of datamule cloud v2 within a few months, if not earlier.
Note: A few months ago I got into AWS Activate, thanks to help from George Lin. He cold emailed me to ask if I needed resources, I said that would be cool, and I soon had quite a bit of cloud credit. Thank you George & AWS.
