The Age of Dark Data
I remember the time right after graduating from university, when I quit my existing job and started looking for a new job with which I could grow and cultivate my skills in the big data field. Since I had already put in a lot of hours and effort on the subject during my final year, I was paying closer attention than most to the conversations happening in this increasingly popular space, and was following the people who were thinking out loud about it. Gartner’s Hype Cycle chart had a solid track record of not lying to us, and it wouldn’t lie about the fate of this hot topic either. By nature, humans tend to fall for optimism bias, acting with an inflated confidence that things will work out in their favor. Sometimes we develop FOMO, convinced we’ll miss a trend, and end up letting roads that look different lead us to the exact same dead end.
The Big Data Illusion
Back then, fueled by industry rivalry, tech leaders across all sorts of fields kept saying things like “We’re stockpiling data for our big data transformation” or “We just onboarded Cassandra clusters.” Given how green I was at the time, I couldn’t make sense of the real motivations behind these moves. I told myself there had to be grand strategic chess games happening behind the scenes that I just wasn’t experienced enough to see. But a few questions kept nagging me from the inside:
“What exactly are we waiting for?”, “What are we stockpiling?”, “Why?”, “And what’s the price tag?”
And it’s not just companies sitting on mountains of data, is it? Think about your own iCloud or Google Drive. How often do you actually open those synced files, photos, and old videos? How much do you pay every year just to keep them there? When you generate images, texts, or videos with AI tools, do they only live on the platform’s servers? Or do you also save copies to your phone, your laptop, and three other cloud apps? And how many of those files do you ever go back to? At some point, our relationship with data started looking a lot like that Netflix watchlist we swear we’ll get to but never will.
Before we can answer any of these questions properly, we need to give the problem a name.
What Is Dark Data?
Solving a problem starts with defining it. This particular issue can be framed from financial, psychological, and technical angles, but for the purpose of this piece, I’ll go with the term “Dark Data.”
Gartner's Definition
Dark data is the information assets that organizations collect, process, and store during routine business operations but fail to use for analytics, business improvement, or any other purpose that generates direct value.
The industry usually frames dark data as untapped potential, a goldmine waiting to be discovered and converted into revenue through analytics. In that reading, dark data is simply “data that hasn’t been monetized yet.”
I think framing it that way alone is a textbook case of optimism bias. Not all data is valuable. You can’t put a price tag on every byte. And even when you can convert a piece of data into an analytical insight, there’s no guarantee the revenue it generates will outweigh the cost of extracting it. A huge chunk of stored data, given the format and conditions it’s kept in, will never become an insight at all, or doing so would cost more than it’s worth.
Part of the blame falls on that “data is the new oil” mantra we’ve all dropped into a slide deck at some point. To be fair, some of that data genuinely can fuel operational gains and smarter predictions. But that optimism shouldn’t give us a free pass to hoard everything.
Let’s put some numbers to this:
According to Splunk’s “The State of the Dark Data” report:
Here’s the irony: nearly everyone agrees data is strategically important, and nearly everyone admits three-quarters of it just sits there collecting dust. The standard reaction to studies like these is predictably: “Great, let’s mine the hidden value!”
My read is different. People hoard because they believe data is priceless. But what they actually end up with isn’t revenue, it’s a recurring line item on their cost sheet, a growing security blind spot, and a significant environmental problem. The lesson I take from this isn’t “let’s squeeze every last drop of value from our data.” The real lesson is that storage should be designed with intent from day one, architected around clear goals and planned for the future. Instead of treating data like cash stuffed under a mattress, we need a storage strategy that serves our actual operational needs and keeps improving over time.
So what does this strategic blindness actually cost an organization in practice? A scenario is the best way to show it.
The Bill for Data Hoarding
Let’s walk through a real-world-style cost model. We’ll look at what happens when an organization doesn’t plan its storage strategy, and break down both the financial and environmental fallout.
Let’s call our fictional company CompW, a leading e-commerce platform. To keep things clear, we’ll group their storage into three buckets:
In-Memory Databases
The high-performance layer powering core business logic and real-time operations.
SQL & NoSQL Databases
The workhorses behind analytics pipelines and various operational needs.
Object Storage
The catch-all for raw data, integration layers, IoT feeds, lakehouse data, BLOBs, snapshots, and everything else.
I’m going to be generous with these estimates. Say their in-memory layer holds 10 TB total, and 90% of it is genuinely accessed on a regular basis. That leaves just 1 TB of dark data. I’ll skip the dollar figure here since costs vary wildly with licensing and deployment models.
Assume their combined SQL and NoSQL footprint is another 10 TB. Being optimistic again, let’s say 80% is actively used. That’s about 2 TB sitting dark, and I’ll skip the pricing for the same reasons.
Now the big one. Let’s say their object storage totals 4 PB across all departments, apps, integration layers, and everything else. If 75% of that is dark, we’re looking at 3 PB of data that nobody has touched in the last 15 to 30 days, just accumulating. For simplicity, I’m not even counting cross-region or cross-continent replication.
How organizations deal with this matters a lot. Cloud vendors typically offer features under the umbrella of DLM. In plain terms, DLM moves data between storage tiers based on age and access frequency: the idea is that data “ages out” over time. The tiers you’ll hear about most are Hot, Warm, Cold, and Archive.
Hot storage is expensive to keep but cheap to read from. Archive storage is the opposite: dirt cheap to store, but it costs you every time you need to pull something out. The whole game is matching your storage technology to your actual access patterns so you’re not burning money on data nobody looks at.
The Komprise 2026 State of Unstructured Data Management report found that 47% of respondents lack visibility into what their departments spend on storage, while 74% of organizations now manage over 5 PB. FinOps for storage is still in its early days. With separate storage accounts and containers spun up for every team, just figuring out how much data is properly tiered across the org takes real effort. In my own career, across several companies, I’ve rarely seen DLM tracked consistently or even factored into workload design. Even at companies I directly advised on this, the topic was rarely taken seriously.
Back to CompW. Without any DLM strategy, using list prices at the time of writing, the monthly bill looks like this:

Now, if we simply move 66% of the data to the Cold tier based on access patterns and keep the rest in Hot, the bill drops to:

That’s nearly a 50% cut. And here’s the thing: what we call “optimization” today is on track to become a survival requirement. Right now, most stored data is created by humans or their devices. But according to Gartner, the volume of synthetic data generated by AI is expected to overtake real-world data in the near future.

The writing is on the wall. What feels optional today will soon be the difference between profitable and unprofitable workloads.
The Carbon Footprint of Digital Waste
Everything so far has been about money. But unused data isn’t just a financial drain, it’s also a security liability. And maybe the part that should worry us most: it’s an environmental disaster hiding in plain sight.
Veritas reported back in 2020 that the energy consumed just to store dark data produces 6.4 million tons of CO2 per year. That same report projected 175 ZB of total stored data by 2025. Current estimates suggest we’ve blown past that number, landing somewhere between 200 and 250 ZB. Factoring in today’s dark data ratios, we’re likely looking at 9.5 to 10 million tons of CO2 every single year.
Absorbing that much carbon annually would take roughly 500 million mature trees. A single tree takes years to grow. Meanwhile, AI is generating data at a rate that’s only accelerating.
Mahatma Gandhi
“The world has enough for everyone’s need, but not enough for everyone’s greed.”
See you in the next posts of this series.