How much data is enough data — Part 1: Beyond numbers

Ilko Masaldzhiyski

Published in

Xeneta

11 min readOct 31, 2023

Navigating the fine line between too little and too much in the quest for actionable insights ⚖️

TL;DR

📊 Explore the data and its characteristics
🧑‍🏫 Understand if the distribution is normal
💪 Use bootstrapping if the data is not normally distributed
👥 Choose a sample size to represent the population accurately
🌎 Consider the systemic biases in the data
🔜 Ensure observations are independent

Background

At Xeneta, we reveal the true picture of the ocean & air freight markets through crowdsourced, real-time data. This transforms how freight is bought and sold by increasing the efficiency and alignment of logistics teams and their ability to uncover revenue opportunities in all market conditions.

Hence, data collection is a core part of Xeneta’s business model. By becoming the leading ocean benchmarking provider, with over 700 data providers and billions of prices, we faced a luxury problem — having too much data.

We faced a prioritization challenge aiming at expanding real-time coverage while maintaining accuracy. It was essential to determine when to delay importing more data for one trade lane and shift focus to another

Instead of oversaturating a few thousand lanes with a lot of data, we balance the efforts on data collection across more lanes with the optimal amount of data.

This is where, in 2020, the question “How much data is enough data?” came to be. I and my partner in crime, Ahmet Vural, took a shot at answering this for Xeneta.

This question has a different degree of importance depending on the context. For your case, consider the following:

How expensive is data collection?
How much time does data collection take?
How much data do you need before it’s too much data?

If data is cheap and easy to collect, there’s no reason not to take it. However, your business must consider where is the point of diminishing returns.

Clearly, expensive or slow data collection will make the question of how much data you need more contentious. For us, there were scenarios where we would want to improve benchmarks on exotic trade lanes. In those cases, we might have to go after all customers and check if they have pricing information on those trade lanes.

No matter what your business is, if you can’t get all the data, you have to make sure you collect enough of the right kind to give a representative picture

In our case, this translated to finding the right amount of data that will represent real-world market pricing for shipping containers from port A to port B.

This is the first part in a series of posts describing how we solved this challenge.

Understanding your data and the assumptions surrounding it

Customers provide us with the cost of shipping a container from port A to port B. Once this data is cleaned, anonymized, normalized, validated, and aggregated, we can give our customers the price range they will receive when negotiating with their suppliers. Information asymmetry is no longer hindering performance.

An essential factor in our case is understanding that the data we collect, regardless of the amount, should be considered a sample from a larger population of an unknown size. It is unrealistic that we collect all the pricing information for every single container out there.

Depending on the characteristics of our samples, we had to decide which type of statistical methods we should use — parametric or non-parametric.

Nonparametric methods are the kind we use when our data doesn’t fit neatly into common patterns or distributions. In contrast, parametric methods assume our data follows a specific distribution like the Normal or Gaussian distribution.

If a data sample is not normally distributed, then the assumptions of parametric statistical tests are violated, and nonparametric statistical methods must be used.

You can use various techniques to check if your data deviates from a Normal distribution called normality tests. Outside histograms, box plots, and QQ plots, the Kolmogorov-Smirnov and Shapiro-Wilk tests are worth using.

Our data goes through multiple cleaning stages and is finally stored within a regional grouping called Geo Hierarchy. The assumption is that after all these steps, we reduce noise, remove outliers, and improve variance in our data.

Check if the distribution of your data is Normal

It’s always a good practice to visually inspect your data first.

Before jumping into analyzing all of our pricing data (more than 4.3 billion prices as of June 30th, 2020), we visually explored some of the most densely saturated trade lanes.

*Annual development of distributions on five trade lanes with a lot of data*

None of these five distributions are easily classified. It is easy to visually confirm that there are no Normal distributions, and after running the Shapiro — Wilk test, we comfortably rejected the hypothesis that these are Normal.

That’s too small of a sample to make conclusions, though. You should always ensure that your findings are based on enough good data. Hence, we further proceeded with exploring all of the available pricing data.

When looking at our data, we found that 24% of all analyzed trade lanes had a Normal distribution. However, those accounted for only 2.6% of all prices. No trade lane exceeding 500 contracts on a given day had a Normal distribution.

Thus, even though there were some indications that the data could follow a Normal distribution, it was concluded that this is highly unlikely.

What were the most common types of distributions we saw in our data?

We plotted the goodness of fit and estimated the best candidate using the least square error (LSE) method, whether a chosen distribution from the most common distributions summarizes our observed values.

We wanted to understand if a given statistical model is more frequently seen across our data.

We tested for the following 32 distributions across hundreds of densely saturated trade lanes:

lognorm, exponnorm, gamma, alpha, beta, weibull_max, weibull_min, loggamma, rice, rayleigh, powerlognorm, powernorm, pareto, maxwell, loglaplace, logistic, laplace, invweibull, invgauss, invgamma, genlogistic, gengamma, gausshyper, fisk, exponweib, exponpow, expon, cosine, chi2, burr, cauchy, norm

The heatmap below visualizes the number of corridors that minimize the LSE function on a given day. There are a few contenders, but it’s apparent that the exponnorm distribution is ahead of the rest regarding the number of trade lanes.

Heatmap illustrating how many trade lanes were described by a given distribution

The exponnorm distribution, or exponential normal distribution, is unique in its flexibility and capacity to fit many data types. In our case, this distribution consistently proved to be the most commonly observed in our data for 2019, reflecting a large number of trade lanes.

This suggests that our data, and possibly the ocean freight market behaviors it represents, share similar patterns of variance and skewness that align with the characteristics of an exponentially modified Gaussian distribution

The same observations were also valid when we looked at how many contracts were described by a given distribution type.

Heatmap illustrating how many contracts on a given day were described by a distribution type

So far, we have seen that our data could not be modeled as Normal, but it conforms with Normal-Like models such as exponnorm.

What if the distribution was Normal?

We would be able to calculate the minimum sample size based on the following criteria:

Confidence level
Error margin

Where n is the sample size, z is the value based on the confidence level, and MOE is the margin of error.

The z value can be chosen from the following table. You can find this table at the back of any book about statistics:

If we further expand the above formula, we get the following number of observations needed in our sample sizes:

You can choose a sample size that works for your case based on the confidence interval and margin of error. For example:

If we want to be 90% confident, with an error margin of 10%, that our prices represent the market on a given day, we would need at least 68 contracts on a given trade lane.

Use Bootstrap estimation if your data is not Normal

In the “What if?” scenario, we assumed that our distributions were Normal and used a simple formula to obtain the required sample size based on a standard error.

However, as we saw in our data, sometimes it’s hard to meet assumptions like the one above.

Normal distribution vs. one seen in our data

This is why and where the bootstrap method addresses these kinds of problems. Bootstrap estimation is a sampling with replacement method that allows you to understand your data better without making too many assumptions.

It works a bit like a lottery draw. Imagine you have a bag full of all possible prices we’ve seen on a trade lane. Now, you draw a handful of prices, write them down, and then put them back in the bag. You repeat this a certain number of times, each time writing down the average price of your handful. The result is a whole range of averages, which gives you a good idea of what to expect from the prices in the market.

Estimating population statistics using multiple samples. Image by Trist’n Joseph

We could write down the algorithm as follows:

Draw a random sample with sizes n from a given trade lane
Compute summary statistics for the sample
Replicate X times for steps 1 and 2, and get X summary statistics
Get the mean and variance for these X statistics, resulting in an approximate of the market

The number of repetitions must be large enough to ensure meaningful statistics, such as the mean, standard deviation, and standard error, can be calculated on the sample. At least try and have a minimum of 30 repetitions — the more, the better.

Why would this work?

The classic theorem Law of Large Numbers:

The average of the results obtained from a large number of trials should be close to the expected value and will tend to become closer to the expected value as more trials are performed.

The Law of Large Numbers is like tossing a coin. While in a few throws, you might get heads 80% of the time, which doesn’t reflect the actual probability of 50%, if you toss the coin hundreds or thousands of times, the percentage of heads will get closer and closer to 50%. This is the essence of the law: the more trials you perform, the closer you get to the actual probability or expected value.

Results

Due to the amount of data and how computationally expensive this was, we randomly chose 400 trade lanes for every 1st and 15th day of 2019 and ran 100 trial samplings with 5, 10, 30, 45, 68, 100, 130, 200, and 350 prices on any given day. We simulated whether we could approximate the market metrics of our trade lanes using different sample sizes.

The difference between the estimated and actual values with a confidence level of 90% is visualized below:

It was interesting that the parametric and non-parametric methods produced almost identical results.

Hence, almost identical phrasing:

If we want to be 90% confident, with a margin of error at 10%, that our prices represent the market on a given day, we would need at least 71 contracts on a given trade lane for that day.

The results guided our decisions on the optimal number of prices to import immediately. This strategy expanded our coverage across trade lanes while maintaining accuracy.

We could now offer accurate real-world benchmarks for an even larger number of trade lanes than before

Consider the systemic biases in the data

The underlying assumption when simulating these results is that the pricing information we receive is composed of independent observations. However, this needs further investigation.

You should challenge such assumptions for your business case. The premise of independence in our pricing information is crucial because if observations are dependent, it may bias our results.

In our case, data from different customers might be dependent because of shared suppliers or alliances, creating correlations within the data. These correlations can distort our analysis if not accounted for, making it look like we have more information than we do. (This will be covered in pt.2)

Furthermore, the data itself comprises pricing provided by Freight Forwarders and Shippers. Ensuring that the composition is representative depends on the use case.

In addition to testing for statistical significance, we considered factors such as Outlier Detection, the GINI coefficient, the number of containers shipped, and our customers' industries.

Outlier Detection

Our process for outlier detection involves both statistical and domain-specific methods. Statistically, we employ techniques such as the IQR method and clustering. But beyond the numbers, understanding the maritime industry’s intricacies and temporal components was essential.

GINI coefficient

The GINI coefficient is a measure of inequality — in our case, we used it to assess the distribution of our data points within the different trade lanes in the context of data providers. A low GINI coefficient would indicate that our data points are evenly spread across data providers. In contrast, a high GINI would suggest an unequal distribution, with one customer responsible for most of the pricing data we get.

Number of containers being shipped

When considering representativeness, the volume of a given customer being moved should also be accounted for. We should have a representative spread across this dimension as well.

Different industries

We also factored in the variety of industries represented in our data. The pricing data could vary significantly across industries due to the differing requirements and market conditions. Hence, ensuring diverse industry representation would make our benchmarks more robust and applicable.

As mentioned before, there is a delicate balance in ensuring that our data is representative, which requires optimizing many parameters.

Stay tuned for the next part, which further explores the independence of data

As we’ve journeyed through the intriguing landscape of data collection, we’ve discovered the balance between too little and too much data, the importance of understanding the characteristics of data, and the novel ways in which we can make sense of it all.

However, our expedition into the world of data doesn’t end here. The independence of our pricing data is a bold assumption. In the upcoming second part of this series, we will dive into this critical aspect of data analysis, exploring how we navigated across this potential pitfall.

Xeneta is always on the lookout for great talent. If you want to join us on our journey to revolutionize how freight is bought and sold, submit your application here.

Ilko Masaldzhiyski is a Data Science Director at Xeneta by day and an avid chess player by night. He’s also a dad who enjoys going to the cinema, playing video games, and reading.

Ahmet Vural is a Data Infrastructure Tech Lead at Xeneta — a real “human Swiss knife” when it comes to data. He’s a dad-to-be who enjoys wine, tinkering with all kinds of code, and can’t operate without music.

If you have passion and curiosity for data or want to discuss and exchange ideas, please don’t hesitate to reach out.

Thanks to Rajesh Bhol, Miguel Jimenéz, Abdalla Elbedwihi, Emil Korzeń, and Dayna Goldman for reviewing and helping finalize this article.