Alexey Novakov
May 1, 2018 · 5 min read

Scalacheck is a scala library for property-based testing. To be honest, I am not using it yet for the its direct purpose, however I found it very handy for generation of random data.

Why would need to generate random data?

You need start your work before you have gotten the real data. Your work is something like:

  • Create UI to visualize data in tabular, graph or in any other form
  • Write SQL query or any specific DSL query to debug it in advance
  • Measure performance of your application when some amount of data seats in it

And many other reasons. Of course once real data is here, you need to test your work against it and apply some feedback actions when necessary.

Data Domain

Let’s pick some domain to focus our data generation around it. This will be some area which we can easily reason about it: transportation. Let’s imagine we have a company business in car sharing, bicycles renting and taxi. We have 3 types of transactions — VehicleType and single model — Trip:

Looking at fields name you can imagine what each of them mean. If particular trip is till in progress, then endTime field is None, otherwise it has end time of this trip. I hope other fields are clear.

Trips generator

First of all we need to import Scalacheck Gen class. It will be only one Scalacheck class we are going to use as part this article, which is org.scalacheck.Gen.

Let’s create a random generator of Trip case class instance. A trick for this is to have random generator per each field of case class and them compose all of them to produce an instance of target case class.

  1. tripId. Using 0 constant value as we are going to rely on database incremented field for tripId column. However, we need to an instance of Gen for tripId to be able to compose it with other field generators. Keep reading.
val tripId: Gen[Long] = Gen.const(0L)

2. vehicleType. We want bike transactions to be generated more often than taxi. Taxi transactions to be generated more often the car sharing transactions. Numbers in front of values are weights. They can be any numbers, but they should be aligned to each other to get desired frequency.

val vehicleType = Gen.frequency(
5 -> Bike, 3 -> Taxi, 2 -> CarSharing
)

3. stateCityZipCode. We want some cities to be more often on overall transaction table of trips. Target value is triple value having state name as 1st element, city name as 2nd element and zip code of the city as 3rd element. We simplify zip codes by assigning the same zip code for the entire city.

val stateCityZipCode = Gen.frequency(
1 -> ("Baden-Württemberg", "Stuttgart", 72160),
4 -> ("Bayern", "Munich", 80333),
5 -> ("Berlin", "Berlin", 14167),
1 -> ("Brandenburg", "Potsdam", 14469),
3 -> ("Bremen", "Bremen", 28195),
4 -> ("Hamburg", "Hamburg", 20095),
2 -> ("Hessen", "Wiesbaden", 65185),
3 -> ("Niedersachsen", "Hannover", 30159),
1 -> ("Mecklenburg-Vorpommern", "Schwerin", 19055),
4 -> ("Nordrhein-Westfalen (NRW)", "Düsseldorf", 40213),
1 -> ("Rheinland-Pfalz", "Mainz", 55128),
3 -> ("Saarland", "Saarbrücken", 66111),
2 -> ("Sachsen", "Dresden", 1067),
1 -> ("Sachsen-Anhalt", "Magdeburg", 39104),
1 -> ("Schleswig-Holstein", "Kiel", 24103),
1 -> ("Thüringen", "Erfurt", 99084)
)

4. customerId. We want to generate trips for customers with id between 1 and 1000. We want the same customer to be generated 0 or many times, but id value needs to be inside the range.

val customerId = Gen.choose(1, 1000L)

5. location. We simplify the location value by having <street name>+<house number> as its value. We are already know what is Gen.choose from above field. Gen.oneOf means we want to get one of provided value. A distribution of the values will be uniform.

val location = for {
street <- Gen.oneOf(
"Landstrasse", "Kettenhofweg",
"Frankenallee", "Taunusstrasse",
"Ohmstrasse", "Goethestrasse")
number <- Gen.choose(1, 100)
} yield street + " " + number

6. completed. We want to have 9 out of 10 trips as completed and 1/9 range of values as uncompleted. This field is simple boolean true/false.

val completed = Gen.frequency(10 -> true, 1 -> false)

7. requestTime. This is a timestamp when customer made a service request. We want all transactions to be started between now and 6 months back to the past. I am using java.time API to calculate range start and end values in order to use their long values in Gen.choose generator as min and max. Then generated long value is used to create an instance of LocalDateTime, which will be later mapped to database timestamp type column.

val requestTime = localDateTimeGen
val rangeStart = LocalDateTime.now(UTC)
.minusMonths(6).toEpochSecond(UTC)
val currentYear = LocalDateTime.now(UTC).getYear
val rangeEnd = LocalDateTime.of(
currentYear, 1, 1, 0, 0, 0).toEpochSecond(UTC)
private def localDateTimeGen: Gen[LocalDateTime] = {
Gen.choose(rangeStart, rangeEnd).map(i =>
LocalDateTime.ofEpochSecond(i, 0, UTC)
)
}

8. distance. Integer value to reflect trip distance in kilometers.

val distanceKm = Gen.choose(1, 500)

Rest fields are calculated based on the values of above field generators. Rest fields are:

  • startTime — depends on requestTime
  • endTime — depends on trip completion, duration and startTime
  • costPerHour — depends on vehicleType
  • duration — depends on vehicleType

Dependent generators

private val bikeWaitingTimeMins = Gen.choose(0, 1)
private val taxiWaitingTimeMins = Gen.choose(0, 20)
private val carWaitingTimeMins = Gen.choose(0, 2)

private val bikeDurationMins = Gen.choose(1, 250L)
private val taxiDurationMins = Gen.choose(1, 120L)
private val carDurationMins = Gen.choose(1, 300L)

private val costPerHourBike = Gen.choose(0.5, 2)
private val costPerHourTaxi = Gen.choose(50, 100.0)
private val costPerHourCarSharing = Gen.choose(15, 20.0)

We define functions to return vehicleType dependent generator:

private def getEndTime(completed: Boolean, durationMin: Long, startTime: LocalDateTime) = {
if (completed) Some(startTime.plus(durationMin, MINUTES))
else None
}

private def getWaitingTime(vehicle: VehicleType) = vehicle match {
case Bike => bikeWaitingTimeMins
case Taxi => taxiWaitingTimeMins
case CarSharing => carWaitingTimeMins
}

private def getDuration(vehicle: VehicleType) = vehicle match {
case Bike => bikeDurationMins
case Taxi => taxiDurationMins
case CarSharing => carDurationMins
}

private def getCost(vehicle: VehicleType) = vehicle match {
case Bike => costPerHourBike
case Taxi => costPerHourTaxi
case CarSharing => costPerHourCarSharing
}

Final composition via for-comprehension

Gen class is a Monad, it has flatMap and unit functions (apply). We just need its flatMap method to compose all generators together and yield generated values as arguments of target case class — Trip.

def tripGen: Gen[Trip] =
for {
tId <- tripId
vehicle <- vehicleType
(state, city, zip) <- stateCityZipCode
cId <- customerId
aLocation <- location
bLocation <- location
done <- completed
rTime <- requestTime
waitingTime <- getWaitingTime(vehicle)
duration <- getDuration(vehicle)
startTime = rTime.plus(waitingTime.toLong, MINUTES)
endTime = getEndTime(done, duration, startTime)
cost <- getCost(vehicle)
distance <- distanceKm
} yield
Trip(
tId,
vehicle,
city,
state,
zip,
cId,
aLocation,
bLocation,
done,
startTime,
endTime,
duration,
cost,
rTime,
distance
)

Conclusion

  1. Sometimes we have to generate fake data because real data for our work is not yet accessible. We cannot wait, so we need to simulate data.
  2. Usage of Scalacheck as random data generator is not its direct purpose. However, when you need some fake data generation based on some laws: frequence, range, one of value or even more dependent-like random generation mechanism, then Scalacheck is a great tool for that. For me, it is much more pleasant to write such utility code in proper statically typed language like Scala, than using random bash scripts or even some Ruby/Python scripts.

Links:

  1. Scalacheck User Guide: https://github.com/rickynils/scalacheck/blob/master/doc/UserGuide.md
  2. Github Project for Trips generation: https://github.com/novakov-alexey/trips-gen
  3. Functional Programming in Scala (Monads). Coursera videos: https://www.coursera.org/learn/progfun2/lecture/98tNE/lecture-1-4-monads
  4. Head image: https://www.pexels.com/photo/bet-black-and-white-casino-chance-278943/

SE Notes by Alexey Novakov

Software Engineering notes on Scala, JVM and other goodies

Alexey Novakov

Written by

SE Notes by Alexey Novakov

Software Engineering notes on Scala, JVM and other goodies

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade