Reading 16GB File in Seconds, Golang

Ohm Patel
Ohm Patel
Jul 7, 2020 · 4 min read

Any computer system in today’s world generates a very high amount of logs or data daily. As the system grows, it is not feasible to store the debugging data into a database, as they’ve are immutable and it’s only going to be used for analytics and fault resolution purposes. So organisations tend to store it in files, which resides in local disks storage.

We are going to extract logs from a .txt or .log file of size 16 GB, having millions of lines using Golang.

ets Code…!

Let’s open the file first. We will be using standard Go os.File for any file IO.

f, err := os.Open(fileName) if err != nil {
fmt.Println("cannot able to read the file", err)
return
}
// UPDATE: close after checking error
defer file.Close() //Do not forget to close the file

Once the file is opened, we have below two options to proceed with

  1. Read the file line by line, it helps to reduce the strain on memory but will take more time in IO.
  2. Read an entire file into memory at once and process the file, which will consume more memory but significantly increase the time.

As we are having file size too high, i.e 16 GB, we can’t load an entire file into memory. But the first option is also not feasible for us, as we want to process the file within seconds.

But guess what, there is a third option. Voila…! Instead on loading entire file into memory we will load the file in chunks, using bufio.NewReader(), available in Go.

r := bufio.NewReader(f)for {buf := make([]byte,4*1024) //the chunk sizen, err := r.Read(buf) //loading chunk into buffer
buf = buf[:n]
if n == 0 {

if err != nil {
fmt.Println(err)
break
}
if err == io.EOF {
break
}
return err
}
}

Once we have the chunk, we will fork a thread, i.e Go routine, to process each chunk concurrently with other chunks. The above code would be changed to -

//sync pools to reuse the memory and decrease the preassure on //Garbage CollectorlinesPool := sync.Pool{New: func() interface{} {
lines := make([]byte, 500*1024)
return lines
}}
stringPool := sync.Pool{New: func() interface{} {
lines := ""
return lines
}}
slicePool := sync.Pool{New: func() interface{} {
lines := make([]string, 100)
return lines
}}
r := bufio.NewReader(f)var wg sync.WaitGroup //wait group to keep track off all threadsfor {

buf := linesPool.Get().([]byte)
n, err := r.Read(buf)
buf = buf[:n]
if n == 0 {
if err != nil {
fmt.Println(err)
break
}
if err == io.EOF {
break
}
return err
}
nextUntillNewline, err := r.ReadBytes('\n')//read entire line

if err != io.EOF {
buf = append(buf, nextUntillNewline...)
}

wg.Add(1)
go func() {

//process each chunk concurrently
//start -> log start time, end -> log end time

ProcessChunk(buf, &linesPool, &stringPool, &slicePool, start, end)
wg.Done()

}()
}
wg.Wait()}

The above code introduces two new optimizations:-

  1. The sync.Pool is a powerful pool of instances that can be re-used to reduce the pressure on the garbage collector. We will be resuing the memory allocated to various slices. It helps us to reduce memory consumption and make our work significantly faster.
  2. The Go Routines which helps us to process the buffer chunk concurrently which significantly increases the processing speed.

Now let’s implement the ProcessChunk function, which will process the logs lines, which are of the format

2020-01-31T20:12:38.1234Z, Some Field, Other Field, And so on, Till new line,...\n

We will be extracting logs based on the time stamp provided at the command line.

func ProcessChunk(chunk []byte, linesPool *sync.Pool, stringPool *sync.Pool, slicePool *sync.Pool, start time.Time, end time.Time) {//another wait group to process every chunk further                             
var wg2 sync.WaitGroup
logs := stringPool.Get().(string)logs = string(chunk)linesPool.Put(chunk) //put back the chunk in pool//split the string by "\n", so that we have slice of logs
logsSlice := strings.Split(logs, "\n")
stringPool.Put(logs) //put back the string poolchunkSize := 100 //process the bunch of 100 logs in threadn := len(logsSlice)noOfThread := n / chunkSizeif n%chunkSize != 0 { //check for overflow
noOfThread++
}
length := len(logsSlice)//traverse the chunk
for i := 0; i < length; i += chunkSize {

wg2.Add(1)
//process each chunk in saperate chunk
go func(s int, e int) {
for i:= s; i<e;i++{
text := logsSlice[i]
if len(text) == 0 {
continue
}

logParts := strings.SplitN(text, ",", 2)
logCreationTimeString := logParts[0]
logCreationTime, err := time.Parse("2006-01- 02T15:04:05.0000Z", logCreationTimeString)
if err != nil {
fmt.Printf("\n Could not able to parse the time :%s for log : %v", logCreationTimeString, text)
return
}
// check if log's timestamp is inbetween our desired period
if logCreationTime.After(start) && logCreationTime.Before(end) {

fmt.Println(text)
}
}
textSlice = nil
wg2.Done()

}(i*chunkSize, int(math.Min(float64((i+1)*chunkSize), float64(len(logsSlice)))))
//passing the indexes for processing
}
wg2.Wait() //wait for a chunk to finish
logsSlice = nil
}

The above code is benchmarked using 16 GB of a log file.

The time taken to extract the logs is around ~ 25 sec.

The entire working code is given below.

You can reach out to me on ohm.patel1997@gmail.com.

Any queries and improvements are most welcome.😉

You can also comment below for further doubts and applauds are always welcome.😁🙈

The Startup

Get smarter at building your thing. Join The Startup’s +724K followers.

Ohm Patel

Written by

Ohm Patel

A computer science graduate, amateur Blogger and a geek who is highly competitive and strive to learn new technologies and skills by travelling the world.

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +724K followers.

Ohm Patel

Written by

Ohm Patel

A computer science graduate, amateur Blogger and a geek who is highly competitive and strive to learn new technologies and skills by travelling the world.

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +724K followers.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store