OCR: Part 1 — Dataset Generation

5 min readJul 14, 2018

OCR is a useful concept with the enormous number of applications. This Post will get you started with OCR. All good architecture uses CNN followed by RNN to get the best result. CNN extracts features from an image which is then fed to RNN for getting final output (text in your image).

When I decided to learn OCR, I decided not to jump directly on the best architectures available because then I would have missed a big part of learning in between. The Strategy which I followed was to first perform OCR on the self-generated dataset using CNN only and then move to full-fledged architecture (CNN + RNN). Another thing which I usually do is to implement everything from scratch using a low-level ML framework like Tensorflow. The reason behind following such strategy is

Just CNN is quite simple and easier to debug in compare to CNN + RNN so initially just making a good CNN will ensure that first half of your code is all good and by the time you will be done with all this, initial setup will also be done which include having well checked basic custom functions, generating a rather simple dataset (optional), pre-processing of dataset, conversion of your dataset to file format you would find best for yourself, setting up pipeline to feed data to your network.
While working with just CNN you will get a pretty good idea on what is actually going on in CNN with your images and what actually going to be fed to your RNN in final architecture.
Being well aware of each and every step of your implementation will give you the freedom to experiment more confidently which will lead to effective learning.

First two parts of this series we will talk about OCR using just CNN on random strings (without any segmentation), in later parts we will try to segment data and then use CNN to classify each character using CNN, in the end, full-fledged OCR will be discussed. Part 1 focuses mainly on the custom data generation and getting it into the desired file format. Here we will be creating a very simple dataset to recognize using CNN. The dataset we will create here will be a random string with 3 to 8 characters. Characters will include all alphabets (lowercase and uppercase) along with the digits and space. Let’s get started!

Generate dataset

In order to generate a dataset you need to have two things:

A function which generates images with a random text of random size, thickness, and font (if you want to increase the complexity of your dataset otherwise you could have just one font). A sample code is given below, you could modify it to your heart’s content if you want to generate the even more complex dataset.

def gen_rand_string_data(data_count,                        
                         min_char_count = 3, 
                         max_char_count = 8,
                         max_char = 16,
                         x_pos = 'side',
                         img_size = (32,256,1),
                         font = [cv2.FONT_HERSHEY_SIMPLEX], 
                         font_scale = np.arange(0.7, 1, 0.1), 
                         thickness = range(1, 3, 1)):
  '''
  random string data generation
  ''' 
  start_time=dt.now() 
  images = []
  labels = []
  color = (255,255,255)
  count = 0
  char_list = list(string.ascii_letters) \
              + list(string.digits) \
              + list(' ')     
  while(1):
    
    for fs in font_scale:
      for thick in thickness:
        for f in font:
          img = np.zeros(img_size, np.uint8)
          char_count = np.random.randint(min_char_count, \
                                         (max_char_count + 1))
          rand_str = ''.join(np.random.choice(char_list, \
                                              char_count))
          #generate image data
          text_size = cv2.getTextSize(rand_str, f, fs, thick)[0]  
          if(x_pos == 'side'):
            org_x = 0
          else:
            org_x = (img_size[1] - text_size[0])//2         
          org_y = (img_size[0] +  text_size[1])//2
          cv2.putText(img, rand_str, (org_x, org_y), f, fs, \
                      color, thick, cv2.LINE_AA)
          
          label = list(rand_str) + [' '] \
          * (max_char - len(rand_str))
          for i,t in enumerate(label):
            label[i] = char_list.index(t)
            
          label = np.uint8(label)
          images.append(img)
          labels.append(label)        
          count +=1
          if count == data_count:
            break
        else: continue
        break
      else: continue
      break
    else: continue
    break  
  end_time = dt.now()  
  print("time taken to generate data", end_time - start_time)          
  return images, labels

2. A function which writes your generated images into tfrecords. The following code will do the job for you:

def _bytes_feature(value):
  return tf.train.Feature \
         (bytes_list=tf.train.BytesList(value=[value]))def write_tfrecords(all_features, all_labels, file):
  '''
  write data to a tfrecords file
  '''
  start_time=dt.now()
  writer = tf.python_io.TFRecordWriter(file)
  for features, labels in zip(all_features, all_labels):
      feature = {'labels': _bytes_feature(tf.compat.as_bytes \
                           (np.array(labels).tostring())),
                 'images': _bytes_feature(tf.compat.as_bytes \
                           (np.array(features).tostring()))}
      example = tf.train.Example(features=tf.train.Features \
                                 (feature=feature))
      writer.write(example.SerializeToString())    
  writer.close()
  end_time = dt.now()  
  print("time taken to write data", end_time - start_time)

With above two functions, you could generate the dataset. A sample code for doing so is given below:

folder_path = <path of folder where you want to store tfrecords>
file_count = 2
train_data_count = 8192
test_data_count = 2048
print('total train data =', file_count * train_data_count)
print('total test data =', file_count * test_data_count)
keyword = '3to8'
for i in range(file_count):
  index = i+1
  train_filename =folder_path+"train_"+keyword+"_%d.tfrecords"%index
  test_filename =folder_path+"test_"+keyword+"_%d.tfrecords"%index
  print('generating train file number %d'%(i+1))
  images, labels = gen_rand_string_data(train_data_count)
  write_tfrecords(images, labels, train_filename)                     
  print('train file number %d generated'%(i+1))
  print('generating test file number %d'%(i+1))
  images, labels, gen_rand_string_data(test_data_count)
  write_tfrecords(images, labels, test_filename)
  print('test file number %d generated'%(i+1))

Visualize generated dataset

In order to visualize generated data, you need to have a function for reading tfrecords. You could find the function below.

def read_data(file_list):
  '''
  read data from tfrecords file
  '''
  file_queue=tf.train.string_input_producer(file_list)
  feature = {'images': tf.FixedLenFeature([], tf.string),
             'labels': tf.FixedLenFeature([], tf.string)}    
  reader = tf.TFRecordReader()  
  _,record=reader.read(file_queue)#read a record
  features = tf.parse_single_example(record, features=feature)
  img = tf.decode_raw(features['images'], tf.uint8)
  label = tf.decode_raw(features['labels'], tf.uint8) 
  return img,labeldef minibatch(batch_size, filename, file_count, \
              image_size, max_char, class_count):
  '''
  create minibatch
  '''
  file_list=[os.path.join(filename + \
            '%d.tfrecords' % i) for i in range(1, file_count+1)]  
  img, label=read_data(file_list)
  img = tf.cast(tf.reshape(img,img_size), dtype = tf.float32)
  label = tf.reshape(label, [1, max_char])
  label = tf.one_hot(label,class_count,axis=1)
  label = tf.reshape(label,tf.shape(label)[1:])
  img_batch,label_batch= tf.train.shuffle_batch([img, label],
                          batch_size,capacity,min_after_dequeue,\
                          num_threads=num_of_threads)
  return img_batch, tf.cast(label_batch, dtype = tf.int64)

Now we will make use of these functions to read tfrecords and visualize the data, its data type, and its shape. You could use the code given below for this purpose:

Folder_path = <path of folder having tfrecords>
keyword = '3to8'
train_filename = folder_path + 'train_' + keyword + '_'
test_filename = folder_path + 'test_' + keyword + '_'
file_count = 2
img_size = [32,256,1]
max_char = 8
class_count = 63
batch_size = 32
num_of_threads=16
min_after_dequeue=5000
capacity=min_after_dequeue+(num_of_threads+1)*batch_sizewith tf.Graph().as_default():
  image_batch, label_batch=minibatch(batch_size, train_filename \
                   , file_count, img_size, max_char, class_count)
  init=tf.global_variables_initializer()
  with tf.Session() as sess:
    sess.run(init)
    sess.run(tf.local_variables_initializer())
    coord = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(coord=coord) 
    for i in range(5):
      image_b, label_b= sess.run([image_batch, label_batch])
      if(i==0):
        print('data type of image:', type(image_b[0][0,0,0]))
        print('data type of label:', type(label_b[0][0,0]))
        print("shape of image_batch:", image_b.shape)
        print('shape of label_out:', label_b.shape)
      plt.imshow(np.reshape(image_b[0],[32,256]), cmap = 'gray')
      plt.show()
      print(sess.run(tf.transpose(label_b[0])))
    coord.request_stop()
    coord.join(threads)

Result

You will get the result which will be something similar to what is given below:

With this, you are all set to go ahead and implement CNN to perform OCR. You could find the source code for this part here. In the next part of this series, we will learn how to design and train CNN for OCR using data generated in this part. Enjoy!

OCR: Part 1 — Dataset Generation

Written by Vijendra Singh