Reinforcement Learning — Dots and Lines — Snake — 2/3

3 min readDec 24, 2022

This blog is a continuation of my previous blog on developing RL agents with snake and we will look into how we are going to develop a RL agent from a basic game in 2/3.

In order to develop a RL agent we can chose the agent to move in a discrete manner or in continuous manner. In discrete manner there are only four steps possible by agent up, down, right and left. Agent should reach the goal with these given steps so we are declaring action space as 4. Observation space is the input with which the agent should decide which step to take as a decision. We are using snake head position (head_x, head_y), change in food position with respect to head (apple_delta_x, apple_delta_y), snakes score and previous moves are fed as input parameters.

    def __init__(self):
        super(SnakeEnv, self).__init__()
        # Define action and observation space
        # Using discrete action
        self.action_space = spaces.Discrete(4)
        self.observation_space =
 spaces.Box(low=-500, high=500, shape=((5+FOOD_FETCH_GOAL),), dtype=np.float64)

Once the class is initialized we will call reset() method to set desired parameters. self.done denotes if the episode is finished or not and self.img denotes the playground 512x512 pixel black screen.

    def reset(self):
        self.done = False
        self.img = np.zeros((512, 512, 3), dtype="uint8")

Then we will initialize food location at a random place; score, reward, snake position and starting direction of snake are initialized

        # Food
        self.foodLoc = [random.randint(1, 511), random.randint(1, 511)]
        cv.rectangle(self.img, self.foodLoc, self.foodLoc, (0,255,0), 10)
        self.score = 0
        self.reward = 0
        self.dot = [410,320]
        self.img = snakeBody(self.dot, self.img)
        self.key = 0 # default key left

Next we will set the observation parameters snake position (head_x, head_y), food position (apple_x, apple_y), snake_score and previous_moves(15).

        # observation
        # head_x, heady_y, apple_x, apple_y, snake_score, previous_moves
        head_x = self.dot[0]
        head_y = self.dot[1]
        apple_delta_x = head_x - self.foodLoc[0]
        apple_delta_y = head_y - self.foodLoc[1]
        snake_score = self.score
        self.prev_actions = deque(maxlen=FOOD_FETCH_GOAL)
        for _ in range(FOOD_FETCH_GOAL):
            self.prev_actions.append(-1)

        self.observation = [head_x, head_y, apple_delta_x, apple_delta_y, snake_score] + list(self.prev_actions)
        self.observation = np.array(self.observation)
        return self.observation

Post reset initialization step() function is executed till an episode ends. Each action is recorded in self.prev_actions variable and based on each action the snake is moved left, right, up and down. Next we are detecting if snake has collided with the boundaries or if the food is fetched. if there is a food is fetch we are increasing the score and setting currentFoodFetch variable.

    def step(self, action):
        self.prev_actions.append(action)
        cv.imshow("The Slytherin Dot game", self.img)
        self.key = action
        previousArr = copy.deepcopy(self.dot)

        # Actions possible
        if self.key == 0:
            self.dot[0] = self.dot[0] - 10
        elif self.key == 1:
            self.dot[0] = self.dot[0] + 10
        elif self.key == 2:
            self.dot[1] = self.dot[1] - 10
        elif self.key == 3:
            self.dot[1] = self.dot[1] + 10

        # detect collision with boudaries
        if dtCollisionBoundaries(self.dot) == True:
            self.done = True
            print("collision with boundaries")

        # check if snake has found the food
        currFoodFetch = False
        if dtFood(self.dot, self.foodLoc) == True:
            # Todo: Add time limit for food fetch
            self.score += 1
            currFoodFetch = True
            print("Food fetched")

        self.img = np.zeros((512, 512, 3), dtype="uint8")
        cv.rectangle(self.img, self.foodLoc, self.foodLoc, (0,255,0), 10)

        self.img = snakeBody(self.dot, self.img)
        cv.waitKey(150)

Based on the euclidean distance between snake and food location we are giving positive or negative reward. if it has collided with boundaries then we are awarding -20 penalty and if its current food is fetched then we are awarding high positive award and continuing the episode. Further the current positions of snake and change in distance to food positions are collected and returned as a observation. Github link for complete code is here

        # Using distance parameters and awarding reward
        currDistToFood = np.linalg.norm(np.array(self.dot)-np.array(self.foodLoc))
        prevDistToFood = np.linalg.norm(np.array(previousArr)-np.array(self.foodLoc))

        if self.done:
            # For colliding with boundaries
            self.reward -= 20
        elif currFoodFetch:
            # Fetching current food
            self.reward += (self.score * 100)
        elif currDistToFood < prevDistToFood:
            # staying alive and moving towards from food
            self.reward += 1
        else:
            # staying alive and moving away from food
            self.reward -= 1

        # head_x, heady_y, apple_x, apple_y, snake_score, previous_moves
        head_x = self.dot[0]
        head_y = self.dot[1]
        apple_delta_x = head_x - self.foodLoc[0]
        apple_delta_y = head_y - self.foodLoc[1]
        snake_score = self.score
        self.observation = [head_x, head_y, apple_delta_x, apple_delta_y, snake_score] + list(self.prev_actions)
        self.observation = np.array(self.observation)
        if currFoodFetch:
            self.foodLoc = [random.randint(1, 511), random.randint(1, 511)]
        info = {}
        return self.observation, self.reward, self.done, info

In next blog 3/3 we will look into how to train the agent. Link for next blog is here.

Reinforcement Learning — Dots and Lines — Snake — 2/3

Written by Ajith Kumar V