Reinforcement Learning — Dots and Lines — Snake — 2/3

Ajith Kumar V
3 min readDec 24, 2022

--

This blog is a continuation of my previous blog on developing RL agents with snake and we will look into how we are going to develop a RL agent from a basic game in 2/3.

In order to develop a RL agent we can chose the agent to move in a discrete manner or in continuous manner. In discrete manner there are only four steps possible by agent up, down, right and left. Agent should reach the goal with these given steps so we are declaring action space as 4. Observation space is the input with which the agent should decide which step to take as a decision. We are using snake head position (head_x, head_y), change in food position with respect to head (apple_delta_x, apple_delta_y), snakes score and previous moves are fed as input parameters.

    def __init__(self):
super(SnakeEnv, self).__init__()
# Define action and observation space
# Using discrete action
self.action_space = spaces.Discrete(4)
self.observation_space =
spaces.Box(low=-500, high=500, shape=((5+FOOD_FETCH_GOAL),), dtype=np.float64)

Once the class is initialized we will call reset() method to set desired parameters. self.done denotes if the episode is finished or not and self.img denotes the playground 512x512 pixel black screen.

    def reset(self):
self.done = False
self.img = np.zeros((512, 512, 3), dtype="uint8")

Then we will initialize food location at a random place; score, reward, snake position and starting direction of snake are initialized

        # Food
self.foodLoc = [random.randint(1, 511), random.randint(1, 511)]
cv.rectangle(self.img, self.foodLoc, self.foodLoc, (0,255,0), 10)
self.score = 0
self.reward = 0
self.dot = [410,320]
self.img = snakeBody(self.dot, self.img)
self.key = 0 # default key left

Next we will set the observation parameters snake position (head_x, head_y), food position (apple_x, apple_y), snake_score and previous_moves(15).

        # observation
# head_x, heady_y, apple_x, apple_y, snake_score, previous_moves
head_x = self.dot[0]
head_y = self.dot[1]
apple_delta_x = head_x - self.foodLoc[0]
apple_delta_y = head_y - self.foodLoc[1]
snake_score = self.score
self.prev_actions = deque(maxlen=FOOD_FETCH_GOAL)
for _ in range(FOOD_FETCH_GOAL):
self.prev_actions.append(-1)

self.observation = [head_x, head_y, apple_delta_x, apple_delta_y, snake_score] + list(self.prev_actions)
self.observation = np.array(self.observation)
return self.observation

Post reset initialization step() function is executed till an episode ends. Each action is recorded in self.prev_actions variable and based on each action the snake is moved left, right, up and down. Next we are detecting if snake has collided with the boundaries or if the food is fetched. if there is a food is fetch we are increasing the score and setting currentFoodFetch variable.

    def step(self, action):
self.prev_actions.append(action)
cv.imshow("The Slytherin Dot game", self.img)
self.key = action
previousArr = copy.deepcopy(self.dot)

# Actions possible
if self.key == 0:
self.dot[0] = self.dot[0] - 10
elif self.key == 1:
self.dot[0] = self.dot[0] + 10
elif self.key == 2:
self.dot[1] = self.dot[1] - 10
elif self.key == 3:
self.dot[1] = self.dot[1] + 10

# detect collision with boudaries
if dtCollisionBoundaries(self.dot) == True:
self.done = True
print("collision with boundaries")

# check if snake has found the food
currFoodFetch = False
if dtFood(self.dot, self.foodLoc) == True:
# Todo: Add time limit for food fetch
self.score += 1
currFoodFetch = True
print("Food fetched")

self.img = np.zeros((512, 512, 3), dtype="uint8")
cv.rectangle(self.img, self.foodLoc, self.foodLoc, (0,255,0), 10)

self.img = snakeBody(self.dot, self.img)
cv.waitKey(150)

Based on the euclidean distance between snake and food location we are giving positive or negative reward. if it has collided with boundaries then we are awarding -20 penalty and if its current food is fetched then we are awarding high positive award and continuing the episode. Further the current positions of snake and change in distance to food positions are collected and returned as a observation. Github link for complete code is here

        # Using distance parameters and awarding reward
currDistToFood = np.linalg.norm(np.array(self.dot)-np.array(self.foodLoc))
prevDistToFood = np.linalg.norm(np.array(previousArr)-np.array(self.foodLoc))

if self.done:
# For colliding with boundaries
self.reward -= 20
elif currFoodFetch:
# Fetching current food
self.reward += (self.score * 100)
elif currDistToFood < prevDistToFood:
# staying alive and moving towards from food
self.reward += 1
else:
# staying alive and moving away from food
self.reward -= 1

# head_x, heady_y, apple_x, apple_y, snake_score, previous_moves
head_x = self.dot[0]
head_y = self.dot[1]
apple_delta_x = head_x - self.foodLoc[0]
apple_delta_y = head_y - self.foodLoc[1]
snake_score = self.score
self.observation = [head_x, head_y, apple_delta_x, apple_delta_y, snake_score] + list(self.prev_actions)
self.observation = np.array(self.observation)
if currFoodFetch:
self.foodLoc = [random.randint(1, 511), random.randint(1, 511)]
info = {}
return self.observation, self.reward, self.done, info

In next blog 3/3 we will look into how to train the agent. Link for next blog is here.

--

--