TY - GEN
T1 - An Efficient Deep Q-learning Strategy for Sequential Decision-making in Game-playing
AU - Chang, Oscar
AU - Morocho-Cayamcela, Manuel Eugenio
AU - Pineda, Israel
AU - Cardenas, Kevin
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - This paper presents a deep reinforcement learning model that efficiently learns a sequential decision-making policy to play tic-tac-toe intelligently directly from a high-dimensional video. To produce a stable, sparse neural representation of the states of the tic-tac-toe board, a convolutional pre-trained neural network has been used, followed by a fully-connected sigmoidal network. The assemble behaves as a Q-matrix and produces the ultimate state-decision pairs that control a robotic arm placing physical tokens on the board. The hyperparameters in the whole network are tuned to produce a stable trainable array of elements. An internal clock composed of internal neurons is integrated to give the agent a sense of sequential timing. To solve the max(⊙) function, a novel algorithm is introduced to search for the Q-network values. The algorithm uses a dedicated, sigmoidal net initialized with random parameters. Under backpropagation it iteratively moves to a stable plateau that mimics the all-zeros condition of an initial Q-matrix. Next, the agent uses Bellman's reinforcement principles to learn an optimal policy with a noticeable look-ahead capability. Computer simulations driving a physical robot proved the convergence and effectiveness of the proposed methodology and demonstrated a marked ability in sequential decision-making, taking raw video frames as input.
AB - This paper presents a deep reinforcement learning model that efficiently learns a sequential decision-making policy to play tic-tac-toe intelligently directly from a high-dimensional video. To produce a stable, sparse neural representation of the states of the tic-tac-toe board, a convolutional pre-trained neural network has been used, followed by a fully-connected sigmoidal network. The assemble behaves as a Q-matrix and produces the ultimate state-decision pairs that control a robotic arm placing physical tokens on the board. The hyperparameters in the whole network are tuned to produce a stable trainable array of elements. An internal clock composed of internal neurons is integrated to give the agent a sense of sequential timing. To solve the max(⊙) function, a novel algorithm is introduced to search for the Q-network values. The algorithm uses a dedicated, sigmoidal net initialized with random parameters. Under backpropagation it iteratively moves to a stable plateau that mimics the all-zeros condition of an initial Q-matrix. Next, the agent uses Bellman's reinforcement principles to learn an optimal policy with a noticeable look-ahead capability. Computer simulations driving a physical robot proved the convergence and effectiveness of the proposed methodology and demonstrated a marked ability in sequential decision-making, taking raw video frames as input.
KW - Artificial intelligence
KW - deep Q-learning
KW - game-playing
KW - reinforcement learning
KW - sequential decision-making
UR - http://www.scopus.com/inward/record.url?scp=85151326273&partnerID=8YFLogxK
U2 - 10.1109/ICI2ST57350.2022.00032
DO - 10.1109/ICI2ST57350.2022.00032
M3 - Contribución a la conferencia
AN - SCOPUS:85151326273
T3 - Proceedings - 3rd International Conference on Information Systems and Software Technologies, ICI2ST 2022
SP - 172
EP - 177
BT - Proceedings - 3rd International Conference on Information Systems and Software Technologies, ICI2ST 2022
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 3rd International Conference on Information Systems and Software Technologies, ICI2ST 2022
Y2 - 8 November 2022 through 10 November 2022
ER -