सारांश और परीक्षण मामलोंTensorflow ResourceExhaustedError पहले बैच के बाद
मुख्य मुद्दा यह है कि Tensorflow एक बैच है कि पहले नहीं है, जैसा कि मैंने उम्मीद होती है पर यह OOM आवंटन फेंकता है। इसलिए, मेरा मानना है कि एक स्मृति रिसाव है क्योंकि प्रत्येक बैच के बाद सभी स्मृति स्पष्ट रूप से मुक्त नहीं हो रही है।
num_units: 50, batch_size: 1000; fails OOM (gpu) before 1st batch as expected
num_units: 50, batch_size: 800, fails OOM (gpu) before 1st batch as expected
num_units: 50, batch_size: 750; fails OOM (gpu) after 10th batch (???)
num_units: 50, batch_size: 500; fails OOM (gpu) after 90th batch (???)
num_units: 50, batch_size: 300; fails OOM (gpu) after 540th batch (???)
num_units: 50, batch_size: 200; computer freezes after around 900 batches with 100% ram use
num_units: 50, batch_size: 100; passes 1 epoch -- may fail later (unknown)
स्पष्टीकरण:
मूलतः, यह 145 वां बैच है, जो अजीब लगता है पर असफल होने से पहले 500
का एक बैच आकार के साथ 144
बैच चलाता है। यदि यह 145 वें बैच के लिए पर्याप्त स्मृति आवंटित नहीं कर सकता है, तो इसे पहले 144 के लिए क्यों काम करना चाहिए? व्यवहार दोहराया जा सकता है।
ध्यान दें कि प्रत्येक बैच आकार में भिन्न करता है, के बाद से हर एक आयाम [BATCH_SIZE, MAX_SEQUENCE_LENGTH]
है, और दृश्यों के आधार पर नमूना, अनुक्रम भिन्न-भिन्न होती है, लेकिन कार्यक्रम नहीं है सबसे बड़ा बैच पर असफल; यह बाद में एक छोटे से विफल रहता है। इसलिए, मैंने निष्कर्ष निकाला है कि एक एकल oversized बैच स्मृति त्रुटि उत्पन्न नहीं कर रहा है; यह एक स्मृति रिसाव प्रतीत होता है।
एक बड़े बैच आकार के साथ, कार्यक्रम पहले विफल रहता है; एक छोटे बैच आकार के साथ, यह बाद में विफल रहता है।
पूर्ण त्रुटि यहाँ है: (models.py से)
Traceback (most recent call last):
File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1323, in _do_call
return fn(*args)
File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1302, in _run_fn
status, run_metadata)
File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[500,80]
[[Node: decoder/while/BasicDecoderStep/basic_lstm_cell/MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](decoder/while/BasicDecoderStep/basic_lstm_cell/concat, decoder/while/BasicDecoderStep/basic_lstm_cell/MatMul/Enter)]]
[[Node: gradients/Add/_282 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_457_gradients/Add", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopdecoder/while/BasicDecoderStep/TrainingHelperNextInputs/add/y/_181)]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/nave01314/IdeaProjects/tf-nmt/main.py", line 89, in <module>
_ = sess.run([update_step])
File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 889, in run
run_metadata_ptr)
File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1120, in _run
feed_dict_tensor, options, run_metadata)
File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1317, in _do_run
options, run_metadata)
File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1336, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[500,80]
[[Node: decoder/while/BasicDecoderStep/basic_lstm_cell/MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](decoder/while/BasicDecoderStep/basic_lstm_cell/concat, decoder/while/BasicDecoderStep/basic_lstm_cell/MatMul/Enter)]]
[[Node: gradients/Add/_282 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_457_gradients/Add", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopdecoder/while/BasicDecoderStep/TrainingHelperNextInputs/add/y/_181)]]
Caused by op 'decoder/while/BasicDecoderStep/basic_lstm_cell/MatMul', defined at:
File "/home/nave01314/IdeaProjects/tf-nmt/main.py", line 49, in <module>
outputs, _, _ = tf.contrib.seq2seq.dynamic_decode(decoder)
File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/seq2seq/python/ops/decoder.py", line 309, in dynamic_decode
swap_memory=swap_memory)
File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2819, in while_loop
result = loop_context.BuildLoop(cond, body, loop_vars, shape_invariants)
File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2643, in BuildLoop
pred, body, original_loop_vars, loop_vars, shape_invariants)
File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2593, in _BuildLoop
body_result = body(*packed_vars_for_body)
File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/seq2seq/python/ops/decoder.py", line 254, in body
decoder_finished) = decoder.step(time, inputs, state)
File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/seq2seq/python/ops/basic_decoder.py", line 138, in step
cell_outputs, cell_state = self._cell(inputs, state)
File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/rnn_cell_impl.py", line 290, in __call__
return base_layer.Layer.__call__(self, inputs, state, scope=scope)
File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/python/layers/base.py", line 618, in __call__
outputs = self.call(inputs, *args, **kwargs)
File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/rnn_cell_impl.py", line 567, in call
array_ops.concat([inputs, h], 1), self._kernel)
File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py", line 1993, in matmul
a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gen_math_ops.py", line 2532, in _mat_mul
name=name)
File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3081, in create_op
op_def=op_def)
File "/home/nave01314/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1528, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[500,80]
[[Node: decoder/while/BasicDecoderStep/basic_lstm_cell/MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](decoder/while/BasicDecoderStep/basic_lstm_cell/concat, decoder/while/BasicDecoderStep/basic_lstm_cell/MatMul/Enter)]]
[[Node: gradients/Add/_282 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_457_gradients/Add", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopdecoder/while/BasicDecoderStep/TrainingHelperNextInputs/add/y/_181)]]
कोड स्निपेट
import tensorflow as tf
from tensorflow.python.layers import core as layers_core
class NMTModel:
def __init__(self, hparams, iterator, mode):
source, target_in, target_out, source_lengths, target_lengths = iterator.get_next()
true_batch_size = tf.size(source_lengths)
# Lookup embeddings
embedding_encoder = tf.get_variable("embedding_encoder", [hparams.src_vsize, hparams.src_emsize])
encoder_emb_inp = tf.nn.embedding_lookup(embedding_encoder, source)
embedding_decoder = tf.get_variable("embedding_decoder", [hparams.tgt_vsize, hparams.tgt_emsize])
decoder_emb_inp = tf.nn.embedding_lookup(embedding_decoder, target_in)
# Build and run Encoder LSTM
encoder_cell = tf.nn.rnn_cell.BasicLSTMCell(hparams.num_units)
encoder_outputs, encoder_state = tf.nn.dynamic_rnn(encoder_cell, encoder_emb_inp, sequence_length=source_lengths, dtype=tf.float32)
# Build and run Decoder LSTM with Helper and output projection layer
decoder_cell = tf.nn.rnn_cell.BasicLSTMCell(hparams.num_units)
projection_layer = layers_core.Dense(hparams.tgt_vsize, use_bias=False)
# if mode is 'TRAIN' or mode is 'EVAL': # then decode using TrainingHelper
# helper = tf.contrib.seq2seq.TrainingHelper(decoder_emb_inp, sequence_length=target_lengths)
# elif mode is 'INFER': # then decode using Beam Search
# helper = tf.contrib.seq2seq.GreedyEmbeddingHelper(embedding_decoder, tf.fill([true_batch_size], hparams.sos), hparams.eos)
helper = tf.contrib.seq2seq.GreedyEmbeddingHelper(embedding_decoder, tf.fill([true_batch_size], hparams.sos), hparams.eos)
decoder = tf.contrib.seq2seq.BasicDecoder(decoder_cell, helper, encoder_state, output_layer=projection_layer)
outputs, _, _ = tf.contrib.seq2seq.dynamic_decode(decoder, maximum_iterations=tf.reduce_max(target_lengths))
logits = outputs.rnn_output
if mode is 'TRAIN' or mode is 'EVAL': # then calculate loss
crossent = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=target_out, logits=logits)
target_weights = tf.sequence_mask(target_lengths, maxlen=tf.shape(target_out)[1], dtype=logits.dtype)
self.loss = tf.reduce_sum((crossent * target_weights))/tf.cast(true_batch_size, tf.float32)
if mode is 'TRAIN': # then calculate/clip gradients, then optimize model
params = tf.trainable_variables()
gradients = tf.gradients(self.loss, params)
clipped_gradients, _ = tf.clip_by_global_norm(gradients, hparams.max_gradient_norm)
optimizer = tf.train.AdamOptimizer(hparams.l_rate)
self.update_step = optimizer.apply_gradients(zip(clipped_gradients, params))
if mode is 'EVAL' or mode is 'INFER': # then allow access to input/output tensors to printout
self.src = source
self.tgt = target_out
self.preds = tf.argmax(logits, axis=2)
# Designate a saver operation
self.saver = tf.train.Saver(tf.global_variables())
def train(self, sess):
return sess.run([self.update_step, self.loss])
def eval(self, sess):
return sess.run([self.loss, self.src, self.tgt, self.preds])
def infer(self, sess):
return sess.run([self.src, self.tgt, self.preds]) # tgt should not exist (temporary debugging only)
पूर्ण कोड (अत्यंत NMT ट्यूटोरियल के समान है, सरलीकृत)।
मॉडल कोड models.py
में है, इटरेटर कोड data_pipeline.py
में है, मुख्य main.py
है।
https://github.com/nave01314/tf-nmt
क्या आप sess.run ([update_step]) लाइन पर पिचर्म आईडीई में डीबग कर सकते हैं?इसे दो बार चलाएं और उपलब्ध चर को जांचें। मैं कुछ आकार में बढ़ने की उम्मीद कर रहा हूं (विशेष रूप से अनुकूलक में)। –
(विशेष रूप से अनुकूलक में) –
https://imgur.com/a/t8w1o यहां ऑप्टिमाइज़र चर की एक तस्वीर है। मुझे पूरा यकीन नहीं है कि यहां क्या देखना है। –