2017-02-27 15 views
9

पर टेन्सफोर्लो ओओएम मैं टेन्सफोर्लो में एलएसटीएम-आरएनएन पर कुछ संगीत डेटा प्रशिक्षण दे रहा हूं और जीपीयू-मेमोरी-आवंटन के साथ कुछ समस्या का सामना कर रहा हूं जो मुझे समझ में नहीं आ रहा है: मुझे वास्तव में ऐसा लगता है जब मैं वास्तव में ऐसा लगता हूं पर्याप्त वीआरएएम अभी भी उपलब्ध है। कुछ पृष्ठभूमि: मैं एक जीटीएक्स 1060 6 जीबी, इंटेल ज़ीऑन ई 3-1231 वी 3 और 8 जीबी रैम का उपयोग कर उबंटू जीनोम 16.04 पर काम कर रहा हूं।जीपीयू

मैं tensorflow/कोर: तो अब पहली त्रुटि संदेश जो मैं में, समझ सकते हैं और यह मदद करने के लिए के लिए मैं कौन पूछ सकते हैं किसी के लिए फिर से अंत में पूरे त्रुटि संदेश जोड़ देगा का हिस्सा /common_runtime/bfc_allocator.cc:696] आकार 256 कुल 2.0KiB मैं tensorflow/कोर/common_runtime/bfc_allocator.cc के 8 चंक्स: 696] आकार 1280 के 1 चंक्स कुल 1.2KiB मैं tensorflow/कोर/common_runtime/bfc_allocator .cc: 696] 5 आकार के टुकड़े 44288 कुल 216.2KiB I tensorflow/core/common_runtime/bfc_allocator.cc: 696] आकार के56064 कुल 273.8KiB I tensorflow/कोर/common_runtime/bfc_allocator.cc: 696] आकार 154,350,080 के 4 चंक्स कुल 588.80MiB मैं tensorflow/कोर/common_runtime/bfc_allocator.cc: 696] आकार 813,400,064 के 3 चंक्स कुल 2.27GiB मैं tensorflow/कोर /common_runtime/bfc_allocator.cc:696] आकार १६१२६१२३५२ कुल 1.50GiB मैं tensorflow/कोर/common_runtime/bfc_allocator.cc का 1 चंक्स: 700] में उपयोग में हिस्सा के कुल योग: 4.35GiB मैं tensorflow/कोर/common_runtime /bfc_allocator.cc:702] आँकड़े:

सीमा: 5484118016

inuse: 4670717952

MaxInUse: 5484118016

NumAllocs: 29

MaxAllocSize: 1612612352

डब्ल्यू tensorflow/कोर/common_runtime/bfc_allocator.cc: 274] ********** *********** ___________ * __ ************************************ *************** xxxxxxxxxxxxxx डब्ल्यू tensorflow/core/common_runtime/bfc_allocator.cc: 275] मेमोरी 775.72 एमआईबी आवंटित करने की कोशिश कर रहा है। स्मृति स्थिति के लिए लॉग देखें। डब्ल्यू tensorflow/कोर/रूपरेखा/op_kernel.cc: 993] संसाधन समाप्त हो: OOM जब आकार [14525,14000]

साथ टेन्सर आवंटन तो मैं पढ़ सकते हैं ५४८४११८०१६ बाइट्स की एक अधिकतम आवंटित किया जाना है कि वहाँ , 4670717952 बाइट्स पहले से ही उपयोग में हैं, और एक और 777.72 एमबी = 775720000 बाइट आवंटित किए जाने हैं। 5484118016 बाइट्स - 4670717952 बाइट्स - 775720000 बाइट्स = 37680064 बाइट्स मेरे कैलक्यूलेटर के अनुसार। तो नए टैंसर के लिए जगह आवंटित करने के बाद भी वहां 37 एमबी मुक्त वीआरएएम होना चाहिए, वह वहां धक्का देना चाहता है। यह मेरे लिए काफी कानूनी लगता है, क्योंकि टेन्सफोर्लो शायद (मुझे लगता है?) अभी भी उपलब्ध होने की तुलना में अधिक वीआरएएम आवंटित करने की कोशिश नहीं करता है और बाकी डेटा को रैम या कुछ में रखकर रखता है।

अब मुझे लगता है कि मेरी सोच में कुछ बड़ी गलती है, लेकिन अगर कोई मुझे समझा सकता है तो मैं बहुत आभारी हूं, यह त्रुटि क्या है। मेरी समस्या के लिए स्पष्ट सुलझाने की रणनीति सिर्फ मेरे बैचों को थोड़ा छोटा बनाना है, उन्हें प्रत्येक 1.5 जीबी पर होने के कारण शायद बहुत बड़ा है। फिर भी मुझे यह जानना अच्छा लगेगा कि वास्तविक समस्या क्या है।

संपादित करें: मैं कुछ मुझे बता कोशिश करने के लिए मिला:

config = tf.ConfigProto() 
config.gpu_options.allocator_type = 'BFC' 
with tf.Session(config = config) as s: 

जो अभी भी काम नहीं करता है, लेकिन जैसा कि tensorflow प्रलेखन क्या

gpu_options.allocator_type = 'BFC' 

होगा का कोई स्पष्टीकरण नहीं का अभाव है, मैं प्यार होता है आपसे पूछने के लिए

रुचि किसी के लिए भी त्रुटि संदेश के बाकी जोड़ना:

लंबे कॉपी/पेस्ट के लिए खेद है, लेकिन शायद किसी की आवश्यकता होगी/यह देखना चाहते हैं,

अग्रिम में बहुत बहुत शुक्रिया, लियोन

(gputensorflow) [email protected]:~/Tensorflow$ python Netzwerk_v0.5.1_gamma.py 
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally 
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally 
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally 
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally 
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally 
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations. 
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations. 
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations. 
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations. 
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations. 
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations. 
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:910] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties: 
name: GeForce GTX 1060 6GB 
major: 6 minor: 1 memoryClockRate (GHz) 1.7335 
pciBusID 0000:01:00.0 
Total memory: 5.93GiB 
Free memory: 5.40GiB 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:01:00.0) 
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (256): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (512): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (1024): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (2048): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (4096): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (8192): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (16384):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (32768):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (65536):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (131072): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (262144): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (524288): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (1048576): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (2097152): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (4194304): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (8388608): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (16777216): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (33554432): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (67108864): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (134217728):  Total Chunks: 1, Chunks in use: 0 147.20MiB allocated for chunks. 147.20MiB client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (268435456):  Total Chunks: 1, Chunks in use: 0 628.52MiB allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin. 
I tensorflow/core/common_runtime/bfc_allocator.cc:660] Bin for 775.72MiB was 256.00MiB, Chunk State: 
I tensorflow/core/common_runtime/bfc_allocator.cc:666] Size: 628.52MiB | Requested Size: 0B | in_use: 0, prev: Size: 147.20MiB | Requested Size: 147.20MiB | in_use: 1, next: Size: 54.8KiB | Requested Size: 54.7KiB | in_use: 1 
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10208000000 of size 1280 
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10208000500 of size 256 
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10208000600 of size 56064 
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x1020800e100 of size 256 
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x1020800e200 of size 44288 
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10208018f00 of size 256 
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10208019000 of size 256 
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10208019100 of size 813400064 
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102387d1100 of size 56064 
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102387dec00 of size 154350080 
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10241b11e00 of size 44288 
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10241b1cb00 of size 256 
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10241b1cc00 of size 256 
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10241b1cd00 of size 154350080 
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102722d4d00 of size 56064 
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x1027b615a00 of size 44288 
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x1027b620700 of size 256 
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x1027b620800 of size 256 
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x1027b620900 of size 813400064 
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102abdd8900 of size 813400064 
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102dc590900 of size 56064 
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102dc59e400 of size 56064 
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102dc5abf00 of size 154350080 
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102e58df100 of size 154350080 
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102eec12300 of size 44288 
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102eec1d000 of size 44288 
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102eec27d00 of size 1612612352 
I tensorflow/core/common_runtime/bfc_allocator.cc:687] Free at 0x1024ae4ff00 of size 659049984 
I tensorflow/core/common_runtime/bfc_allocator.cc:687] Free at 0x102722e2800 of size 154350080 
I tensorflow/core/common_runtime/bfc_allocator.cc:693]  Summary of in-use Chunks by size: 
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 8 Chunks of size 256 totalling 2.0KiB 
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 1280 totalling 1.2KiB 
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 5 Chunks of size 44288 totalling 216.2KiB 
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 5 Chunks of size 56064 totalling 273.8KiB 
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 4 Chunks of size 154350080 totalling 588.80MiB 
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 3 Chunks of size 813400064 totalling 2.27GiB 
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 1612612352 totalling 1.50GiB 
I tensorflow/core/common_runtime/bfc_allocator.cc:700] Sum Total of in-use chunks: 4.35GiB 
I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats: 
Limit:     5484118016 
InUse:     4670717952 
MaxInUse:    5484118016 
NumAllocs:      29 
MaxAllocSize:   1612612352 

W tensorflow/core/common_runtime/bfc_allocator.cc:274] *********************___________*__***************************************************xxxxxxxxxxxxxx 
W tensorflow/core/common_runtime/bfc_allocator.cc:275] Ran out of memory trying to allocate 775.72MiB. See logs for memory state. 
W tensorflow/core/framework/op_kernel.cc:993] Resource exhausted: OOM when allocating tensor with shape[14525,14000] 
Traceback (most recent call last): 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1022, in _do_call 
    return fn(*args) 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1004, in _run_fn 
    status, run_metadata) 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/contextlib.py", line 66, in __exit__ 
    next(self.gen) 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 469, in raise_exception_on_not_ok_status 
    pywrap_tensorflow.TF_GetCode(status)) 
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[14525,14000] 
    [[Node: rnn/basic_lstm_cell/weights/Initializer/random_uniform = Add[T=DT_FLOAT, _class=["loc:@rnn/basic_lstm_cell/weights"], _device="/job:localhost/replica:0/task:0/gpu:0"](rnn/basic_lstm_cell/weights/Initializer/random_uniform/mul, rnn/basic_lstm_cell/weights/Initializer/random_uniform/min)]] 

During handling of the above exception, another exception occurred: 

Traceback (most recent call last): 
    File "Netzwerk_v0.5.1_gamma.py", line 171, in <module> 
    session.run(tf.global_variables_initializer()) 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 767, in run 
    run_metadata_ptr) 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 965, in _run 
    feed_dict_string, options, run_metadata) 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1015, in _do_run 
    target_list, options, run_metadata) 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1035, in _do_call 
    raise type(e)(node_def, op, message) 
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[14525,14000] 
    [[Node: rnn/basic_lstm_cell/weights/Initializer/random_uniform = Add[T=DT_FLOAT, _class=["loc:@rnn/basic_lstm_cell/weights"], _device="/job:localhost/replica:0/task:0/gpu:0"](rnn/basic_lstm_cell/weights/Initializer/random_uniform/mul, rnn/basic_lstm_cell/weights/Initializer/random_uniform/min)]] 

Caused by op 'rnn/basic_lstm_cell/weights/Initializer/random_uniform', defined at: 
    File "Netzwerk_v0.5.1_gamma.py", line 94, in <module> 
    initial_state=initial_state, time_major=False)  # time_major = FALSE currently 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/rnn.py", line 545, in dynamic_rnn 
    dtype=dtype) 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/rnn.py", line 712, in _dynamic_rnn_loop 
    swap_memory=swap_memory) 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2626, in while_loop 
    result = context.BuildLoop(cond, body, loop_vars, shape_invariants) 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2459, in BuildLoop 
    pred, body, original_loop_vars, loop_vars, shape_invariants) 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2409, in _BuildLoop 
    body_result = body(*packed_vars_for_body) 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/rnn.py", line 697, in _time_step 
    (output, new_state) = call_cell() 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/rnn.py", line 683, in <lambda> 
    call_cell = lambda: cell(input_t, state) 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/contrib/rnn/python/ops/core_rnn_cell_impl.py", line 179, in __call__ 
    concat = _linear([inputs, h], 4 * self._num_units, True, scope=scope) 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/contrib/rnn/python/ops/core_rnn_cell_impl.py", line 747, in _linear 
    "weights", [total_arg_size, output_size], dtype=dtype) 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 988, in get_variable 
    custom_getter=custom_getter) 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 890, in get_variable 
    custom_getter=custom_getter) 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 348, in get_variable 
    validate_shape=validate_shape) 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 333, in _true_getter 
    caching_device=caching_device, validate_shape=validate_shape) 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 684, in _get_single_variable 
    validate_shape=validate_shape) 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/variables.py", line 226, in __init__ 
    expected_shape=expected_shape) 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/variables.py", line 303, in _init_from_args 
    initial_value(), name="initial_value", dtype=dtype) 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 673, in <lambda> 
    shape.as_list(), dtype=dtype, partition_info=partition_info) 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/init_ops.py", line 360, in __call__ 
    dtype, seed=self.seed) 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/random_ops.py", line 246, in random_uniform 
    return math_ops.add(rnd * (maxval - minval), minval, name=name) 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/gen_math_ops.py", line 73, in add 
    result = _op_def_lib.apply_op("Add", x=x, y=y, name=name) 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op 
    op_def=op_def) 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2395, in create_op 
    original_op=self._default_original_op, op_def=op_def) 
    File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1264, in __init__ 
    self._traceback = _extract_stack() 

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[14525,14000] 
    [[Node: rnn/basic_lstm_cell/weights/Initializer/random_uniform = Add[T=DT_FLOAT, _class=["loc:@rnn/basic_lstm_cell/weights"], _device="/job:localhost/replica:0/task:0/gpu:0"](rnn/basic_lstm_cell/weights/Initializer/random_uniform/mul, rnn/basic_lstm_cell/weights/Initializer/random_uniform/min)]] 
+0

मुझे हाल ही में इस समस्या का सामना करना पड़ा है, और प्रशिक्षण के बीच में संसाधन निकास समस्या का सामना करना पड़ता है। और मैंने इस https://github.com/tensorflow/tensorflow/issues/4735 का पालन किया, और सत्यापन बैच आकार को कम करके इस समस्या का सामना किया। – RyanLiu

उत्तर

3

कोशिश इस

पर एक नज़र लेने के लिए सावधान नहीं रहो उसी GPU पर मूल्यांकन और प्रशिक्षण बाइनरी चलाने के लिए या अन्यथा आप स्मृति से बाहर हो सकते हैं। यदि एक ही जीपीयू पर मूल्यांकन चलाते समय बाइनरी प्रशिक्षण उपलब्ध है या निलंबित करते हैं तो एक अलग GPU पर मूल्यांकन पर विचार करें।

https://www.tensorflow.org/tutorials/deep_cnn

1

मैं batch_size=52 केवल स्मृति उपयोग को कम करने को कम करके इस समस्या को हल batch_size कम करना है।

Batch_size अपने GPU ग्राफिक्स कार्ड, VRAM के आकार, कैश स्मृति आदि पर निर्भर करता है

GPU मैं batch size बदलते विश्वास पर इस Another Stack Overflow Link

0

जब सामना OOM पसंद करते हैं कृपया करने के लिए सही विकल्प है पहले कोशिश करो।

विभिन्न जीपीयू के लिए आपको GPU मेमोरी के आधार पर अलग-अलग बैच आकार की आवश्यकता हो सकती है।

हाल ही में मुझे इसी तरह की समस्या का सामना करना पड़ा, विभिन्न प्रकार के प्रयोग करने के लिए बहुत कुछ tweaked।

यहां question का लिंक है (कुछ चाल भी शामिल हैं)।

हालांकि, बैच के आकार को कम करते समय आप पाते हैं कि आपका प्रशिक्षण धीमा हो जाता है। तो यदि आपके पास एकाधिक GPU है तो आप उनका उपयोग कर सकते हैं। आपके GPU के बारे में जांच करने के लिए आप टर्मिनल पर लिख सकते हैं,

nvidia-smi 

यह आपको अपने GPU रैक के बारे में आवश्यक जानकारी दिखाएगा।

संबंधित मुद्दे