Numpy to TFrecords: Is there a more simple way to handle batch inputs from tfrecords?

My question is about how to get batch inputs from multiple (or sharded) tfrecords. I’ve read the example https://github.com/tensorflow/models/blob/master/inception/inception/image_processing.py#L410. The basic pipeline is, take the training set as as example, (1) first generate a series of tfrecords (e.g., train-000-of-005, train-001-of-005, …), (2) from these filenames, generate a list and fed them into the tf.train.string_input_producer to get a queue, (3) simultaneously generate a tf.RandomShuffleQueue to do other stuff, (4) using tf.train.batch_join to generate batch inputs.

I think this is complex, and I’m not sure the logic of this procedure. In my case, I have a list of .npy files, and I want to generate sharded tfrecords(multiple seperated tfrecords, not just one single large file). Each of these .npy files contains different number of positive and negative samples (2 classes). A basic method is to generate one single large tfrecord file. But the file is too large (~20Gb). So I resort to sharded tfrecords. Are there any simpler way to do this? Thanks.

Contents hide

Answers:

Method 1

1. Creation of tfrecords from a numpy array:

2. Read the tfrecords using the Dataset API (tensorflow >=1.2):

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

The whole process is simplied using the Dataset API. Here are both the parts: (1): Convert numpy array to tfrecords and (2,3,4): read the tfrecords to generate batches.

1. Creation of tfrecords from a numpy array:

    def npy_to_tfrecords(...):
       # write records to a tfrecords file
       writer = tf.python_io.TFRecordWriter(output_file)

       # Loop through all the features you want to write
       for ... :
          let say X is of np.array([[...][...]])
          let say y is of np.array[[0/1]]

         # Feature contains a map of string to feature proto objects
         feature = {}
         feature['X'] = tf.train.Feature(float_list=tf.train.FloatList(value=X.flatten()))
         feature['y'] = tf.train.Feature(int64_list=tf.train.Int64List(value=y))

         # Construct the Example proto object
         example = tf.train.Example(features=tf.train.Features(feature=feature))

         # Serialize the example to a string
         serialized = example.SerializeToString()

         # write the serialized objec to the disk
         writer.write(serialized)
      writer.close()

2. Read the tfrecords using the Dataset API (tensorflow >=1.2):

    # Creates a dataset that reads all of the examples from filenames.
    filenames = ["file1.tfrecord", "file2.tfrecord", ..."fileN.tfrecord"]
    dataset = tf.contrib.data.TFRecordDataset(filenames)
    # for version 1.5 and above use tf.data.TFRecordDataset

    # example proto decode
    def _parse_function(example_proto):
      keys_to_features = {'X':tf.FixedLenFeature((shape_of_npy_array), tf.float32),
                          'y': tf.FixedLenFeature((), tf.int64, default_value=0)}
      parsed_features = tf.parse_single_example(example_proto, keys_to_features)
     return parsed_features['X'], parsed_features['y']

    # Parse the record into tensors.
    dataset = dataset.map(_parse_function)  

    # Shuffle the dataset
    dataset = dataset.shuffle(buffer_size=10000)

    # Repeat the input indefinitly
    dataset = dataset.repeat()  

    # Generate batches
    dataset = dataset.batch(batch_size)

    # Create a one-shot iterator
    iterator = dataset.make_one_shot_iterator()

    # Get batch X and y
    X, y = iterator.get_next()

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating