Block和SSTable的构建 
 
我们在之前提到了SSTable和内部的各类Block的结构,也初步分析了其在源码中的类级别的组织结构,但是我们也能注意到类结构和实际写入文件的结构是存在明显差异的,所以我们现在就需要进一步了解levelDB是如何构建这些Block与SSTable的。
            源码位置与说明:
table/block_builder.cc table/block_builder.h : Block构建相关 
include/leveldb/table_builder.h table/table_builder.cc :
table构建相关 
include/leveldb/table.h table/table.cc :
读取一个Table的内容,产生迭代器,或者根据一个键读取一个值
           
Block的构建--BlockBuilder 
我们之前看到的Block的类其实只有存储的数据类型、迭代器和查询相关的东西,那么Block获取到的数据是哪里来的呢?也就是哪个类来负责把一个一个kv插入到data_中呢?其实这个类就是BlockBuilder。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 class  BlockBuilder { private :    const  Options *options_;              std::string buffer_;                  std::vector<uint32_t > restarts_;      int  counter_;                         bool  finished_;                       std::string last_key_;            public :         explicit  BlockBuilder (const  Options *options)  ;     BlockBuilder (const  BlockBuilder &) = delete ;     BlockBuilder &operator =(const  BlockBuilder &) = delete ;               void  Reset ()  ;          void  Add (const  Slice &key, const  Slice &value)  ;               Slice Finish ()  ;          size_t  CurrentSizeEstimate ()  const  ;          bool  empty ()  const   { return  buffer_.empty (); } }; 
 
我们可以看到一个BlockBuilder的作用其实就是不断把key-value添加到缓冲区中,然后在完成构建时返回这个缓冲区。由于没有必须要为每个Block都生成一个BlockBuilder类,所以BlockBuilder类是可以复用的。
下面我们来看一下具体的实现。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 BlockBuilder::BlockBuilder (const  Options *options)     : options_ (options), restarts_ (), counter_ (0 ), finished_ (false ) {     assert (options->block_restart_interval >= 1 );     restarts_.push_back (0 );  } void  BlockBuilder::Reset ()  {    buffer_.clear ();     restarts_.clear ();     restarts_.push_back (0 );      counter_ = 0 ;     finished_ = false ;     last_key_.clear (); } size_t  BlockBuilder::CurrentSizeEstimate ()  const  {    return  (buffer_.size () +                                   restarts_.size () * sizeof (uint32_t ) +              sizeof (uint32_t ));                     } Slice BlockBuilder::Finish ()   {         for  (size_t  i = 0 ; i < restarts_.size (); i++)     {         PutFixed32 (&buffer_, restarts_[i]);     }     PutFixed32 (&buffer_, restarts_.size ());     finished_ = true ;     return  Slice (buffer_); } void  BlockBuilder::Add (const  Slice &key, const  Slice &value)  {    Slice last_key_piece (last_key_)  ;      assert (!finished_);     assert (counter_ <= options_->block_restart_interval);     assert (buffer_.empty ()             || options_->comparator->Compare (key, last_key_piece) > 0 );     size_t  shared = 0 ;          if  (counter_ < options_->block_restart_interval)     {                  const  size_t  min_length = std::min (last_key_piece.size (), key.size ());         while  ((shared < min_length) && (last_key_piece[shared] == key[shared]))         {             shared++;         }     }     else      {                  restarts_.push_back (buffer_.size ());         counter_ = 0 ;     }     const  size_t  non_shared = key.size () - shared;                PutVarint32 (&buffer_, shared);     PutVarint32 (&buffer_, non_shared);     PutVarint32 (&buffer_, value.size ());          buffer_.append (key.data () + shared, non_shared);     buffer_.append (value.data (), value.size ());          last_key_.resize (shared);     last_key_.append (key.data () + shared, non_shared);     assert (Slice (last_key_) == key);     counter_++; } 
 
我们可以看到,BlockBuilder还是很简单的,主要就是创建重启点、插入key-value,不过需要注意的是这个插入key-value的时候需要保证keys是有序的。
Table类 
在之前我们讲述SSTable的时候,我们其实是没有提到Table这个类的,因为前面的内容有点多,下面我们就来具体看一下这个类。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 class  LEVELDB_EXPORT  Table{ private :    friend  class  TableCache ;      struct  Rep ;                        static  Iterator *BlockReader (void  *, const  ReadOptions &, const  Slice &)  ;          explicit  Table (Rep *rep)  : rep_(rep) { }                    Status InternalGet (const  ReadOptions &, const  Slice &key, void  *arg,                          void  (*handle_result)(void  *arg, const  Slice &k,                                              const  Slice &v))  ;         void  ReadMeta (const  Footer &footer)  ;     void  ReadFilter (const  Slice &filter_handle_value)  ;     Rep *const  rep_; public :                                                                               static  Status Open (const  Options &options, RandomAccessFile *file,                         uint64_t  file_size, Table **table)  ;         Table (const  Table &) = delete ;     Table &operator =(const  Table &) = delete ;     ~Table ();          Iterator *NewIterator (const  ReadOptions &)  const  ;          uint64_t  ApproximateOffsetOf (const  Slice &key)  const  ; }; 
 
头文件中定义了缓存用的友元类、迭代器这些,说明Table类其实也没有定义怎么构建SSTable,而主要的目的是从文件中读取一个SSTable出来。
由于这其中涉及到了迭代器和缓存类,所以具体的实现我们就先略过,等到后面再详细看。
Table的构建--TableBuilder 
Block的构建是通过BlockBuilder类来完成的,而Table的构建则是通过TableBuilder类来完成的。我们先来看一下其头文件
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 class  LEVELDB_EXPORT  TableBuilder{ private :         bool  ok ()  const   { return  status ().ok (); }          void  WriteBlock (BlockBuilder *block, BlockHandle *handle)  ;          void  WriteRawBlock (const  Slice &data, CompressionType, BlockHandle *handle)  ;     struct  Rep ;      Rep *rep_; public :         TableBuilder (const  Options &options, WritableFile *file);     TableBuilder (const  TableBuilder &) = delete ;     TableBuilder &operator =(const  TableBuilder &) = delete ;     ~TableBuilder ();          Status ChangeOptions (const  Options &options)  ;          void  Add (const  Slice &key, const  Slice &value)  ;          void  Flush ()  ;          Status status ()  const  ;          Status Finish ()  ;          void  Abandon ()  ;          uint64_t  NumEntries ()  const  ;          uint64_t  FileSize ()  const  ; }; 
 
我们可以看到TableBuilder的头文件中定义了将Key写入文件的部分。以及,虽然这个类中也存在一个类型为Req的东西,但是和Table类中的不是同一个东西,我们看一下其定义。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 struct  TableBuilder ::Rep{     Rep (const  Options &opt, WritableFile *f)         : options (opt),           index_block_options (opt),           file (f),           offset (0 ),           data_block (&options),           index_block (&index_block_options),           num_entries (0 ),           closed (false ),           filter_block (opt.filter_policy == nullptr                             ? nullptr                             : new  FilterBlockBuilder (opt.filter_policy)),           pending_index_entry (false )     {         index_block_options.block_restart_interval = 1 ;     }     Options options;                       Options index_block_options;           WritableFile *file;                    uint64_t  offset;                       Status status;                         BlockBuilder data_block;               BlockBuilder index_block;              std::string last_key;                  int64_t  num_entries;                   bool  closed;                           FilterBlockBuilder *filter_block;                                                   bool  pending_index_entry;        BlockHandle pending_handle;      std::string compressed_output;  }; 
 
这个结构体中定义了table中包含的各个部分,这些部分没有直接在Table类中定义,是因为那样就会让Table类十分冗长,使用类内的一个结构体会清晰很多。实际上,TableBuilder类在进行构建的时候也是直接处理这个结构体的。
下面我们来看一下TableBuiler类的具体实现。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 TableBuilder::TableBuilder (const  Options &options, WritableFile *file)     : rep_ (new  Rep (options, file)) {     if  (rep_->filter_block != nullptr )     {         rep_->filter_block->StartBlock (0 );     } } TableBuilder::~TableBuilder () {     assert (rep_->closed);      delete  rep_->filter_block;     delete  rep_; } Status TableBuilder::ChangeOptions (const  Options &options)   {    if  (options.comparator != rep_->options.comparator)     {         return  Status::InvalidArgument ("changing comparator while building table" );     }     rep_->options = options;     rep_->index_block_options = options;     rep_->index_block_options.block_restart_interval = 1 ;     return  Status::OK (); } void  TableBuilder::Add (const  Slice &key, const  Slice &value)  {    Rep *r = rep_;     assert (!r->closed);     if  (!ok ())         return ;          if  (r->num_entries > 0 )     {         assert (r->options.comparator->Compare (key, Slice (r->last_key)) > 0 );     }               if  (r->pending_index_entry)     {         assert (r->data_block.empty ());                  r->options.comparator->FindShortestSeparator (&r->last_key, key);         std::string handle_encoding;         r->pending_handle.EncodeTo (&handle_encoding);                  r->index_block.Add (r->last_key, Slice (handle_encoding));         r->pending_index_entry = false ;     }     if  (r->filter_block != nullptr )     {         r->filter_block->AddKey (key);     }          r->last_key.assign (key.data (), key.size ());     r->num_entries++;     r->data_block.Add (key, value);          const  size_t  estimated_block_size = r->data_block.CurrentSizeEstimate ();     if  (estimated_block_size >= r->options.block_size)     {         Flush ();     } } void  TableBuilder::Flush ()  {    Rep *r = rep_;     assert (!r->closed);     if  (!ok ())         return ;     if  (r->data_block.empty ())         return ;     assert (!r->pending_index_entry);     WriteBlock (&r->data_block, &r->pending_handle);      if  (ok ())     {         r->pending_index_entry = true ;         r->status = r->file->Flush ();      }     if  (r->filter_block != nullptr )     {         r->filter_block->StartBlock (r->offset);      } } void  TableBuilder::WriteBlock (BlockBuilder *block, BlockHandle *handle)  {         assert (ok ());     Rep *r = rep_;     Slice raw = block->Finish ();      Slice block_contents;     CompressionType type = r->options.compression;      switch  (type)                                       {     case  kNoCompression:          block_contents = raw;         break ;     case  kSnappyCompression:      {         std::string *compressed = &r->compressed_output;         if  (port::Snappy_Compress (raw.data (), raw.size (), compressed) &&             compressed->size () < raw.size () - (raw.size () / 8u ))         {             block_contents = *compressed;         }         else          {                                       block_contents = raw;             type = kNoCompression;         }         break ;     }     }          WriteRawBlock (block_contents, type, handle);     r->compressed_output.clear ();     block->Reset (); } void  TableBuilder::WriteRawBlock (const  Slice &block_contents,                                  CompressionType type, BlockHandle *handle)  {    Rep *r = rep_;     handle->set_offset (r->offset);     handle->set_size (block_contents.size ());     r->status = r->file->Append (block_contents);      if  (r->status.ok ())     {         char  trailer[kBlockTrailerSize];          trailer[0 ] = type;         uint32_t  crc = crc32c::Value (block_contents.data (), block_contents.size ());         crc = crc32c::Extend (crc, trailer, 1 );          EncodeFixed32 (trailer + 1 , crc32c::Mask (crc));                  r->status = r->file->Append (Slice (trailer, kBlockTrailerSize));         if  (r->status.ok ())         {             r->offset += block_contents.size () + kBlockTrailerSize;         }     } } Status TableBuilder::status ()  const   { return  rep_->status; }Status TableBuilder::Finish ()   {    Rep *r = rep_;     Flush ();     assert (!r->closed);     r->closed = true ;     BlockHandle filter_block_handle, metaindex_block_handle, index_block_handle;          if  (ok () && r->filter_block != nullptr )     {         WriteRawBlock (r->filter_block->Finish (), kNoCompression,                       &filter_block_handle);     }          if  (ok ())     {         BlockBuilder meta_index_block (&r->options)  ;         if  (r->filter_block != nullptr )         {             std::string key = "filter." ;             key.append (r->options.filter_policy->Name ());             std::string handle_encoding;             filter_block_handle.EncodeTo (&handle_encoding);             meta_index_block.Add (key, handle_encoding);         }         WriteBlock (&meta_index_block, &metaindex_block_handle);     }          if  (ok ())     {         if  (r->pending_index_entry)         {             r->options.comparator->FindShortSuccessor (&r->last_key);             std::string handle_encoding;             r->pending_handle.EncodeTo (&handle_encoding);             r->index_block.Add (r->last_key, Slice (handle_encoding));             r->pending_index_entry = false ;         }         WriteBlock (&r->index_block, &index_block_handle);     }          if  (ok ())     {         Footer footer;         footer.set_metaindex_handle (metaindex_block_handle);         footer.set_index_handle (index_block_handle);         std::string footer_encoding;         footer.EncodeTo (&footer_encoding);         r->status = r->file->Append (footer_encoding);         if  (r->status.ok ())         {             r->offset += footer_encoding.size ();         }     }     return  r->status; } void  TableBuilder::Abandon ()  {    Rep *r = rep_;     assert (!r->closed);     r->closed = true ; } uint64_t  TableBuilder::NumEntries ()  const   { return  rep_->num_entries; }uint64_t  TableBuilder::FileSize ()  const   { return  rep_->offset; }
 
通过这个TableBuilder类,我们就可以很清晰地看出一个Table是怎么构建的了。下面给出构建一个Table的流程。
TableBuilder类首先在类内创建出各类BlockBuilder,包括dataBlock,metadataBlock,filterBlock等等。 
然后当每次需要插入一个key时就会调用TableBuilder的add方法,这个方法会自动负责处理index_block、重启点、filter_block以及对每个key进行长度压缩(这个压缩不是共享key,而是调用comparator中的FindShortestSeparator方法)。另外TableBuilder类插入key-value会首先写入内存中,只有当内存部分的大小超过一定长度才会写持久化到硬盘。 
Flush函数负责将目前的数据区追加写到磁盘中,它会调用WriteBlock函数,注意Flush不会把非data_block的block持久化。 
WriteBlock函数是对WriteRawBlock函数的封装。它主要是处理了数据压缩的部分,然后把压缩后的数据和压缩类型以及数据的crc校验码传递给WriteRawBlock函数。 
WriteRawBlock函数是真正写入数据的函数,每个Block都由数据区和偏置区、压缩类型以及crc校验组成。 
当一个table完成构建时就会调用finish函数,这个函数会负责追加写filter_block、metaindex_block、index_block和Footer。其中写入每个Block时会调用对应handler的encode函数来将每个block编码成字符串。 
 
我们继续来审视一下SSTable和Block的结构图,这样我们会更加清楚其结构。
 
 
总结 
不得不说,Block的统一的设计还是非常的巧妙,所有的Block结构都是一样的,只是提供了统一的接口(Add),内部设计包含了key的前缀压缩、偏置区、重启点等等优化设计。而filterBlock和IndexBlock等等也是包含这些偏置区和重启点的,可能会感觉有些奇怪,因为这些存储的都是过滤器或者是索引。但是由于我们不需要关心这些内部的细节,所以可以看成是数据结构内部优化即可。