|
| 1 | +Feature |
| 2 | +------- |
| 3 | +Support geo-replication for sharded volumes. |
| 4 | + |
| 5 | +Summary |
| 6 | +------- |
| 7 | +This features helps geo-replicate the large files stored on sharded volume. The |
| 8 | +requirement is that the slave volume also should be sharded. |
| 9 | + |
| 10 | +Owners |
| 11 | +------ |
| 12 | + |
| 13 | + |
| 14 | + |
| 15 | +Current status |
| 16 | +-------------- |
| 17 | +Traditionally changelog xlator, sitting just above the posix records the changes |
| 18 | +at brick level and geo-replication picks up these files that are |
| 19 | +modified/created and syncs them over gluster mount to slave. This works well as |
| 20 | +long as a file in gluster volume is represented by a single file at the brick |
| 21 | +level. But with the introduction of sharding in gluster, a file in gluster |
| 22 | +volume could be represented by multiple files at the brick level spawning |
| 23 | +different bricks. Hence the traditional way syncing files using changelog |
| 24 | +results in related files being synced as different files all together. So there |
| 25 | +has to be some understanding between geo-replication and sharding to tell all |
| 26 | +those sharded files are related. Hence this feature. |
| 27 | + |
| 28 | +Related Feature Requests and Bugs |
| 29 | +--------------------------------- |
| 30 | + 1. [Mask sharding translator for geo-replication client](https://bugzilla.redhat.com/show_bug.cgi?id=1275972) |
| 31 | + 2. [All other related changes for geo-replication](https://bugzilla.redhat.com/show_bug.cgi?id=1284453) |
| 32 | + |
| 33 | +Detailed Description |
| 34 | +-------------------- |
| 35 | +Sharding breaks the file into multiple small files based on agreed upon |
| 36 | +shard-size(usually 4MB, 64MB...) and helps distribute the one big file well |
| 37 | +across sub-volumes. Let's say 4MB is the shard size, the first 4MB of the file |
| 38 | +is saved with the actual filename, say file1. The next 4MB will be it's first |
| 39 | +shard with the filename <GFID>.1 and it follows. So shards will be saved as |
| 40 | +<GFID>.1, <GFID>.2, <GFID>.3.......<GFID>.n where GFID is the gfid of file1. |
| 41 | + |
| 42 | +The shard xlator is placed just above DHT on client stack, shard determines to |
| 43 | +which shard the write/read belongs to based on offset and gives specific |
| 44 | +<GFID>.n file to DHT. Each of the sharded files are stored under a special |
| 45 | +directory called ".shard" in respective sub-volumes as hashed by DHT. |
| 46 | + |
| 47 | +For more information on Gluster sharding please go through following links. |
| 48 | + 1. <https://gluster.readthedocs.org/en/release-3.7.0/Features/shard> |
| 49 | + 2. <http://blog.gluster.org/2015/12/introducing-shard-translator> |
| 50 | + 3. <http://blog.gluster.org/2015/12/sharding-what-next-2> |
| 51 | + |
| 52 | +To make geo-rep work with sharded files, we got two options. |
| 53 | + |
| 54 | + 1. Somehow record only the main gfid and bname on changes to any shard: |
| 55 | + This would simplify the design but lacks performance as geo-rep has to |
| 56 | + sync all the shards from single brick and rsync might take more time |
| 57 | + calculating checksums to find out delta if shards of the file are |
| 58 | + placed in different nodes by DHT. |
| 59 | + |
| 60 | + 2. Let geo-rep sync each mainfile and each sharded files separately: |
| 61 | + This approach overcomes the performance issue but the solution needs |
| 62 | + to be implemented carefully considering all the cases. As for this, |
| 63 | + geo-rep client is given the access by sharding xlator to sync each |
| 64 | + shards as different files, hence the rsync need not calculate |
| 65 | + check-sums over wire and sync the shard as if it's a single file. |
| 66 | + The xattrs maintained by the main file to track the shard-size and |
| 67 | + file-size is also synced. Here multiple bricks participate in syncing |
| 68 | + the shard with respect to where the shard is hashed. |
| 69 | + |
| 70 | + Keeping performance in mind, the second approach is chosen!!! |
| 71 | + |
| 72 | +So the key here is that sharding xlator is masked for geo-replication |
| 73 | +(gsyncd client). It syncs all the sharded files as separate files as if no |
| 74 | +sharding xlator is loaded. Since xattrs of the main file is also synced from |
| 75 | +master, while reading from non geo-rep clients from slave, the data is intact. |
| 76 | +It could be possible that geo-rep wouldn't have synced all the shards of a file |
| 77 | +from master, during which, it is expected to get inconsistent data as any way |
| 78 | +geo-rep is eventually consistent model. |
| 79 | + |
| 80 | +So this brings in certain prerequisite configurations: |
| 81 | + |
| 82 | + 1. If master is a sharded volume, slave also needs to be sharded volume. |
| 83 | + 2. Geo-rep sync-engine must be 'rsync'. tarssh is not supported for sharding |
| 84 | + configuration. |
| 85 | + |
| 86 | +Benefit to GlusterFS |
| 87 | +-------------------- |
| 88 | +The sharded volumes can be geo-replicated. The main use case is in the |
| 89 | +hyperconvergence scenario where the large VM images are stored in sharded |
| 90 | +gluster volumes and needs to be geo-replicated for disaster recovery. |
| 91 | + |
| 92 | +Scope |
| 93 | +----- |
| 94 | +#### Nature of proposed change |
| 95 | +No new translators are written as part of this feature. |
| 96 | +The modification spawns sharding, gfid-access translators |
| 97 | +and geo-replication. |
| 98 | + |
| 99 | + 1. <http://review.gluster.org/#/c/12438> |
| 100 | + 2. <http://review.gluster.org/#/c/12732> |
| 101 | + 3. <http://review.gluster.org/#/c/12729> |
| 102 | + 4. <http://review.gluster.org/#/c/12721> |
| 103 | + 5. <http://review.gluster.org/#/c/12731> |
| 104 | + 6. <http://review.gluster.org/#/c/13643> |
| 105 | + |
| 106 | +#### Implications on manageability |
| 107 | +No implication to manageability. Ther is no change in the way geo-replication |
| 108 | +is setup. |
| 109 | + |
| 110 | +#### Implications on presentation layer |
| 111 | +No implication to NFS/SAMBA/UFO/FUSE/libglusterfsclient |
| 112 | + |
| 113 | +#### Implications on persistence layer |
| 114 | +No implications to LVM/XFS/RHEL. |
| 115 | + |
| 116 | +#### Implications on 'GlusterFS' backend |
| 117 | +No implication to brick's data format, layout changes |
| 118 | + |
| 119 | +#### Modification to GlusterFS metadata |
| 120 | +No modifications to metatdata. No new extended attributes used, |
| 121 | +internal hidden files to keep the metadata |
| 122 | + |
| 123 | +#### Implications on 'glusterd' |
| 124 | +None |
| 125 | + |
| 126 | +How To Test |
| 127 | +----------- |
| 128 | + 1. Setup master gluster volume and enable sharding |
| 129 | + 2. Setup slave gluster volume and enable sharding |
| 130 | + 3. Create geo-replication session between master and slave volume. |
| 131 | + 4. Make sure geo-rep config 'use_tarssh' is set to false |
| 132 | + 5. Make sure geo-rep config 'sync_xattrs' is set to true |
| 133 | + 6. Start geo-replication |
| 134 | + 7. Write a large file greater than shard size and check for the same |
| 135 | + on slave volume. |
| 136 | + |
| 137 | +User Experience |
| 138 | +--------------- |
| 139 | + Following configuration should be done |
| 140 | + 1. If master is a sharded volume, slave also needs to be sharded volume. |
| 141 | + 2. Geo-rep sync-engine must be 'rsync'. tarssh is not supported for sharding |
| 142 | + configuration. |
| 143 | + 3. Geo-replication config option 'sync_xattrs' should be set to true. |
| 144 | + |
| 145 | +Dependencies |
| 146 | +------------ |
| 147 | +No dependencies apart from the sharding feature:) |
| 148 | + |
| 149 | +Documentation |
| 150 | +------------- |
| 151 | + |
| 152 | +Status |
| 153 | +------ |
| 154 | +Completed |
| 155 | + |
| 156 | +Comments and Discussion |
| 157 | +----------------------- |
0 commit comments