说明

本文描述问题及解决方法同样适用于 腾讯云 Elasticsearch Service（ES）。

背景

前面我们学习了Elasticsearch集群异常状态（RED、YELLOW）原因分析，了解到了当集群发生主分片无法上线的情况下，集群状态会变为RED，此时相应的RED索引读写请求都会受到严重的影响。
这里我们将介绍索引分片损坏这种情况，当索引分片发生损坏时，对应的主分片会无法分配，且状态也会是RED。然而分片的损坏的情况又分为很多种，有些只是表象，可以通过一些手段恢复，但有些则是真实的物理损坏，且无法恢复，只能丢弃部分数据，甚至整块分片。

问题

场景：服务器物理断电引发的分片损坏

这种情况比较常见，一般我们可以通过explain api来确认：

[root@sh ~]# curl -s -XGET localhost:9200/_cluster/allocation/explain?pretty
{
  "index" : "index-net-20210902-3",
  "shard" : 3,
  "primary" : true,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2021-09-28T03:10:58.099Z",
    "failed_allocation_attempts" : 5,
    "details" : """failed shard on node [LwWiAwmdQCiEibtiF7oqxQ]: failed recovery, failure RecoveryFailedException[[device_search_20201204][3]: Recovery failed on {reading_9.10.126.164_node2}{LwWiAwmdQCiEibtiF7oqxQ}{YVadGK2FSDKbR69l0Wu0xg}{9.10.126.164}{9.10.126.164:9300}{dil}{ml.machine_memory=539647844352, xpack.installed=true, ml.max_open_jobs=20}]; nested: IndexShardRecoveryException[failed recovery]; nested: ElasticsearchException[java.io.IOException: failed to read /data1/containers/1612339152002810932/es/data/nodes/0/indices/b0Ar9gFpQc6_oHhrdHYxGQ/7/_state/retention-leases-1518.st]; nested: IOException[failed to read /data1/containers/1612339152002810932/es/data/nodes/0/indices/b0Ar9gFpQc6_oHhrdHYxGQ/7/_state/retention-leases-1518.st]; nested: IOException[org.apache.lucene.index.CorruptIndexException: codec footer mismatch (file truncated?): actual footer=892219961 vs expected footer=-1071082520 (resource=BufferedChecksumIndexInput(SimpleFSIndexInput(path="/data1/containers/1612339152002810932/es/data/nodes/0/indices/b0Ar9gFpQc6_oHhrdHYxGQ/7/_state/retention-leases-1518.st")))]; nested: CorruptIndexException[codec footer mismatch (file truncated?): actual footer=892219961 vs expected footer=-1071082520 (resource=BufferedChecksumIndexInput(SimpleFSIndexInput(path="/data1/containers/1612339152002810932/es/data/nodes/0/indices/b0Ar9gFpQc6_oHhrdHYxGQ/7/_state/retention-leases-1518.st")))]; """,
    "last_allocation_status" : "no"
  }
}

或者通过日志信息来确认：

[o.e.c.r.a.AllocationService] [1612339152002813032] failing shard [failed shard, shard [index-net-20210902-3][7], node[6sTEWvTlTlWZutgb_sK8ZA], [P], recovery_source[existing store recovery; bootstrap_history_uuid=false], s[INITIALIZING], a[id=y1Gvnr_hTuaaVIqm9TKaFA], unassigned_info[[reason=ALLOCATION_FAILED], at[2021-09-28T03:10:58.099Z], failed_attempts[4], failed_nodes[[6sTEWvTlTlWZutgb_sK8ZA]], delayed=false, details[failed shard on node [6sTEWvTlTlWZutgb_sK8ZA]: failed recovery, failure RecoveryFailedException[[index-net-20210902-3][7]: Recovery failed on {1612339152002810932}{6sTEWvTlTlWZutgb_sK8ZA}{JLZ-DDlmQoiw3MUHcxYydQ}{9.10.126.164}{9.10.126.164:9300}{dil}{ml.machine_memory=67210133504, rack=cvm_1_100003, xpack.installed=true, set=100003, ip=9.10.126.164, temperature=hot, ml.max_open_jobs=20, region=1}]; nested: IndexShardRecoveryException[failed recovery]; nested: ElasticsearchException[java.io.IOException: failed to read /data1/containers/1612339152002810932/es/data/nodes/0/indices/b0Ar9gFpQc6_oHhrdHYxGQ/7/_state/retention-leases-1518.st]; nested: IOException[failed to read /data1/containers/1612339152002810932/es/data/nodes/0/indices/b0Ar9gFpQc6_oHhrdHYxGQ/7/_state/retention-leases-1518.st]; nested: IOException[org.apache.lucene.index.CorruptIndexException: misplaced codec footer (file truncated?): length=0 but footerLength==16 (resource=SimpleFSIndexInput(path="/data1/containers/1612339152002810932/es/data/nodes/0/indices/b0Ar9gFpQc6_oHhrdHYxGQ/7/_state/retention-leases-1518.st"))]; nested: CorruptIndexException[misplaced codec footer (file truncated?): length=0 but footerLength==16 (resource=SimpleFSIndexInput(path="/data1/containers/1612339152002810932/es/data/nodes/0/indices/b0Ar9gFpQc6_oHhrdHYxGQ/7/_state/retention-leases-1518.st"))]; ], allocation_status[fetching_shard_data]], message [failed recovery], failure [RecoveryFailedException[[index-net-20210902-3][7]: Recovery failed on {1612339152002810932}{6sTEWvTlTlWZutgb_sK8ZA}{JLZ-DDlmQoiw3MUHcxYydQ}{9.10.126.164}{9.10.126.164:9300}{dil}{ml.machine_memory=67210133504, rack=cvm_1_100003, xpack.installed=true, set=100003, ip=9.10.126.164, temperature=hot, ml.max_open_jobs=20, region=1}]; nested: IndexShardRecoveryException[failed recovery]; nested: ElasticsearchException[java.io.IOException: failed to read /data1/containers/1612339152002810932/es/data/nodes/0/indices/b0Ar9gFpQc6_oHhrdHYxGQ/7/_state/retention-leases-1518.st]; nested: IOException[failed to read /data1/containers/1612339152002810932/es/data/nodes/0/indices/b0Ar9gFpQc6_oHhrdHYxGQ/7/_state/retention-leases-1518.st]; nested: IOException[org.apache.lucene.index.CorruptIndexException: misplaced codec footer (file truncated?): length=0 but footerLength==16 (resource=SimpleFSIndexInput(path="/data1/containers/1612339152002810932/es/data/nodes/0/indices/b0Ar9gFpQc6_oHhrdHYxGQ/7/_state/retention-leases-1518.st"))]; nested: CorruptIndexException[misplaced codec footer (file truncated?): length=0 but footerLength==16 (resource=SimpleFSIndexInput(path="/data1/containers/1612339152002810932/es/data/nodes/0/indices/b0Ar9gFpQc6_oHhrdHYxGQ/7/_state/retention-leases-1518.st"))]; ], markAsStale [true]]
org.elasticsearch.indices.recovery.RecoveryFailedException: [index-net-20210902-3][7]: Recovery failed on {1612339152002810932}{6sTEWvTlTlWZutgb_sK8ZA}{JLZ-DDlmQoiw3MUHcxYydQ}{9.10.126.164}{9.10.126.164:9300}{dil}{ml.machine_memory=67210133504, rack=cvm_1_100003, xpack.installed=true, set=100003, ip=9.10.126.164, temperature=hot, ml.max_open_jobs=20, region=1}
at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$16(IndexShard.java:2604) ~[elasticsearch-7.5.1.jar:7.5.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703) ~[elasticsearch-7.5.1.jar:7.5.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:1.8.0_181]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:1.8.0_181]
at java.lang.Thread.run(Unknown Source) [?:1.8.0_181]
Caused by: org.elasticsearch.index.shard.IndexShardRecoveryException: failed recovery
at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:353) ~[elasticsearch-7.5.1.jar:7.5.1]
at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:93) ~[elasticsearch-7.5.1.jar:7.5.1]
at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1904) ~[elasticsearch-7.5.1.jar:7.5.1]
at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$16(IndexShard.java:2600) ~[elasticsearch-7.5.1.jar:7.5.1]
... 4 more
Caused by: org.elasticsearch.ElasticsearchException: java.io.IOException: failed to read /data1/containers/1612339152002810932/es/data/nodes/0/indices/b0Ar9gFpQc6_oHhrdHYxGQ/7/_state/retention-leases-1518.st
at org.elasticsearch.ExceptionsHelper.maybeThrowRuntimeAndSuppress(ExceptionsHelper.java:167) ~[elasticsearch-7.5.1.jar:7.5.1]
at org.elasticsearch.gateway.MetaDataStateFormat.loadGeneration(MetaDataStateFormat.java:423) ~[elasticsearch-7.5.1.jar:7.5.1]
at org.elasticsearch.gateway.MetaDataStateFormat.loadLatestStateWithGeneration(MetaDataStateFormat.java:442) ~[elasticsearch-7.5.1.jar:7.5.1]
at org.elasticsearch.gateway.MetaDataStateFormat.loadLatestState(MetaDataStateFormat.java:463) ~[elasticsearch-7.5.1.jar:7.5.1]
at org.elasticsearch.index.seqno.ReplicationTracker.loadRetentionLeases(ReplicationTracker.java:460) ~[elasticsearch-7.5.1.jar:7.5.1]
at org.elasticsearch.index.shard.IndexShard.loadRetentionLeases(IndexShard.java:2263) ~[elasticsearch-7.5.1.jar:7.5.1]
at org.elasticsearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:1633) ~[elasticsearch-7.5.1.jar:7.5.1]
at org.elasticsearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:1607) ~[elasticsearch-7.5.1.jar:7.5.1]
at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:422) ~[elasticsearch-7.5.1.jar:7.5.1]
at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:95) ~[elasticsearch-7.5.1.jar:7.5.1]
at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:308) ~[elasticsearch-7.5.1.jar:7.5.1]
at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:93) ~[elasticsearch-7.5.1.jar:7.5.1]
at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1904) ~[elasticsearch-7.5.1.jar:7.5.1]
at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$16(IndexShard.java:2600) ~[elasticsearch-7.5.1.jar:7.5.1]
... 4 more
Caused by: java.io.IOException: failed to read /data1/containers/1612339152002810932/es/data/nodes/0/indices/b0Ar9gFpQc6_oHhrdHYxGQ/7/_state/retention-leases-1518.st
at org.elasticsearch.gateway.MetaDataStateFormat.loadGeneration(MetaDataStateFormat.java:417) ~[elasticsearch-7.5.1.jar:7.5.1]
at org.elasticsearch.gateway.MetaDataStateFormat.loadLatestStateWithGeneration(MetaDataStateFormat.java:442) ~[elasticsearch-7.5.1.jar:7.5.1]
at org.elasticsearch.gateway.MetaDataStateFormat.loadLatestState(MetaDataStateFormat.java:463) ~[elasticsearch-7.5.1.jar:7.5.1]
at org.elasticsearch.index.seqno.ReplicationTracker.loadRetentionLeases(ReplicationTracker.java:460) ~[elasticsearch-7.5.1.jar:7.5.1]
at org.elasticsearch.index.shard.IndexShard.loadRetentionLeases(IndexShard.java:2263) ~[elasticsearch-7.5.1.jar:7.5.1]
at org.elasticsearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:1633) ~[elasticsearch-7.5.1.jar:7.5.1]
at org.elasticsearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:1607) ~[elasticsearch-7.5.1.jar:7.5.1]
at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:422) ~[elasticsearch-7.5.1.jar:7.5.1]
at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:95) ~[elasticsearch-7.5.1.jar:7.5.1]
at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:308) ~[elasticsearch-7.5.1.jar:7.5.1]
at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:93) ~[elasticsearch-7.5.1.jar:7.5.1]
at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1904) ~[elasticsearch-7.5.1.jar:7.5.1]
at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$16(IndexShard.java:2600) ~[elasticsearch-7.5.1.jar:7.5.1]
... 4 more
Caused by: java.io.IOException: org.apache.lucene.index.CorruptIndexException: misplaced codec footer (file truncated?): length=0 but footerLength==16 (resource=SimpleFSIndexInput(path="/data1/containers/1612339152002810932/es/data/nodes/0/indices/b0Ar9gFpQc6_oHhrdHYxGQ/7/_state/retention-leases-1518.st"))
at org.elasticsearch.gateway.MetaDataStateFormat.read(MetaDataStateFormat.java:316) ~[elasticsearch-7.5.1.jar:7.5.1]
at org.elasticsearch.gateway.MetaDataStateFormat.loadGeneration(MetaDataStateFormat.java:413) ~[elasticsearch-7.5.1.jar:7.5.1]
at org.elasticsearch.gateway.MetaDataStateFormat.loadLatestStateWithGeneration(MetaDataStateFormat.java:442) ~[elasticsearch-7.5.1.jar:7.5.1]
at org.elasticsearch.gateway.MetaDataStateFormat.loadLatestState(MetaDataStateFormat.java:463) ~[elasticsearch-7.5.1.jar:7.5.1]
at org.elasticsearch.index.seqno.ReplicationTracker.loadRetentionLeases(ReplicationTracker.java:460) ~[elasticsearch-7.5.1.jar:7.5.1]
at org.elasticsearch.index.shard.IndexShard.loadRetentionLeases(IndexShard.java:2263) ~[elasticsearch-7.5.1.jar:7.5.1]
at org.elasticsearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:1633) ~[elasticsearch-7.5.1.jar:7.5.1]
at org.elasticsearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:1607) ~[elasticsearch-7.5.1.jar:7.5.1]
at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:422) ~[elasticsearch-7.5.1.jar:7.5.1]
at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:95) ~[elasticsearch-7.5.1.jar:7.5.1]
at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:308) ~[elasticsearch-7.5.1.jar:7.5.1]
at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:93) ~[elasticsearch-7.5.1.jar:7.5.1]
at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1904) ~[elasticsearch-7.5.1.jar:7.5.1]
at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$16(IndexShard.java:2600) ~[elasticsearch-7.5.1.jar:7.5.1]
... 4 more
Caused by: org.apache.lucene.index.CorruptIndexException: misplaced codec footer (file truncated?): length=0 but footerLength==16 (resource=SimpleFSIndexInput(path="/data1/containers/1612339152002810932/es/data/nodes/0/indices/b0Ar9gFpQc6_oHhrdHYxGQ/7/_state/retention-leases-1518.st"))
at org.apache.lucene.codecs.CodecUtil.checksumEntireFile(CodecUtil.java:523) ~[lucene-core-8.3.0.jar:8.3.0 6305aea4e5929f262e9c07fcf16d3afe2b4bb9f5 - danielhuang - 2020-11-10 17:11:38]
at org.elasticsearch.gateway.MetaDataStateFormat.read(MetaDataStateFormat.java:299) ~[elasticsearch-7.5.1.jar:7.5.1]
at org.elasticsearch.gateway.MetaDataStateFormat.loadGeneration(MetaDataStateFormat.java:413) ~[elasticsearch-7.5.1.jar:7.5.1]
at org.elasticsearch.gateway.MetaDataStateFormat.loadLatestStateWithGeneration(MetaDataStateFormat.java:442) ~[elasticsearch-7.5.1.jar:7.5.1]
at org.elasticsearch.gateway.MetaDataStateFormat.loadLatestState(MetaDataStateFormat.java:463) ~[elasticsearch-7.5.1.jar:7.5.1]
at org.elasticsearch.index.seqno.ReplicationTracker.loadRetentionLeases(ReplicationTracker.java:460) ~[elasticsearch-7.5.1.jar:7.5.1]
at org.elasticsearch.index.shard.IndexShard.loadRetentionLeases(IndexShard.java:2263) ~[elasticsearch-7.5.1.jar:7.5.1]
at org.elasticsearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:1633) ~[elasticsearch-7.5.1.jar:7.5.1]
at org.elasticsearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:1607) ~[elasticsearch-7.5.1.jar:7.5.1]
at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:422) ~[elasticsearch-7.5.1.jar:7.5.1]
at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:95) ~[elasticsearch-7.5.1.jar:7.5.1]
at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:308) ~[elasticsearch-7.5.1.jar:7.5.1]
at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:93) ~[elasticsearch-7.5.1.jar:7.5.1]
at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1904) ~[elasticsearch-7.5.1.jar:7.5.1]
at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$16(IndexShard.java:2600) ~[elasticsearch-7.5.1.jar:7.5.1]

其共同的关键信息都是：file truncated?

问题分析过程

那么这种情况发生的原因是什么呢？我们要知道，索引分片是不可能无故发生损坏的，分片所在的节点一定发生过异常。

于是我们在cerebro里确认了最近发生下线的节点，然后登上这个节点；
经过排查发现，通过uptime命令发现节点所在的机器在当天12:21:57左右，疑似出现了重启，但普通重启一般不太可能出现这种故障，这大概率不是一次优雅的重启，一定没有通过shutdown等命令进行关机再启动；
随后我们找到了负责硬件运维的同学得到了确认：机器当天确实发生了母机故障转移，而故障自动转移这个操作会强制断电，所以罪魁祸首便是这次物理断电；
那么，为什么物理断电会导致分片损坏呢？这是因为reboot关机时是系统发起的关机，这种会主动停掉子机内的服务；但是母机重启子机是感知不到的，这种情况下子机重启是被动强制停机的，所以当一些正在写入的文件不能正常关闭，就会导致数据无法正常读取。

解决方案

方案一：修复分片

retention-leases-1518.st 这个文件的损坏，与这个文件曾经有一段时间不在线有关系。也就是说，与机器重启有关。如果要恢复的话，则需要手动删除这个文件，然后重新尝试分配分片：

[root@sh ~]# curl -s -XPOST localhost:9200/_cluster/reroute?retry_failed=true
{
"acknowledged": true,
"state": {
"cluster_uuid": "LOk2L8k5RsmCC7eg2y3h8A",
"version": 533752,
"state_uuid": "jVm_8aAIT6ug9NBJazjVig",
"master_node": "kHbBiclxR5-c-rsra2A5Jg",
"blocks": {
},
"nodes": {
"m5eloUNuTJak4xDRqf3FeA": {
"name": "1625799512002116132",
"ephemeral_id": "dqHmYahLSbuqvSkRXy2IPg",
"transport_address": "9.27.34.96:9300",
"attributes": {
"ml.machine_memory": "134587404288",
"rack": "cvm_33_330002",
"xpack.installed": "true",
"set": "330002",
"transform.node": "true",
"ip": "9.27.34.96",
"temperature": "hot",
"ml.max_open_jobs": "20",
"region": "33"
}
},
"security_tokens": {
}
}
}

方案二：分配陈腐的分片

如果删除损坏的.st文件无法使分片上线，则需要考虑使用reroute api分配stale primary。执行这个api之前，我们需要得到一些信息：

索引名称和分片ID可以通过explain api直观看到；
节点名称可以通过unassigned_info.details得到。

根据这些信息，我们就可以执行reroute api了：

[root@sh ~]# curl -s -H "Content-Type:application/json" -XPOST "localhost:9200/_cluster/reroute?pretty" -d '
{
"commands": [
{
"allocate_stale_primary": {
"index": "{索引名称}",
"shard": "{分片ID}",
"node": "{节点名称}",
"accept_data_loss": true
}
}
]
}

方案三：丢弃分片

如果分配陈腐的分片也无法使分片上线，为了不影响索引读写请求，就只能丢弃掉损坏的分片了，这是最糟糕的情况：

[root@sh ~]# curl -s -H "Content-Type:application/json" -XPOST "localhost:9200/_cluster/reroute?pretty" -d '
{
"commands" : [
{
"allocate_empty_primary" : {
"index" : "{索引名称}", 
"shard" : "{分片ID}",
"node" : "{节点名称}",
"accept_data_loss": true
}
}
]
}'

原创文章，作者：3628473679，如若转载，请注明出处：https://blog.ytso.com/212013.html

Elasticsearch索引分片损坏该怎么办？（一）

说明

背景

问题

场景：服务器物理断电引发的分片损坏

问题分析过程

解决方案

方案一：修复分片

方案二：分配陈腐的分片

方案三：丢弃分片

发表回复

Elasticsearch索引分片损坏该怎么办？（一）

说明

背景

问题

场景：服务器物理断电引发的分片损坏

问题分析过程

解决方案

方案一：修复分片

方案二：分配陈腐的分片

方案三：丢弃分片

相关推荐

发表回复