Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

incompatible clusterID Hadoop #55

Open
Atahualkpa opened this issue May 11, 2018 · 4 comments
Open

incompatible clusterID Hadoop #55

Atahualkpa opened this issue May 11, 2018 · 4 comments

Comments

@Atahualkpa
Copy link

incompatible clusterID Hadoop

Hi,
anytime I rebooted the swarm I have this problem

java.io.IOException: Incompatible clusterIDs in /hadoop/dfs/data: namenode clusterID = CID-b25a0845-5c64-4603-a2cb-d7878c265f44; datanode clusterID = CID-f90183ca-4d87-4b49-8fb2-ca642d46016c
at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:777)

FATAL datanode.DataNode: Initialization failed for Block pool (Datanode Uuid unassigned) service to namenode/10.0.0.7:8020. Exiting.
java.io.IOException: All specified directories are failed to load.
at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:574)

I solved this problem deleting this docker volume

sudo docker volume inspect hadoop_datanode

[ { "CreatedAt": "2018-05-10T19:35:31Z", "Driver": "local", "Labels": { "com.docker.stack.namespace": "hadoop" }, "Mountpoint": "/data0/docker_var/volumes/hadoop_datanode/_data", "Name": "hadoop_datanode", "Options": {}, "Scope": "local" } ]
but in this volume are present the files which I put in hdfs, so in this way I have to to put again the files into hdfs when I deploy the swarm. I'm not sure this is the right way to solve this problem.
Googling I found one solution but I dont know how to applicate it before the swarm reboot, this is the solution:
The problem is with the property name dfs.datanode.data.dir, it is misspelt as dfs.dataode.data.dir. This invalidates the property from being recognised and as a result, the default location of ${hadoop.tmp.dir}/hadoop-${USER}/dfs/data is used as data directory.
hadoop.tmp.dir is /tmp by default, on every reboot the contents of this directory will be deleted and forces datanode to recreate the folder on startup. And thus Incompatible clusterIDs.
Edit this property name in hdfs-site.xml before formatting the namenode and starting the services.

thanks.

@earthquakesan
Copy link
Member

@Atahualkpa Hi!

Which docker-compose are you using? Or what is your setup? Do you persist the data to the local drive from your docker containers? Eg by having volumes key.

services:
  namenode:
    volumes:
      - /path/to/the/folder:/hadoop/dfs/name
  datanode:
    volumes:
      - /path/to/the/folder:/hadoop/dfs/data

@Atahualkpa
Copy link
Author

Hi @earthquakesan thanks for your answer,
I have this setup for docker compose:

version: '3'
services:
  namenode:
    image: bde2020/hadoop-namenode:2.0.0-hadoop2.7.4-java8
    networks:
      - workbench
    volumes:
      - namenode:/hadoop/dfs/name
    environment:
      - CLUSTER_NAME=test
    env_file:
      - ./hadoop.env
    deploy:
      mode: replicated
      replicas: 1
      restart_policy:
        condition: on-failure
      labels:
        traefik.docker.network: workbench
        traefik.port: 50070
    ports:
      - 8334:50070
    volumes:
      - /data0/reference/hg19-ucsc/:/reference/hg19-ucsc/
      - /data0/output/:/output/
      - /data/ngs/:/ngs/
  datanode:
    image: bde2020/hadoop-datanode:2.0.0-hadoop2.7.4-java8
    networks:
      - workbench
    volumes:
      - datanode:/hadoop/dfs/data
    environment:
      SERVICE_PRECONDITION: "namenode:50070"
    env_file:
      - ./hadoop.env
    deploy:
      mode: global
      restart_policy:
        condition: on-failure
    labels:
      traefik.docker.network: workbench
      traefik.port: 50075

volumes:
  datanode:
  namenode:

networks:
  workbench:
    external: true

but I observe I have not set a path for hdfs. I try to set a local path but the problem is still present.
i checked the path and I find this file called VERSION into a directory named current. This is written into file:

storageID=DS-6e863e5f-34a1-4d09-bcf2-58f6badc7dba
clusterID=CID-4a2c4782-785b-4b8c-be8f-e0d7cef85b24
cTime=0
datanodeUuid=48dc924c-fea1-40d8-9da2-7faeb2ee28b9
storageType=DATA_NODE
layoutVersion=-56

also checking the directory i fount this folder BP-1651631011-10.0.0.12-1527073017748/current and into this folder is present another file called VERSION but in this is written this:

namespaceID=1025220048
cTime=0
blockpoolID=BP-1651631011-10.0.0.12-1527073017748
layoutVersion=-56

this is the exception generated

namenode clusterID = CID-37f14517-46c8-430a-803d-5fe2b0d047fc; datanode clusterID = CID-4a2c4782-785b-4b8c-be8f-e0d7cef85b24
	at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:777)
	at org.apache.hadoop.hdfs.server.datanode.DataStorage.loadStorageDirectory(DataStorage.java:300)
	at org.apache.hadoop.hdfs.server.datanode.DataStorage.loadDataStorage(DataStorage.java:416)
	at org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:395)
	at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:573)
	at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1386)
	at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1351)
	at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:313)
	at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:216)
	at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:637)
	at java.lang.Thread.run(Thread.java:748)

Thanks for your support.

@earthquakesan
Copy link
Member

@Atahualkpa How many nodes do you have in your swarm cluster? Do the containers always allocated on the same nodes?

@Atahualkpa
Copy link
Author

Atahualkpa commented May 23, 2018

Now I have three nodes into the swarm into the leader are running 6 containers their are:

  • bde2020/hadoop-namenode:2.0.0-hadoop2.7.4-java8
  • bde2020/hadoop-datanode:2.0.0-hadoop2.7.4-java8
  • atahualpa/spark-master:4.1.2(this container contains bde2020/spark-master:2.2.0-hadoop2.8-hive-java8 where I installed GATK 4)
  • bde2020/spark-worker:2.2.0-hadoop2.8-hive-java8
  • traefik:v1.1.0

and into the others are running

  • bde2020/spark-worker:2.2.0-hadoop2.8-hive-java
  • bde2020/hadoop-datanode:2.0.0-hadoop2.7.4-java8
  • vzzarr/reference:hg19_img

Do the containers always allocated on the same nodes?
Yes, but I must first start the leader because in this are presents files I puting into HDFS, if I join the another nodes the master spark and the name node is selected random.

moreover anytime I deploy the swarm was are present this hadoop_volume into any node the swarm.

thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants