Hadoop: installation and configuration#
This guide proposes an example of installation and configuration of Apache Hadoop.
Installation#
- Download hadoop distribution from https://hadoop.apache.org/releases.html and uncompress.
cd /tmp
wget https://downloads.apache.org/hadoop/common/hadoop-3.1.3/hadoop-3.1.3.tar.gz
cd /opt
tar zxf /tmp/hadoop-3.1.3.tar.gz
ln -s hadoop-3.1.3 hadoop
- Create a user and update permissions
useradd hadoop
chown hadoop:root -R /opt/hadoop*
- Create a datastore and set permissions
mkdir /opt/thp/thehive/hdfs
chown hadoop:root -R /opt/thp/thehive/hdfs
- Create ssh keys for
hadoop
user:
su - hadoop
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys
- Update
.bashrc
file forhadoop
user. Add following lines:
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/opt/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
Note: Apache has a well detailed documentation for more advanced configuration with Hadoop.
Configuration the Hadoop Master#
Configuration files are located in etc/hadoop
(/opt/hadoop/etc/hadoop
). They must be identical in all nodes.
Notes:
- The configuration described there is for a single node server. This node is the master node, namenode and datanode (refer to Hadoop documentation for more information). After validating this node is running successfully, refer to the related guide to add nodes;
-
Ensure you update the port value to something different than
9000
as it is already reserved for TheHive application service; -
Edit the file
core-site.xml
:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://thehive1:10000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/thp/thehive/hdfs/temp</value>
</property>
<property>
<name>dfs.client.block.write.replace-datanode-on-failure.best-effort</name>
<value>true</value>
</property>
</configuration>
- Edit the file
hdfs-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/opt/thp/thehive/hdfs/namenode/data</value>
</property>
<property>
<name>dfs.datanode.name.dir</name>
<value>/opt/thp/thehive/hdfs/datanode/data</value>
</property>
<property>
<name>dfs.namenode.checkpoint.dir</name>
<value>/opt/thp/thehive/hdfs/checkpoint</value>
</property>
<property>
<name>dfs.namenode.http-address</name>
<value>0.0.0.0:9870</value>
</property>
<!--
<property>
<name>dfs.client.block.write.replace-datanode-on-failure.best-effort</name>
<value>true</value>
</property>
-->
<property>
<name>dfs.client.block.write.replace-datanode-on-failure.policy</name>
<value>NEVER</value>
</property>
</configuration>
Format the volume and start services#
- Format the volume
su - hadoop
cd /opt/hadoop
bin/hdfs namenode -format
Run it as a service#
Create the /etc/systemd/system/hadoop.service
file with the following content:
[Unit]
Description=Hadoop
Documentation=https://hadoop.apache.org/docs/current/index.html
Wants=network-online.target
After=network-online.target
[Service]
WorkingDirectory=/opt/hadoop
Type=forking
User=hadoop
Group=hadoop
Environment=JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
Environment=HADOOP_HOME=/opt/hadoop
Environment=YARN_HOME=/opt/hadoop
Environment=HADOOP_COMMON_HOME=/opt/hadoop
Environment=HADOOP_HDFS_HOME=/opt/hadoop
Environment=HADOOP_MAPRED_HOME=/opt/hadoop
Restart=on-failure
TimeoutStartSec=2min
ExecStart=/opt/hadoop/sbin/start-all.sh
ExecStop=/opt/hadoop/sbin/stop-all.sh
StandardOutput=null
StandardError=null
# Specifies the maximum file descriptor number that can be opened by this process
LimitNOFILE=65536
# Disable timeout logic and wait until process is stopped
TimeoutStopSec=0
# SIGTERM signal is used to stop the Java process
KillSignal=SIGTERM
# Java process is never killed
SendSIGKILL=no
[Install]
WantedBy=multi-user.target
Start the service#
service hadoop start
You can check cluster status in http://thehive1:9870
Add nodes#
To add Hadoop nodes, refer the the related guide.
Last update:
October 13, 2023 07:01:35