title | description | ms.service | ms.topic | ms.custom | ms.date |
---|---|---|---|---|---|
Customize Azure HDInsight cluster configurations using bootstrap |
Learn how to customize HDInsight cluster configuration programmatically using .NET, PowerShell, and Resource Manager templates. |
hdinsight |
how-to |
hdinsightactive, devx-track-azurepowershell |
05/31/2022 |
Bootstrap scripts allow you to install and configure components in Azure HDInsight programmatically.
There are three approaches to set configuration file settings as your HDInsight cluster is created:
- Use Azure PowerShell
- Use .NET SDK
- Use Azure Resource Manager template
For example, using these programmatic methods, you can configure options in these files:
- clusterIdentity.xml
- core-site.xml
- gateway.xml
- hbase-env.xml
- hbase-site.xml
- hdfs-site.xml
- hive-env.xml
- hive-site.xml
- mapred-site
- oozie-site.xml
- oozie-env.xml
- storm-site.xml
- tez-site.xml
- webhcat-site.xml
- yarn-site.xml
- server.properties (kafka-broker configuration)
For information on installing additional components on HDInsight cluster during the creation time, see Customize HDInsight clusters using Script Action (Linux).
- If using PowerShell, you'll need the Az Module.
The following PowerShell code customizes an Apache Hive configuration:
Important
The parameter Spark2Defaults
may need to be used with Add-AzHDInsightConfigValue. You can pass empty values to the parameter as shown in the code example below.
# hive-site.xml configuration
$hiveConfigValues = @{ "hive.metastore.client.socket.timeout"="90s" }
$config = New-AzHDInsightClusterConfig `
| Set-AzHDInsightDefaultStorage `
-StorageAccountName "$defaultStorageAccountName.blob.core.windows.net" `
-StorageAccountKey $defaultStorageAccountKey `
| Add-AzHDInsightConfigValue `
-HiveSite $hiveConfigValues `
-Spark2Defaults @{}
New-AzHDInsightCluster `
-ResourceGroupName $existingResourceGroupName `
-ClusterName $clusterName `
-Location $location `
-ClusterSizeInNodes $clusterSizeInNodes `
-ClusterType Hadoop `
-OSType Linux `
-Version "3.6" `
-HttpCredential $httpCredential `
-Config $config
A complete working PowerShell script can be found in Appendix.
To verify the change:
- Navigate to
https://CLUSTERNAME.azurehdinsight.net/
whereCLUSTERNAME
is the name of your cluster. - From the left menu, navigate to Hive > Configs > Advanced.
- Expand Advanced hive-site.
- Locate hive.metastore.client.socket.timeout and confirm the value is 90s.
Some more samples on customizing other configuration files:
# hdfs-site.xml configuration
$HdfsConfigValues = @{ "dfs.blocksize"="64m" } #default is 128MB in HDI 3.0 and 256MB in HDI 2.1
# core-site.xml configuration
$CoreConfigValues = @{ "ipc.client.connect.max.retries"="60" } #default 50
# mapred-site.xml configuration
$MapRedConfigValues = @{ "mapreduce.task.timeout"="1200000" } #default 600000
# oozie-site.xml configuration
$OozieConfigValues = @{ "oozie.service.coord.normal.default.timeout"="150" } # default 120
See Azure HDInsight SDK for .NET.
You can use bootstrap in Resource Manager template:
"configurations": {
"hive-site": {
"hive.metastore.client.connect.retry.delay": "5",
"hive.execution.engine": "mr",
"hive.security.authorization.manager": "org.apache.hadoop.hive.ql.security.authorization.DefaultHiveAuthorizationProvider"
}
}
:::image type="content" source="./media/hdinsight-hadoop-customize-cluster-bootstrap/hdinsight-customize-cluster-bootstrap-arm.png" alt-text="Hadoop customizes cluster bootstrap Azure Resource Manager template":::
Sample Resource Manager template snippet to switch configuration in spark2-defaults to periodically clean up event logs from storage.
"configurations": {
"spark2-defaults": {
"spark.history.fs.cleaner.enabled": "true",
"spark.history.fs.cleaner.interval": "7d",
"spark.history.fs.cleaner.maxAge": "90d"
}
}
- Create Apache Hadoop clusters in HDInsight provides instructions on how to create an HDInsight cluster by using other custom options.
- Develop Script Action scripts for HDInsight
- Install and use Apache Spark on HDInsight clusters
- Install and use Apache Giraph on HDInsight clusters.
This PowerShell script creates an HDInsight cluster and customizes a Hive setting. Be sure to enter values for $nameToken
, $httpPassword
, and $sshPassword
.
####################################
# Set these variables
####################################
#region - used for creating Azure service names
$nameToken = "<ENTER AN ALIAS>"
#endregion
#region - cluster user accounts
$httpUserName = "admin" #HDInsight cluster username
$httpPassword = '<ENTER A PASSWORD>'
$sshUserName = "sshuser" #HDInsight ssh user name
$sshPassword = '<ENTER A PASSWORD>'
#endregion
####################################
# Service names and varialbes
####################################
#region - service names
$namePrefix = $nameToken.ToLower() + (Get-Date -Format "MMdd")
$resourceGroupName = $namePrefix + "rg"
$hdinsightClusterName = $namePrefix + "hdi"
$defaultStorageAccountName = $namePrefix + "store"
$defaultBlobContainerName = $hdinsightClusterName
$location = "East US"
#endregion
####################################
# Connect to Azure
####################################
#region - Connect to Azure subscription
Write-Host "`nConnecting to your Azure subscription ..." -ForegroundColor Green
$sub = Get-AzSubscription -ErrorAction SilentlyContinue
if(-not($sub))
{
Connect-AzAccount
}
# If you have multiple subscriptions, set the one to use
# Select-AzSubscription -SubscriptionId "<SUBSCRIPTIONID>"
#endregion
#region - Create an HDInsight cluster
####################################
# Create dependent components
####################################
Write-Host "Creating a resource group ..." -ForegroundColor Green
New-AzResourceGroup `
-Name $resourceGroupName `
-Location $location
Write-Host "Creating the default storage account and default blob container ..." -ForegroundColor Green
New-AzStorageAccount `
-ResourceGroupName $resourceGroupName `
-Name $defaultStorageAccountName `
-Location $location `
-SkuName Standard_LRS `
-Kind StorageV2 `
-EnableHttpsTrafficOnly 1
# Note: Storage account kind BlobStorage cannot be used as primary storage.
$defaultStorageAccountKey = (Get-AzStorageAccountKey `
-ResourceGroupName $resourceGroupName `
-Name $defaultStorageAccountName)[0].Value
$defaultStorageContext = New-AzStorageContext `
-StorageAccountName $defaultStorageAccountName `
-StorageAccountKey $defaultStorageAccountKey
New-AzStorageContainer `
-Name $defaultBlobContainerName `
-Context $defaultStorageContext #use the cluster name as the container name
####################################
# Create a configuration object
####################################
$hiveConfigValues = @{"hive.metastore.client.socket.timeout"="90s"}
$config = New-AzHDInsightClusterConfig `
| Set-AzHDInsightDefaultStorage `
-StorageAccountName "$defaultStorageAccountName.blob.core.windows.net" `
-StorageAccountKey $defaultStorageAccountKey `
| Add-AzHDInsightConfigValue `
-HiveSite $hiveConfigValues `
-Spark2Defaults @{}
####################################
# Create an HDInsight cluster
####################################
$httpPW = ConvertTo-SecureString -String $httpPassword -AsPlainText -Force
$httpCredential = New-Object System.Management.Automation.PSCredential($httpUserName,$httpPW)
$sshPW = ConvertTo-SecureString -String $sshPassword -AsPlainText -Force
$sshCredential = New-Object System.Management.Automation.PSCredential($sshUserName,$sshPW)
New-AzHDInsightCluster `
-ResourceGroupName $resourceGroupName `
-ClusterName $hdinsightClusterName `
-Location $location `
-ClusterSizeInNodes 1 `
-ClusterType Hadoop `
-OSType Linux `
-Version "3.6" `
-HttpCredential $httpCredential `
-SshCredential $sshCredential `
-Config $config
####################################
# Verify the cluster
####################################
Get-AzHDInsightCluster `
-ClusterName $hdinsightClusterName
#endregion