Terraform defaults slowly filling the Azure function storage account.

When using Azure functions you need an underlying storage account to accommodate for a storage blob for the storage of your code files and storage tables for your durable functions. However, when you use Terraform to deploy your infrastructure there is a strange default behavior you need to be aware of.

TLDR;

When using the Terraform azurerm_function_app resource make sure to disable the enable_builtin_logging otherwise you have an ever-growing consumption of storage table capacity for your azure function.

Looking at the signs

Looking at the resource consumption and finances of your solution is part of your job as a DevOps engineer. You know best what is expected and thus the normal behavior and what is unexpected abnormal behavior of your solution. The other day I was going over a solution for a customer and I was blown away by the storage costs that were slowly increasing over time. The solution was designed to be high in the transfer but low in storage. The azure functions processed over 40 million messages an hour but should only store faults in a fault-processor queue to try again and then purge it. The solution showed the following staggering climb.

The table storage is, on average for the past 14 days over 635GB, and at the time of discovery, this was over 700GB. This storage account runs using a standard tier. So that makes the transactions, compared to the premium tier, rather expensive. Looking at the costs of the resource per day the write operations are the majority of the costs. When I can reduce the storage we can upgrade into the premium tier. Having a high cost on storage but the low cost of operations. With 700GB and increasing this is not an option.

How did I end up with it

So, where does the storage increase come from? When Azure Functions first introduced the logging it was dependent on AzureWebJobsDashboard. To activate it you needed to add a configuration to your app-configuration with the value of a storage key it could use to store logging data. That would create two tables. One called AzureWebJobsHostsLogscommon and another called AzureWebJobsHostLogs<Date> and then extending for every month it runs from now.

Obviously this will increase your storage capacity slowly but continuously if left un-managed.

The strange thing is that I did not enable this because we use Application Insights for the logging and this option is marked as deprecated ever since the introduction of Application Insights for Azure functions. So where did it come from? This could only be two possible locations first the application deployment (didn’t come from there) or the second one, the infrastructure deployment from Terraform. At first glance, there is nothing too fishy going on there.

resource "azurerm_function_app" "hosting" {
  name                      = var.function_app_name
  resource_group_name       = module.environment.resource_group_name
  location                  = var.specific_location
  app_service_plan_id       = azurerm_app_service_plan.hosting.id
  storage_connection_string = azurerm_storage_account.function.primary_connection_string
  https_only                = true
  version                   = "~2"
  
  identity {
    type = "SystemAssigned, UserAssigned"
    identity_ids = ["/subscriptions/XXXXXXXX/resourceGroups/${module.environment.resource_group_name}/providers/Microsoft.ManagedIdentity/userAssignedIdentities/${var.identity_name}"]
  }
}

When you look into the azurerm_function_app resource in the Github repository you can see that the AzureWebJobsDashboard is being set by using the settings provided by the required AzureWebJobsStorage.

The enable_builtin_logging we can find back on the azurerm_function_app page with the following explanation. While from the description it is not entirely clear that they are talking about the AzureWebJobsDashboard, from the code it is clear that they enable this.

The reason the boolean defaults to true instead of the usual false, I think is due to not wanting to have a breaking change in the resource for people that already implemented it from back when it was the only option. While in the current day this is a bit strange cause you want your monitoring done in application insights instead of an un-managed, ever-growing, deprecated feature that you did not opt-in by choice.

Conclusion

Besides what is described in the TLDR; By fixing the terraform resource and deleting all created WebJobLog tables the capacity decrease from 700GB to 20 MB making it worth investigating the shift to a premium account. The reduction in capacity alone is worth over 14€ a day 5000€ a year.

The cost can be useful to identify the wrong behavior of a solution that can be hidden due to the size of usage during your development. For the cloud consumption model, there are some great tools there to give you insights on where your money went, with on-premise solutions this is often a lot harder to distinguish. Make use of the cost to identify unexpected behavior and be aware of the unexpected default behavior of Terraform.

Have fun Erick

Share

You may also like...

Leave a Reply