This project automates the provisioning of an Azure Data Lake environment and schedules a periodic refresh of NBA data into the lake. I used Terraform for infrastructure-as-code to create Azure resources (Resource Group, Storage Account, Data Lake Gen2, Synapse Workspace, Key Vault, Function App, Monitor, and related components). An Azure Function, triggered on a timer, fetches NBA data from a configurable API and uploads it to Blob Storage.
- Terraform: Infrastructure-as-code that reliably provisions and configures Azure resources, ensuring consistency and repeatability.
- Azure Data Lake Gen2: Provides a high-performance, hierarchical namespace–enabled storage layer optimized for analytics workloads.
- Azure Synapse Workspace: Offers integrated analytics with both serverless SQL and Spark engines, enabling rapid querying of data stored in the Data Lake for BI and data science scenarios.
- Azure Key Vault: Centralizes and secures application secrets, connection strings, and certificates; integrates with managed identities to enforce least-privilege access without embedding sensitive values in code or Terraform state.
- Azure Function (Python): Serverless compute that runs code on-demand, triggered by a timer for periodic data ingestion without managing servers.
- Application Insights: Provides deep monitoring, distributed tracing, and alerting for the Function App, helping diagnose performance issues and exceptions in real time.
- Azure Monitor (Log Analytics): Collects, aggregates, and analyzes telemetry across resources; enables custom log queries and dashboards for long-term monitoring and operational insights.
- Azure Subscription with contributor permissions
- Terraform v1.0+ and AzureRM provider v3.100.0+
- Azure CLI configured to target your subscription
- Python 3.9+ for local testing
- Git for cloning the repo
- NBA_ENDPOINT: Sportsdata.io API endpoint
- SPORTS_DATA_API_KEY: Sportsdata.io API key
-
Clone the repository
git clone https://github.com/kingdave4/AzureDataLake.git cd AzureDataLake -
Configure variables Create a
secrets.tfvarsfile with:subscription_id = "<YOUR_SUB_ID>" sql_admin_password = "<YOUR_SQL_PASSWORD>" apikey = "<SPORTSDATA_API_KEY>" nba_endpoint = "<NBA_API_ENDPOINT>" sp_object_id = "<TERRAFORM_SP_OBJECT_ID>"
From the root of the repo, initialize and apply Terraform to provision infra:
terraform init
terraform plan -var-file="secrets.tfvars"
terraform apply -var-file="secrets.tfvars" -auto-approveNavigate into the Function App project folder (myfunctionapp) and publish the Azure Function:
cd myfunctionapp
func azure functionapp publish datafunctionapp54- Open the
myfunctionappfolder in VS Code or Visual Studio. - Install and sign in with the Azure Functions extension.
- Right-click the function project and select Deploy to Function App.
Terraform configuration lives in main.tf and variables.tf:
- Resource Group:
azurerm_resource_group.rg - Storage & Gen2:
azurerm_storage_account.ST,azurerm_storage_data_lake_gen2_filesystem.fs - Container:
azurerm_storage_container.container - Synapse:
azurerm_synapse_workspace.syn - Key Vault & Secrets:
azurerm_key_vault.kv,azurerm_key_vault_secret.* - Function App Plan & App:
azurerm_service_plan.func_plan,azurerm_linux_function_app.nba_refresh - Monitoring:
azurerm_log_analytics_workspace.la,azurerm_application_insights.ai,azurerm_monitor_diagnostic_setting.func_diagnostics
variable "resource_group_name" { default = "AzureDatalake-RG" }
variable "location" { default = "East US 2" }
# ... see file for all variablesThe Function is in Python and triggered every 10 minutes ("0 */10 * * * *" in function.json).
import logging, os
import azure.functions as func
from data_operations import fetch_nba_data, upload_to_blob_storage
def main(mytimer: func.TimerRequest):
logging.info("NBA timer trigger function started...")
vault = os.getenv("KEY_VAULT_NAME")
data = fetch_nba_data(vault, "SportsDataApiKey")
if data:
upload_to_blob_storage(vault, data)
logging.info("Data lake refresh complete.")
else:
logging.warning("No data fetched; skipping upload.")- Secret Retrieval:
DefaultAzureCredential+SecretClient - API Call: GET to
NBA_ENDPOINTwith headerOcp-Apim-Subscription-Key - Blob Upload: Line-delimited JSON to
nba-datalakecontainer
- Live Logs:
func azure functionapp log stream --name datafunctionapp54 - Application Insights: Metrics and traces in Azure Portal under the Function’s AI resource
Make sure to go to Security then networking in side synapse workstation to add your IP and give it access to Firewall and then click save.
SELECT TOP 10 *
FROM OPENROWSET(
BULK 'https://<storage-account>.dfs.core.windows.net/<container>/raw-data/nba_player_data.jsonl',
FORMAT = 'CSV',
FIELDTERMINATOR = '0x0b',
FIELDQUOTE = '0x0b'
)
WITH (
line varchar(max)
) AS [result]To tear down all resources:
terraform destroy -var-file="secrets.tfvars" -auto-approve- Fork the repo
- Create a feature branch
- Open a pull request
MIT © Kingdave4

