Unlock the Full Potential of Your Jira Data: Data Lake Integration with Talend

The article shows you how to unlock the full potential of your Jira data by integrating it into your own data lake using Talend. You will learn which data sources are available, how to efficiently extract and process them, and why having your own data lake makes sense for customized analyses and improved decision-making. Step by step, it explains how the implementation works, the benefits it offers, and how you can make the most of your data.

Motivation

In an era where data-driven decisions significantly influence business success, direct access to your company data is indispensable. Jira, as a central tool for project management and issue tracking, collects a wealth of data that often remains unused. This article shows you why and how you can use your Jira data yourself and what opportunities arise from it. We have selected Talend as the ETL tool, but you can also use another tool.

What Jira Data Does Atlassian Offer?

Companies operating an Atlassian Jira Data Center installation must already provide their own database to ensure the operation of the Jira instance. However, the database used by Atlassian is technically very complex and subject to potential changes by Atlassian, which makes direct access for analytical purposes difficult. Therefore, it is not advisable to access this database directly. Instead, external ETL tools (such as Talend) or third-party Jira applications like our VIP.LEAN ETL for Reporting are recommended. This solution is successfully used by our customer Samsung for a large Jira instance.

In the Jira Cloud version, direct access to Jira data is not possible in the Free, Standard, and Premium plans. Only with the Enterprise version does Atlassian offer access to the Atlassian Data Lake (Atlassian Analytics). However, the data is currently only available for Atlassian’s own dashboards, which means that direct external access to this data is not supported.

Schema for Jira Family of Products: Schema for Jira family of products | Atlassian Analytics | Atlassian Support

Why Should You Manage Your Jira Data in Your Own Data Lake?

Here are some reasons:

No Direct Data Access in the Atlassian Cloud: As mentioned in the previous section, analysis of Jira data is limited to non-existent by default.
Customized Analyses: Self-extracted Jira data allows you to create individual reports and dashboards precisely tailored to your needs.
Improved Decision-Making: Detailed insights into projects and processes enable you to make more informed strategic decisions.
Integration with Other Systems: Owning your Jira data facilitates seamless integration with business intelligence tools and other databases.
Data Security and Compliance: Independent management of your Jira data increases security and helps comply with compliance guidelines.

Solution Outline with Talend

An ETL solution with Talend could look like this:

Required Solution Components:

One or more Jira instances.
An ETL tool, in this case, Talend Cloud.
A data lake, either in the cloud (GCP/Azure/AWS…) or on-premises.
Possibly other source data you wish to process with the Jira data.
Accessing systems like Power BI, Tableau, or other tools.

The Jira Data Model - ERD

To store the Jira data in your data lake instance, an Entity-Relationship Diagram (ERD) must first be created:

Staging Layer: In this layer, the REST API JSON responses are stored in raw format so that the original Jira data can be reviewed and reprocessed at any time.
Processed Layer: The ERD model for this layer should be defined. This model can be based on the REST API structure. However, there are differences between the Data Center version (REST API 2.0) and the Cloud version (REST API 3.0), which differ at various points.

Additionally, further processing steps can be offered to provide added value to the users of the Processed Layer:

Extraction of user data and linking with the Jira user table.
Extraction of different field values: Depending on the type, field values are extracted so that complex parsing of the Jira raw data is not required.
Processing of the changelog: Representation of the entire history of an issue to make changes over time traceable.
Pseudonymization of user data: Replacement of user data with surrogate keys from a central user table, which can be specially protected.

Implementation Guide

We recommend the following steps for a successful implementation:

Data Extraction with ETL Tools

Use an ETL tool (Extract, Transform, Load) like Talend to extract data from your Jira instance. You can access all relevant data via the REST API of Jira. Talend offers the possibility to perform REST API queries using Java routines and then start subsequent processes in parallel. It’s important to minimize the load on the Jira instance to avoid affecting its performance.

Loading the results into the target database should also be performed as a bulk operation to ensure high performance.

Initial Data Loading and Regular Updates

Initial Data Loading: Begin with a complete data extraction of your Jira data to create a comprehensive basis for your analyses.
Regular Updates: Set up regular delta extractions (e.g., every 5 minutes) to continuously update and keep your data current. The execution of the Talend jobs takes place in the Talend Cloud. Ensure the assignment of a well-scaled engine to guarantee optimal performance and resource utilization.

Updating the issue data should be performed significantly more frequently (e.g., every 5 minutes) than updating the administrative data (e.g., every 60 minutes). This ensures that project-related data like changes to issues, statuses, or comments are captured almost in real-time, while less dynamic administrative data like user information or configurations can be updated at longer intervals.

Processing the Changelog

Read the Changelog: Extract the changelog of the issues to obtain a complete historical view of all changes.
Utilize Native Versioning: Use Jira’s native versioning to track changes over time.

Use of Surrogate Keys and Business Keys

Jira already provides primary keys like the Issue Key or Project Key, which should be used to establish business relationships between entities.

However, to clearly separate the versions per data entity technically, it’s advisable to introduce surrogate keys. These surrogate keys allow you to uniquely identify different versions of the same entity (e.g., changes to an issue over time), ensuring consistent and traceable historization (see ERD).

Support of Slowly Changing Dimensions

Implement mechanisms to track changes in data over time, which is particularly advantageous for historical analyses. This includes:

Snapshots: Capture the current state of the data at each update. Snapshots allow you to represent the state of the data at a specific point in time and serve as a basis for temporal comparisons.
Changelog: The changelog of the issues enables native versioning and documents all changes to an issue, such as status changes, field updates, or assignments. This allows detailed tracking of the development and history of an issue without requiring a complete snapshot.

Use of Cloud Databases

Scalable Storage: Store your data in a scalable cloud database like Google BigQuery.
Individual Scaling: Adjust the resources of your cloud database flexibly to your needs.
Security Features: Utilize functions such as masking PII data (Personally Identifiable Information) and security policies to comply with data protection regulations.

Configuration of Processing Chains

Create Processing Chains: Set up processing chains to further process your data after the ETL process.
Data Marts and Dimension Tables: Develop data marts for specific analysis purposes and use tools or scripts to create dimension tables.
Automatic User Referencing: Extract user data and create automatic links to the user table to facilitate user-related analyses.

Integration of Multiple Jira Instances

Consolidate Data: Combine data from multiple Jira instances (e.g., a mix of Data Center and Cloud) into a central data source.
Central Data Source: Obtain a unified view of all projects and processes by merging data from different instances.
Comparative Analyses: Identify differences and similarities between different teams or departments for improved collaboration and transparency.

Applications and Benefits

Enhanced Reporting: Create reports that go beyond Jira’s standard functions.
Process Optimization: Identify bottlenecks and inefficient workflows to improve processes.
Predictive Analytics: Use historical data to recognize trends and predict future developments.
Resource Management: Plan resources more effectively based on real data rather than estimates.
Big Picture: If your Jira data is linked to other systems like SAP, you can perform comprehensive, cross-system analyses in your data lake. For example, you can compare the planned costs from Jira with the actual costs and commitments from SAP. This provides a holistic view of your projects and allows you to make informed decisions based on consolidated data.

Tips for Getting Started

Transparent Traceability: Store all REST API calls in JSON format in a staging layer to ensure maximum transparency and traceability.
Security First: Always pay attention to security aspects and data protection guidelines when extracting and storing data.
Keep Scalability in Mind: Plan your infrastructure so it can grow with your requirements.

Comparison with Other Solutions

Feature	ETL with Talend & BigQuery	Atlassian Analytics	VIP.LEAN ETL for Reporting
Who implements the solution / Available from	Your company / After development phase (VIP.LEAN Solutions can offer support)	Atlassian / Immediately	VIP.LEAN Solutions / Immediately
Jira DC Support	✓	–	✓
Jira Cloud Support	Free: ✓ Standard: ✓ Premium: ✓ Enterprise: ✓	Free: – Standard: – Premium: – Enterprise: ✓	Free: ✓ Standard: ✓ Premium: ✓ Enterprise: ✓
Data in Your Data Lake	✓	–	✓
Costs	Individual	Included in Cloud Enterprise	From 10.00 USD
Data Structure	Customizable	Fixed	Customizable
Data Level	Technical / Processed Layer	Technical / Processed Layer	Business / Presentation Layer
Pre-built Dashboards	–	✓ (included in Jira)	–
Using Data in Your Own BI Reports	✓	–	✓
Extendable to Jira Apps with REST API (e.g. Tempo)	✓ (e.g. Tempo Data)	✓ (Only App Custom Fields)	✓ (Only App Custom Fields)
DB Support / Planned	BigQuery / Azure, AWS, Oracle, Postgres	–	Azure SQL, Postgres, Oracle / BigQuery & AWS
Link	Book a free Demo	Analytics – Built into the Atlassian platform \| Atlassian	VIP.LEAN ETL for Reporting

Get Started

Seamlessly Export Jira Data to BigQuery with Talend (Demo)

Export your Jira Data to your own BigQuery Data Lake using Talend ! Request a free live demo session and enjoy a free trial phase if you choose our support.

Request Your Free Demo →