Introduction

Yubi  is a data-driven company, where our strategy is driven by a data governance framework.

Data governance is the overall management framework that ensures data is managed and utilised in a consistent, secure, and compliant manner.

A Data Governance Program is incomplete without a Data Catalog. Any organisation can solve its data mystery or treasure hunt with a properly implemented data catalog. An enterprise-wide data catalog acts as a guide to identifying and managing data assets.

Importance of Cataloguing

Data cataloging is the process of comprehensively and methodically classifying and arranging data assets. It entails producing metadata, or information about information, that outlines the traits, attributes, and content of data assets. Data cataloging is crucial for a variety of reasons, including:

  • Data Discovery: Locating data assets that may be pertinent to user (Business, BA, and DS) needs is made easier with the use of data cataloging. Our organisations, who have several distinct data sources (Athena, Redshift, MongoDB, and Postgres), and sorts of data, may find this to be extremely helpful.
  • Data Governance: Data cataloging can assist us in managing and enforcing data governance procedures, including compliance, data security, and privacy. Yubi can make sure that data is utilised lawfully and in line with applicable laws and rules by identifying data assets and their characteristics.
  • Data Quality: By offering a single repository for data definitions, descriptions, and lineage information, data cataloging may assist Yubi in improving the quality of their data. By doing this, companies can make sure that the data is reliable and consistent.
  • Data Reuse: By giving Yubi a clear grasp of the data assets that are accessible and how they may be utilised, data cataloging can assist it to reuse data assets more successfully. Through the avoidance of the need to gather and analyse data unnecessarily, this can help companies save time and money.

Overall, data cataloging is a crucial component of managing data assets in Yubi and may increase the efficacy, value, and efficiency of decision-driven analysis.

LastPage Data Cataloguing at Yubi

Why is Data Cataloguing important for governance?

Data cataloging is important because it helps Yubi to  better manage and use their data assets. It allows Yubi and its subsidiaries to:

  • Improve data discovery: Data cataloging makes it easier for users to find and access the data they need, as they can search for specific data sources and metadata using keywords or tags.
  • Enhance data governance: A data catalog can help Yubi to  enforce data governance policies and ensure that data is being used appropriately and ethically. It can also help Yubi to  track and audit data access and usage.
  • Improve data quality: Data cataloging can help Yubi to  ensure that their data is accurate, up-to-date, and complete. It can also help identify and resolve data quality issues.
  • Enhance data literacy: Data cataloging can improve the data literacy of an organisation by making it easier for users to understand the data they are working with and its context.
  • Support data-driven decision making: Data cataloging can help Yubi to  make more informed decisions by providing a comprehensive view of the data available to them and the relationships between different data sources.

Overall, data cataloging is a critical part of any data management strategy and can help Yubi and its subsidiaries to  make better use of their data assets.

Tools Selection

We investigated multiple tools and ended up choosing Apache Atlas and Datahub as our two open source choices.Each tool has its own advantages and disadvantages. The choice was made based on the ease of maintenance and the number of resources it supports.

Apache Atlas provides open metadata management and governance capabilities for organisations to build a catalog of their data assets, classify and govern these assets and provide collaboration capabilities around these data assets for data scientists, analysts and the data governance team.

DataHub is an open-source extensible metadata platform, enabling data discovery, data observability, and federated governance.

We end up using Datahub since it supports almost all the databases we use in our organisation. In addition to supporting Yubi’s data sources, it also supports DBT lineage, Data quality, and Profile. We chose Datahub as our Data Catalog tool based on these three factors.

Atals datahub Data Cataloguing at Yubi

How to Drive Adoption?

Yubi’s strategies to drive adoption of a data catalog:

  • Make it easy to use: The data catalog should be user-friendly and intuitive, with a simple interface that makes it easy for users to find and access the data they need.
  • Promote the benefits: Clearly communicate the benefits of using the data catalog, such as improved data discovery, enhanced data governance, and better data quality.
  • Encourage participation: Encourage all users to contribute to the data catalog by adding new data sources and metadata, and by providing feedback and suggestions for improvement.
  • Provide training and support: Offer training and support to help users get started with the data catalog and learn how to use it effectively.
  • Integrate the data catalog into existing workflows: Make the data catalog an integral part of the organisation’s data management processes and workflows, so that it becomes a natural and essential part of users’ daily work.
  • Engage with key stakeholders: Work with key stakeholders, such as data owners, data stewards, and data users, to understand their needs and ensure that the data catalog meets their requirements.

Adaption Data Cataloguing at Yubi

Overall, the key to driving adoption of a data catalog is to make it easy to use, promote the benefits, and engage with key stakeholders to ensure that it meets the needs of the organisation.

Stakeholders in Data Cataloguing 

Multiple stakeholders, including the following, must be involved in the creation, administration, and maintenance of a data catalogue, including:

  • Data Owners ( Data Warehouse Lead, Product Owner):  These people or organisations are in charge of creating, using, and maintaining particular data sets. They guarantee that the data is correct, current, and safely kept.
  • Data Stewards :  These individuals or organisations serve as a bridge between data owners and users, ensuring that data is appropriately categorised, documented, and available to the appropriate individuals. They could also apply laws and rules pertaining to data governance.
  • Data Governance Committee (Data Governance Team) : The Data Governance Committee (Data Governance Team) examines new data sets to ensure correct classification and presentation and develops and implements data governance rules and procedures.
  • Data Users ( Business Analytics/Data Science Team/Product Development ):  These individuals or group members make use of the categorised Users may be internal, such as workers, or external, such as clients or partners, and they may utilise the data for a variety of objectives, such as market research and business analysis.
  • DevOps Team: This group is in charge of upkeep of the supporting infrastructure, backup and restore, compliance, performance, and security as well as the technical aspects of the data catalogue. They guarantee that the data catalogue is accessible to and usable by all parties.

Each of these stakeholders is essential to ensuring that the data catalogue is reliable, current, and available to authorised users. They may turn the data catalogue into a useful resource for the company by cooperating well and speaking clearly.

Our Learning

Key Takeaways for Implementing a Data Catalog in Yubi:

  • Clarify the Objective and Scope: Clearly define the purpose and scope of the data catalog, including what types of data it will encompass and who it will serve. This will ensure that the catalog aligns with Yubi’s goals and stakeholders.
  • Gather Data Sources and Metadata: Identify all relevant data sources and gather the necessary metadata to describe them, such as format, quality, lineage, and any related business or technical context.
  • Establish Governance: Create a governance model for the data catalog, including responsibilities for data owners, stewards, and users, to maintain and update the catalog over time.
  • Choose the Right Platform: Choose a data catalog platform that fits Yubi’s requirements, taking into consideration factors such as capabilities, scalability, security, and compatibility with other systems.
  • Fill the Catalog: Populate the data catalog with the identified data sources and metadata, which could involve importing data from existing systems or manually entering it into the catalog.
  • Encourage Adoption: Promote and support the data catalog to users by providing training and support, and encourage participation by encouraging users to add new data and provide feedback.

In conclusion, a successful implementation of a data catalog in Yubi requires a thorough approach and attention to detail to meet the needs of Yubi and its subsidiaries.

Future of Data Catalog

In order to increase data cataloging’s efficacy and efficiency, more cutting-edge technologies and methods are anticipated to be used in the future.

Data cataloging could undergo the following developments:

  • Intensified automation : Tools for categorising data may become increasingly automated, making it simpler for users to find, comprehend, and use data. Machine learning techniques may be used in this situation to extract information and enhance data discovery.
  • Connecting with other systems: Users may be able to access and use data from many sources more simply when data catalogs become increasingly connected with other systems, such as data lakes and data warehouses.
  • Data governance enhancement: Data catalogs may advance in terms of data governance, making it simpler for businesses to enact data regulations and monitor data access and usage.
  • Greater interoperability: Data catalogs may become more interoperable, enabling users to find and access data from other systems and organisations with greater ease.

Overall, it is anticipated that the application of cutting-edge technology and methodologies will increase data cataloging’s efficacy and efficiency while also assisting companies in making better use of their data assets.