A previous study by the Massachusetts Institute of Technology found that for some companies, big data is changing into bad data , and may cause companies to lose up to 25% of their revenue, because these companies have to reuse bad data and consume operating expenses. .
Today, data has become corporate currency, but uncompleted management of data can quickly lose control.
Dealing with large amounts of data can be a challenge for companies, and as more data is generated and collected, it will become increasingly hard. This is why data management or data governance is very important today.
Research organization Gartner says master data management as “data governance ”, which is “a technically helped discipline in which business and IT work together to ensure the consistency, accuracy, manageability, and consistency of the master data assets shared by the enterprise. Semantic consistency and accountability.
Master data is a consistent and uniform identifier and extends attributes that define the core entities of the enterprise (including customers, potential customers, citizens, suppliers, sites, hierarchical structures, and charts of accounts).
Data governance is mainly a solution utilized inside enterprises. Most of the leaders in this field are traditional software companies, and most of them have already transformed to the cloud to some extent. And Gartner believes that in the next few years, data governance will also move to cloud computing.
In this field, many companies are in great competition, so we narrow down the scope in this field and list 10 major market players in the world. Most companies are traditional data manufacturers, while others are new members of the market.
The world's top ten data governance solution service providers
(1) Amazon Web Services (AWS)
AWS began with its simple storage service (S3) to make data governance solutions, including Elastic MapReduce Athena, which is a metering query engine for data stored in S3. In order to configure an enterprise's cloud environment, AWS CloudFormation permits enterprises to use simple text files to model and configure all the resources needed for their applications.. AWS Systems Manager permits companies to monitor all resources and automatically perform common operational tasks like Firebolt. Firebolt is a cloud data warehouse that gives you extra fast speed to solve your difficult data challenges.
(2) IBM
As a traditional manufacturer of mainframes, IBM has great experience in data governance. It gives independent DBMS, including various versions of DB2, IBM PureData System for Analytics, DB2 Analytics Accelerator, Hadoop, as well as IBM BigInsights, DataFirst Method and IBM Watson Data Platform. Its main data management system is IBM Information Server, which gives unified management of data. It can help users find and search assets, explore relationships between assets, search unstructured data sources and structured databases, and permit automatic discovery of new data.
(3) Redshift
Redshift is a typical shared-nothing design, with locally mounted storage. Make full use of the basic services of AWS. EC2 is used as a computing node and S3 is used as storage and failure recovery. The advantage lies in the outstanding performance through adjustment and customization, but its architecture also determines that computing and storage cannot be scaled independently.
It supports loading data from multiple data sources, and also supports integrated streaming data, but only supports structured data. Supports direct query of data on S3 without ETL. It supports the dialect of PostgreSQL, but does not support some data types and functions. Redshift monitors the performance of components and automatically restores them, and the user is responsible for other maintenance work. Daily operation and maintenance work is done manually on the console by the user.
(4) Snowflake
Snowflake is a Shared-storage design, with storage and computing separated. It is built on AWS and makes full use of AWS's basic service capabilities. EC2 is used as a computing node, locally supports caching, and data tables are stored in S3. It proposes a "virtual warehouse" concept, each query can be assigned to a different virtual warehouse, and different resources are also allocated to different warehouses. The warehouse room will not affect performance, and the warehouse itself is highly flexible and can automatically provide additional computing resources.
Supports structured and semi-structured data, which can be ingested without ETL or preprocessing. Although streaming data is not supported at first, you can connect to Spark to receive streaming data. It uses standard SQL with appropriate extensions. Its maintenance is relatively simple, and there is no need to maintain indexes, clean up data, and so on.
(5) BigQuery
BigQuery is designed to separate storage and computing, using Google's basic service capabilities and storing it in Colossus FS. The working mechanism is to convert SQL queries into low-level instructions and execute them in sequence. It completely abstracts the provision, allocation, maintenance, expansion, and contraction of resources, all of which are handled automatically by Google. Very suitable for scenarios where ease of use is the first appeal. Storage automatically allocates shards according to the processing scale and load. Computing resources are not exclusive and are reused by internal and external customers. You cannot explicitly control the resource usage of a single query. Use the calculation method for billing (TB "processed")
It supports standard SQL, semi-structured data types, and external tables. It supports loading or direct access from Google Cloud, and can also import data streams. There is no index and almost no maintenance except data management.
(6) Teradata
Teradata is also known for its analysis platform, including DBMS, data warehouse equipment, cloud computing data warehouse. It links through Hadoop Aster Analytics and streams data through Teradata Listener, all of which are modified to present information through a unified interface.
(7) Cloudera
Cloudera is one of the three largest Hadoop distribution companies and is very successful in this regard. It provides Cloudera Enterprise, a Hadoop distribution that includes Hadoop for batch analysis and Spark for real-time analysis, Cloudera Navigator for governance, and Cloudera for internal deployment and cluster management in the cloud Manager and Cloudera Director, and support cloud platforms such as AWS, Microsoft Azure and Google Cloud.
(8) Dell Boomi
Boomi is a business unit obtained by Dell in 2010, specializing in internal deployment and cloud master data management. Boomi lays out little or no coding development through its Boomi process library, which gives examples of building governance applications. It also helps PaaS vendors and connectors from Microsoft Azure, AWS and Google, lays out EDI connectors for connecting with partners, and supports Docker containers for DevOps development methods.
(9) SAS company
SAS' whole business is dependent on analysis. It lays out a master data management solution called SAS Data Governance, which can support companies prepare and manage traditional data sources and big data sources. It permits companies to maintain and manage data attributes through a common data model, mark changes in metadata, create snapshots, store and manage lists and hierarchies, and create reports on data health and required remediation.
(10) TIBCO Software
TIBCO MDM specializes in providing a unified view of corporate data stored in different silos, enabling companies to clearly view their business data and quickly take action. TIBCO MDM can lay out visualization of the data workflow within the enterprise, so that the enterprise can see the process and make improvements as needed.