Blog, Data Ingestion and Metadata - Artificial Intelligence Zone

The importance of data ingestion and integration for enterprise AI

IBM Journey to AI blog

JANUARY 9, 2024

In the generative AI or traditional AI development cycle, data ingestion serves as the entry point. Here, raw data that is tailored to a company’s requirements can be gathered, preprocessed, masked and transformed into a format suitable for LLMs or other models. One potential solution is to use remote runtime options like.

Data Ingestion

Data Ingestion Data Integration Data Quality LLM

Amazon Q Business simplifies integration of enterprise knowledge bases at scale

Flipboard

FEBRUARY 11, 2025

Amazon Q Business , a new generative AI-powered assistant, can answer questions, provide summaries, generate content, and securely complete tasks based on data and information in an enterprises systems. Large-scale data ingestion is crucial for applications such as document analysis, summarization, research, and knowledge management.

Data Ingestion

Data Ingestion Metadata Generative AI Machine Learning

How Deltek uses Amazon Bedrock for question and answering on government solicitation documents

AWS Machine Learning Blog

AUGUST 9, 2024

Deltek is continuously working on enhancing this solution to better align it with their specific requirements, such as supporting file formats beyond PDF and implementing more cost-effective approaches for their data ingestion pipeline. The first step is data ingestion, as shown in the following diagram. What is RAG?

Data Ingestion

Data Ingestion Metadata LLM Generative AI

Webinars

4 HR Priorities for 2025 to Supercharge Your Employee Experience

AI in Marketing & Sales: Today’s Tools, Tomorrow’s Potential

AI for Paralegals: Everything You Need to Know (and How to Use It Safely)

MORE WEBINARS

Data4ML Preparation Guidelines (Beyond The Basics)

Towards AI

NOVEMBER 8, 2024

Table: Research Phase vs Production Phase Datasets The contrast highlights the “production data” we’ll call “data” in this post. Data is a key differentiator in ML projects (more on this in my blog post below). We don’t have better algorithms; we just have more data. It involves the following core operations: 1.

Data Ingestion

Data Ingestion Metadata ML Engineer ML

Simplify automotive damage processing with Amazon Bedrock and vector databases

AWS Machine Learning Blog

NOVEMBER 14, 2024

This metadata includes details such as make, model, year, area of the damage, severity of the damage, parts replacement cost, and labor required to repair. The information contained in these datasets—the images and the corresponding metadata—is converted to numerical vectors using a process called multimodal embedding.

Metadata

Metadata Data Ingestion Generative AI Computer Vision

Data architecture strategy for data quality

IBM Journey to AI blog

JANUARY 5, 2023

Next generation of big data platforms and long running batch jobs operated by a central team of data engineers have often led to data lake swamps. Both approaches were typically monolithic and centralized architectures organized around mechanical functions of data ingestion, processing, cleansing, aggregation, and serving.

Data Quality

Data Quality Metadata ETL Big Data

Drive hyper-personalized customer experiences with Amazon Personalize and generative AI

AWS Machine Learning Blog

NOVEMBER 26, 2023

You follow the same process of data ingestion, training, and creating a batch inference job as in the previous use case. Getting recommendations along with metadata makes it more convenient to provide additional context to LLMs. You can also use this for sequential chains.

Generative AI

Generative AI Metadata Software Engineer AI

Knowledge Bases in Amazon Bedrock now simplifies asking questions on a single document

AWS Machine Learning Blog

APRIL 26, 2024

With Knowledge Bases for Amazon Bedrock, you can securely connect foundation models (FMs) in Amazon Bedrock to your company data for fully managed Retrieval Augmented Generation (RAG). You can now interact with your documents in real time without prior data ingestion or database configuration.

Data Ingestion

Data Ingestion Generative AI Python Software Engineer

How Twilio generated SQL using Looker Modeling Language data with Amazon Bedrock

AWS Machine Learning Blog

AUGUST 8, 2024

This post highlights how Twilio enabled natural language-driven data exploration of business intelligence (BI) data with RAG and Amazon Bedrock. Twilio’s use case Twilio wanted to provide an AI assistant to help their data analysts find data in their data lake.

Metadata

Metadata LLM Prompt Engineering Prompt Engineer

Build an image search engine with Amazon Kendra and Amazon Rekognition

AWS Machine Learning Blog

MAY 5, 2023

After modeling, detected services of each architecture diagram image and its metadata, like URL origin and image title, are indexed for future search purposes and stored in Amazon DynamoDB , a fully managed, serverless, key-value NoSQL database designed to run high-performance applications.

Metadata

Metadata ETL ML Data Ingestion

How the UNDP Independent Evaluation Office is using AWS AI/ML services to enhance the use of evaluation to support progress toward the Sustainable Development Goals

AWS Machine Learning Blog

MARCH 29, 2023

Data ingestion and extraction Evaluation reports are prepared and submitted by UNDP program units across the globe—there is no standard report layout template or format. The data ingestion and extraction component ingests and extracts content from these unstructured documents.

ML

ML Metadata Data Ingestion Data Extraction

Automate the deployment of an Amazon Forecast time-series forecasting model

AWS Machine Learning Blog

MAY 4, 2023

Each dataset group can have up to three datasets, one of each dataset type: target time series (TTS), related time series (RTS), and item metadata. A dataset is a collection of files that contain data that is relevant for a forecasting task. DatasetGroupFrequencyTTS The frequency of data collection for the TTS dataset.

Automation

Automation Metadata Data Ingestion Data Scientist

Secure a generative AI assistant with OWASP Top 10 mitigation

Flipboard

JANUARY 24, 2025

By default, Amazon Bedrock encrypts all knowledge base-related data using an AWS managed key. When setting up a data ingestion job for your knowledge base, you can also encrypt the job using a custom AWS Key Management Service (AWS KMS) key. Alternatively, you can choose to use a customer managed key.

Generative AI

Generative AI LLM AI AI

Personalize your generative AI applications with Amazon SageMaker Feature Store

AWS Machine Learning Blog

OCTOBER 6, 2023

A feature store maintains user profile data. A media metadata store keeps the promotion movie list up to date. A language model takes the current movie list and user profile data, and outputs the top three recommended movies for each user, written in their preferred tone.

Generative AI

Generative AI LLM Natural Language Processing Metadata

Achieve operational excellence with well-architected generative AI solutions using Amazon Bedrock

AWS Machine Learning Blog

OCTOBER 2, 2024

Additionally, you can enable model invocation logging to collect invocation logs, full request response data, and metadata for all Amazon Bedrock model API invocations in your AWS account. Before you can enable invocation logging, you need to set up an Amazon Simple Storage Service (Amazon S3) or CloudWatch Logs destination.

Generative AI

Generative AI Data Ingestion AI AI

Power recommendations and search using an IMDb knowledge graph – Part 3

AWS Machine Learning Blog

JANUARY 6, 2023

In this post, we illustrate how to handle OOC by utilizing the power of the IMDb dataset (the premier source of global entertainment metadata) and knowledge graphs. Creates a Lambda function to process and load movie metadata and embeddings to OpenSearch Service indexes ( **-ReadFromOpenSearchLambda-** ).

Metadata

Metadata Machine Learning Data Scientist ML

Implement unified text and image search with a CLIP model using Amazon SageMaker and Amazon OpenSearch Service

AWS Machine Learning Blog

APRIL 5, 2023

The dataset is a collection of 147,702 product listings with multilingual metadata and 398,212 unique catalogue images. There are 16 files that include product description and metadata of Amazon products in the format of listings/metadata/listings_.json.gz. We use the first metadata file in this demo.

Metadata

Metadata Neural Network Python ML

Principal Financial Group uses AWS Post Call Analytics solution to extract omnichannel customer insights

AWS Machine Learning Blog

NOVEMBER 15, 2023

The teams built a new data ingestion mechanism, allowing the CTR files to be jointly delivered with the audio file to an S3 bucket. Principal and AWS collaborated on a new AWS Lambda function that was added to the Step Functions workflow.

Data Ingestion

Data Ingestion Metadata NLP Data Scientist

MLOps Landscape in 2023: Top Tools and Platforms

The MLOps Blog

JUNE 27, 2023

Core features of end-to-end MLOps platforms End-to-end MLOps platforms combine a wide range of essential capabilities and tools, which should include: Data management and preprocessing : Provide capabilities for data ingestion, storage, and preprocessing, allowing you to efficiently manage and prepare data for training and evaluation.

Machine Learning

Machine Learning Metadata Data Scientist Data Quality

Integrate Amazon Bedrock Knowledge Bases with Microsoft SharePoint as a data source

AWS Machine Learning Blog

SEPTEMBER 19, 2024

It provides the ability to extract structured data, metadata, and other information from documents ingested from SharePoint to provide relevant search results based on the user query. For more information, see Encryption of transient data storage during data ingestion. Choose Next.

Metadata

Metadata Data Ingestion ML Generative AI

Build a news recommender application with Amazon Personalize

AWS Machine Learning Blog

APRIL 4, 2024

Prerequisites To implement this solution, you need the following: Historical and real-time user click data for the interactions dataset Historical and real-time news article metadata for the items dataset Ingest and prepare the data To train a model in Amazon Personalize, you need to provide training data.

ETL

ETL Auto-complete Metadata Data Ingestion

Boost your forecast accuracy with time series clustering

AWS Machine Learning Blog

APRIL 4, 2023

Refer to the Amazon Forecast Developer Guide for information about data ingestion , predictor training , and generating forecasts. If you have item metadata and related time series data, you can also include these as input datasets for training in Forecast.

Python

Python Machine Learning Explainability Data Ingestion

Introduction to Apache NiFi and Its Architecture

Pickl AI

JULY 30, 2024

Summary: Apache NiFi is a powerful open-source data ingestion platform design to automate data flow management between systems. Its architecture includes FlowFiles, repositories, and processors, enabling efficient data processing and transformation. FlowFile At the core of NiFi’s architecture is the FlowFile.

Data Ingestion

Data Ingestion ETL Big Data Data Integration

How Earth.com and Provectus implemented their MLOps Infrastructure with Amazon SageMaker

AWS Machine Learning Blog

JUNE 27, 2023

This blog post is co-written with Marat Adayev and Dmitrii Evstiukhin from Provectus. The ML components for data ingestion, preprocessing, and model training were available as disjointed Python scripts and notebooks, which required a lot of manual heavy lifting on the part of engineers.

DevOps

DevOps ML Machine Learning ML Engineer

Build a multi-interface AI assistant using Amazon Q and Slack with Amazon CloudFront clickable references from an Amazon S3 bucket

AWS Machine Learning Blog

FEBRUARY 5, 2025

Amazon Kendra also supports the use of metadata for each source file, which enables both UIs to provide a link to its sources, whether it is the Spack documentation website or a CloudFront link. Furthermore, Amazon Kendra supports relevance tuning , enabling boosting certain data sources.

Data Ingestion

Data Ingestion AI AI Metadata

Operationalizing Large Language Models: How LLMOps can help your LLM-based applications succeed

deepsense.ai

JULY 30, 2023

In this blog post we will discuss the importance of LLMOps principles and best practices, which will enable you to take your existing or new machine learning projects to the next level. During training, we log all the model metrics and metadata automatically. This triggers a bunch of quality checks (e.g.

Large Language Models

Large Language Models LLM Machine Learning Automation

How to Build Machine Learning Systems With a Feature Store

The MLOps Blog

JANUARY 26, 2024

A feature store typically comprises a feature repository, a feature serving layer, and a metadata store. It can also transform incoming data on the fly. The metadata store manages the metadata associated with each feature, such as its origin and transformations. What are the components of a feature store?

Machine Learning

Machine Learning Metadata ML Python

Unfolding the Details of Hive in Hadoop

Pickl AI

JULY 6, 2023

Hive is a powerful data warehousing infrastructure that provides an interface for querying and analyzing large datasets stored in Hadoop. In this blog, we will explore the key aspects of Hive Hadoop. Hive is a data warehousing infrastructure built on top of Hadoop. Here comes the role of Hive in Hadoop. What is Hadoop ?

Big Data

Big Data Data Analysis ETL Data Ingestion

Unlock the power of data governance and no-code machine learning with Amazon SageMaker Canvas and Amazon DataZone

AWS Machine Learning Blog

AUGUST 21, 2024

As the data scientist, complete the following steps: In the Environments section of the Banking-Consumer-ML project, choose SageMaker Studio. On the Asset catalog tab, search for and choose the data asset Bank. You can view the metadata and schema of the banking dataset to understand the data attributes and columns.

Machine Learning

Machine Learning Data Scientist ML Data Quality

Learnings From Teams Training Large-Scale Models: Challenges and Solutions For Monitoring at Hyperscale

The MLOps Blog

FEBRUARY 13, 2025

The solution lies in systems that can handle high-throughput data ingestion while providing accurate, real-time insights. A solution lies in adopting a single source of truth for all experiment metadata, encompassing everything from input data and training metrics to checkpoints and outputs. Tools like neptune.ai

Data Ingestion

Data Ingestion Automation Software Engineer Metadata

How Can The Adoption of a Data Platform Simplify Data Governance For An Organization?

Pickl AI

APRIL 14, 2023

Falling into the wrong hands can lead to the illicit use of this data. Hence, adopting a Data Platform that assures complete data security and governance for an organization becomes paramount. In this blog, we are going to discuss more on What are Data platforms & Data Governance.

Data Platform

Data Platform Data Integration Data Ingestion Automation

Unlocking the 12 Ways to Improve Data Quality

Pickl AI

OCTOBER 19, 2023

Whether you are a business executive making critical choices, a scientist conducting groundbreaking research, or simply an individual seeking accurate information, data quality is a paramount concern. The Relevance of Data Quality Data quality refers to the accuracy, completeness, consistency, and reliability of data.

Data Quality

Data Quality ETL Machine Learning Data Ingestion

Supercharging Your Data Pipeline with Apache Airflow (Part 2)

Heartbeat

NOVEMBER 6, 2023

You might need to extract the weather and metadata information about the location, after which you will combine both for transformation. In the image, you can see that the extract the weather data and extract metadata information about the location need to run in parallel. This type of execution is shown below.

ETL

ETL Python Metadata Deep Learning

LLMOps: What It Is, Why It Matters, and How to Implement It

The MLOps Blog

MARCH 12, 2024

Model management Teams typically manage their models, including versioning and metadata. Develop the text preprocessing pipeline Data ingestion: Use Unstructured.io to ingest data from health forums, medical journals, and wellness blogs. using techniques like RLHF.)

Prompt Engineer

Prompt Engineer Prompt Engineering Large Language Models LLM

How to Build an End-To-End ML Pipeline

The MLOps Blog

MAY 9, 2023

The components comprise implementations of the manual workflow process you engage in for automatable steps, including: Data ingestion (extraction and versioning). Data validation (writing tests to check for data quality). Data preprocessing. Let’s briefly go over each of the components below. CSV, Parquet, etc.)

ML

ML Machine Learning Metadata Data Science

A review of purpose-built accelerators for financial services

AWS Machine Learning Blog

SEPTEMBER 11, 2024

The following table shows the metadata of three of the largest accelerated compute instances. The automated process of data ingestion, processing, packaging, combination, and prediction is referred to by WorldQuant as their “alpha factory.” 32xlarge 0 16 0 128 512 512 4 x 1.9

ML

ML Deep Learning Algorithm Large Language Models

Definite Guide to Building a Machine Learning Platform

The MLOps Blog

MARCH 21, 2023

To make that possible, your data scientists would need to store enough details about the environment the model was created in and the related metadata so that the model could be recreated with the same or similar outcomes. Your ML platform must have versioning in-built because code and data mostly make up the ML system.

Machine Learning

Machine Learning Data Scientist ML Metadata

Unlock the power of structured data for enterprises using natural language with Amazon Q Business

AWS Machine Learning Blog

AUGUST 20, 2024

In this post, we discuss an architecture to query structured data using Amazon Q Business, and build out an application to query cost and usage data in Amazon Athena with Amazon Q Business. You can extend this architecture to use additional data sources, query validation, and prompting techniques to cover a wider range of use cases.

Natural Language Processing

Natural Language Processing Metadata NLP Data Ingestion

Multi-tenant RAG with Amazon Bedrock Knowledge Bases

AWS Machine Learning Blog

DECEMBER 16, 2024

In the context of RAG systems, tenants might have varying requirements for data ingestion frequency, document chunking strategy, or vector search configuration. Metadata filtering can be used in the silo pattern to restrict the search to a subset of documents with a specific characteristic.

Metadata

Metadata Data Ingestion Generative AI Natural Language Processing

Introducing Amazon Kendra GenAI Index – Enhanced semantic search and retrieval capabilities

AWS Machine Learning Blog

DECEMBER 4, 2024

The core challenge lies in developing data pipelines that can handle diverse data sources, the multitude of data entities in each data source, their metadata and access control information, while maintaining accuracy. As a result, they can index one time and reuse that indexed content across use cases.

Metadata

Metadata Generative AI Data Ingestion Software Engineer

Discover insights from your Amazon Aurora PostgreSQL database using the Amazon Q Business connector

AWS Machine Learning Blog

DECEMBER 11, 2024

A document is a collection of information that consists of a title, the content (or the body), metadata (data about the document), and access control list (ACL) information to make sure answers are provided from documents that the user has access to. When the data source state is Active , choose Sync now.

Auto-complete

Auto-complete IDP Generative AI Metadata

Dive deep into vector data stores using Amazon Bedrock Knowledge Bases

AWS Machine Learning Blog

OCTOBER 11, 2024

Role of metadata while indexing data in vector databases Metadata plays a crucial role when loading documents into a vector data store in Amazon Bedrock. These identifiers can be used to uniquely reference and retrieve specific documents from the vector data store.

Metadata

Metadata Generative AI LLM Data Ingestion

The importance of data ingestion and integration for enterprise AI

Amazon Q Business simplifies integration of enterprise knowledge bases at scale

Webinars

Trending Sources

How Deltek uses Amazon Bedrock for question and answering on government solicitation documents

Webinars

Data4ML Preparation Guidelines (Beyond The Basics)

Simplify automotive damage processing with Amazon Bedrock and vector databases

Data architecture strategy for data quality

Drive hyper-personalized customer experiences with Amazon Personalize and generative AI

Knowledge Bases in Amazon Bedrock now simplifies asking questions on a single document

How Twilio generated SQL using Looker Modeling Language data with Amazon Bedrock

Build an image search engine with Amazon Kendra and Amazon Rekognition

How the UNDP Independent Evaluation Office is using AWS AI/ML services to enhance the use of evaluation to support progress toward the Sustainable Development Goals

Automate the deployment of an Amazon Forecast time-series forecasting model

Secure a generative AI assistant with OWASP Top 10 mitigation

Personalize your generative AI applications with Amazon SageMaker Feature Store

Achieve operational excellence with well-architected generative AI solutions using Amazon Bedrock

Power recommendations and search using an IMDb knowledge graph – Part 3

Implement unified text and image search with a CLIP model using Amazon SageMaker and Amazon OpenSearch Service

Principal Financial Group uses AWS Post Call Analytics solution to extract omnichannel customer insights

MLOps Landscape in 2023: Top Tools and Platforms

Integrate Amazon Bedrock Knowledge Bases with Microsoft SharePoint as a data source

Build a news recommender application with Amazon Personalize

Boost your forecast accuracy with time series clustering

Introduction to Apache NiFi and Its Architecture

How Earth.com and Provectus implemented their MLOps Infrastructure with Amazon SageMaker

Build a multi-interface AI assistant using Amazon Q and Slack with Amazon CloudFront clickable references from an Amazon S3 bucket

Operationalizing Large Language Models: How LLMOps can help your LLM-based applications succeed

How to Build Machine Learning Systems With a Feature Store

Unfolding the Details of Hive in Hadoop

Unlock the power of data governance and no-code machine learning with Amazon SageMaker Canvas and Amazon DataZone

Learnings From Teams Training Large-Scale Models: Challenges and Solutions For Monitoring at Hyperscale

How Can The Adoption of a Data Platform Simplify Data Governance For An Organization?

Unlocking the 12 Ways to Improve Data Quality

Supercharging Your Data Pipeline with Apache Airflow (Part 2)

LLMOps: What It Is, Why It Matters, and How to Implement It

How to Build an End-To-End ML Pipeline

A review of purpose-built accelerators for financial services

Definite Guide to Building a Machine Learning Platform

Unlock the power of structured data for enterprises using natural language with Amazon Q Business

Multi-tenant RAG with Amazon Bedrock Knowledge Bases

Introducing Amazon Kendra GenAI Index – Enhanced semantic search and retrieval capabilities

Discover insights from your Amazon Aurora PostgreSQL database using the Amazon Q Business connector

Dive deep into vector data stores using Amazon Bedrock Knowledge Bases

Stay Connected