n Amazon Kinesis Data Streams, shards are fundamental units of data storage and throughput capacity. Shards represent the partitions of a data stream and allow for parallel processing and distribution of data records. Each shard is an independent sequence of data records, and the number of shards in a stream directly affects its capacity for data ingestion and retrieval.
Here are some key points to understand about shards in Amazon Kinesis Data Streams:
Shard Capacity:
Each shard has its own capacity for handling data. The capacity is defined by two main factors:
Data Ingestion Rate: The maximum rate at which data records can be ingested into the shard.
Data Retrieval Rate: The maximum rate at which data records can be retrieved from the shard.
Scalability:
You can adjust the number of shards in a stream to accommodate changes in data volume or processing requirements. Increasing the number of shards allows you to increase the overall capacity of the stream.
Partition Key:
When you put data records into a stream, you provide a partition key along with each record. The partition key is used to determine which shard a data record belongs to. This ensures that data records with the same partition key are stored and processed together.
Data Distribution:
Shards enable parallel processing of data. Records with different partition keys are distributed across different shards, allowing for efficient processing of data in parallel by multiple consumers.
Data Ordering:
Data records within a shard are ordered based on their arrival time. This means that records in the same shard are processed in the order in which they were ingested.
Resharding:
If you need to increase or decrease the capacity of your stream, you can perform a resharding operation. Resharding involves splitting a shard into two smaller shards or merging two shards into one, depending on whether you're scaling up or down.
Scaling Considerations:
When deciding on the number of shards, consider factors such as the anticipated data volume, the number of consumers, and the desired level of parallelism in your processing.
Cost Considerations:
The number of shards directly impacts the cost of using Kinesis Data Streams. Each shard contributes to the overall cost, so managing the number of shards based on your actual processing needs is important for cost optimization.
Data Retention and Aging:
Each shard retains data records for a specified retention period (up to 7 days). After this period, the data is automatically aged out.