-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.html
1018 lines (901 loc) · 53.1 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>
ResourceSync Sitemap-based Approach
</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<link rel="stylesheet" type="text/css" href="resync.css" media="screen, projection" />
<meta name="robots" content="all" />
</head>
<body>
<p>
<a name="top" id="top"></a>
</p>
<div id="mainContent">
<h1>
<a name="resync_protocol" id="resync_protocol"></a>ResourceSync Sitemap-based Approach
</h1>
<!-- TOC -->
<blockquote>
<b>Jump to:</b><br />
<a href="#definitions">1. Definitions</a><br />
<a href="#baseline_sync">2. Baseline Synchronization</a><br />
<a href="#source_capability">3. Source Capabilities Description</a><br />
<a href="#change_communication">4. Change Communication</a><br />
<a href="#dump_packaging">5. Dump Packaging</a><br />
<a href="#resource_transfer_override">6. Resource Transfer Override</a><br />
</blockquote>
<!-- Intro -->
<p>
ResourceSync is a synchronization framework for Web resources. It allows a <em>Destination</em> to keep in sync with resource changes in a <em>Source</em>.
This document describes the XML Schemas for the ResourceSync components and the messages that are exchanged between Source and Destination.
</p>
<br />
<br />
<br />
<br />
<!-- DEFINITIONS -->
<h2>
<a name="definitions" id="definitions"></a>1. Definitions
</h2>
<ul>
<li><strong>Resource</strong>: an object to be synchronized, a web resource <span class="todo"> identified by a de-referencable URI?</span></li>
<li><strong>Metadata</strong>: information about Resources such as URI, modification time,
checksum, etc. (Not to be confused with Resources that may
themselves be metadata about another resource, e.g. a DC record)</li>
<li><strong>Source</strong>: system with original or master Resource</li>
<li><strong>Destination</strong>: system to which Resources from the Source will be
copied and kept in synchronization</li>
<li><strong>Pull</strong>: process to get information from Source to Destination initiated
by the Destination.</li>
<li><strong>Push</strong>: process to get information from Source to Destination initiated
by the Source.</li>
</ul>
<p>Three distinct needs considered in scope for the effort:</p>
<ol>
<li><strong>Baseline Synchronization</strong>: Allows a Destination to perform an
initial synchronization with a Source. (We consider only Pull
methods in scope.)</li>
<li><strong>Incremental Synchronization</strong>: Allows a Destination to remain
synchronized with the Source by following changes at the
Source. Two aspects:</li>
<ol type="a">
<li><strong>Change Communication</strong>: Allows a Destination to understand that a
Resource has changed; and what the nature of that change event
is.</li>
<li><strong>Resource Transfer</strong>: Allows a Destination to update its holdings to
reflect a change in a Resource at the Source.</li>
</ol>
<li><strong>Audit</strong>: Allow checking whether a Destination is in synchronization
with a Source. (We consider only Pull methods in scope.)</li>
</ol>
<p>Additional terminology/concepts:</p>
<ul>
<li><strong>Change Memory</strong>: A record of changes, perhaps as ChangeSets</li>
<li><strong>Resource Memory</strong>: A record or archive of past Resource states</li>
<li><strong>ChangeSet</strong>: a set of events used for Change Communication</li>
<li><strong>Dump</strong>: a package of Resources and associated Metadata</li>
<li><strong>Pull based Incremental Synchronization</strong>: method of Incremental
Synchronization relying on polling by the Destination.</li>
<li><strong>Push based Incremental Synchronization</strong>: method of Incremental
Synchronization relying on changes being pushed from the Source to
the Destination, likely via an intermediary.</li>
</ul>
<!-- INVENTORY -->
<h2>
<a name="baseline_sync" id="baseline_sync"></a>2. Baseline Synchronization
</h2>
<p>To allow a Destination to perform an initial synchronization with a Source, it needs to retrieve the list of Resources available in a Source, its <strong>Inventory</strong>.</p>
<p>A Source generates an <a href="http://www.sitemaps.org/" title="sitemaps.org - Home">XML Sitemap</a> to describe its Inventory. All data values in a Sitemap must be <a href="http://www.sitemaps.org/protocol.html#escaping">entity-escaped</a>. The Sitemap serialization must be UTF-8 encoded.</p>
<p>
The Sitemap must:
</p>
<ul>
<li>Begin with an opening <code><<a href="#urlsetdef">urlset</a>></code> tag and end with a closing <code></urlset></code> tag.
</li>
<li>Specify the namespace (protocol standard) within the <code><urlset></code> tag.
</li>
<li>Include a <code><<a href="#urldef">url</a>></code> entry for each URL <span class="rs">(resource)</span>, as a parent XML tag. </li>
<li>Include a <code><<a href="#locdef">loc</a>></code> child entry for each <code><url></code> parent tag.</li>
<li><span class="rs">Include a <code><<a href="#lastmod">lastmod</a>></code> child entry for each <code><url></code> parent tag in order to enable baseline synchronization.</span> <span class="todo">Should this be mandatory for ResourceSync?</span>
</ul>
<p>
Elements introduced for ResourceSync are in the <code class="rs">http://resourcesync.org/ns/</code> namespace. The <code class="rs">rs</code> prefix is used in this document to indicate that XML tags are defined in that namespace.
</p>
<h3>
2.1 Sample XML Sitemap Inventory
</h3>
<p>
The following example shows a Sitemap inventory using just the minimum required tags for resource synchronization.
</p>
<pre>
<?xml version="1.0" encoding="UTF-8"?>
<<a href="#urlsetdef">urlset</a> xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<<a href="#urldef">url</a>>
<<a href="#locdef">loc</a>>http://www.example.com/res1</loc>
<<a href="#lastmoddef">lastmod</a>>2005-01-01</lastmod>
</url>
</urlset>
</pre>
<p>
The following example shows a Sitemap inventory that contains just one URL and uses all optional tags. The optional tags are in italics.
</p>
<pre>
<?xml version="1.0" encoding="UTF-8"?>
<<a href="#urlsetdef">urlset</a> xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
<span class="rs">xmlns:rs="http://resourcesync.org/ns/"</span>>
<<a href="#urldef">url</a>>
<<a href="#locdef">loc</a>>http://www.example.com/res1</loc>
<<a href="#lastmoddef">lastmod</a>>2005-01-01</lastmod>
<em><<span class="rs"><a href="#sizedef">rs:size</a>>5746</rs:size></span>
<<span class="rs"><a href="#md5def">rs:md5</a>>asdfb213l123br1234Pulll-based Change Communication Discovery</rs:md5></span>
<<span class="rs"><a href="#tagdef">rs:tag</a>>frogs</rs:tag></span>
<<a href="#changefreqdef">changefreq</a>>monthly</changefreq>
<<a href="#prioritydef">priority</a>>0.8</priority></em>
</url>
</urlset>
</pre>
<h3>
<a name="xmlTagDefinitions" id="xmlTagDefinitions"></a>2.2 XML tag definitions
</h3>
<p>
The available XML tags are described below.
</p>
<table width="80%">
<tr>
<th>
Attribute
</th>
<th></th>
<th>
Description
</th>
</tr>
<tr>
<td>
<a name="urlsetdef" id="urlsetdef"></a><code><urlset></code>
</td>
<td>
required
</td>
<td>
<p>
Encapsulates the file and references the current XML Sitemap <span class="rs">protocol standard</span>.
</p>
</td>
</tr>
<tr class="alt">
<td>
<a name="urldef" id="urldef"></a><code><url></code>
</td>
<td>
required
</td>
<td>
<p>
Parent tag for each URL entry. The remaining tags are children of this tag.
</p>
</td>
</tr>
<tr>
<td>
<a name="locdef" id="locdef"></a><code><loc></code>
</td>
<td>
required
</td>
<td>
<p>
URL of the page. This URL must begin with the protocol (such as http) and end with a trailing slash, if your web server requires it. This value must be less than 2,048 characters.
</p>
</td>
</tr>
<tr class="alt">
<td>
<a name="lastmoddef" id="lastmoddef"></a><code><lastmod></code>
</td>
<td>
optional <span class="todo">required?</span>
</td>
<td>
<p>
The date of last modification of the file. This date should be in <a href="http://www.w3.org/TR/NOTE-datetime">W3C Datetime</a> format. This format allows you to omit the time portion, if desired, and use YYYY-MM-DD.
</p>
<p>
Note that this tag is separate from the If-Modified-Since (304) header the server can return, and search engines may use the information from both sources differently.
</p>
</td>
</tr>
<tr class="rs">
<td>
<a name="sizedef" id="sizedef"></a><code><rs:size></code>
</td>
<td>
optional
</td>
<td>
<p>
The size in bytes of the resource.
</p>
</td>
</tr>
<tr class="alt rs">
<td>
<a name="md5def" id="md5def"></a><code><rs:md5></code>
</td>
<td>
optional
</td>
<td>
<p>
The MD5 checksum for the resource expressed in hex, e.g. <code>4415d4a1df0e4bee731db465b04da138</code>.
</p>
</td>
</tr>
<tr class="rs">
<td>
<a name="tagdef" id="tagdef"></a><code><rs:tag></code>
</td>
<td>
optional, repeatable
</td>
<td>
<p>
A keyword or term assigned to a resource, which may originate from existing controlled vocabularies. This element may be repeated to indicate multiple tags. A destination can use this element for selective synchronization. <span class="todo">TODO: use dc:subject instead? Simeon - think no, because dc:subject has semantics that may not match the reason for giving tags. I like rs:tag.</span>
</p>
</td>
</tr>
<tr class="alt">
<td>
<a name="changefreqdef" id="changefreqdef"></a><code><changefreq></code>
</td>
<td>
optional
</td>
<td>
<p>
How frequently the page is likely to change. This value provides general information to search engines and may not correlate exactly to how often they crawl the page. Valid values are:
</p>
<ul>
<li>always
</li>
<li>hourly
</li>
<li>daily
</li>
<li>weekly
</li>
<li>monthly
</li>
<li>yearly
</li>
<li>never
</li>
</ul>
<p>
The value "always" should be used to describe documents that change each time they are accessed. The value "never" should be used to describe archived URLs.
</p>
<p>
Please note that the value of this tag is considered a <i>hint</i> and not a command. Even though search engine crawlers may consider this information when making decisions, they may crawl pages marked "hourly" less frequently than that, and they may crawl pages marked "yearly" more frequently than that. Crawlers may periodically crawl pages marked "never" so that they can handle unexpected changes to those pages.
</p>
</td>
</tr>
<tr>
<td>
<a name="prioritydef" id="prioritydef"></a><code><priority></code>
</td>
<td>
optional
</td>
<td>
<p>
The priority of this URL relative to other URLs on your site. Valid values range from 0.0 to 1.0. This value does not affect how your pages are compared to pages on other sites—it only lets the search engines know which pages you deem most important for the crawlers.
</p>
<p>
The default priority of a page is 0.5.
</p>
<p>
Please note that the priority you assign to a page is not likely to influence the position of your URLs in a search engine's result pages. Search engines may use this information when selecting between URLs on the same site, so you can use this tag to increase the likelihood that your most important pages are present in a search index.
</p>
<p>
Also, please note that assigning a high priority to all of the URLs on your site is not likely to help you. Since the priority is relative, it is only used to select between URLs on your site.
</p>
</td>
</tr>
</table>
<h3>
<a name="sitemapXMLExample" id="sitemapXMLExample"></a>2.3 Sample XML Sitemap
</h3>
<p>
The following example shows a Sitemap in XML format. The Sitemap in the example contains a small number of URLs.
</p>
<pre>
<?xml version="1.0" encoding="UTF-8"?>
<<a href="#urlsetdef">urlset</a> xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
<span class="rs">xmlns:rs="http://resourcesync.org/ns/"</span>>
<<a href="#urldef">url</a>>
<<a href="#locdef">loc</a>>http://www.example.com/res1</loc>
<<a href="#lastmoddef">lastmod</a>>2005-01-01</lastmod>
<<span class="rs"><a href="#tagdef">rs:tag</a>>frogs</rs:tag></span>
<<span class="rs"><a href="#tagdef">rs:tag</a>>crocodiles</rs:tag></span>
</url>
<<a href="#urldef">url</a>>
<<a href="#locdef">loc</a>>http://www.example.com/res2</loc>
<<a href="#lastmoddef">lastmod</a>>2006-02-21T18:00:15+00:00</lastmod>
<<span class="rs"><a href="#tagdef">rs:tag</a>>fish</rs:tag></span>
</url>
<<a href="#urldef">url</a>>
<<a href="#locdef">loc</a>>http://www.example.com/res3</loc>
<<a href="#lastmoddef">lastmod</a>>2007-03-23T18:00:15+00:00</lastmod>
<<span class="rs"><a href="#tagdef">rs:tag</a>>humans</rs:tag></span>
</url>
</urlset>
</pre>
<h3>
<a name="sitemap_index_inventory" id="sitemap_index_inventory"></a>2.4 Sitemap Index Inventory (to group multiple Sitemap files)
</h3>
<p>
You can provide multiple Sitemap files, but each Sitemap file that you provide must have no more than 50,000 URLs and must be no larger than 10MB (10,485,760 bytes). If you would like, you may compress your Sitemap files using gzip to reduce your bandwidth requirement; however the Sitemap file once uncompressed must be no larger than 10MB. If you want to list more than 50,000 URLs, you must create multiple Sitemap files.
</p>
<p>
If you do provide multiple Sitemaps, you should then list each Sitemap file in a Sitemap index file. Sitemap index files may not list more than 50,000 Sitemaps and must be no larger than 10MB (10,485,760 bytes) and can be compressed. You can have more than one Sitemap index file. The XML format of a Sitemap index file is very similar to the XML format of a Sitemap file.
</p>
<p class="todo">
The Sitemaps specification results in a limit of 2.5 billion (2.5x10<sup>9</sup>) resources if the Sitemap limits are followed (and up to 500GB for all the Sitemap files). If necessary, strategies for extension might include relaxation of the size/entry limits of individual Sitemap and Sitemap index files, or extension to three (or more) tiers where a Sitemap index may specify a set of Sitemap indexes (with three levels one could then support 125 trillion (1.25x10<sup>14</sup>) resources and the Sitemaps alone would be extremely large, up to 25PB).
</p>
<p>
The Sitemap index file must:
</p>
<ul>
<li>Begin with an opening <code><<a href="#sitemapIndex_sitemapindex">sitemapindex</a>></code> tag and end with a closing <code></sitemapindex></code> tag.
</li>
<li>Include a <code><<a href="#sitemapIndex_sitemap">sitemap</a>></code> entry for each Sitemap as a parent XML tag.
</li>
<li>Include a <code><<a href="#sitemapIndex_loc">loc</a>></code> child entry for each <code><sitemap></code> parent tag.
</li>
</ul>
<p>
The <code><<a href="#sitemapIndex_lastmod">lastmod</a>></code> and <span class="rs"><code><<a href="#sitemapIndex_tag">rs:tag</a>></code></span> elements are optional Sitemap index files.
</p>
<p>
<strong>Note:</strong> A Sitemap index file can only specify Sitemaps that are found on the same site as the Sitemap index file. For example, http://www.yoursite.com/sitemap_index.xml can include Sitemaps on http://www.yoursite.com but not on http://www.example.com or http://yourhost.yoursite.com. As with Sitemaps, your Sitemap index file must be UTF-8 encoded.
</p>
<h3>
<a name="sitemapIndexXMLExample" id="sitemapIndexXMLExample"></a>2.5 Sample XML Sitemap Index
</h3>
<p>
The following example shows a Sitemap index that lists two Sitemaps:
</p>
<pre>
<?xml version="1.0" encoding="UTF-8"?>
<<a href="#sitemapIndex_sitemapindex">sitemapindex</a> xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
<span class="rs">xmlns:rs="http://resourcesync.org/ns/"</span>>
<<a href="#sitemapIndex_sitemap">sitemap</a>>
<<a href="#sitemapIndex_loc">loc</a>>http://www.example.com/sitemap1.xml.gz</loc>
<<a href="#sitemapIndex_lastmod">lastmod</a>>2012-10-01T18:23:17+00:00</lastmod>
<<span class="rs"><a href="#tagdef">rs:tag</a>>frogs</rs:tag></span>
<<span class="rs"><a href="#tagdef">rs:tag</a>>crocodiles</rs:tag></span>
<<span class="rs"><a href="#tagdef">rs:tag</a>>fish</rs:tag></span>
</sitemap>
<<a href="#sitemapIndex_sitemap">sitemap</a>>
<<a href="#sitemapIndex_loc">loc</a>>http://www.example.com/sitemap2.xml.gz</loc>
<<a href="#sitemapIndex_lastmod">lastmod</a>>2012-01-01</lastmod>
<<span class="rs"><a href="#tagdef">rs:tag</a>>humans</rs:tag></span>
</sitemap>
</sitemapindex>
</pre>
<p>
<strong>Note:</strong> Sitemap URLs, like all values in your XML files, must be <a href="#escaping">entity escaped</a>.
</p>
<h3>
<a name="sitemapIndexTagDefinitions" id="sitemapIndexTagDefinitions"></a>2.6 Sitemap Index XML Tag Definitions
</h3>
<table width="80%">
<tr>
<th>
Attribute
</th>
<th></th>
<th>
Description
</th>
</tr>
<tr>
<td>
<a name="sitemapIndex_sitemapindex" id="sitemapIndex_sitemapindex"></a><code><sitemapindex></code>
</td>
<td>
required
</td>
<td>
Encapsulates information about all of the Sitemaps in the file.
</td>
</tr>
<tr class="alt">
<td>
<a name="sitemapIndex_sitemap" id="sitemapIndex_sitemap"></a><code><sitemap></code>
</td>
<td>
required
</td>
<td>
Encapsulates information about an individual Sitemap.
</td>
</tr>
<tr>
<td>
<a name="sitemapIndex_loc" id="sitemapIndex_loc"></a><code><loc></code>
</td>
<td>
required
</td>
<td>
<p>
Identifies the location of the Sitemap.
</p>
<p>
This location can be a Sitemap, an Atom file, RSS file or a simple text file.
</p>
</td>
</tr>
<tr class="alt">
<td>
<a name="sitemapIndex_lastmod" id="sitemapIndex_lastmod"></a><code><lastmod></code>
</td>
<td>
optional
</td>
<td>
<p>
Identifies the time that the corresponding Sitemap file was modified. It does not correspond to the time that any of the pages listed in that Sitemap were changed. The value for the lastmod tag should be in <a href="http://www.w3.org/TR/NOTE-datetime">W3C Datetime</a> format.
</p>
<p>
By providing the last modification timestamp, you enable search engine crawlers to retrieve only a subset of the Sitemaps in the index i.e. a crawler may only retrieve Sitemaps that were modified since a certain date. This incremental Sitemap fetching mechanism allows for the rapid discovery of new URLs on very large sites.
</p>
</td>
</tr>
<tr class="rs">
<td>
<a name="sitemapIndex_tag" id="sitemapIndex_tag"></a><code><rs:tag></code>
</td>
<td>
optional
</td>
<td>
<p>
A keyword or term assigned to a Sitemap. This terms is aggregated from the resources' tags within a Sitemap. This element may be repeated to indicate multiple tags. A destination can use this element for selective synchronization. TODO: use dc:subject instead?
</p>
<p>
By providing tags clients can retrieve only a subset of available Sitemap index files.
</p>
</td>
</tr>
</table>
<p class="backtotop">
<a href="#top">Back to top</a>
</p>
<!-- INVENTORY - COMPONENT DISCOVERY -->
<h2>
<a name="source_capability" id="source_capability"></a>3. Source Capabilities Description
</h2>
<p>ResourceSync proposes a <strong>modular</strong> synchronization framework, consisting of several components. The availability of these components depends on a Source's capabilities. A Source can describe its capabilities by providing links to available synchronization components. <span class="todo">TODO: more details on this!</span></p>
<h3>
<a name="dump_discovery" id="dump_discovery"></a>3.1 Dump Discovery
</h3>
<p>A source may periodically produce Dumps, which are packages of Resources and associated Metadata. It gives a Destination the opportunity to obtain the resource content without having to pull all resources separately via HTTP GET requests. A Source can indicate the availability of Dumps by providing links to downloadable dump files in its Sitemap inventory.</p>
<pre>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://resourcesync.org/ns/"
xmlns:atom="http://www.w3.org/2005/Atom">
<span class="rs">
<!-- the source produces resource dumps -->
<atom:link href="http://example.com/dump/dump1.tar.gz" rel="rs:dump" rs:tags="crocodiles frogs"/>
<atom:link href="http://example.com/dump/dump2.tar.gz" rel="rs:dump" rs:tags="fish"/>
<atom:link href="http://example.com/dump/dump2.tar.gz" rel="rs:dump" rs:tags="humans"/>
</span>
<url>
<loc>http://www.example.com/res1</loc>
<lastmod>2005-01-01</lastmod>
<rs:tag>frogs</rs:tag>
<rs:tag>crocodiles</rs:tag>
</url>
<url>
<loc>http://www.example.com/res2</loc>
<lastmod>2006-02-21T18:00:15+00:00</lastmod>
<rs:tag>fish</rs:tag>
</url>
<url>
<loc>http://www.example.com/res3</loc>
<lastmod>2007-03-23T18:00:15+00:00</lastmod>
<rs:tag>humans</rs:tag>
</url>
</urlset>
</pre>
<p class="todo">
Open Issue: Should we conflate tags into a single rs:tags element? Simeon - I think that separate elements is best (with attributes one has to do it though).
</p>
<h3>
<a name="change_comm_discovery_pull" id="change_comm_discovery_pull"></a>3.2 Pull-based Change Communication Discovery
</h3>
<p>Pull-based Incremental Synchronization relies on changes being polled by the Destination. Therefore, a Source may provide links to relevant ChangeSets in a Change Memory, which may be hosted by the Source itself or an external Change Memory service. A ChangeSet is identified by a de-referencable URI and contains an entry for each change event that occurred on a Source's resources <strong>after</strong> the creation of the inventory, i.e, the Sitemap serialization.</p>
<p>ChangeSets can be static dumps (e.g., files on a Web server)...</p>
<pre>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://resourcesync.org/ns/"
xmlns:atom="http://www.w3.org/2005/Atom">
<span class="rs"> <!-- Static changeset file -->
<atom:link href="http://example.com/changes/changeset.xml" rel="rs:changeset"/></span>
<url>...</url>
<url>...</url>
</urlset>
</pre>
<p>...or dynamically generated Web resources (e.g., DB-backed Change Memory).</p>
<pre>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://resourcesync.org/ns/"
xmlns:atom="http://www.w3.org/2005/Atom">
<span class="rs"> <!-- Dynamically generated changeset -->
<atom:link href="http://example.com/changes/20/diff" rel="rs:changeset"/></span>
<url>...</url>
<url>...</url>
</urlset>
</pre>
<h3>
<a name="change_comm_discovery_push" id="change_comm_discovery_push"></a>3.3 Push-based Change Communication Discovery
</h3>
<p>Push-based Incremental Synchronization relies on changes being pushed from the Source to the Destination, likely via an intermediary notification service. Therefore, a Source may provide links to (an) intermediary notification service(s) a Destination can subscribe to.</p>
<pre>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://resourcesync.org/ns/"
xmlns:atom="http://www.w3.org/2005/Atom">
<span class="rs"> <!-- The intermediary notification service -->
<atom:link href="xmpp:pubsub.example.org/" rel="rs:notification" rs:profile="<http://xmpp.org/extensions/xep-0060.html>" rs:pubsubnode="All"/></span>
<url>...</url>
<url>...</url>
</urlset>
</pre>
<p>Push- and Pull-based Incremental Synchronization can be combined. A Source can indicate the availability of an intermediary notification service and a Change Memory service. A Destination can subscribe to the notification service and eventually catch up with change events by polling the Change Memory.</p>
<pre>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://resourcesync.org/ns/"
xmlns:atom="http://www.w3.org/2005/Atom">
<span class="rs"> <!-- The used change notification transport protocol -->
<atom:link href="xmpp:pubsub.example.org/" rel="rs:notification" rs:profile="<http://xmpp.org/extensions/xep-0060.html>" rs:pubsubnode="All"/>
<!-- The change memory for polling changes -->
<atom:link href="http://example.com/changes/changeset.xml" rel="rs:changeset"/></span>
<url>...</url>
<url>...</url>
</urlset>
</pre>
<h3>
<a name="resource_memory_discovery_memento" id="resource_memory_discovery_memento"></a>3.4 Resource Memory Discovery (via Memento)
</h3>
<p>A Source can indicate the availability of archived or past resource states by providing links to Memento TimeGates <span class="todo">TODO: link!</span>.</p>
<pre>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://resourcesync.org/ns/"
xmlns:atom="http://www.w3.org/2005/Atom">
<url>
<loc>http://www.example.com/res1</loc>
<lastmod>2005-01-01</lastmod>
<rs:tag>frogs</rs:tag>
<rs:tag>crocodiles</rs:tag>
<span class="rs"> <atom:link href="http://example.com/timegate/http://www.example.com/res1" rel="timegate"/></span>
</url>
<url>
<loc>http://www.example.com/res2</loc>
<lastmod>2006-02-21T18:00:15+00:00</lastmod>
<rs:tag>fish</rs:tag>
<span class="rs"> <atom:link href="http://example.com/timegate/http://www.example.com/res2" rel="timegate"/></span>
</url>
<url>
<loc>http://www.example.com/res3</loc>
<lastmod>2007-03-23T18:00:15+00:00</lastmod>
<rs:tag>humans</rs:tag>
<span class="rs"> <atom:link href="http://example.com/timegate/http://www.example.com/res3" rel="timegate"/></span>
</url>
</urlset>
</pre>
<h3>
<a name="resource_memory_discovery_versioning" id="resource_memory_discovery_versioning"></a>3.5 Resource Memory Discovery (via Resource version URIs)
</h3>
<p>A Source can indicate the availability of archived or past resource states by providing Resource version URIs (e.g., as in Wikipedia).</p>
<pre>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://resourcesync.org/ns/">
<url>
<loc>http://www.example.com/res1</loc>
<lastmod>2005-01-01</lastmod>
<rs:tag>frogs</rs:tag>
<rs:tag>crocodiles</rs:tag>
<span class="rs"> <atom:link href="http://example.com/5/res1" rel="self"/></span>
</url>
<url>
<loc>http://www.example.com/res2</loc>
<lastmod>2006-02-21T18:00:15+00:00</lastmod>
<rs:tag>fish</rs:tag>
<span class="rs"> <atom:link href="http://example.com/10/res2" rel="self"/></span>
</url>
<url>
<loc>http://www.example.com/res3</loc>
<lastmod>2007-03-23T18:00:15+00:00</lastmod>
<rs:tag>humans</rs:tag>
<span class="rs"> <atom:link href="http://example.com/3/res3" rel="self"/></span>
</url>
</urlset>
</pre>
<p class="backtotop">
<a href="#top">Back to top</a>
</p>
<!-- CHANGE COMMUNICATION -->
<h2>
<a name="change_communication" id="change_communication"></a>4. Change Communication
</h2>
<p>Change Communication allows a Destination to understand that a Resource has changed in the Source; and what the nature of that change event is. Change events can either be polled from a Change Memory by a Destination or pushed from a Source to a Destination via some intermediary notification mechanism.</p>
<h3>4.1 Pull-based Change Communication</h3>
<p>A ChangeMemory records change events and allows Destinations to poll for change events. As a response, the Destination receives a ChangeSet, which can contain a (possibly empty) set of change events. Each event has an event identifier, which allows a Destination to identify already known change events.</p>
<p>A <strong>Dynamic ChangeMemory</strong> can be implemented as a RESTful Web application (probably backed by some RDBMS), which allows ChangeSets to be constructed and retrieved dynamically. From a Source's inventory (the Sitemap) a Destination retrieves a link to ChangeSet (e.g., http://example.com/changes/22/diff), which contains all change events that have occurred after retrieving the inventory. In the example below, two events have occurred since the Destination retrieved the Sitemap. The Destination also obtains a link to the "next" ChangeSet, which can contain later change events or be empty if no changes have occurred in between. This mechanism requires that the Source assigns sequential identifiers to change events (rs:eventid), considering their natural temporal order. If a Destination remembers the "next" ChangeSet URI (e.g., http://example.com/changes/22/diff) it retrieves with each ChangeSet response, it can efficiently poll for the changes it hasn't seen yet and incrementally synchronize Resources. Event identifiers can be dereferencable Web resources or sequential numeric values.</p>
<pre>
<changeset xmlns:rs="http://resourcesync.org/ns/"
xmlns:sm="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:atom="http://www.w3.org/2005/Atom"
xmlns:dc="http://purl.org/dc/terms/">
<span class="rs"> <atom:link href="http://example.com/changes/20/diff" rel="self rs:changeset"/>
<atom:link href="http://example.com/changes/22/diff" rel="next rs:changeset"/></span>
<sm:url>
<span class="rs"> <rs:eventid>http://example.com/changes/21</rs:eventid></span>
<sm:loc>http://www.example.com/res1</sm:loc>
<sm:lastmod>2012-05-30</sm:lastmod>
<rs:tag>frogs</rs:tag>
<rs:tag>crocodiles</rs:tag>
<span class="rs"> <rs:eventtype>updated</rs:eventtype>
<dc:publisher>http://example.com</dc:publisher> <!-- optional --></span>
</sm:url>
<sm:url>
<span class="rs"> <rs:eventid>http://example.com/changes/22</rs:eventid></span>
<sm:loc>http://www.example.com/res4</sm:loc>
<sm:lastmod>2012-05-31</sm:lastmod>
<rs:tag>elephants</rs:tag>
<span class="rs"> <rs:eventtype>created</rs:eventtype>
<dc:publisher>http://example.com</dc:publisher> <!-- optional --></span>
</sm:url>
</changeset>
</pre>
<p>A <strong>Static Change Memory</strong> can serialize ChangeSets into static files and place them into some Web-accessible directory. The Source can use the link in the Sitemap to retrieve the relevant ChangeSet (e.g., changeset2.xml). If ChangeSets files are linked with each other, the Destination can walk through these files entry by entry. A Destination can also retrieve the most recent ("most_recent.xml") changes and walk the list of change events backwards (throughout files) in order to apply the changes as they occurred in the Source.
</p>
<pre>
<changeset xmlns="http://resourcesync.org/ns/"
xmlns:rs="http://resourcesync.org/ns/"
xmlns:sm="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:atom="http://www.w3.org/2005/Atom">
<span class="rs"> <atom:link href="http://example.com/changes/changeset2.xml" rel="self rs:changeset"/>
<atom:link href="http://example.com/changes/changeset1.xml" rel="prev rs:changeset"/>
<atom:link href="http://example.com/changes/most_recent.xml" rel="last rs:changeset"/></span>
<sm:url>
<span class="rs"> <rs:eventid>4353dfsgesn431</rs:eventid></span>
<sm:loc>http://www.example.com/res1</sm:loc>
<sm:lastmod>2012-05-30</sm:lastmod>
<rs:tag>frogs</rs:tag>
<rs:tag>crocodiles</rs:tag>
<span class="rs"> <rs:eventtype>updated</rs:eventtype></span>
<span class="rs"> <dc:publisher>http://example.com</dc:publisher> <!-- optional --></span>
</sm:url>
<sm:url>
<span class="rs"> <rs:eventid>sadfn234sfn3f</rs:eventid></span>
<sm:loc>http://www.example.com/res4</sm:loc>
<sm:lastmod>2012-05-31</sm:lastmod>
<rs:tag>elephants</rs:tag>
<span class="rs"> <rs:eventtype>created</rs:eventtype></span>
<span class="rs"> <dc:publisher>http://example.com</dc:publisher> <!-- optional --></span>
</sm:url>
</changeset>
</pre>
<p>In both cases, the Destination must follow an initial Change Memory discovery link from the Inventory and then paginate through ChangeSets by following links to previous or next ChangeSets.</p>
<p><span class="todo">Should ChangeSets allow component discovery or should only the Inventory contain discovery links?</span></p>
<h3>4.2 Push-based Change Communication (via XMPP)</h3>
<p>The Source pushes change events via some intermediary (e.g., XMPP) mechanism to the Destination. The intermediary usually packages change events into some kind of message "envelope".</p>
<pre>
<message xmlns:rs="http://resourcesync.org/ns/"
xmlns:sm="http://www.sitemaps.org/schemas/sitemap/0.9">
<event>
<items>
<item id="12234">
<stanza>
<!-- payload -->
<sm:url>
<rs:eventid>4353dfsgesn431</rs:eventid>
<sm:loc>http://www.example.com/res1</sm:loc>
<sm:lastmod>2012-05-30</sm:lastmod>
<rs:tag>frogs</rs:tag>
<rs:tag>crocodiles</rs:tag>
<rs:eventtype>updated</rs:eventtype>
<dc:publisher>http://example.com</dc:publisher> <!-- optional -->
<!-- publisher info generates a level trust by going after its sitemap and check if notification link is there -->
</sm:url>
</stanza>
</item>
<item id="34456">
<stanza>
<!-- payload -->
<sm:url>
<rs:eventid>sadfn234sfn3f</rs:eventid>
<sm:loc>http://www.example.com/res4</sm:loc>
<sm:lastmod>2012-05-31</sm:lastmod>
<rs:tag>elephants</rs:tag>
<rs:eventtype>created</rs:eventtype>
<dc:publisher>http://example.com</dc:publisher> <!-- optional -->
</sm:url>
</stanza>
</item>
</item>
</event>
</message>
</pre>
<h3>4.3 Push-based Change Communication (via simple HTTP Callback)</h3>
<p>A Source (or its ChangeMemory) can provide a subscription interface, which allows Destinations to register their HTTP Callback URIs. The Source can push ChangeSets to the Destination on a per-event basis or in pre-defined intervals. This functionality can also be outsourced to some intermediary ChangeMemory. The Source would then ping the ChangeMemory whenever change events occur.</p>
<pre>
>> Subscription Request <<
POST /subscribe HTTP/1.1
Host: example.com
callbackURI=http://aggregator.org/callback
>> Change Notification <<
POST /callback HTTP/1.1
Host: aggregator.org
<rs:changeset xmlns:rs="http://resourcesync.org/ns/"
xmlns:sm="http://www.sitemaps.org/schemas/sitemap/0.9">
<!-- don't put any atom links here -->
<sm:url>
<rs:eventid>4353dfsgesn431</rs:eventid>
<sm:loc>http://www.example.com/res1</sm:loc>
<sm:lastmod>2012-05-30</sm:lastmod>
<rs:tag>frogs</rs:tag>
<rs:tag>crocodiles</rs:tag>
<rs:eventtype>updated</rs:eventtype>
<dc:publisher>http://example.com</dc:publisher> <!-- optional -->
</sm:url>
<sm:url>
<rs:eventid>sadfn234sfn3f</rs:eventid>
<sm:loc>http://www.example.com/res4</sm:loc>
<sm:lastmod>2012-05-31</sm:lastmod>
<rs:tag>elephants</rs:tag>
<rs:eventtype>created</rs:eventtype>
<dc:publisher>http://example.com</dc:publisher> <!-- optional -->
</sm:url>
</rs:changeset>
</pre>
<p class="backtotop">
<a href="#top">Back to top</a>
</p>
<!-- DUMP PACKAGING -->
<h2>
<a name="dump_packaging" id="dump_packaging"></a>5. Dump Packaging
</h2>
<p>A Dump packages Resources and associated Metadata of a Source and provides an Sitemap as Dump index file (Manifest). The Manifest (manifest.xml) follows the Inventory's structure but provides an additional pointer (rs:key) to the relative location of the serialized resource within a Dump package (archive).</p>
<span class="todo">Q - Simeon - Would remove ./ prefix on rs:key values, I don't think it is of any benefit. Also I wonder whether <rs:path> or even <rs:dumploc> might be a better (more explicit) name.</span>
<pre>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://resourcesync.org/ns/">
<url>
<loc>http://www.example.com/res1</loc>
<lastmod>2005-01-01</lastmod>
<size>5746</rs:size>
<md5>asdfhkk23blasdfb3223</rs:md5>
<rs:tag>frogs</rs:tag>
<rs:tag>crocodiles</rs:tag>
<span class="rs"> <rs:key>./resources/res1.txt</rs:key></span>
</url>
<url>
<loc>http://www.example.com/res2</loc>
<lastmod>2006-02-21T18:00:15+00:00</lastmod>
<size>1123</rs:size>
<md5>asdvb3223fvsn1234l34</rs:md5>
<rs:tag>fish</rs:tag>
<span class="rs"> <rs:key>./resources/res2.txt</rs:key></span>
</url>
<url>
<loc>http://www.example.com/res3</loc>
<lastmod>2007-03-23T18:00:15+00:00</lastmod>
<size>769234</rs:size>
<md5>bralerbaagbearlerab2e32</rs:md5>
<rs:tag>humans</rs:tag>
<span class="rs"> <rs:key>./resources/res3.txt</rs:key></span>
</url>
</urlset>
</pre>
<p class="backtotop">
<a href="#top">Back to top</a>
</p>
<!-- Resource Transfer -->
<h2>
<a name="resource_transfer_override" id="resource_transfer_override"></a>6. Resource Transfer Override
</h2>
<p>HTTP GET is the default Resource Transfer mechanism used by the Destination to update its holdings to reflect a change in a Resource at the Source. This behavior can be overridden by specifying alternate access URIs or alternate protocols and endpoints.</p>
<h3>6.1 Resource Transfer Override in Sitemap Inventory</h3>
<pre>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://resourcesync.org/ns/"
xmlns:atom="http://www.w3.org/2005/Atom">
<url>
<loc>http://dbpedia.org/resource/Paris</loc>
<lastmod>2012-05-08T19:59:57Z</lastmod>
<rs:eventtype>updated</rs:eventtype>
<span class="rs"> <rs:access>
<atom:link rel="alternate">http://dbpedia.org/data/Paris.rdf</atom:link>
</rs:access></span>
</url>
<urlset>
</pre>
<h3>6.2 Resource Transfer Override in Dump</h3>
<pre>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://resourcesync.org/ns/"
xmlns:atom="http://www.w3.org/2005/Atom">
<url>
<loc>http://dbpedia.org/resource/Paris</loc>
<lastmod>2012-05-08T19:59:57Z</lastmod>
<rs:eventtype>updated</rs:eventtype>
<span class="rs"> <rs:access>
<atom:link rel="alternate">http://dbpedia.org/data/Paris.rdf</atom:link>
</rs:access>
<rs:key>./data/Paris.rdf</rs:key></span>
</url>
<urlset>
</pre>
<h3>6.3 Resource Transfer Override in Changeset</h3>
<span class="todo"><br />Q - Simeon - The OAI example seems a bit fuzzy in that the endpoint and protocol elements are just next to each other and hence rather loosely tied. Would it be better to simply define an oai-endpoint and omit the rel=protocol part?</span>
<span class="todo"><br />Q - Simeon - Should the example with alternate types use the type="mime/type" attribute of atom:link?</span>
<pre>
<changeset xmlns="http://resourcesync.org/ns/"
xmlns:rs="http://resourcesync.org/ns/"
xmlns:sm="http://www.sitemaps.org/schemas/sitemap/0.9">
<sm:url>
<sm:loc>http://my.oairepo.org/oai</sm:loc>
<sm:lastmod>2012-05-08T19:59:57Z</sm:lastmod>
<rs:eventtype>updated</rs:eventtype>
<span class="rs"> <rs:access>
<atom:link rel="endpoint">http://my.oairepo.org/oai</atom:link>
<atom:link rel="protocol">http://www.openarchives.org/OAI/
openarchivesprotocol.html</atom:link>
</rs:access></span>
</sm:url>
<sm:url>
<sm:loc>http://dbpedia.org/resource/Paris</sm:loc>
<sm:lastmod>2010-05-08T19:59:57Z</sm:lastmod>
<rs:eventtype>updated</rs:eventtype>
<span class="rs"> <rs:access>
<atom:link rel="alternate">http://dbpedia.org/data/Paris.rdf</atom:link>
<atom:link rel="alternate">http://dbpedia.org/data/Paris.nt</atom:link>
</rs:access></span>
</sm:url>
<sm:url>
<sm:loc>http://flickr.org/photos/large.jp</sm:loc>
<sm:lastmod>2009-05-08T19:59:57Z</sm:lastmod>
<rs:eventtype>updated</rs:eventtype>
<span class="rs"> <rs:access>
<atom:link rel="alternate">http://farm1.flickr.com/user1/photos/asdfaslj.jp</atom:link>
</rs:access></span>
</sm:url>
<sm:url>
<sm:loc>http://images.org/photos/image.jp</sm:loc>
<sm:lastmod>2008-05-08T19:59:57Z</sm:lastmod>
<rs:eventtype>updated</rs:eventtype>
<span class="rs"> <rs:access>
<atom:link rel="alternate">http://user1@http://images.org/photos/image.jp</atom:link>