-
Notifications
You must be signed in to change notification settings - Fork 1
/
BlogsToWordpress.py
1992 lines (1661 loc) · 89.8 KB
/
BlogsToWordpress.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
#!/usr/bin/python
# -*- coding: utf-8 -*-
"""
-------------------------------------------------------------------------------
【版本信息】
版本: v18.3
作者: crifan
联系方式: http://www.crifan.com/crifan_released_all/website/python/blogstowordpress/
【详细信息】
BlogsToWordPress:将(新版)百度空间,网易163,新浪Sina,QQ空间,人人网,CSDN,搜狐Sohu,Blogbus博客大巴,天涯博客,点点轻博客等博客搬家到WordPress
http://www.crifan.com/crifan_released_all/website/python/blogstowordpress/
【使用说明】
BlogsToWordPress使用前必读
http://www.crifan.com/crifan_released_all/website/python/blogstowordpress/before_use/
BlogsToWordPress 的用法的举例说明
http://www.crifan.com/crifan_released_all/website/python/blogstowordpress/usage_example/
如何扩展BlogsToWordPress以支持更多类型的博客搬家
http://www.crifan.com/crifan_released_all/website/python/blogstowordpress/extend_blog_type/
BlogsToWordPress – WordPress博客搬家工具 讨论区
http://www.crifan.com/bbs/categories/blogstowordpress
【TODO】
1.增加对于friendOnly类型帖子的支持。
2.支持自定义导出特定类型的帖子为public和private。
3.支持设置导出WXR帖子时的顺序:正序和倒序。
【版本历史】
[v18.3]
[BlogSina.py]
1.fixbug -> support sub comments for some post:
http://blog.sina.com.cn/s/blog_89445d4f0101jgen.html
[v18.2]
[BlogSina.py]
1.fixbug -> support blog author reply comments
[v18.1]
[BlogDiandian.py]
1. fix post content and next perma link for http://remixmusic.diandian.com
2. fix title for post title:
BlogsToWordpress.py -f http://remixmusic.diandian.com/?p=669 -l 1
BlogsToWordpress.py -f http://remixmusic.diandian.com/?p=316 -l 1
BlogsToWordpress.py -f http://remixmusic.diandian.com/?p=18117 -l 1
BlogsToWordpress.py -f http://remixmusic.diandian.com/post/2013-05-13/40051897352 -l 1
3. fix post content for:
BlogsToWordpress.py -f http://remixmusic.diandian.com/post/2013-05-13/40051897352 -l 1
[v17.7]
1. add note when not designate -s or -f
[BlogNetease.py]
2. add emotion into post
eg:
http://blog.163.com/ni_chen/blog/#m=1
-> 心情随笔
3. support direct input feeling card url:
BlogsToWordpress.py -f http://green-waste.blog.163.com/blog/#m=1
BlogsToWordpress.py -f http://blog.163.com/ni_chen/blog/#m=1
[BlogSina.py]
4. fix parse sina post comment response json string
http://blog.sina.com.cn/s/blog_4701280b0101854o.html
comment url:
http://blog.sina.com.cn/s/comment_4701280b0101854o_1.html
[BlogDiandian.py]
5. fix bug now support http://googleyixia.com/ to find first perma link, next perma link, extract title, tags
[v17.2]
1. [BlogNetease] update to fix bug: can not find first permanent link
[v17.1]
1.fix error for extract post title and nex link for:
http://78391997.qzone.qq.com/
[v17.0]
1.fix csdn pic download
[v16.9]
1.update for only support baidu new space
[v16.8]
1. [BlogBaidu] fix bug for catetory extract, provided by Zhenyu Jiang
2. add template BlogXXX.py for add support for more new blog type
[v16.6]
1. [BlogBlogbus] fix bugs for extract title and date time string
2. [BlogQQ] add support for http://84896189.qzone.qq.com, which contain special content & comments & subComments
[v16.2]
1. csdn: Can not find the first link for http://blog.csdn.net/v_JULY_v, error=Unknown error!
2. fix bug: on ubuntu, AttributeError: ‘module’ object has no attribute ‘getwindowsversion’
[v16.0]
1. add BlogTianya support
2. add BlogDiandian support
3. fix path combile bug in mac, add logRuntimeInfo
[v13.9]
1. BlogRenren add captcha for login
[v13.8]
1. do release include chardet 1.0.1
[v12.8]
1. BlogBaidu update for support new space
[v11.7]
1.move blog modules into sub dir
2.change pic search pattern to support non-capture match
[v11.5]
1. support Blogbus
2. add unified downloadFile and isFileValid during process pic
3. fix pic filter regular pattern to support more type picture, include 3 fields, https, upper suffix
4. support use default pic setting
5. support new baidu space
6. support many template for new baidu space, include:
时间旅程,平行线,边走边看,窗外风景,雕刻时光,粉色佳人,理性格调,清心雅筑,低调优雅,蜕变新生,质感酷黑,经典简洁
7. support non-title post for new baidu space
[v9.2]
1. support modify 163 post via manually input verify code.
[v9.1]
1. export WXR during processing => whole process speed become a little bit faster !
2. change default pic prefix path to http://localhost/wp-content/uploads/pic
[v8.7]
1. support all type of other site pic for BlogSina
[v8.6]
1. support other site pic for BlogSina
2. support quoted filename check for crifanLib
[v8.4]
1. support more type pic for BlogQQ
[v8.3]
1. add Sohu blog support.
2. add auto omit invalid/hidden post which returned by extractTitle.
3. add remove control char for comment author and content
[v7.0]
1. add CSDN blog support.
[v6.2]
1. add RenRen Blog support.
2. For title and category, move repUniNumEntToChar and saxutils.escape from different blog providers into main function
[v5.6]
1. (当评论数据超多的时候,比如sina韩寒博客帖子评论,很多都是2,3万个的)添加日志信息,显示当前已处理多少个评论。
-------------------------------------------------------------------------------
"""
#---------------------------------import---------------------------------------
import os;
import platform;
import re;
import sys;
sys.path.append("libs/crifan");
sys.path.append("libs/crifan/blogModules");
sys.path.append("libs/thirdparty");
import math;
import time;
import codecs;
import logging;
import urllib;
from datetime import datetime,timedelta;
from optparse import OptionParser;
from string import Template,replace;
import xml;
from xml.sax import saxutils;
import crifanLib;
import BlogNetease;
import BlogBaidu;
import BlogSina;
import BlogQQ;
import BlogRenren;
import BlogCsdn;
import BlogSohu;
import BlogBlogbus;
import BlogTianya;
import BlogDiandian;
#Change Here If Add New Blog Provider Support
#--------------------------------const values-----------------------------------
__VERSION__ = "v18.3";
gConst = {
'generator' : "http://www.crifan.com/crifan_released_all/website/python/blogstowordpress/",
'tailUni' : u"""
</channel>
</rss>""",
'picRootPathInWP' : "http://localhost/wp-content/uploads/pic",
'othersiteDirName' : 'other_site',
#Change Here If Add New Blog Provider Support
# for different blog provider
'blogs' : {
'Baidu' : {
'blogModule' : BlogBaidu, # module name, should same with above import BlogXXX
'mandatoryIncStr' : "hi.baidu.com", # url must contain this
'descStr' : "Baidu Space", # Blog description string
},
'Netease' : {
'blogModule' : BlogNetease,
'mandatoryIncStr' : "blog.163.com",
'descStr' : "Netease 163 Blog",
},
'Sina' : {
'blogModule' : BlogSina,
'mandatoryIncStr' : "blog.sina.com.cn",
'descStr' : "Sina Blog",
},
'QQ' : {
'blogModule' : BlogQQ,
#'mandatoryIncStr' : "qzone.qq.com",
# special one http://blog.qq.com/qzone/622007179/1333268691.htm
'mandatoryIncStr' : ".qq.com",
'descStr' : "QQ Space",
},
'Renren' : {
'blogModule' : BlogRenren,
'mandatoryIncStr' : ".renren.com",
'descStr' : "Renren Blog",
},
'Csdn' : {
'blogModule' : BlogCsdn,
'mandatoryIncStr' : "blog.csdn.net",
'descStr' : "CSDN Blog",
},
'Sohu' : {
'blogModule' : BlogSohu,
'mandatoryIncStr' : "blog.sohu.com",
'descStr' : "Sohu Blog",
},
'Blogbus' : {
'blogModule' : BlogBlogbus,
'mandatoryIncStr' : ".blogbus.com",
'descStr' : "Blogbus Blog",
},
'BlogTianya' : {
'blogModule' : BlogTianya,
'mandatoryIncStr' : "blog.tianya.cn",
'descStr' : "Tianya Blog",
},
'BlogDiandian' : {
'blogModule' : BlogDiandian,
'mandatoryIncStr' : ".diandian.com",
'descStr' : "Diandian Qing Blog",
},
} ,
};
#----------------------------------global values--------------------------------
gVal = {
'blogProvider' : None,
'postList' : [],
'catNiceDict' : {}, # store { catName: catNiceName}
'tagSlugDict' : {}, # store { tagName: tagSlug}
'curItem' : { 'catNiceDict':{},
'tagSlugDict':{},
},
'postID' : 0,
'curPostUrl' : "",
'blogUser' : '',
'blogEntryUrl' : '',
'processedUrlList' : [],
'processedStUrlList' : [],
'replacedUrlDict' : {},
'outputFileName' : '',
'fullHeadInfo' : '', # include : header + category + generator
'statInfoDict' : {}, # store statistic info
'errorUrlList' : [], # store the (pic) url, which error while open
'postModifyPattern' : '', # the string, after repalce the pattern, used for each post
#----------------------------------
# used to output xml during processing
'wxrHeaderUni' : '',
'wxrHeaderSize' : 0,
'generatorUni' : '',
'generatorSize' : 0,
'tailUni' : '',
'tailSize' : 0,
'categoriesUni' : '',
'categoriesSize' : 0,
'tagsUni' : '',
'tagsSize' : 0,
'itemsUni' : '',
'itemsSize' : 0,
'curGeneratedUni' : '',
'curGeneratedSize' : 0,
'wxrValidUsername' : '',
'curOutputFileIdx' : 0,
'outputFileCreateTime' : '',
'nextCatId' : 1,
'nextTagId' : 1,
#----------------------------------
'curPicCfgDict' : {}, # store current/active/real picure config dict
};
#--------------------------configurable values---------------------------------
gCfg ={
# For defalut setting for following config value, please refer parameters.
# where to save the downloaded pictures
# Default (in code) set to: gConst['picRootPathInWP']
'picPathInWP' : '',
# Default (in code) set to: gCfg['picPathInWP'] + '/' + gConst['othersiteDirName']
'otherPicPathInWP' : '',
# process pictures or not
'processPic' : '',
# process other site pic or not
'processOtherPic' : '',
# omit process pic, which is similar before errored one
'omitSimErrUrl' : '',
# do translate or not
'googleTrans' : '',
# process comments or not
'processCmt' : '',
# post ID prefix address
'postPrefAddr' : '',
# max/limit size for output XML file
'maxXmlSize' : 0,
# function execute times == max retry number + 1
# when fail to do something: fetch page/get comment/....)
'funcTotalExecNum' : 1,
'username' : '',
'password' : '',
'postTypeToProcess' : '',
'processType' : '',
#Change Here If Add New Blog Provider Support
# for modify post, auto jump over the post of
# baidu: "文章内容包含不合适内容,请检查", "文章标题包含不合适内容,请检查"
# other blog : TODO
'autoJumpSensitivePost' : '',
};
#--------------------------functions--------------------------------------------
#------------------------------------------------------------------------------
# just print whole line
def printDelimiterLine() :
logging.info("%s", '-'*80);
return ;
#------------------------------------------------------------------------------
# open output file name in rw mode, return file handler
def openOutputFile():
global gVal;
# 'a+': read,write,append
# 'w' : clear before, then write
return codecs.open(gVal['outputFileName'], 'a+', 'utf-8');
#------------------------------------------------------------------------------
# init for output file
def initForOutputFile():
global gVal;
gVal['curOutputFileIdx'] = 0;
gVal['outputFileCreateTime'] = datetime.now().strftime('%Y%m%d_%H%M');
return;
#------------------------------------------------------------------------------
# just create new output file
def createNewOutputFile():
global gVal;
gVal['outputFileName'] = "WXR_" + gVal['blogProvider'] + '_[' + gVal['blogUser'] + "]_" + gVal['outputFileCreateTime'] + '-' + str(gVal['curOutputFileIdx']) + '.xml';
expFile = codecs.open(gVal['outputFileName'], 'w', 'utf-8');
if expFile:
logging.info('Created export WXR file: %s', gVal['outputFileName']);
expFile.close();
# update
gVal['curOutputFileIdx'] += 1;
logging.debug("gVal['curOutputFileIdx']=%d", gVal['curOutputFileIdx']);
else:
logging.error("Can not open writable exported WXR file: %s", gVal['outputFileName']);
sys.exit(2);
return;
#------------------------------------------------------------------------------
# add CDATA, also validate it for xml
def packageCDATA(info):
#info = saxutils.escape('<![CDATA[' + info + ']]>');
info = '<![CDATA[' + info + ']]>';
return info;
#------------------------------------------------------------------------------
# download file
def defDownloadFile(curPostUrl, picInfoDict, dstPicFile) :
curUrl = picInfoDict['picUrl'];
#use common function to download file
return crifanLib.downloadFile(curUrl, dstPicFile);
#------------------------------------------------------------------------------
#check file validation
def defIsFileValid(picInfoDict):
curUrl = picInfoDict['picUrl'];
#use common function to check file validation
return crifanLib.isFileValid(curUrl);
#------------------------------------------------------------------------------
# generate the file name for other pic
# depend on following picInfoDict definition
def defGenNewOtherPicName(picInfoDict):
newOtherPicName = "";
filename = picInfoDict['filename'];
fd1 = picInfoDict['fields']['fd1'];
fd2 = picInfoDict['fields']['fd2'];
newOtherPicName = fd1 + '_' + fd2 + "_" + filename;
return newOtherPicName;
#------------------------------------------------------------------------------
# check whether is self blog pic
# depend on following picInfoDict definition
# here default to set True: consider all pic is self blog pic
def defIsSelfBlogPic(picInfoDict):
isSelfPic = True;
logging.debug("defIsSelfBlogPic: %s", isSelfPic);
return isSelfPic;
#------------------------------------------------------------------------------
# get the found pic info after re.search
# foundPic is MatchObject
def defGetFoundPicInfo(foundPic):
# here should corresponding to singlePicUrlPat in curPicCfgDict
picUrl = foundPic.group(0);
fd1 = foundPic.group("fd1"); # blog user's name / img1
fd2 = foundPic.group("fd2"); # blogbus / blogbuscdn
fd3 = foundPic.group("fd3"); # com
fd4 = foundPic.group("fd4"); #
fd5 = foundPic.group("fd5"); #
fd6 = foundPic.group("fd6"); #
filename= foundPic.group("filename");
suffix = foundPic.group("suffix");
#logging.debug("fd:%s,%s,%s,%s,%s,%s, filename=%s, suffix=%s", fd1,fd2,fd3,fd4,fd5,fd6, filename, suffix);
picInfoDict = {
'isSupportedPic': False,
'picUrl' : picUrl,
'filename' : filename,
'suffix' : suffix,
'fields' :
{
'fd1' : fd1,
'fd2' : fd2,
'fd3' : fd3,
'fd4' : fd4,
'fd5' : fd5,
'fd6' : fd6,
},
'isSelfBlog' : False, # value is set by call isSelfBlogPic
};
if (suffix.lower() in crifanLib.getPicSufList()) :
picInfoDict['isSupportedPic'] = True;
logging.debug("%s is supported pic", picUrl);
return picInfoDict;
#------------------------------------------------------------------------------
# 1. generate the default picture config dict
# 2. init config dict
def initPicCfgDict():
global gVal;
# 1. generate the default picture config dict
logging.debug("now to generate the default picture config dict");
picSufChars = crifanLib.getPicSufChars();
logging.debug("picSufChars=%s", picSufChars);
# more about the following re pattern corresponding pic url type
# can refer the detailed comments in each blog's getProcessPhotoCfg function
defPicCfgDict = {
#'allPicUrlPat' : r'(?<=src=")http://\w+?\.\w+?\.?\w+?\.?\w+?\.?\w+?\.?\w+?/[\w%\-=]{0,50}[/]?[\w%\-/=]*/[\w\-\.]{1,100}' + r'\.[' + picSufChars + r']{3,4}(?=")',
#'singlePicUrlPat' : r'http://(?P<fd1>\w+?)\.(?P<fd2>\w+?)(\.(?P<fd3>\w+?))?(\.(?P<fd4>\w+?))?(\.(?P<fd5>\w+?))?(\.(?P<fd6>\w+?))?/([\w%\-=]{0,50})[/]?[\w\-/%=]*/(?P<filename>[\w\-\.]{1,100})' + r'\.(?P<suffix>[' + picSufChars + r']{3,4})',
#'allPicUrlPat' : r'(?<=src=")http://\w+?\.\w+?\.?\w*?\.?\w*?\.?\w*?\.?\w*?/[\w%\-=]{0,50}[/]?[\w%\-/=]*/[\w\-\.]{1,100}' + r'\.[' + picSufChars + r']{3,4}(?=")',
#'singlePicUrlPat' : r'http://(?P<fd1>\w+?)\.(?P<fd2>\w+?)(\.(?P<fd3>\w*?))?(\.(?P<fd4>\w*?))?(\.(?P<fd5>\w*?))?(\.(?P<fd6>\w*?))?/([\w%\-=]{0,50})[/]?[\w\-/%=]*/(?P<filename>[\w\-\.]{1,100})' + r'\.(?P<suffix>[' + picSufChars + r']{3,4})',
#'allPicUrlPat' : r'(?<=src=")https?://\w+?\.\w+?\.?\w*?\.?\w*?\.?\w*?\.?\w*?/[\w%\-=]{0,50}[/]?[\w%\-/=]*/[\w\-\.]{1,100}' + r'\.[' + picSufChars + r']{3,4}(?=")',
#'singlePicUrlPat' : r'https?://(?P<fd1>\w+?)\.(?P<fd2>\w+?)(\.(?P<fd3>\w*?))?(\.(?P<fd4>\w*?))?(\.(?P<fd5>\w*?))?(\.(?P<fd6>\w*?))?/([\w%\-=]{0,50})[/]?[\w\-/%=]*/(?P<filename>[\w\-\.]{1,100})' + r'\.(?P<suffix>[' + picSufChars + r']{3,4})',
'allPicUrlPat' : r'(?<=src=")https?://(?:\w+?)\.(?:\w+?)(?:\.(?:\w*?))?(?:\.(?:\w*?))?(?:\.(?:\w*?))?(?:\.(?:\w*?))?/[\w%\-=]{0,50}[/]?[\w%\-/=]*/[\w\-\.]{1,100}' + r'\.[' + picSufChars + r']{3,4}(?=")',
'singlePicUrlPat' : r'https?://(?P<fd1>\w+?)\.(?P<fd2>\w+?)(\.(?P<fd3>\w*?))?(\.(?P<fd4>\w*?))?(\.(?P<fd5>\w*?))?(\.(?P<fd6>\w*?))?/([\w%\-=]{0,50})[/]?[\w\-/%=]*/(?P<filename>[\w\-\.]{1,100})' + r'\.(?P<suffix>[' + picSufChars + r']{3,4})',
# allPicUrlPat: search pattern for all pic, should not include '()'
# singlePicUrlPat: search pattern for single pic, should inclde '()'
'getFoundPicInfo' : defGetFoundPicInfo, # function to get the found pic info after re.search
'isSelfBlogPic' : defIsSelfBlogPic, # function to func to check whether is self blog pic, otherwise is other site pic
'genNewOtherPicName' : defGenNewOtherPicName,# function to generate the new name for other pic
'isFileValid' : defIsFileValid, # function to check the (pic) url/file is valid or not
'downloadFile' : defDownloadFile, # function to download picture, maybe some special blog pic download need special process:
# 1. QQ: speed is low
# 2. blogbus: download pic need referer
};
logging.debug("defPicCfgDict=%s", defPicCfgDict);
# 2. init config dict
gotPicCfgDict = getProcessPhotoCfg();
logging.debug("gotPicCfgDict=%s", gotPicCfgDict);
curPicCfgDict = gotPicCfgDict;
for eachCfg in gotPicCfgDict:
if(not gotPicCfgDict[eachCfg]):
# if empty -> use default config
curPicCfgDict[eachCfg] = defPicCfgDict[eachCfg];
gVal['curPicCfgDict'] = curPicCfgDict;
logging.debug("gVal['curPicCfgDict']=%s", gVal['curPicCfgDict']);
return ;
#------------------------------------------------------------------------------
# 1. extract picture URL from blog content
# 2. process it:
# remove overlapped
# remove processed
# saved into the gVal['processedUrlList']
# download
# replace url
def processPhotos(blogContent):
global gVal;
global gCfg;
global gConst;
if gCfg['processPic'] == 'yes' :
try :
crifanLib.calcTimeStart("process_all_picture");
logging.debug("Begin to process all pictures");
#logging.debug("before find pic, post Conten=%s", blogContent);
curPicCfgDict = gVal['curPicCfgDict'];
allUrlPattern = curPicCfgDict['allPicUrlPat'];
#print "allUrlPattern=",allUrlPattern;
# if matched, result for findall() is a list when no () in pattern
matchedList = re.findall(allUrlPattern, blogContent);
logging.debug("Len(matchedList)=%d", len(matchedList));
logging.debug("matchedList=%s", matchedList);
if matchedList :
nonOverlapList = crifanLib.uniqueList(matchedList); # remove processed
# remove processed and got ones that has been processed
(filteredPicList, existedList) = crifanLib.filterList(nonOverlapList, gVal['processedUrlList']);
if filteredPicList :
logging.debug("Filtered url list to process:\n%s", filteredPicList);
picNum = 0;
for curUrl in filteredPicList :
# to check is similar, only when need check and the list it not empty
if ((gCfg['omitSimErrUrl'] == 'yes') and gVal['errorUrlList']):
(isSimilar, simSrcUrl) = crifanLib.findSimilarUrl(curUrl, gVal['errorUrlList']);
if isSimilar :
logging.warning(" Omit process %s for similar with previous error url", curUrl);
logging.warning(" %s", simSrcUrl);
continue;
logging.debug("Now to process %s", curUrl);
# no matter:(1) it is pic or not, (2) follow search fail or not
# (3) latter fail to fetch pic or not -> still means this url is processed
gVal['processedUrlList'].append(curUrl);
picNum += 1;
# process this url
singleUrlPattern = curPicCfgDict['singlePicUrlPat'];
#print "singleUrlPattern=",singleUrlPattern;
foundPic = re.search(singleUrlPattern, curUrl);
if foundPic :
#print "foundPic=",foundPic;
picInfoDict = {
'isSupportedPic': False,
'picUrl' : "", # the current pic url
'filename' : "", # filename of pic
'suffix' : "", # maybe empty for sina pic url
'fields' : {}, # depend on the implemented functions, normal should contains fd1/fd2/fd3/...
'isSelfBlog' : False,#is self blog pic, otherwise is other site pic
};
picInfoDict = curPicCfgDict['getFoundPicInfo'](foundPic);
#print "picInfoDict=",picInfoDict;
if picInfoDict['isSupportedPic'] :
picUrl = picInfoDict['picUrl'];
filename= picInfoDict['filename'];
suffix = picInfoDict['suffix'];
if(not suffix):
# for sina pic url:
# http://s14.sinaimg.cn/middle/3d55a9b7g9522d474a84d&690
# http://s14.sinaimg.cn/middle/3d55a9b7g9522d474a84d&690
# no suf, then set to jpg
suffix = 'jpg';
suffix = suffix.lower();
#print "filename=",filename;
#print "suffix=",suffix
#print "picInfoDict['fields']=",picInfoDict['fields'];
# check isSelfBlog first to get info for latter isFileValid
picInfoDict['isSelfBlog'] = curPicCfgDict['isSelfBlogPic'](picInfoDict);
# indeed is pic, process it
#(picIsValid, errReason) = curPicCfgDict['isFileValid'](curUrl);
(picIsValid, errReason) = curPicCfgDict['isFileValid'](picInfoDict);
#print "picIsValid=%s,errReason=%s"%(picIsValid,errReason);
if picIsValid :
# 1. prepare info
nameWithSuf = filename + '.' + suffix;
curPath = os.getcwd();
#dstPathOwnPicOld = curPath + '\\' + gVal['blogUser'] + '\\pic';
dstPathOwnPic = os.path.join(curPath, gVal['blogUser'], 'pic');
# 2. create dir for save pic
if (os.path.isdir(dstPathOwnPic) == False) :
os.makedirs(dstPathOwnPic); # create dir recursively
logging.info("Create dir %s for save downloaded pictures of own site", dstPathOwnPic);
if gCfg['processOtherPic'] == 'yes' :
#dstPathOtherPic = dstPathOwnPic + '\\' + gConst['othersiteDirName'];
dstPathOtherPic = os.path.join(dstPathOwnPic, gConst['othersiteDirName']);
if (os.path.isdir(dstPathOtherPic) == False) :
os.makedirs(dstPathOtherPic); # create dir recursively
logging.info("Create dir %s for save downloaded pictures of other site", dstPathOtherPic);
# 3. prepare info for follow download and save
if(picInfoDict['isSelfBlog']):
#print "++++ yes is self blog pic";
newPicUrl = gCfg['picPathInWP'] + '/' + nameWithSuf;
#dstPicFile = dstPathOwnPic + '\\' + nameWithSuf;
dstPicFile = os.path.join(dstPathOwnPic, nameWithSuf);
else :
# is othersite pic
#print "--- is other pic";
if gCfg['processOtherPic'] == 'yes' :
#newNameWithSuf = fd1 + '_' + fd2 + "_" + nameWithSuf;
newNameWithSuf = curPicCfgDict['genNewOtherPicName'](picInfoDict) + '.' + suffix;
#print "newNameWithSuf=",newNameWithSuf;
newPicUrl = gCfg['otherPicPathInWP'] + '/' + newNameWithSuf;
#dstPicFile = dstPathOtherPic + '\\' + newNameWithSuf;
dstPicFile = os.path.join(dstPathOtherPic, newNameWithSuf);
else :
dstPicFile = ''; # for next not download
#newPicUrl = curUrl
# download pic and replace url
logging.debug("dstPicFile=%s", dstPicFile);
#if dstPicFile and crifanLib.downloadFile(curUrl, dstPicFile) :
if dstPicFile and curPicCfgDict['downloadFile'](gVal['curPostUrl'], picInfoDict, dstPicFile) :
# replace old url with new url
logging.debug("download pic OK, now to replace url");
# http://b306.photo.store.qq.com/psb?/8d8d9a4f-2e9f-4b37-82d4-4559d7ec8472/E8WLK8l*kBjpak.5kg.xPzZ.**38oN517LBfrBNEAaQ!/b/YQN5aLZpBwAAYmuLa7YOBwAA
# will fail for follow line
#blogContent = re.compile(curUrl).sub(newPicUrl, blogContent);
# so use this line
blogContent = blogContent.replace(curUrl, newPicUrl);
# record it
gVal['replacedUrlDict'][curUrl] = newPicUrl;
logging.debug("Replace %s with %s", curUrl, newPicUrl);
#logging.debug("After replac, new blog content:\n%s", blogContent);
logging.info(" Processed picture %3d: %s", picNum, curUrl);
else :
logging.debug("Invalid picture: %s, reason: %s", curUrl, errReason);
if (gCfg['omitSimErrUrl'] == 'yes'): # take all error pic into record
# when this pic occur error, then add to list
gVal['errorUrlList'].append(curUrl);
#logging.debug("Add invalid %s into global error url list.", curUrl);
logging.info("Add invalid %s into global error url list.", curUrl);
else :
logging.debug("Omit unsupported picture %s", curUrl);
# for that processed url, only replace the address
if existedList :
for processedUrl in existedList:
# some pic url maybe is invalid, so not download and replace,
# so here only processed that downloaded and replaceed ones
if processedUrl in gVal['replacedUrlDict'] :
newPicUrl = gVal['replacedUrlDict'][processedUrl];
blogContent = re.compile(processedUrl).sub(newPicUrl, blogContent);
logging.debug("For processed url %s, not download again, only replace it with %s", processedUrl, newPicUrl);
logging.debug("Done for process all pictures");
gVal['statInfoDict']['processPicTime'] += crifanLib.calcTimeEnd("process_all_picture");
logging.debug("Successfully to process all pictures");
except :
logging.warning(' Process picture failed.');
return blogContent;
#------------------------------------------------------------------------------
# post process blog content:
# 1. download pic and replace pic url
# 2. remove invalid ascii control char
def postProcessContent(blogContent) :
processedContent = '';
try :
blogContent = packageCDATA(blogContent);
# 1. extract pic url, download pic, replace pic url
afterProcessPic = processPhotos(blogContent);
# 2. remove invalid ascii control char
afterFilter = crifanLib.removeCtlChr(afterProcessPic);
processedContent = afterFilter;
except :
logging.debug("Fail while post process for blog content");
return processedContent;
#------------------------------------------------------------------------------
# calc the bytes/size of utf-8 string of input unicode
def utf8Bytes(unicodeVal) :
if (unicodeVal):
utf8Val = unicodeVal.encode("utf-8");
bytes = len(utf8Val);
else:
bytes = 0;
return bytes;
#------------------------------------------------------------------------------
# process each feteched post info
def processSinglePost(infoDict) :
# remove the control char in title:
# eg;
# http://green-waste.blog.163.com/blog/static/32677678200879111913911/
# title contains control char:DC1, BS, DLE, DLE, DLE, DC1
infoDict['title'] = crifanLib.removeCtlChr(infoDict['title']);
# do translate here -> avoid in the end,
# too many translate request to google will cause "HTTPError: HTTP Error 502: Bad Gateway"
infoDict['titleForPublish'] = generatePostName(infoDict['title']);
if(gCfg['processType'] == "exportToWxr") :
# do some post process for blog content
infoDict['content'] = postProcessContent(infoDict['content']);
# export single post item if necessary
#--------------------------- start --------------------------------
crifanLib.calcTimeStart("export_posts");
# generate (unicode) strings
category = infoDict['category'];
if(not (category in gVal['catNiceDict'])):
curCatNice = generatePostName(category);
gVal['curItem']['catNiceDict'][category] = curCatNice;
# also add to global dict
gVal['catNiceDict'][category] = curCatNice;
gVal['nextCatId'] = 1;
newCategoriesUni = generateCategories(gVal['catNiceDict']);
else:
gVal['curItem']['catNiceDict'][category] = gVal['catNiceDict'][category];
newCategoriesUni = gVal['categoriesUni'];
# add into global tagSlugDict
# note: input tags should be unicode type
if(infoDict['tags']) :
for eachTag in infoDict['tags'] :
if eachTag : # maybe is u'', so here should check whether is empty
if(eachTag in gVal['tagSlugDict']):
gVal['curItem']['tagSlugDict'][eachTag] = gVal['tagSlugDict'][eachTag];
else :
curTagSlug = generatePostName(eachTag);
gVal['curItem']['tagSlugDict'][eachTag] = curTagSlug;
gVal['tagSlugDict'][eachTag] = curTagSlug;
if(gVal['curItem']['tagSlugDict']) :
newTagsUni = generateTags(gVal['tagSlugDict']);
else:
newTagsUni = gVal['tagsUni'];
itemUni = generateSingleItem(infoDict);
newItemsUni = gVal['itemsUni'] + itemUni;
newGeneratedUni = gVal['wxrHeaderUni'] + newCategoriesUni + newTagsUni + gVal['generatorUni'] + newItemsUni + gVal['tailUni'];
newGeneratedSize = utf8Bytes(newGeneratedUni);
logging.debug("newGeneratedSize=%d", newGeneratedSize);
# check whether size exceed limit
# Note: 0 means no limit
if gCfg['maxXmlSize'] and (newGeneratedSize > gCfg['maxXmlSize']) : # if exceed limit
# create file for output
createNewOutputFile();
#write processed ones
newFile = openOutputFile();
newFile.write(gVal['curGeneratedUni']);
newFile.flush();
newFile.close();
# update something
gVal['nextCatId'] = 1;
itemCategoriyUni = generateCategories(gVal['curItem']['catNiceDict']);
gVal['categoriesUni'] = itemCategoriyUni;
if(gVal['curItem']['tagSlugDict']) :
itemTagsUni = generateTags(gVal['curItem']['tagSlugDict']);
else:
itemTagsUni = "";
gVal['tagsUni'] = itemTagsUni;
gVal['itemsUni'] = itemUni;
# reset something
gVal['tagSlugDict'] = {};
gVal['catNiceDict'] = {};
else : # if not exceed limit:
# update something
gVal['categoriesUni'] = newCategoriesUni;
gVal['tagsUni'] = newTagsUni;
gVal['itemsUni'] = newItemsUni;
# update something
gVal['curGeneratedUni'] = gVal['wxrHeaderUni'] + gVal['categoriesUni'] + gVal['tagsUni'] + gVal['generatorUni'] + gVal['itemsUni'] + gVal['tailUni'];
gVal['curGeneratedSize'] = utf8Bytes(gVal['curGeneratedUni']);
logging.debug("after process post, gVal['curGeneratedSize']=%d", gVal['curGeneratedSize']);
# clear something
gVal['curItem']['catNiceDict'] = {};
gVal['curItem']['tagSlugDict'] = {};
gVal['statInfoDict']['exportPostsTime'] += crifanLib.calcTimeEnd("export_posts");
logging.debug("gVal['statInfoDict']['exportPostsTime']=%f", gVal['statInfoDict']['exportPostsTime']);
#--------------------------- end --------------------------------
elif (gCfg['processType'] == "modifyPost") :
# 1. prepare new content
newPostContentUni = gVal['postModifyPattern'];
# replace permanent link in wordpress == title for publish
newPostContentUni = newPostContentUni.replace("${titleForPublish}", infoDict['titleForPublish']);
# replace title, infoDict['title'] must non-empty
newPostContentUni = newPostContentUni.replace("${originalTitle}", unicode(infoDict['title']));
titleUtf8 = infoDict['title'].encode("UTF-8");
#quotedTitle = urllib.quote_plus(titleUtf8);
quotedTitle = urllib.quote(titleUtf8);
newPostContentUni = newPostContentUni.replace("${quotedTitle}", quotedTitle);
# replace datetime, infoDict['datetime'] must non-empty
localTime = parseDatetimeStrToLocalTime(infoDict['datetime']);
newPostContentUni = newPostContentUni.replace("${postYear}", str.format("{0:4d}", localTime.year));
newPostContentUni = newPostContentUni.replace("${postMonth}", str.format("{0:02d}", localTime.month));
newPostContentUni = newPostContentUni.replace("${postDay}", str.format("{0:02d}", localTime.day));
# replace category
newPostContentUni = newPostContentUni.replace("${category}", infoDict['category']);
# replace content
newPostContentUni = newPostContentUni.replace("${originBlogContent}", infoDict['content']);
# 2. modify to new content
(modifyOk, errInfo) = modifySinglePost(newPostContentUni, infoDict, gCfg);
if(modifyOk) :
logging.debug("Modify %s successfully.", infoDict['url']);
else:
logging.error("Modify %s failed for %s.", infoDict['url'], errInfo);
sys.exit(2);
#------------------------------------------------------------------------------
#1. open current post item
#2. save related info into post info dict
#3. return post info dict
def fetchSinglePost(url):
global gVal;
global gConst;
global gCfg;
#update post id
gVal['postID'] += 1;
gVal['curPostUrl'] = url;
logging.debug("----------------------------------------------------------");
logging.info("[%04d] %s", gVal['postID'], url);
crifanLib.calcTimeStart("fetch_page");
# sometime due to network error, fetch page will fail, so here do several try
for tries in range(gCfg['funcTotalExecNum']) :
try :
logging.debug("Begin to get url resp html for %s", url);
respHtml = crifanLib.getUrlRespHtml(url);
#logging.debug("Response html\n---------------\n%s", respHtml);
gVal['statInfoDict']['fetchPageTime'] += crifanLib.calcTimeEnd("fetch_page");
logging.debug("Successfully downloaded: %s", url);
break # successfully, so break now
except :
if tries < (gCfg['funcTotalExecNum'] - 1) :
logging.warning("Fetch page %s fail, do %d retry", url, (tries + 1));
continue;
else : # last try also failed, so exit
logging.error("Has tried %d times to fetch page: %s, all failed!", gCfg['funcTotalExecNum'], url);
sys.exit(2);
infoDict = {
'omit' : False,
'url' : '',
'postid' : 0,
'title' : '',
'nextLink' : '',
'type' : '',
'content' : '',
'datetime' : '',
'category' : '',
'tags' : [],
'comments' : [], # each one is a dict value
'respHtml' : '',
};
infoDict['url'] = url;
infoDict['postid'] = gVal['postID'];
infoDict['respHtml']= respHtml;
# extract title
(needOmit, infoDict['title']) = extractTitle(url, respHtml);
if(not infoDict['title'] ) :
logging.error("Can not extract post title for %s !", url);
sys.exit(2);
else :
infoDict['title'] = crifanLib.repUniNumEntToChar(infoDict['title']);
# for later export to WXR, makesure is xml safe
infoDict['title'] = saxutils.escape(infoDict['title']);
logging.debug("Extracted post title: %s", infoDict['title']);
# extrat next (previously published) blog item link
# here must extract next link first, for next call to use while omit=True
#logging.info("Begin to call findNextPermaLink");
infoDict['nextLink'] = findNextPermaLink(url, respHtml);
logging.debug("infoDict['nextLink']=%s", infoDict['nextLink']);
logging.debug("Extracted post's next permanent link: %s", infoDict['nextLink']);
isPrivate = isPrivatePost(url, respHtml);
if(isPrivate) :
infoDict['type'] = 'private';
logging.debug("Post type is private.");
else :
# tmp not consider the "friendOnly" type
logging.debug("Post type is public.");
infoDict['type'] = 'publish';
if(needOmit):
logging.info(" Omit process current post: %s", infoDict['title']);
infoDict['omit'] = True;
elif((gCfg['postTypeToProcess'] == "privateOnly") and (not isPrivate)) :
logging.info(" Omit process non-private post: %s", infoDict['title']);
infoDict['omit'] = True;
elif((gCfg['postTypeToProcess'] == "publicOnly") and isPrivate) :
infoDict['omit'] = True;
logging.info(" Omit process private post: %s", infoDict['title']);
if (infoDict['omit']):
return infoDict;
else :
logging.info(" Title = %s", infoDict['title']);
# extract datetime
infoDict['datetime'] = extractDatetime(url, respHtml);
if(not infoDict['datetime'] ) :
logging.error("Can not extract post publish datetime for %s !", url);
sys.exit(2);
else :
logging.debug("Extracted post publish datetime: %s", infoDict['datetime']);